I have a dataset, it is a data frame format.
But I need to convert to the matrix for recommender system purpose.
my data format:
col1 col1 col3
1 name 1 5.9
2 name 1 7.9
3 name 1 10
4 name 1 9
5 name 1 8.4
1 name 2 6
2 name 2 8.5
3 name 2 10
4 name 2 9.3
This is what I want:
name 1 name 2
1 5.9 6
2 7.9 8.5
3 10 10
4 9 9.3
5 8.4 NA (missing value, autofill "NA")
For the data you shared, the following base R solution works (as long as your data frame is called df
do.call(cbind, lapply(split(df$Hotel_Rating, df$Hotel_Name), `[`,
seq(max(table(df$Hotel_Name)))))
Related
I have a wide table with more than 22 columns. This table is the result of fuzzymatch and that's why it's in wide format. The column names are shown below (in order) (I will try to create a sample data frame for better demonstration):
[1] "shift_date.x" "shift" "ageyrs" "site" "level"
[6] "crowded_shift" "time" "dd" "AE" "ageyrs_start"
[11] "ageyrs_end" "time_start" "time_end" "shift_date.y" "shift_n"
[16] "ageyrs_n" "site_n" "level_n" "crowded_shift_n" "los_n"
[21] "dd_n" "AE_n"
What I want to do is to break this data frame starting from column 14 to the end ("shift_date.y" to "AE_n") and add it as new rows to the bottom of first section of table (change it to long format). The problem is that the first section has 13 columns but the second part has 8 and I am not sure how I can combine them (that's why probably subsetting and rbind don't work).
As an example, imagine we have the following data frame:
shift <- c (2,1,0)
ageyrs <- c(12.2,13,14)
site <- c(0,1,3)
level <- c (1,5,6)
ageyrs_s <- c (2,4,5)
ageyrs_n <- c (4,6,8)
shift2 <- c (2,1,0)
ageyrs2 <- c(12.2,13,14)
site2 <- c(0,1,3)
level2 <- c (1,5,6)
a <- data.frame(shift, ageyrs, site, level, ageyrs_s, ageyrs2, shift2, ageyrs2, site2, level2)
shift ageyrs site level ageyrs_s ageyrs_n shift_n ageyrs_n site_n level_n
1 2 12.2 0 1 2 4 2 12.2 0 1
2 1 13.0 1 5 4 6 1 13.0 1 5
3 0 14.0 3 6 5 8 0 14.0 3 6
No I want to break this dataframe at "shift2" column and create a dataframe line shown below:
shift ageyrs site level ageyrs_s ageyrs_n
1 2 12.2 0 1 2 4
2 1 13.0 1 5 4 6
3 0 14.0 3 6 5 8
4 2 12.2 0 1 NA NA
5 1 13.0 1 5 NA NA
6 0 14.0 3 6 NA NA
Any suggestions on how to resolve this?
We can use split.default from base R to split the data into list of data.frames and then convert to a single data.frame after unlisting the list elements
nm1 <- sub("\\d+$", "", names(a))
lst1 <- lapply(split.default(a, nm1),
unlist, use.names = FALSE)
out <- data.frame(lapply(lst1, `length<-`, max(lengths(lst1))))[unique(nm1)]
-output
out
# shift ageyrs site level ageyrs_s ageyrs_n
#1 2 12.2 0 1 2 4
#2 1 13.0 1 5 4 6
#3 0 14.0 3 6 5 8
#4 2 12.2 0 1 NA NA
#5 1 13.0 1 5 NA NA
#6 0 14.0 3 6 NA NA
Or using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
a %>%
rename_at(vars(shift:level), ~ str_c(., '1')) %>%
pivot_longer(cols = -c(ageyrs_s, ageyrs_n), names_to = c(".value", 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])")
Try this. You can use bind_rows() and setNames() to define common names so that the values can be joined properly:
library(dplyr)
#Code
newa <- a %>% select(shift:ageyrs_n) %>%
bind_rows(a %>% select(shift2:level2) %>% setNames(gsub('2','',names(.))))
Output:
shift ageyrs site level ageyrs_s ageyrs_n
1 2 12.2 0 1 2 4
2 1 13.0 1 5 4 6
3 0 14.0 3 6 5 8
4 2 12.2 0 1 NA NA
5 1 13.0 1 5 NA NA
6 0 14.0 3 6 NA NA
This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 2 years ago.
Probably this is not that complex, but I couldn't figure out how to write a concise title explaining it:
I'm trying to use the aggregate function in R to return (1) the lowest value of a given column (val) by category (cat.2) in a data frame and (2) the value of another column (cat.1) on the same row. I know how to do part #1, but I can't figure out part #2.
The data:
cat.1<-c(1,2,3,4,5,1,2,3,4,5)
cat.2<-c(1,1,1,2,2,2,2,3,3,3)
val<-c(10.1,10.2,9.8,9.7,10.5,11.1,12.5,13.7,9.8,8.9)
df<-data.frame(cat.1,cat.2,val)
> df
cat.1 cat.2 val
1 1 1 10.1
2 2 1 10.2
3 3 1 9.8
4 4 2 9.7
5 5 2 10.5
6 1 2 11.1
7 2 2 12.5
8 3 3 13.7
9 4 3 9.8
10 5 3 8.9
I know how to use aggregate to return the minimum value for each cat.2:
> aggregate(df$val, by=list(df$cat.2), FUN=min)
Group.1 x
1 1 9.8
2 2 9.7
3 3 8.9
The second part of it, which I can't figure out, is to return the value in cat.1 on the same row of df where aggregate found min(df$val) for each cat.2. Not sure I'm explaining it well, but this is the intended result:
> ...
Group.1 x cat.1
1 1 9.8 3
2 2 9.7 4
3 3 8.9 5
Any help much appreciated.
If we need the output after the aggregate, we can do a merge with original dataset
merge(aggregate(df$val, by=list(df$cat.2), FUN=min),
df, by.x = c('Group.1', 'x'), by.y = c('cat.2', 'val'))
# Group.1 x cat.1
#1 1 9.8 3
#2 2 9.7 4
#3 3 8.9 5
But, this can be done more easily with dplyr by using slice to slice the rows with the min value of 'val' after grouping by 'cat.2'
library(dplyr)
df %>%
group_by(cat.2) %>%
slice(which.min(val))
# A tibble: 3 x 3
# Groups: cat.2 [3]
# cat.1 cat.2 val
# <dbl> <dbl> <dbl>
#1 3 1 9.8
#2 4 2 9.7
#3 5 3 8.9
Or with data.table
library(data.table)
setDT(df)[, .SD[which.min(val)], cat.2]
Or in base R, this can be done with ave
df[with(df, val == ave(val, cat.2, FUN = min)),]
# cat.1 cat.2 val
#3 3 1 9.8
#4 4 2 9.7
#10 5 3 8.9
This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 3 years ago.
I have a data frame containing id and other variables, and also a list which has some ids. Now I want to extract row of ids from the data frame which has same ids in list.
data frame
id value time
1 12 1.0
1 14 1.6
4 18 2.0
6 9 3.6
3 11 4.2
5 12 0.8
list
1,3,4
Result
id value time
1 12 1.0
1 14 1.6
3 11 4.2
4 18 2.0
As #Sotos explained, that could be done as following using %in%:
Data[Data$id %in% list,]
# id value time
# 1: 1 12 1.0
# 2: 1 14 1.6
# 3: 4 18 2.0
# 4: 3 11 4.2
I am trying to get means from a column in a data frame based on a unique value. So trying to get mean of column b and column c in this exampled based on the unique values in column a. I thought the .(a) would make it calculate by unique value in a (it gives the unique values of a) but it just gives a mean for the whole column b or c.
df2<-data.frame(a=seq(1:5),b=c(1:10), c=c(11:20))
simVars <- c("b", "c")
for ( var in simVars ){
print(var)
dat = ddply(df2, .(a), summarize, mean_val = mean(df2[[var]])) ## my script
assign(var, dat)
}
c
a mean_val
1 15.5
2 15.5
3 15.5
4 15.5
5 15.5
How can I have it take an average for the column based on the unique value from column a?
thanks
You don't need a loop. Just calculate the means of b and c within a single call to ddply and the means will be calculated separately for each value of a. And, as #Gregor said, you don't need to re-specify the data frame name inside mean():
ddply(df2, .(a), summarise,
mean_b=mean(b),
mean_c=mean(c))
a mean_b mean_c
1 1 3.5 13.5
2 2 4.5 14.5
3 3 5.5 15.5
4 4 6.5 16.5
5 5 7.5 17.5
UPDATE: To get separate data frames for each column of means:
# Add a few additional columns to the data frame
df2 = data.frame(a=seq(1:5),b=c(1:10), c=c(11:20), d=c(21:30), e=c(31:40))
# New data frame with means by each level of column a
library(dplyr)
dfmeans = df2 %>%
group_by(a) %>%
summarise_each(funs(mean))
# Separate each column of means into a separate data frame and store it in a list:
means.list = lapply(names(dfmeans)[-1], function(x) {
cbind(dfmeans[,"a"], dfmeans[,x])
})
means.list
[[1]]
a b
1 1 3.5
2 2 4.5
3 3 5.5
4 4 6.5
5 5 7.5
[[2]]
a c
1 1 13.5
2 2 14.5
3 3 15.5
4 4 16.5
5 5 17.5
[[3]]
a d
1 1 23.5
2 2 24.5
3 3 25.5
4 4 26.5
5 5 27.5
[[4]]
a e
1 1 33.5
2 2 34.5
3 3 35.5
4 4 36.5
5 5 37.5
I have a data frame that looks like this:
site date var dil
1 A 7.4 2
2 A 6.5 2
1 A 7.3 3
2 A 7.3 3
1 B 7.1 1
2 B 7.7 2
1 B 7.7 3
2 B 7.4 3
I need add a column called wt to this dataframe that contains the weighting factor needed to calculate the weighted mean. This weighting factor has to be derived for each combination of site and date.
The approach I'm using is to first built a function that calculate the weigthing factor:
> weight <- function(dil){
dil/sum(dil)
}
then apply the function for each combination of site and date
> df$wt <- ddply(df,.(date,site),.fun=weight)
but I get this error message:
Error in FUN(X[[1L]], ...) :
only defined on a data frame with all numeric variables
You are almost there. Modify your code to use the transform function. This allows you to add columns to the data.frame inside ddply:
weight <- function(x) x/sum(x)
ddply(df, .(date,site), transform, weight=weight(dil))
site date var dil weight
1 1 A 7.4 2 0.40
2 1 A 7.3 3 0.60
3 2 A 6.5 2 0.40
4 2 A 7.3 3 0.60
5 1 B 7.1 1 0.25
6 1 B 7.7 3 0.75
7 2 B 7.7 2 0.40
8 2 B 7.4 3 0.60