This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 2 years ago.
Probably this is not that complex, but I couldn't figure out how to write a concise title explaining it:
I'm trying to use the aggregate function in R to return (1) the lowest value of a given column (val) by category (cat.2) in a data frame and (2) the value of another column (cat.1) on the same row. I know how to do part #1, but I can't figure out part #2.
The data:
cat.1<-c(1,2,3,4,5,1,2,3,4,5)
cat.2<-c(1,1,1,2,2,2,2,3,3,3)
val<-c(10.1,10.2,9.8,9.7,10.5,11.1,12.5,13.7,9.8,8.9)
df<-data.frame(cat.1,cat.2,val)
> df
cat.1 cat.2 val
1 1 1 10.1
2 2 1 10.2
3 3 1 9.8
4 4 2 9.7
5 5 2 10.5
6 1 2 11.1
7 2 2 12.5
8 3 3 13.7
9 4 3 9.8
10 5 3 8.9
I know how to use aggregate to return the minimum value for each cat.2:
> aggregate(df$val, by=list(df$cat.2), FUN=min)
Group.1 x
1 1 9.8
2 2 9.7
3 3 8.9
The second part of it, which I can't figure out, is to return the value in cat.1 on the same row of df where aggregate found min(df$val) for each cat.2. Not sure I'm explaining it well, but this is the intended result:
> ...
Group.1 x cat.1
1 1 9.8 3
2 2 9.7 4
3 3 8.9 5
Any help much appreciated.
If we need the output after the aggregate, we can do a merge with original dataset
merge(aggregate(df$val, by=list(df$cat.2), FUN=min),
df, by.x = c('Group.1', 'x'), by.y = c('cat.2', 'val'))
# Group.1 x cat.1
#1 1 9.8 3
#2 2 9.7 4
#3 3 8.9 5
But, this can be done more easily with dplyr by using slice to slice the rows with the min value of 'val' after grouping by 'cat.2'
library(dplyr)
df %>%
group_by(cat.2) %>%
slice(which.min(val))
# A tibble: 3 x 3
# Groups: cat.2 [3]
# cat.1 cat.2 val
# <dbl> <dbl> <dbl>
#1 3 1 9.8
#2 4 2 9.7
#3 5 3 8.9
Or with data.table
library(data.table)
setDT(df)[, .SD[which.min(val)], cat.2]
Or in base R, this can be done with ave
df[with(df, val == ave(val, cat.2, FUN = min)),]
# cat.1 cat.2 val
#3 3 1 9.8
#4 4 2 9.7
#10 5 3 8.9
Related
I have a dataset, it is a data frame format.
But I need to convert to the matrix for recommender system purpose.
my data format:
col1 col1 col3
1 name 1 5.9
2 name 1 7.9
3 name 1 10
4 name 1 9
5 name 1 8.4
1 name 2 6
2 name 2 8.5
3 name 2 10
4 name 2 9.3
This is what I want:
name 1 name 2
1 5.9 6
2 7.9 8.5
3 10 10
4 9 9.3
5 8.4 NA (missing value, autofill "NA")
For the data you shared, the following base R solution works (as long as your data frame is called df
do.call(cbind, lapply(split(df$Hotel_Rating, df$Hotel_Name), `[`,
seq(max(table(df$Hotel_Name)))))
This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 5 years ago.
I am coding in R.
I have a table like :
region;2012;2013;2014;2015
1;2465;245;2158;645
2;44;57;687;564
3;545;784;897;512
...
And I want to transform it into :
region;value;annee
1;2465;2012
1;245;2013
1;2158;2014
1;645;2015
2;44;2012
...
Do you know how I can do it ?
First, read the data:
dat <- read.csv2(text = "region;2012;2013;2014;2015
1;2465;245;2158;645
2;44;57;687;564
3;545;784;897;512",
check.names = FALSE)
The data frame con be converted into the long format with gather from package tidyr.
library(tidyr)
dat_long <- gather(dat, key = "annee", , -region)
The result:
region annee value
1 1 2012 2465
2 2 2012 44
3 3 2012 545
4 1 2013 245
5 2 2013 57
6 3 2013 784
7 1 2014 2158
8 2 2014 687
9 3 2014 897
10 1 2015 645
11 2 2015 564
12 3 2015 512
You can also produce the ;-separated result of your question:
write.csv2(dat_long, "", row.names = FALSE, quote = FALSE)
This results in:
region;annee;value
1;2012;2465
2;2012;44
3;2012;545
1;2013;245
2;2013;57
3;2013;784
1;2014;2158
2;2014;687
3;2014;897
1;2015;645
2;2015;564
3;2015;512
An example to answer the question :
olddata_wide
#> subject sex control cond1 cond2
#> 1 1 M 7.9 12.3 10.7
#> 2 2 F 6.3 10.6 11.1
#> 3 3 F 9.5 13.1 13.8
#> 4 4 M 11.5 13.4 12.9
library(tidyr)
# The arguments to gather():
# - data: Data object
# - key: Name of new key column (made from names of data columns)
# - value: Name of new value column
# - ...: Names of source columns that contain values
# - factor_key: Treat the new key column as a factor (instead of character vector)
data_long <- gather(olddata_wide, condition, measurement, control:cond2, factor_key=TRUE)
data_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 2 2 F control 6.3
#> 3 3 F control 9.5
#> 4 4 M control 11.5
#> 5 1 M cond1 12.3
#> 6 2 F cond1 10.6
#> 7 3 F cond1 13.1
#> 8 4 M cond1 13.4
#> 9 1 M cond2 10.7
#> 10 2 F cond2 11.1
#> 11 3 F cond2 13.8
#> 12 4 M cond2 12.9
I have such a data frame(df) which is just a sapmle:
group value
1 12.1
1 10.3
1 NA
1 11.0
1 13.5
2 11.7
2 NA
2 10.4
2 9.7
Namely,
df<-data.frame(group=c(1,1,1,1,1,2,2,2,2), value=c(12.1, 10.3, NA, 11.0, 13.5, 11.7, NA, 10.4, 9.7))
Desired output is:
group value order
1 12.1 3
1 10.3 1
1 NA NA
1 11.0 2
1 13.5 4
2 11.7 3
2 NA NA
2 10.4 2
2 9.7 1
Namely, I want to find the
rank of the "value"s from starting from the smallest value
within the "group"s.
How can I do that with R? I will be very glad for any help Thanks a lot.
We could use ave from base R to create the rank column ("order1") of "value" by "group". If we need to have NAs for corresponding NA in "value" column, this can be done (df$order[is.na(..)])
df$order1 <- with(df, ave(value, group, FUN=rank))
df$order1[is.na(df$value)] <- NA
Or using data.table
library(data.table)
setDT(df)[, order1:=rank(value)* NA^(is.na(value)), by = group][]
# group value order1
#1: 1 12.1 3
#2: 1 10.3 1
#3: 1 NA NA
#4: 1 11.0 2
#5: 1 13.5 4
#6: 2 11.7 3
#7: 2 NA NA
#8: 2 10.4 2
#9: 2 9.7 1
You can use the rank() function applied to each group at a time to get your desired result. My solution for doing this is to write a small helper function and call that function in a for loop. I'm sure there are other more elegant means using various R libraries but here is a solution using only base R.
df <- read.table('~/Desktop/stack_overflow28283818.csv', sep = ',', header = T)
#helper function
rankByGroup <- function(df = NULL, grp = 1)
{
rank(df[df$group == grp, 'value'])
}
# Remove NAs
df.na <- df[is.na(df$value),]
df.0 <- df[!is.na(df$value),]
# For loop over groups to list the ranks
for(grp in unique(df.0$group))
{
df.0[df.0$group == grp, 'order'] <- rankByGroup(df.0, grp)
print(grp)
}
# Append NAs
df.na$order <- NA
df.out <- rbind(df.0,df.na)
#re-sort for ordering given in OP (probably not really required)
df.out <- df.out[order(as.numeric(rownames(df.out))),]
This gives exactly the output desired, although I suspect that maintaining the position of the NAs in the data may not be necessary for your application.
> df.out
group value order
1 1 12.1 3
2 1 10.3 1
3 1 NA NA
4 1 11.0 2
5 1 13.5 4
6 2 11.7 3
7 2 NA NA
8 2 10.4 2
9 2 9.7 1
I am trying to get means from a column in a data frame based on a unique value. So trying to get mean of column b and column c in this exampled based on the unique values in column a. I thought the .(a) would make it calculate by unique value in a (it gives the unique values of a) but it just gives a mean for the whole column b or c.
df2<-data.frame(a=seq(1:5),b=c(1:10), c=c(11:20))
simVars <- c("b", "c")
for ( var in simVars ){
print(var)
dat = ddply(df2, .(a), summarize, mean_val = mean(df2[[var]])) ## my script
assign(var, dat)
}
c
a mean_val
1 15.5
2 15.5
3 15.5
4 15.5
5 15.5
How can I have it take an average for the column based on the unique value from column a?
thanks
You don't need a loop. Just calculate the means of b and c within a single call to ddply and the means will be calculated separately for each value of a. And, as #Gregor said, you don't need to re-specify the data frame name inside mean():
ddply(df2, .(a), summarise,
mean_b=mean(b),
mean_c=mean(c))
a mean_b mean_c
1 1 3.5 13.5
2 2 4.5 14.5
3 3 5.5 15.5
4 4 6.5 16.5
5 5 7.5 17.5
UPDATE: To get separate data frames for each column of means:
# Add a few additional columns to the data frame
df2 = data.frame(a=seq(1:5),b=c(1:10), c=c(11:20), d=c(21:30), e=c(31:40))
# New data frame with means by each level of column a
library(dplyr)
dfmeans = df2 %>%
group_by(a) %>%
summarise_each(funs(mean))
# Separate each column of means into a separate data frame and store it in a list:
means.list = lapply(names(dfmeans)[-1], function(x) {
cbind(dfmeans[,"a"], dfmeans[,x])
})
means.list
[[1]]
a b
1 1 3.5
2 2 4.5
3 3 5.5
4 4 6.5
5 5 7.5
[[2]]
a c
1 1 13.5
2 2 14.5
3 3 15.5
4 4 16.5
5 5 17.5
[[3]]
a d
1 1 23.5
2 2 24.5
3 3 25.5
4 4 26.5
5 5 27.5
[[4]]
a e
1 1 33.5
2 2 34.5
3 3 35.5
4 4 36.5
5 5 37.5
I have a dataframe (imported from a csv file) as follows
moose loose hoose
2 3 8
1 3 4
5 4 2
10 1 4
The R code should generate a mean column and then I would like to remove all rows where the value of the mean is <4 so that I end up with:
moose loose hoose mean
2 3 8 4.3
1 3 4 2.6
5 4 2 3.6
10 1 4 5
which should then end up as:
moose loose hoose mean
2 3 8 4.3
10 1 4 5
How can I do this in R?
dat2 <- subset(transform(dat1, Mean=round(rowMeans(dat1),1)), Mean >=4)
dat2
# moose loose hoose Mean
#1 2 3 8 4.3
#4 10 1 4 5.0
Using data.table
setDT(dat1)[, Mean:=rowMeans(.SD)][Mean>=4]
# moose loose hoose Mean
#1: 2 3 8 4.333333
#2: 10 1 4 5.000000
I will assume your data is called d. Then you run:
d$mean <- rowMeans(d) ## create a new column with the mean of each row
d <- d[d$mean >= 4, ] ## filter the data using this column in the condition
I suggest you read about creating variables in a data.frame, and filtering data. These are very common operations that you can use in many many contexts.
You could also use within, which allows you to assign/remove columns and then returns the transformed data. Start with df,
> df
# moose loose hoose
#1 2 3 8
#2 1 3 4
#3 5 4 2
#4 10 1 4
> within(d <- df[rowMeans(df) > 4, ], { means <- round(rowMeans(d), 1) })
# moose loose hoose means
#1 2 3 8 4.3
#4 10 1 4 5.0