How to group by column? - r

I have a dataframe of student scores, instead of getting an overall average score for every student, I need to get the average scores by "course-type" for every student, for example, courses a,c,d are the same type, and courses b, e are the same type. I do this by the following code, but it is not "R" enough:
x <- data.frame(a=c(1,2,3), b=c(4,5,6), c=c(6,7,8),
d=c(7,8,9), e=c(10, 11, 12))
group <- data.frame(no=c(1,2,1,1,2), name=c("a", "b", "c", "d","e"))
> x
a b c d e
1 1 4 6 7 10
2 2 5 7 8 11
3 3 6 8 9 12
> group
no name
1 1 a
2 2 b
3 1 c
4 1 d
5 2 e
I think this is some stupid:
x.1 <- x[,as.character(group$name[group$no==1])]
x.2 <- x[,as.character(group$name[group$no==2])]
mean.by.no <- data.frame(x.1.mean=apply(x.1, 1, mean),
x.2.mean=apply(x.2, 1, mean))

If mean.by.no is the expected result, we could split the 'name' column by 'no' ('group' dataset) to get a list. Using one ofapply family functions (lapply/sapply/vapply), we can use the output as column index for the 'x', and get the mean for each row (rowMeans).
vapply(with(group, split(as.character(name), no)),
function(y) rowMeans(x[y]), numeric(nrow(x)))
# 1 2
#[1,] 4.666667 7
#[2,] 5.666667 8
#[3,] 6.666667 9
Or using tapply, we can get the mean using grouping index for row and column.
indx <- xtabs(no~name, group)[col(x)]
t(tapply(as.matrix(x), list(indx, row(x)), FUN=mean))
# 1 2
#1 4.666667 7
#2 5.666667 8
#3 6.666667 9
Or another option would be to convert the 'x' from 'wide' to 'long' format using melt from data.table after converting the 'data.frame' to 'data.table' (setDT). Set the key column as 'name' (setkey(..), and get the mean grouped by 'no' and 'rn' (row number column created by keep.rownames=TRUE). If needed, the output can be converted back to 'wide' format using dcast.
library(data.table)#v1.9.5+
dL <- setkey(melt(setDT(x, keep.rownames=TRUE), id.var='rn',
variable.name='name')[, name:= as.character(name)],
name)[group[2:1]][,mean(value) , by=list( no, rn)]
dcast(dL, rn~paste0('mean',no), value.var='V1')[,rn:=NULL][]
# mean1 mean2
#1: 4.666667 7
#2: 5.666667 8
#3: 6.666667 9

There's probably a more elegant way of this, but:
library(reshape)
library(plyr)
x <- data.frame(a=c(1,2,3), b=c(4,5,6), c=c(6,7,8), d=c(7,8,9), e=c(10, 11, 12))
group <- data.frame(no=c(1,2,1,1,2), name=c("a", "b", "c", "d","e"))
a<-melt(x)
names(a)<-c("name", "score")
b<-merge(a, group, by="name")
c<-ddply(b, c("no"), summarize, meanscore=mean(score))
c
> c
no meanscore
1 1 5.666667
2 2 8.000000

Related

Row operations on selected columns based on substring in data.table

I would like to apply a function to selected columns that match two different substrings. I've found this post related to my question but I couldn't get an answer from there.
Here is a reproducible example with my failed attempt. For the sake of this example, I want to do a row-wise operation where I sum the values from all columns starting with string v and subtract from the average of the values in all columns starting with f.
update: the proposed solution must (a) use the := operator to make the most of data.table fast performance, and (2) be flexible to other operation rather than mean and sum, which I used here just for the sake of simplicity
library(data.table)
# generate data
dt <- data.table(id= letters[1:5],
v1= 1:5,
v2= 1:5,
f1= 11:15,
f2= 11:15)
dt
#> id v1 v2 f1 f2
#> 1: a 1 1 11 11
#> 2: b 2 2 12 12
#> 3: c 3 3 13 13
#> 4: d 4 4 14 14
#> 5: e 5 5 15 15
# what I've tried
dt[, Y := sum( .SDcols=names(dt) %like% "v" ) - mean( .SDcols=names(dt) %like% "f" ) by = id]
We melt the dataset into 'long' format, by making use of the measure argument, get the difference between the sum of 'v' and mean of 'f', grouped by 'id', join on the 'id' column with the original dataset and assign (:=) the 'V1' as the 'Y' variable
dt[melt(dt, measure = patterns("^v", "^f"), value.name = c("v", "f"))[
, sum(v) - mean(f), id], Y :=V1, on = .(id)]
dt
# id v1 v2 f1 f2 Y
#1: a 1 1 11 11 -9
#2: b 2 2 12 12 -8
#3: c 3 3 13 13 -7
#4: d 4 4 14 14 -6
#5: e 5 5 15 15 -5
Or another option is with Reduce after creating index or 'v' and 'f' columns
nmv <- which(startsWith(names(dt), "v"))
nmf <- which(startsWith(names(dt), "f"))
l1 <- length(nmv)
dt[, Y := Reduce(`+`, .SD[, nmv, with = FALSE])- (Reduce(`+`, .SD[, nmf, with = FALSE])/l1)]
rowSums and rowMeans combined with grep can accomplish this.
dt$Y <- rowMeans(dt[,grep("f", names(dt)),with=FALSE]) - rowSums(dt[,grep("v", names(dt)),with=FALSE])

dummy variable based on unique column interaction [duplicate]

This question already has answers here:
Pasting elements of two vectors alphabetically
(5 answers)
Closed 2 years ago.
I have the following data and wish to create an $ID variable for each unique interaction between two columns
DATE <- c('V', 'V', 'W', 'W', 'X', 'X', 'Y', 'Y', 'Z', 'Z')
SEX <- rep(1:2, 5)
Blood_T1 <- c(3,4,3,3,4,3,1,6,3,4)
Blood_T2 <- c(4,3,3,3,3,4,6,1,4,3)
df1 <- data.frame(DATE, SEX, Blood_T1, Blood_T2)
When grouping by $DATE, I want to create a new dummy variable for each unique combination of $Blood_T1 and $Blood_T2 regardless of their order.
The desired out appears below:
I cant use the sum, as it does not always produce unique combinations. (See the part marked in yellow above for clarification)
I have tried the following commands but have not yet hit the nail on the head
with(df1, interaction(Blood_T1, Blood_T2))
as.numeric(as.factor(with(df1, paste(Blood_T1, Blood_T2))))
transform(df1, Cluster_ID = as.numeric(interaction(Blood_T1, Blood_T2, drop=TRUE)))
You can actually sort the individual pairs ($Blood_T1 and $Blood_T2) and paste them together which is already a kind of ID
apply(df1, 1, function(x) paste(sort(x[3:4]), collapse = ""))
#[1] "34" "34" "33" "33" "34" "34" "16" "16" "34" "34"
If you want to further reduce it, you can treat it as a factor and obtain the numeric value
as.numeric(as.factor(apply(df1, 1, function(x) paste(sort(x[3:4]), collapse = ""))))
#[1] 3 3 2 2 3 3 1 1 3 3
You could throw in DATE too, if that is necessary
apply(df1, 1, function(x) paste(sort(x[c(1,3:4)]), collapse = ""))
#[1] "34V" "34V" "33W" "33W" "34X" "34X" "16Y" "16Y" "34Z" "34Z"
We can try with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), get the pmin and pmax of the 'Blood_T1' and 'Blood_T2' columns, paste, it together, match the values with the unique elements to create the 'Unique_ID', then we group by 'DATE' and concatenate the sum of 'Blood_T1' and 'Blood_T2' to create the 'Sum' column
library(data.table)
setDT(df1)[, Unique_ID := {
i1 <- paste(pmin(Blood_T1, Blood_T2), pmax(Blood_T1, Blood_T2))
match(i1, unique(i1))}]
df1[, Sum := c(sum(Blood_T1), sum(Blood_T2)), DATE][]
# DATE SEX Blood_T1 Blood_T2 Unique_ID Sum
#1: V 1 3 4 1 7
#2: V 2 4 3 1 7
#3: W 1 3 3 2 6
#4: W 2 3 3 2 6
#5: X 1 4 3 1 7
#6: X 2 3 4 1 7
#7: Y 1 1 6 3 7
#8: Y 2 6 1 3 7
#9: Z 1 3 4 1 7
#10: Z 2 4 3 1 7
The above can be also implemented in base R i.e. vectorized approach.
i1 <- with(df1, paste(pmin(Blood_T1, Blood_T2), pmax(Blood_T1, Blood_T2)))
df1$Unique_ID <- match(i1, unique(i1))

How to group in data.table with overlapping value?

i have a question relating to data.table in R.
I am working on an acceleration data that requires me to generate features from the raw data. I want to group data by each 2 second. It is easy by generating 1 more column to indicate groups for each 2 second and group with by.
However, i want do the overlapping windows. For example, my raw data is this
a=data.table(x = c(1:10), y= c(2:11), z = c(5), second=rep(c(1:5),each=2))
x y z second
1: 1 2 5 1
2: 2 3 5 1
3: 3 4 5 2
4: 4 5 5 2
5: 5 6 5 3
6: 6 7 5 3
7: 7 8 5 4
8: 8 9 5 4
9: 9 10 5 5
10: 10 11 5 5
Now, i want to calculate the mean of x,y,z column by each 2 seconds. 1and2, 2 and 3, 3 and 4, 4 and 5.
I can run the for loops but since i have a huge dataset, it will take a long time. Do you know how do to it with just data table tools?
Thanks so much
Here's another way:
ag = data.table(
second = c(1:2, 2:3, 3:4, 4:5),
g = rep(paste(1:4, 2:5, sep="-"), each=2)
)
a[ag, on="second"][, mean(unlist(.SD)), by=g, .SDcols=x:z]
# g V1
# 1: 1-2 3.666667
# 2: 2-3 5.000000
# 3: 3-4 6.333333
# 4: 4-5 7.666667
I'm sure you could write ag less manually, but it's not clear to me what the rules behind it are.
Generally, if you are computing statistics across columns, then your data is not well-formatted. If you have time, I'd suggest reading about making data "tidy".
As there is only 2 unique observations for 'second', we get the lead of the 'x', 'y', 'z' columns, grouped by 'second', unlist the Subset of Data.table and get the mean.
nm1 <- c("x", "y", "z")
na.omit(a[, paste0(nm1, 2) := lapply(.SD, function(x) shift(x, 2,
type = "lead")), .SDcols = nm1])[, .(Mean = mean(unlist(.SD))),
.(second = paste0(second, "-", second + 1))]
# second Mean
#1: 1-2 3.666667
#2: 2-3 5.000000
#3: 3-4 6.333333
#4: 4-5 7.666667
Or a slightly more compact option would be
library(dplyr)
cbind(a[second!= last(second)], a[second!= first(second)])[
,.(Mean = mean(unlist(.SD))), .(second = paste0(second, "-", second+1))]
# second Mean
#1: 1-2 3.666667
#2: 2-3 5.000000
#3: 3-4 6.333333
#4: 4-5 7.666667
Or another option would be place them in a list, rbind the dataset, create a new 'id1' column, get the mean after unlisting the .SDcols or we can get the individual mean of each column
dt1 <- rbindlist(list(a[second!= last(second)],
a[second!= first(second)]), idcol=TRUE)[, id1:= as.numeric(gl(.N, 2, .N)), .id][]
Get the mean for each column by 'second'
dt1[, lapply(.SD, mean), .(second = paste0(id1, "-", id1 + 1)), .SDcols = x:z]
Get the whole mean by 'second'
dt1[, mean(unlist(.SD)), .(second = paste0(id1, "-", id1 +1)), .SDcols = x:z]

R: How to sum up (aggregate) values of dfs according to column criteria within a list?

I want to sum up all Values from the same country in a list of dfs (df-wise!). Here some example data:
df1 <- data.frame(CNTRY = c("A", "B", "C"), Value=c(3,1,4))
df2 <- data.frame(CNTRY = c("A", "B", "C", "C"),Value=c(3,5,8,7))
dfList <- list(df1, df2)
names(dfList) <- c("111.2000", "112.2000")
My list is consisting of dfs (only dfs) with different rownumbers,
but same columnsstructure for all dfs. The listnames are a mixture of articleIDs and Year, this are more then 1000 dfs.
Now my question: How can sum up or aggregate the country values in all dfs?
My expected result is:
$`111.2000`
CNTRY Value
1 A 3
2 B 1
3 C 4
$`112.2000`
CNTRY Value
1 A 3
2 B 5
3 C 15
I tried aggregate(Value ~ CNTRY, data=dfList,FUN=sum) which delivers an error message as, CNTRY and VALUE are not the objects, but the columns within the objects. Any ideas? Thanks in advance.
Use lapply() for applying the aggregate() function over dfList.
lapply(dfList, function(x) aggregate(Value ~ CNTRY, x, sum))
# $`111.2000`
# CNTRY Value
# 1 A 3
# 2 B 1
# 3 C 4
#
# $`112.2000`
# CNTRY Value
# 1 A 3
# 2 B 5
# 3 C 15
A list of DFs is better than a bunch of DFs in the wild, but a single DF is even better:
library(data.table)
DF = rbindlist(dfList, idcol="id")
id CNTRY Value
1: 111.2000 A 3
2: 111.2000 B 1
3: 111.2000 C 4
4: 112.2000 A 3
5: 112.2000 B 5
6: 112.2000 C 8
7: 112.2000 C 7
From there, you can aggregate according to data.table syntax
DTres <- DF[, .(Value = sum(Value)), by=.(id, CNTRY)]
id CNTRY Value
1: 111.2000 A 3
2: 111.2000 B 1
3: 111.2000 C 4
4: 112.2000 A 3
5: 112.2000 B 5
6: 112.2000 C 15
From here, you can do things like
dcast(DTres, id ~ CNTRY)
id A B C
1: 111.2000 3 1 4
2: 112.2000 3 5 15
I'm sure there's some way to do this in base R as well, but I'd say don't bother.

Reduce dataset based on value

I have a dataset
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
for every id the values are sorted with ascending order
i want to reduce the dtf to include only the first row for every id that the value exceeds a specified limit. Only one row per id, and that should be the one that the value first exceed a specified limit.
For this example and for the limit of 5 the dtf should reduce to :
A 6
B 6
Is the a nice way to do this?
Thanks a lot
It could be done with aggregate:
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
limit <- 5
aggregate(value ~ id, dtf, function(x) x[x > limit][1])
The result:
id value
1 A 6
2 B 6
Update: A solution for multiple columns:
An example data frame, dtf2:
dtf2 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
value=c(2,4,6,8,4,6,8,10),
col3 = letters[1:8],
col4 = 1:8)
A solution including ave:
with(dtf2, dtf2[ave(value, id, FUN = function(x) cumsum(x > limit)) == 1, ])
The result:
id value col3 col4
3 A 6 c 3
6 B 6 f 6
Here is a "nice" option using data.table:
library(data.table)
DT <- data.table(dft, key = "id")
DT[value > 5, head(.SD, 1), by = key(DT)]
# id value
# 1: A 6
# 2: B 6
And, in the spirit of sharing, an option using sqldf which might be nice depending on whether you feel more comfortable with SQL.
sqldf("select id, min(value) as value from dtf where value > 5 group by id")
# id value
# 1 A 6
# 2 B 6
Update: Unordered source data, and a data.frame with multiple columns
Based on your comments to some of the answers, it seems like there might be a chance that your "value" column might not be ordered like it is in your example, and that there are other columns present in your data.frame.
Here are two alternatives for those scenarios, one with data.table, which I find easiest to read and is most likely the fastest, and one with a typical "split-apply-combine" approach that is commonly needed for such tasks.
First, some sample data:
dtf2 <- data.frame(id = c("A","A","A","A","B","B","B","B"),
value = c(6,4,2,8,4,10,8,6),
col3 = letters[1:8],
col4 = 1:8)
dtf2 # Notice that the value column is not ordered
# id value col3 col4
# 1 A 6 a 1
# 2 A 4 b 2
# 3 A 2 c 3
# 4 A 8 d 4
# 5 B 4 e 5
# 6 B 10 f 6
# 7 B 8 g 7
# 8 B 6 h 8
Second, the data.table approach:
library(data.table)
DT <- data.table(dtf2)
DT # Verify that the data are not ordered
# id value col3 col4
# 1: A 6 a 1
# 2: A 4 b 2
# 3: A 2 c 3
# 4: A 8 d 4
# 5: B 4 e 5
# 6: B 10 f 6
# 7: B 8 g 7
# 8: B 6 h 8
DT[order(value)][value > 5, head(.SD, 1), by = "id"]
# id value col3 col4
# 1: A 6 a 1
# 2: B 6 h 8
Second, base R's common "split-apply-combine" approach:
do.call(rbind,
lapply(split(dtf2, dtf2$id),
function(x) x[x$value > 5, ][which.min(x$value[x$value > 5]), ]))
# id value col3 col4
# A A 6 a 1
# B B 6 h 8
Another approach with aggregate:
> aggregate(value~id, dtf[dtf[,'value'] > 5,], min)
id value
1 A 6
2 B 6
This does depend on the elements being sorted, as that will be the entry returned by min
might aswell, an alternative with plyr and head :
library(plyr)
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
limit <- 5
result <- ddply(dtf, "id", function(x) head(x[x$value > limit ,],1) )
> result
id value
1 A 6
2 B 6
This depends on your data.frame being sorted:
threshold <- 5
foo <- dtf[dtf$value>=threshold,]
foo[c(1,which(diff(as.numeric(as.factor(foo$id)))>0)),]

Resources