Grouping bottom scores by two variables - r

I have a table that looks like this
uid gid score
1 a 5
1 a 8
1 a 9
1 b 2
1 b 7
2 a 5
2 a 9
.
.
.
But with many more entries for each user and group.
I want to get a table that has a row for each uid/gid pairing that is the mean of their bottom 5 scores.
This was trivial in Excel using pivot tables, but I need to do some analysis that R is much better for.
So I want my result to look like
uid gid top5avg
1 a 4.3
1 b 5.7
2 a 3.5
2 b 6.8
.
.
.
with one row for each uid gid pair and then the average of the top five scores for that uid/gid pair.

This is even more trivial in R, assuming your data frame is called dat and you really meant bottom 5 scores (even though your example suggests the top 5):
library(plyr)
ddply(dat,.(uid,gid),summarise,bottom5avg = mean(tail(sort(score),5)))
Note that this code assumes that there will be at least 5 observations in each group.

If your data was called dat this would work:
aggregate(score~uid+gid, data=dat, function(x) mean(sort(x)[1:5]))
EDIT:
If you meant the opposite (bottom 5) than what I had, as indicated by Joran (I was confused too), then use rev as in:
aggregate(score~uid+gid, data=dat, function(x) mean(rev(sort(x))[1:5]))
Or use the tail suggestion Joran made.

And the data.table solution
library(data.table)
setkey(dat,uid,gid,score)
sol <- dat[,list(avg5 = mean(tail(score,5)),by='uid,gid'])

Related

Removing/collapsing duplicate rows in R

I am using the following R code, which I copied from elsewhere (https://support.bioconductor.org/p/70133/). Seems to work great for what I hope to do (which is remove/collapse duplicates from a dataset), but I do not understand the last line. I would like to know on what basis the duplicates are removed/collapsed. It was commented it was based on the median absolute deviation (MAD), but I am not following that. Could anyone help me understand this, please?
Probesets=paste("a",1:200,sep="")
Genes=sample(letters,200,replace=T)
Value=rnorm(200)
X=data.frame(Probesets,Genes,Value)
X=X[order(X$Value,decreasing=T),]
Y=X[which(!duplicated(X$Genes)),]
Are you sure you want to remove those rows where the Genesvalues are duplicated? That's at least what this code does:
Y=X[which(!duplicated(X$Genes)),]
Thus, Ycontains only unique Genesvalues. If you compare nrow(Y)and length(unique(X$Genes))you will see that the result is the same:
nrow(Y); length(unique(X$Genes))
[1] 26
[1] 26
If you want to remove rows that contain duplicate values across all columns, which is arguably the definition of a duplicate row, then you can do this:
Y=X[!duplicated(X),]
To see how it works consider this example:
df <- data.frame(
a = c(1,1,2,3),
b = c(1,1,3,4)
)
df
a b
1 1 1
2 1 1
3 2 3
4 3 4
df[!duplicated(df),]
a b
1 1 1
3 2 3
4 3 4
Your code is keeping the records containing maximum value per gene.

How can I extract and merge corresponding row variables for designated column in data frame if the row has string variable in R?

So I have this data frame in R of which I'd like to bar plot the terms of one column via info <- table(df$ForPlot) . But first I need to merge corresponding row variables with that column IF the that row of the column I'd like to plot has a text (of which some rows have 2 terms some have 1 and others have none). So for example from this:
ID Name ForPlot
1 cool
2 nice ready soft
3 fast
4 slow party
5 good low
6 bad
7 true yo fit
8 false
I need a function or a practical way of accomplishing this:
ID Name ForPlot
1 cool
2 nice nice ready soft
3 fast
4 slow slow party
5 good good low
6 bad
7 true true yo fit
8 false
So ONLY if my "ForPlot" column has a string, the corresponding row from the "Name" column should be extracted an merged. Any ideas?
UPDATE So I thought I new how to plot the frequencies via info <- table(df$ForPlot) which I thought would have taken the frequencies of all the different texts in ForPlot, then run a bar plot of that. I was wrong. Instead it took the entire string of each row (multiple words) as a frequency count. Any ideas on how to make a bar plot from a column with multiple values?
You can do it with ifelse
df$ForPlot <- ifelse(df$ForPlot != "", paste(df$Name, df$ForPlot), " ")
> df
#Name ForPlot
#1 Cool
#2 nice nice ready soft
#3 fast
#4 slow slow party
#5 good good low
#6 bad
#7 true true yo fit
#8 false
EDIT : Updated the answer as per #Robert Dove's comment
Here is a way:
i <- df$ForPlot != ''
df$ForPlot[i] <- paste(df$Name[i], df$ForPlot[i])
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1'), using the logical condition (ForPlot!='') in 'i', we assign the 'ForPlot' by pasteing 'Name' and 'ForPlot' columns. This should be very fast as we are assigning in place.
library(data.table)
setDT(df1)[ForPlot!='', ForPlot:= paste(Name, ForPlot)]
df1
# ID Name ForPlot
#1: 1 cool
#2: 2 nice nice ready soft
#3: 3 fast
#4: 4 slow slow party
#5: 5 good good low
#6: 6 bad
#7: 7 true true yo fit
#8: 8 false
Update
If we need a bar plot of the word frequency after the transformation, we can split the 'ForPlot' column by space (strsplit), unlist the output list, use table to get the frequency and then plot with barplot.
barplot(table(unlist(strsplit(df1$ForPlot, ' '))))

Getting corresponding values from data.frame

my problem is that I can't really get my problem down in words which makes it hard to google it, so I am forced to ask you. I hope you will shed light on my issue:
I got a data.frame like this:
6 4
5 2
3 6
0 7
0 2
1 3
6 0
1 1
As you noticed, in the first column I got 0 repeating two times, 1 two times and so one. What I would like to do is get get all the corresponging values for one number, say 0, in the second columns (in this example 7 and 2). Preferably in data.frame.
I know the attempt with df$V2[which(df$V1==0)], however since the first column might have over 100 rows I can't really use this. Do you guys have a good solution?
Maybe some words regarding the background of this question: I need to process this data, i.e. get the mean of the second column for all 0's in the first columns, or get min/max values.
Regards
Here a solution using dplyr
df %>% group_by(V1) %>% summarize(ME=mean(V2))
Using your data (with some temporary names attached)
txt <- "6 4
5 2
3 6
0 7
0 2
1 3
6 0
1 1"
df <- read.table(text = txt)
names(df) <- paste0("Var", seq_len(ncol(df)))
Coerce the first column to be a factor
df <- transform(df, Var1 = factor(Var1))
Then you can use aggregate() with a nice formula interface
aggregate(Var2 ~ Var1, data = df, mean)
aggregate(Var2 ~ Var1, data = df, max)
aggregate(Var2 ~ Var1, data = df, min)
(eg:
> aggregate(Var2 ~ Var1, data = df, mean)
Var1 Var2
1 0 4.5
2 1 2.0
3 3 6.0
4 5 2.0
5 6 2.0
) or using the default interface
with(df, aggregate(Var2, list(Var1), FUN = mean))
> with(df, aggregate(Var2, list(Var1), FUN = mean))
Group.1 x
1 0 4.5
2 1 2.0
3 3 6.0
4 5 2.0
5 6 2.0
But the output is nicer from the formula interface.
Using data.table
library(data.table)
setDT(df)[, list(mean=mean(V2), max= max(V2), min=min(V2)), by = V1]
First, what exactly is the issue with the solution you suggest? Is it a question of efficiency? Frankly the code you present is close to optimal [1].
For the general case, you're probably looking at a split-apply-combine action, to apply a function to subsets of the data based on some differentiator. As #teucer points out, dplyr (and it's ancestor, plyr) are designed for exactly this, as is data.tables. In vanilla R, you would tend to use by or aggregate (or split and sapply for more advanced usage) for the same task. For example, to compute group means, you would do
by(df$V2, df$V1, mean)
or
aggregate(df, list(type=df$V1), mean)
Or even
sapply(split(df$V2, df$V1), mean)
[1] The code can be simplified to df$V2[df$V1 == 0] or df[df$V1 == 0,] as well.
Thanks all for your replies. I decided to go for the dplyr solution posted by teucer and eipi10. Since I have a third (and maybe even a fourth) column, this solution seems to be pretty easy to use (just adding V3 to group_by).
Since some are asking what's wrong with df$V2[which(df$V1==0)]: I maybe was a bit unclear when saying "rows", was I actually meant was "values". Let's assume I had n distinct values in the first column, I would have to use the command n times for all distinct values and store the n resulting vectors.

create new dataframe based on 2 columns

I have a large dataset "totaldata" containing multiple rows relating to each animal. Some of them are LactationNo 1 readings, and others are LactationNo 2 readings. I want to extract all animals that have readings from both LactationNo 1 and LactationNo 2 and store them in another dataframe "lactboth"
There are 16 other columns of variables of varying types in each row that I need to preserve in the new dataframe.
I have tried merge, aggregate and %in%, but perhaps I'm using them incorrectly eg.
(lactboth <- totaldata[totaldata$LactationNo %in% c(1,2), ])
Animal Id is column 1, and lactationno is column 2. I can't figure out how to select only those AnimalId with LactationNo=1&2
Have also tried
lactboth <- totaldata[ which(totaldata$LactationNo==1 & totaldata$LactationNo ==2), ]
I feel like this should be simple, but couldn't find an example to follow quite the same. Help appreciated!!
If I understand your question correctly, then your dataset looks something like this:
AnimalId LactationNo
1 A 1
2 B 2
3 E 2
4 A 2
5 E 2
and you'd like to select animals that happen to have both lactation numbers 1 & 2 (like A in this particular example). If that's the case, then you can simply use merge:
lactboth <- merge(totaldata[totaldata$LactationNo == 1,],
totaldata[totaldata$LactationNo == 2,],
by.x="AnimalId",
by.y="AnimalId")[,"AnimalId"]

R - aggregate factor/character variable

I have this kind of data frame :
df<- data.frame(cluster=c('1','1','2','3','3','3'), class=c('A','B','C','B','B','C'))
I would like to get for each cluster (1,2,3), the class which appears the most often. In case of a tie, it would also be great to get an info, as for example the combination of the classes (or if not possible just have NA).
So for my example, I would like to have something like this as result:
cluster class.max
1 'A B' (or NA)
2 'C'
3 'B'
Maybe I should use aggregate() but don't know how.
rank has ways of dealing with ties:
aggregate(class~cluster,df,function(x) paste(names(table(x)[rank(-1*table(x),ties.method="min")==1]),collapse=" "))
cluster class
1 1 A B
2 2 C
3 3 B

Resources