Reduce dataset based on value - r

I have a dataset
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
for every id the values are sorted with ascending order
i want to reduce the dtf to include only the first row for every id that the value exceeds a specified limit. Only one row per id, and that should be the one that the value first exceed a specified limit.
For this example and for the limit of 5 the dtf should reduce to :
A 6
B 6
Is the a nice way to do this?
Thanks a lot

It could be done with aggregate:
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
limit <- 5
aggregate(value ~ id, dtf, function(x) x[x > limit][1])
The result:
id value
1 A 6
2 B 6
Update: A solution for multiple columns:
An example data frame, dtf2:
dtf2 <- data.frame(id=c("A","A","A","A","B","B","B","B"),
value=c(2,4,6,8,4,6,8,10),
col3 = letters[1:8],
col4 = 1:8)
A solution including ave:
with(dtf2, dtf2[ave(value, id, FUN = function(x) cumsum(x > limit)) == 1, ])
The result:
id value col3 col4
3 A 6 c 3
6 B 6 f 6

Here is a "nice" option using data.table:
library(data.table)
DT <- data.table(dft, key = "id")
DT[value > 5, head(.SD, 1), by = key(DT)]
# id value
# 1: A 6
# 2: B 6
And, in the spirit of sharing, an option using sqldf which might be nice depending on whether you feel more comfortable with SQL.
sqldf("select id, min(value) as value from dtf where value > 5 group by id")
# id value
# 1 A 6
# 2 B 6
Update: Unordered source data, and a data.frame with multiple columns
Based on your comments to some of the answers, it seems like there might be a chance that your "value" column might not be ordered like it is in your example, and that there are other columns present in your data.frame.
Here are two alternatives for those scenarios, one with data.table, which I find easiest to read and is most likely the fastest, and one with a typical "split-apply-combine" approach that is commonly needed for such tasks.
First, some sample data:
dtf2 <- data.frame(id = c("A","A","A","A","B","B","B","B"),
value = c(6,4,2,8,4,10,8,6),
col3 = letters[1:8],
col4 = 1:8)
dtf2 # Notice that the value column is not ordered
# id value col3 col4
# 1 A 6 a 1
# 2 A 4 b 2
# 3 A 2 c 3
# 4 A 8 d 4
# 5 B 4 e 5
# 6 B 10 f 6
# 7 B 8 g 7
# 8 B 6 h 8
Second, the data.table approach:
library(data.table)
DT <- data.table(dtf2)
DT # Verify that the data are not ordered
# id value col3 col4
# 1: A 6 a 1
# 2: A 4 b 2
# 3: A 2 c 3
# 4: A 8 d 4
# 5: B 4 e 5
# 6: B 10 f 6
# 7: B 8 g 7
# 8: B 6 h 8
DT[order(value)][value > 5, head(.SD, 1), by = "id"]
# id value col3 col4
# 1: A 6 a 1
# 2: B 6 h 8
Second, base R's common "split-apply-combine" approach:
do.call(rbind,
lapply(split(dtf2, dtf2$id),
function(x) x[x$value > 5, ][which.min(x$value[x$value > 5]), ]))
# id value col3 col4
# A A 6 a 1
# B B 6 h 8

Another approach with aggregate:
> aggregate(value~id, dtf[dtf[,'value'] > 5,], min)
id value
1 A 6
2 B 6
This does depend on the elements being sorted, as that will be the entry returned by min

might aswell, an alternative with plyr and head :
library(plyr)
dtf<-data.frame(id=c("A","A","A","A","B","B","B","B"), value=c(2,4,6,8,4,6,8,10))
limit <- 5
result <- ddply(dtf, "id", function(x) head(x[x$value > limit ,],1) )
> result
id value
1 A 6
2 B 6

This depends on your data.frame being sorted:
threshold <- 5
foo <- dtf[dtf$value>=threshold,]
foo[c(1,which(diff(as.numeric(as.factor(foo$id)))>0)),]

Related

Arranging data.frame's columns based on a reference vector [duplicate]

I have a data.frame that looks like this:
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
The vector is sorted by the cluster_id (which goes up to 11).
I want to sort the columns in the data frame such that the columns are in the order of the names in the vector.
A simple example of what I want is that:
Data:
A B C
1 2 3
4 5 6
Vector:
c("B","C","A")
Sorted:
B C A
2 3 1
5 6 4
Is there a fast way to do this?
UPDATE, with reproducible data added by OP:
df <- read.table(h=T, text="A B C
1 2 3
4 5 6")
vec <- c("B", "C", "A")
df[vec]
Results in:
B C A
1 2 3 1
2 5 6 4
As OP desires.
How about:
df[df.clust$mutation_id]
Where df is the data.frame you want to sort the columns of and df.clust is the data frame that contains the vector with the column order (mutation_id).
This basically treats df as a list and uses standard vector indexing techniques to re-order it.
Brodie's answer does exactly what you're asking for. However, you imply that your data are large, so I will provide an alternative using "data.table", which has a function called setcolorder that will change the column order by reference.
Here's a reproducible example.
Start with some simple data:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
matches <- data.frame(X = 1:3, Y = c("C", "A", "B"), Z = 4:6)
mydf
# A B C
# 1 1 3 5
# 2 2 4 6
matches
# X Y Z
# 1 1 C 4
# 2 2 A 5
# 3 3 B 6
Provide proof that Brodie's answer works:
out <- mydf[matches$Y]
out
# C A B
# 1 5 1 3
# 2 6 2 4
Show a more memory efficient way to do the same thing.
library(data.table)
setDT(mydf)
mydf
# A B C
# 1: 1 3 5
# 2: 2 4 6
setcolorder(mydf, as.character(matches$Y))
mydf
# C A B
# 1: 5 1 3
# 2: 6 2 4
A5C1D2H2I1M1N2O1R2T1's solution didn't work for my data (I've a similar problem that Yilun Zhang) so I found another option:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
# A B C
# 1 1 3 5
# 2 2 4 6
matches <- c("B", "C", "A") #desired order
mydf_reorder <- mydf[,match(matches, colnames(mydf))]
colnames(mydf_reorder)
#[1] "B" "C" "A"
match() find the the position of first element on the second one:
match(matches, colnames(mydf))
#[1] 2 3 1
I hope this can offer another solution if anyone is having problems!

Speed up data.frame rearrangement

I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)

R: How to sum up (aggregate) values of dfs according to column criteria within a list?

I want to sum up all Values from the same country in a list of dfs (df-wise!). Here some example data:
df1 <- data.frame(CNTRY = c("A", "B", "C"), Value=c(3,1,4))
df2 <- data.frame(CNTRY = c("A", "B", "C", "C"),Value=c(3,5,8,7))
dfList <- list(df1, df2)
names(dfList) <- c("111.2000", "112.2000")
My list is consisting of dfs (only dfs) with different rownumbers,
but same columnsstructure for all dfs. The listnames are a mixture of articleIDs and Year, this are more then 1000 dfs.
Now my question: How can sum up or aggregate the country values in all dfs?
My expected result is:
$`111.2000`
CNTRY Value
1 A 3
2 B 1
3 C 4
$`112.2000`
CNTRY Value
1 A 3
2 B 5
3 C 15
I tried aggregate(Value ~ CNTRY, data=dfList,FUN=sum) which delivers an error message as, CNTRY and VALUE are not the objects, but the columns within the objects. Any ideas? Thanks in advance.
Use lapply() for applying the aggregate() function over dfList.
lapply(dfList, function(x) aggregate(Value ~ CNTRY, x, sum))
# $`111.2000`
# CNTRY Value
# 1 A 3
# 2 B 1
# 3 C 4
#
# $`112.2000`
# CNTRY Value
# 1 A 3
# 2 B 5
# 3 C 15
A list of DFs is better than a bunch of DFs in the wild, but a single DF is even better:
library(data.table)
DF = rbindlist(dfList, idcol="id")
id CNTRY Value
1: 111.2000 A 3
2: 111.2000 B 1
3: 111.2000 C 4
4: 112.2000 A 3
5: 112.2000 B 5
6: 112.2000 C 8
7: 112.2000 C 7
From there, you can aggregate according to data.table syntax
DTres <- DF[, .(Value = sum(Value)), by=.(id, CNTRY)]
id CNTRY Value
1: 111.2000 A 3
2: 111.2000 B 1
3: 111.2000 C 4
4: 112.2000 A 3
5: 112.2000 B 5
6: 112.2000 C 15
From here, you can do things like
dcast(DTres, id ~ CNTRY)
id A B C
1: 111.2000 3 1 4
2: 112.2000 3 5 15
I'm sure there's some way to do this in base R as well, but I'd say don't bother.

How to group by column?

I have a dataframe of student scores, instead of getting an overall average score for every student, I need to get the average scores by "course-type" for every student, for example, courses a,c,d are the same type, and courses b, e are the same type. I do this by the following code, but it is not "R" enough:
x <- data.frame(a=c(1,2,3), b=c(4,5,6), c=c(6,7,8),
d=c(7,8,9), e=c(10, 11, 12))
group <- data.frame(no=c(1,2,1,1,2), name=c("a", "b", "c", "d","e"))
> x
a b c d e
1 1 4 6 7 10
2 2 5 7 8 11
3 3 6 8 9 12
> group
no name
1 1 a
2 2 b
3 1 c
4 1 d
5 2 e
I think this is some stupid:
x.1 <- x[,as.character(group$name[group$no==1])]
x.2 <- x[,as.character(group$name[group$no==2])]
mean.by.no <- data.frame(x.1.mean=apply(x.1, 1, mean),
x.2.mean=apply(x.2, 1, mean))
If mean.by.no is the expected result, we could split the 'name' column by 'no' ('group' dataset) to get a list. Using one ofapply family functions (lapply/sapply/vapply), we can use the output as column index for the 'x', and get the mean for each row (rowMeans).
vapply(with(group, split(as.character(name), no)),
function(y) rowMeans(x[y]), numeric(nrow(x)))
# 1 2
#[1,] 4.666667 7
#[2,] 5.666667 8
#[3,] 6.666667 9
Or using tapply, we can get the mean using grouping index for row and column.
indx <- xtabs(no~name, group)[col(x)]
t(tapply(as.matrix(x), list(indx, row(x)), FUN=mean))
# 1 2
#1 4.666667 7
#2 5.666667 8
#3 6.666667 9
Or another option would be to convert the 'x' from 'wide' to 'long' format using melt from data.table after converting the 'data.frame' to 'data.table' (setDT). Set the key column as 'name' (setkey(..), and get the mean grouped by 'no' and 'rn' (row number column created by keep.rownames=TRUE). If needed, the output can be converted back to 'wide' format using dcast.
library(data.table)#v1.9.5+
dL <- setkey(melt(setDT(x, keep.rownames=TRUE), id.var='rn',
variable.name='name')[, name:= as.character(name)],
name)[group[2:1]][,mean(value) , by=list( no, rn)]
dcast(dL, rn~paste0('mean',no), value.var='V1')[,rn:=NULL][]
# mean1 mean2
#1: 4.666667 7
#2: 5.666667 8
#3: 6.666667 9
There's probably a more elegant way of this, but:
library(reshape)
library(plyr)
x <- data.frame(a=c(1,2,3), b=c(4,5,6), c=c(6,7,8), d=c(7,8,9), e=c(10, 11, 12))
group <- data.frame(no=c(1,2,1,1,2), name=c("a", "b", "c", "d","e"))
a<-melt(x)
names(a)<-c("name", "score")
b<-merge(a, group, by="name")
c<-ddply(b, c("no"), summarize, meanscore=mean(score))
c
> c
no meanscore
1 1 5.666667
2 2 8.000000

Sort columns of a data frame by a vector of column names

I have a data.frame that looks like this:
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
The vector is sorted by the cluster_id (which goes up to 11).
I want to sort the columns in the data frame such that the columns are in the order of the names in the vector.
A simple example of what I want is that:
Data:
A B C
1 2 3
4 5 6
Vector:
c("B","C","A")
Sorted:
B C A
2 3 1
5 6 4
Is there a fast way to do this?
UPDATE, with reproducible data added by OP:
df <- read.table(h=T, text="A B C
1 2 3
4 5 6")
vec <- c("B", "C", "A")
df[vec]
Results in:
B C A
1 2 3 1
2 5 6 4
As OP desires.
How about:
df[df.clust$mutation_id]
Where df is the data.frame you want to sort the columns of and df.clust is the data frame that contains the vector with the column order (mutation_id).
This basically treats df as a list and uses standard vector indexing techniques to re-order it.
Brodie's answer does exactly what you're asking for. However, you imply that your data are large, so I will provide an alternative using "data.table", which has a function called setcolorder that will change the column order by reference.
Here's a reproducible example.
Start with some simple data:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
matches <- data.frame(X = 1:3, Y = c("C", "A", "B"), Z = 4:6)
mydf
# A B C
# 1 1 3 5
# 2 2 4 6
matches
# X Y Z
# 1 1 C 4
# 2 2 A 5
# 3 3 B 6
Provide proof that Brodie's answer works:
out <- mydf[matches$Y]
out
# C A B
# 1 5 1 3
# 2 6 2 4
Show a more memory efficient way to do the same thing.
library(data.table)
setDT(mydf)
mydf
# A B C
# 1: 1 3 5
# 2: 2 4 6
setcolorder(mydf, as.character(matches$Y))
mydf
# C A B
# 1: 5 1 3
# 2: 6 2 4
A5C1D2H2I1M1N2O1R2T1's solution didn't work for my data (I've a similar problem that Yilun Zhang) so I found another option:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
# A B C
# 1 1 3 5
# 2 2 4 6
matches <- c("B", "C", "A") #desired order
mydf_reorder <- mydf[,match(matches, colnames(mydf))]
colnames(mydf_reorder)
#[1] "B" "C" "A"
match() find the the position of first element on the second one:
match(matches, colnames(mydf))
#[1] 2 3 1
I hope this can offer another solution if anyone is having problems!

Resources