ifelse in data.table aggregate result - r

I have a question which really need your help:
set.seed(1111)
s<-rep(seq(1,4),5)
a<-runif(20,0.2,0.6)
b<-runif(20,0.4,0.7)
b[6:8]<-NA
c<-runif(20,4,7)
d<-data.table(s,a,b,c)
setkey(d,s)
The data is as following:
s a b c
1: 1 0.3862011 0.4493240 6.793058
2: 1 0.4955267 0.4187441 4.708561
3: 1 0.4185155 0.5916827 6.810053
4: 1 0.5003833 0.5403744 5.948629
5: 1 0.5667312 0.5634135 6.880848
6: 2 0.3651699 0.5263655 5.721908
7: 2 0.5905308 NA 6.863213
8: 2 0.2560464 0.4649180 5.745656
9: 2 0.4533625 0.5077432 5.958526
10: 2 0.4228027 0.4340407 5.115065
11: 3 0.5628013 0.6517352 6.252962
12: 3 0.5519840 NA 4.875669
13: 3 0.2006761 0.6418540 5.452210
14: 3 0.5472671 0.4503713 6.962282
15: 3 0.5601675 0.5195013 6.666593
16: 4 0.2548422 0.6962112 5.535579
17: 4 0.2467137 NA 6.680080
18: 4 0.4995830 0.6793684 6.334579
19: 4 0.2637452 0.4078512 6.076039
20: 4 0.5063548 0.4055017 5.287291
If I do a simple sum, using s as key, it will return a nice table summarize the result:
d[,sum(c),by=s]
s V1
1: 1 31.14115
2: 2 29.40437
3: 3 30.20972
4: 4 29.91357
However, if my data.table command contain ifelse statement, I will not get similar table:
d2<-d[,ifelse(a<b,"NA",sum(c)),by=s]
d2
s V1
1: 1 NA
2: 1 31.1411493057385
3: 1 NA
4: 1 NA
5: 1 31.1411493057385
6: 2 NA
7: 2 NA
8: 2 NA
9: 2 NA
10: 2 NA
11: 3 NA
12: 3 NA
13: 3 NA
14: 3 30.2097161230631
15: 3 30.2097161230631
16: 4 NA
17: 4 NA
18: 4 NA
19: 4 NA
20: 4 29.9135677714366
Is that possible to use the ifelse statement return a result just like the simple sum result table which return the unique non-na value under the each index value?
Thanks a lot!!!!!

I am not entirely certain what you are looking for, but I think you just want to use the a<b condition as the row selector in your data.table, which is done by using it as the first argument in the brackets:
> d[a<b, sum(c), by = s]
s V1
1: 1 19.6
2: 2 22.5
3: 3 11.7
4: 4 17.9

library(plyr)
ddply(d[a<b], .(s), summarize, tot=sum(c))

There is a simple and fast solution based on conditional sum using which:
d[, .( sum_c = sum(c[which( a < b)]) ), by=s]
# s sum_c
# 1: 1 19.552
# 2: 2 22.541
# 3: 3 11.705
# 4: 4 17.946
The advantage of this structure over the other answers presented so far is that it allows you to calculate different aggregations in the same call using different conditions, for example:
d[, .( sum_c = sum(c[which( a < b)]),
sum_a = sum(c[which( c < 6)]) ), by=s]
# s sum_c sum_a
# 1: 1 19.552 10.657
# 2: 2 22.541 22.541
# 3: 3 11.705 10.328
# 4: 4 17.946 10.823
There is a benchmark of the speed of this solution compared to other approaches in a similar question, here.

Related

For each value of one column, find which is the last value of another vector that is lower

Finding the last position of a vector that is less than a given value is fairly straightforward (see e.g. this question
But, doing this line by line for a column in a data.frame or data.table is horribly slow. For example, we can do it like this (which is ok on small data, but not good on big data)
library(data.table)
set.seed(123)
x = sort(sample(20,5))
# [1] 6 8 15 16 17
y = data.table(V1 = 1:20)
y[, last.x := tail(which(x <= V1), 1), by = 1:nrow(y)]
# V1 last.x
# 1: 1 NA
# 2: 2 NA
# 3: 3 NA
# 4: 4 NA
# 5: 5 NA
# 6: 6 1
# 7: 7 1
# 8: 8 2
# 9: 9 2
# 10: 10 2
# 11: 11 2
# 12: 12 2
# 13: 13 2
# 14: 14 2
# 15: 15 3
# 16: 16 4
# 17: 17 5
# 18: 18 5
# 19: 19 5
# 20: 20 5
Is there a fast, vectorised way to get the same thing? Preferably using data.table or base R.
You may use findInterval
y[ , last.x := findInterval(V1, x)]
Slightly more convoluted using cut. But on the other hand, you get the NAs right away:
y[ , last.x := as.numeric(cut(V1, c(x, Inf), right = FALSE))]
Pretty simple in base R
x<-c(6L, 8L, 15L, 16L, 17L)
y<-1:20
cumsum(y %in% x)
[1] 0 0 0 0 0 1 1 2 2 2 2 2 2 2 3 4 5 5 5 5

Data.table selecting columns by name, e.g. using grepl

Say I have the following data.table:
dt <- data.table("x1"=c(1:10), "x2"=c(1:10),"y1"=c(10:1),"y2"=c(10:1), desc = c("a","a","a","b","b","b","b","b","c","c"))
I want to sum columns starting with an 'x', and sum columns starting with an 'y', by desc. At the moment I do this by:
dt[,.(Sumx=sum(x1,x2), Sumy=sum(y1,y2)), by=desc]
which works, but I would like to refer to all columns with "x" or "y" by their column names, eg using grepl().
Please could you advise me how to do so? I think I need to use with=FALSE, but cannot get it to work in combination with by=desc?
One-liner:
melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))[,
lapply(.SD, sum), by=desc, .SDcols=x:y]
Long version (by #Frank):
First, you probably don't want to store your data like that. Instead...
m = melt(dt, id="desc", measure.vars=patterns("^x", "^y"), value.name=c("x","y"))
desc variable x y
1: a 1 1 10
2: a 1 2 9
3: a 1 3 8
4: b 1 4 7
5: b 1 5 6
6: b 1 6 5
7: b 1 7 4
8: b 1 8 3
9: c 1 9 2
10: c 1 10 1
11: a 2 1 10
12: a 2 2 9
13: a 2 3 8
14: b 2 4 7
15: b 2 5 6
16: b 2 6 5
17: b 2 7 4
18: b 2 8 3
19: c 2 9 2
20: c 2 10 1
Then you can do...
setnames(m[, lapply(.SD, sum), by=desc, .SDcols=x:y], 2:3, paste0("Sum", c("x", "y")))[]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6
For more on improving the data structure you're working with, read about tidying data.
Use mget with grep is an option, where grep("^x", ...) returns the column names starting with x and use mget to get the column data, unlist the result and then you can calculate the sum:
dt[,.(Sumx=sum(unlist(mget(grep("^x", names(dt), value = T)))),
Sumy=sum(unlist(mget(grep("^y", names(dt), value = T))))), by=desc]
# desc Sumx Sumy
#1: a 12 54
#2: b 60 50
#3: c 38 6

How to replace NAs of a variable with values from another dataframe

i hope this one isn´t stupid.
I have two dataframes with Variables ID and gender/sex. In df1, there are NAs. In df2, the variable is complete. I want to complete the column in df1 with the values from df2.
(In df1 the variable is called "gender". In df2 it is called "sex".)
Here is what i tried so far:
#example-data
ID<-seq(1,30,by=1)
df1<-as.data.frame(ID)
df2<-df1
df1$gender<-c(NA,"2","1",NA,"2","2","2","2","2","2",NA,"2","1","1",NA,"2","2","2","2","2","1","2","2",NA,"2","2","2","2","2",NA)
df2$sex<-c("2","2","1","2","2","2","2","2","2","2","2","2","1","1","2","2","2","2","2","2","1","2","2","2","2","2","2","2","2","2")
#Approach 1:
NAs.a <- is.na(df1$gender)
df1$gender[NAs.a] <- df2[match(df1$ID[NAs.a], df2$ID),]$sex
#Approach 2 (i like dplyr a lot, perhaps there´s a way to use it):
library("dplyr")
temp<-df2 %>% select(ID,gender)
#EDIT:
#df<-left_join(df1$gender,df2$gender, by="ID")
df<-left_join(df1,df2, by="ID")
Thank you very much.
Here's a quick solution using data.tables binary join this will join only gender with sex and leave all the rest of the columns untouched
library(data.table)
setkey(setDT(df1), ID)
df1[df2, gender := i.sex][]
# ID gender
# 1: 1 2
# 2: 2 2
# 3: 3 1
# 4: 4 2
# 5: 5 2
# 6: 6 2
# 7: 7 2
# 8: 8 2
# 9: 9 2
# 10: 10 2
# 11: 11 2
# 12: 12 2
# 13: 13 1
# 14: 14 1
# 15: 15 2
# 16: 16 2
# 17: 17 2
# 18: 18 2
# 19: 19 2
# 20: 20 2
# 21: 21 1
# 22: 22 2
# 23: 23 2
# 24: 24 2
# 25: 25 2
# 26: 26 2
# 27: 27 2
# 28: 28 2
# 29: 29 2
# 30: 30 2
This would probably be the simplest with base R.
idx <- is.na(df1$gender)
df1$gender[idx] = df2$sex[idx]
You could do
df1 %>% select(ID) %>% left_join(df2, by = "ID")
# ID sex
#1 1 2
#2 2 2
#3 3 1
#4 4 2
#5 5 2
#6 6 2
#.. ..
This assumes - as in the example - that all ID's from df1 are also present in df2 and have a sex/gender information there.
If you have other columns in your data you could also try this instead:
df1 %>% select(-gender) %>% left_join(df2[c("ID", "sex")], by = "ID")

How to apply a function with if statement in ddply or any kind of apply()?

First let's generate some sample data and install plyr and data.table package:
library("plyr", lib.loc="~/R/win-library/3.1")
library("data.table", lib.loc="~/R/win-library/3.1")
x<-seq(1:12)
y<-rep(seq(1:4),3)
z<-c(rep("a",6),rep("b",6))
t<-rep(seq(2005,length.out=6),2)
df<-data.table(t,x,y,z)
setkey(df,z,t)
this will yield a table:
t x y z
1: 2005 1 1 a
2: 2006 2 2 a
3: 2007 3 3 a
4: 2008 4 4 a
5: 2009 5 1 a
6: 2010 6 2 a
7: 2005 7 3 b
8: 2006 8 4 b
9: 2007 9 1 b
10: 2008 10 2 b
11: 2009 11 3 b
12: 2010 12 4 b
Now the job is: separate this data.frame into two small data set according to z. in each set, if y > lag(y,k=1)(i.e y>previous y). then apply function i=y/lag(y,k=1), otherwise, apply function i=-y/lag(y,k=1).
The approach I tried is following:
#####define a function f
f<-function(x,y)
{ if (y>lag(y,k=1)) {i<-y/lag(y,k=1)}
else{i<--y/lag(y,k=1)}
return (i)
}
#######using ddply to apply function to subset
v<-ddply(df,.(z),summarize,i=f(x,y))
However this will return error massages saying:
Error in attributes(column) <- a :
invalid time series parameters specified
In addition: Warning messages:
1: In if (y > lag(y, k = 1)) { :
the condition has length > 1 and only the first element will be used
2: In if (y > lag(y, k = 1)) { :
the condition has length > 1 and only the first element will be used
I think I made some mistake during the coding and more importantly, it seems that my if statement doesn't looping in the function. Anyone have any idea how to correct this problem??
Thank you very much for your help in advance!!!
UPdate:
The desired result is something like this:
t x y z i
1: 2005 1 1 a na
2: 2006 2 2 a 2.000000
3: 2007 3 3 a 1.500000
4: 2008 4 4 a 1.333333
5: 2009 5 1 a -0.250000
6: 2010 6 2 a 2.000000
1: 2005 7 3 b na
2: 2006 8 4 b 1.333333
3: 2007 9 1 b -0.250000
4: 2008 10 2 b 2.000000
5: 2009 11 3 b 1.500000
6: 2010 12 4 b 1.333333
Thanks again!

update data.table subset with function

I have a data.table
dt2 <- data.table(urn=1:10,freq=0, freqband="")
dt2$freqband = NA
dt2$freq <- 1:7 #does give a warning message
## urn freq freqband
## 1: 1 1 NA
## 2: 2 2 NA
## 3: 3 3 NA
## 4: 4 4 NA
## 5: 5 5 NA
## 6: 6 6 NA
## 7: 7 7 NA
## 8: 8 1 NA
## 9: 9 2 NA
##10: 10 3 NA
i also have a function that I am wanting to use to group my freq column
fn_GetFrequency <- function(numgifts) {
if (numgifts <5) return("<5")
if (numgifts >=5) return("5+")
return("ERROR")
}
I am wanting to set the freqband column based on this function. In some cases it will be all records, in some cases it will be a subset. My current approach is (for a subset):
dt2[dt2$urn < 9, freqband := fn_GetFrequency(freq)]
using this approach I get the warning:
Warning message:
In if (numgifts < 5) return("<5") :
the condition has length > 1 and only the first element will be used
then it sets all the records to have a value of "<5" rather than the correct value. I'm figuring that I need to use some sort of lapply/sapply/etc function, however I still haven't been able to quite grasp how they work in order to use them to solve my problem.
Any help would be greatly appreciated.
EDIT: How might you do this if you use a function that requires 2 parameters?
UPDATED: to include the output of dt2 after my attempted update
urn freq freqband
1: 1 1 <5
2: 2 2 <5
3: 3 3 <5
4: 4 4 <5
5: 5 5 <5
6: 6 6 <5
7: 7 7 <5
8: 8 1 <5
9: 9 2 NA
10: 10 3 NA
UPDATE: I tried this code to and it worked to deliver the desired output, and it allows me to have a function I can call in other places of code too.
dt2[dt2$urn < 9, freqband := sapply(freq, fn_GetFrequency)]
> fn_GetFrequency <- function(numgifts) {
+ ifelse (numgifts <5, "<5", "5+")
+ }
> dt2[dt2$urn < 9, freqband := fn_GetFrequency(freq)]
> dt2
urn freq freqband
1: 1 1 <5
2: 2 2 <5
3: 3 3 <5
4: 4 4 <5
5: 5 5 5+
6: 6 6 5+
7: 7 7 5+
8: 8 1 <5
9: 9 2 NA
10: 10 3 NA
For multiple bands (which I'm sure has been asked before) you should use the findInterval function. And I'm doing it the data.table way reather than the dataframe way:
dt2[ urn==8, freq := -1 ] # and something to test the <0 condition
dt2[ urn <= 8, freqband := c("ERROR", "<5", "5+")[
findInterval(freq,c(-Inf, 0, 5 ,Inf))] ]
dt2
urn freq freqband
1: 1 1 <5
2: 2 2 <5
3: 3 3 <5
4: 4 4 <5
5: 5 5 5+
6: 6 6 5+
7: 7 7 5+
8: 8 -1 ERROR
9: 9 2 NA
10: 10 3 NA

Resources