Data.table: operation with group-shifted data - r

Consider the folowing data.table:
DT <- data.table(year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013),
level = c(137,137,137,136,136,136,135,135,135),
valueIn = c(13,30,56,11,25,60,8,27,51))
I would like have the following ouput:
DT <- data.table(year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013),
level = c(137,137,137,136,136,136,135,135,135),
valueIn = c(13,30,56, 11,25,60, 8,27,51),
valueOut = c(12,27.5,58, 9.5,26,55.5, NA,NA,NA))
In other words, I want to calculate the operation (valueIn[level] - valueIn[level-1]) / 2, according to the year. For example, the first value is calculated like this: (13+11)/2=12.
For the moment, I do that with for loops, in which I create data.table's subsets for each level:
levelDtList <- list()
levels <- sort(DT$level, decreasing = FALSE)
for (this.level in levels) {
levelDt <- DT[level == this.level]
if (this.level == min(levels)) {
valueOut <- NA
} else {
levelM1Data <- levelDtList[[this.level - 1]]
valueOut <- (levelDt$valueIn + levelM1Data$valueIn) / 2
}
levelDt$valueOut <- valueOut
levelDtList[[this.level]] <- levelDt
}
datatable <- rbindlist(levelDtList)
This is ugly and quite slow, so I am looking for a better, faster, data.table-based solution.

Using the shift-function with type = 'lead' to get the next value, sum and divide by two:
DT[, valueOut := (valueIn + shift(valueIn, type = 'lead'))/2, by = year]
you get:
year level valueIn valueOut
1: 2011 137 13 12.0
2: 2012 137 30 27.5
3: 2013 137 56 58.0
4: 2011 136 11 9.5
5: 2012 136 25 26.0
6: 2013 136 60 55.5
7: 2011 135 8 NA
8: 2012 135 27 NA
9: 2013 135 51 NA
With all the parameters of the shift-function specified:
DT[, valueOut := (valueIn + shift(valueIn, n = 1L, fill = NA, type = 'lead'))/2, by = year]

We can also use shift with Reduce
DT[, valueOut := Reduce(`+`, shift(valueIn, type = "lead", 0:1))/2, by = year]
DT
# year level valueIn valueOut
#1: 2011 137 13 12.0
#2: 2012 137 30 27.5
#3: 2013 137 56 58.0
#4: 2011 136 11 9.5
#5: 2012 136 25 26.0
#6: 2013 136 60 55.5
#7: 2011 135 8 NA
#8: 2012 135 27 NA
#9: 2013 135 51 NA
It is more easier to generalize as shift can take a vector of 'n' values.

If you:
don't mind using dplyr
the year is the thing that relates your items
the structure shown is representative of reality
then this could work for you:
DT %>% group_by(year) %>% mutate(valueOut = (valueIn + lead(valueIn)) / 2)

Related

Summing a dataframe based on another dataframe

I have daily data of rainfall from 10 locations across 10 years
set.seed(123)
df <- data.frame(loc.id = rep(1:10, each = 10*365),years = rep(rep(2001:2010,each = 365),times = 10),
day = rep(rep(1:365,times = 10),times = 10), rain = runif(min = 0 , max = 35, 10*10*365))
I have a separate data frame that has certain days using which I want to sum the rainfall in df
df.ref <- data.frame(loc.id = rep(1:10, each = 10),
years = rep(2001:2010,times = 10),
index1 = rep(250,times = 10*10),
index2 = sample(260:270, size = 10*10,replace = T),
index3 = sample(280:290, size = 10*10,replace = T),
index4 = sample(291:300, size= 10*10,replace = T))
df.ref
loc.id years index1 index2 index3 index4
1: 1 2001 250 264 280 296
2: 1 2002 250 269 284 298
3: 1 2003 250 268 289 293
4: 1 2004 250 266 281 295
5: 1 2005 250 260 289 293
What I want to is for row in in df.ref, use the index values in df.ref and
sum the rainfall in df between index1 to index2, index1 to index3 and index1 to index4. For example:
Using df.ref, for loc.id = 1 and year == 2001, sum the rainfall in df from 250 to 264, 250 to 280, 250 to 296 (as shown in df.ref)
Similarly, for year 2002, for loc.id = 1, sum the rainfall from 250 to 269, 250 to 284, 250 to 298.
I did this:
library(dplyr)
ptm <- proc.time()
dat <- df.ref %>% left_join(df)
index1.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index2) %>% summarise(sum.rain1 = sum(rain))
index2.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index3) %>% summarise(sum.rain2 = sum(rain))
index3.cal <- dat %>% group_by(loc.id,years) %>% filter(day >= index1 & day <= index4) %>% summarise(sum.rain3 = sum(rain))
all.index <- index1.cal %>% left_join(index2.cal) %>% left_join(index3.cal))
proc.time() - ptm
user system elapsed
2.36 0.64 3.06
I am looking to make my code faster since my actual df.ref is quite large. Could anyone advise me how to make this quicker.
Non-equi join from data.table package can be both faster and more memory efficient than dplyr::left_join (slide | video)
For each value in df, find all the rain values in df.ref that have day in between index 1 and index 2. Then calculate the summation of rain based on loc.id and years.
df1 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index2)
, by = .EACHI][
][
, c("sum_1") := .(sum(rain)), by = .(loc.id, years)][
# remove all redundant columns
, day := NULL][
, day := NULL][
, rain := NULL])
df2 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index3)
, by = .EACHI][
][
, c("sum_2") := .(sum(rain)), by = .(loc.id, years)][
, day := NULL][
, day := NULL][
, rain := NULL])
df3 <- unique(df[df.ref
, .(rain)
, on = .(loc.id, years, day >= index1, day <= index4)
, by = .EACHI][
][
, c("sum_3") := .(sum(rain)), by = .(loc.id, years)][
, day := NULL][
, day := NULL][
, rain := NULL])
Merge all three data.tables together
df1[df2, on = .(loc.id, years)][
df3, on = .(loc.id, years)]
loc.id years sum_1 sum_2 sum_3
1: 1 1950 104159.11 222345.4 271587.1
2: 1 1951 118689.90 257450.2 347624.3
3: 1 1952 99262.27 212923.7 280877.6
4: 1 1953 72435.50 192072.7 251593.6
5: 1 1954 104021.19 242525.3 326463.4
6: 1 1955 93436.32 232653.1 304921.4
7: 1 1956 89122.79 190424.4 255535.0
8: 1 1957 135658.11 262918.7 346361.4
9: 1 1958 80064.18 220454.8 292966.4
10: 1 1959 114231.19 273181.0 349489.2
11: 2 1950 94360.69 238296.8 301751.8
12: 2 1951 93845.50 195273.7 289686.0
13: 2 1952 107692.53 245019.4 308093.7
14: 2 1953 86650.14 257225.1 332674.1
15: 2 1954 104085.83 238859.4 286350.7
16: 2 1955 101602.16 223107.3 300958.4
17: 2 1956 73912.77 198087.2 276590.1
18: 2 1957 117780.86 228299.8 305348.5
19: 2 1958 98625.45 220902.6 291583.7
20: 2 1959 109851.38 266745.2 324246.8
[ reached getOption("max.print") -- omitted 81 rows ]
Compare processing time and memory used
> time_dplyr; time_datatable
user system elapsed
2.17 0.27 2.61
user system elapsed
0.45 0.00 0.69
rowname Class MB
1 dat data.frame 508
2 df3 data.table 26
3 df2 data.table 20
4 df1 data.table 9
When testing for about 100 years of data, dplyr used more than 50 GB of memory while data.table consumed only 5 GB. dplyr also took about 4 times longer to finish.
'data.frame': 3650000 obs. of 4 variables:
$ loc.id: int 1 1 1 1 1 1 1 1 1 1 ...
$ years : int 1860 1860 1860 1860 1860 1860 1860 1860 1860 1860 ...
$ day : int 1 2 3 4 5 6 7 8 9 10 ...
$ rain : num 10.1 27.6 14.3 30.9 32.9 ...
'data.frame': 3650000 obs. of 6 variables:
$ loc.id: int 1 1 1 1 1 1 1 1 1 1 ...
$ years : int 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 ...
$ index1: num 250 250 250 250 250 250 250 250 250 250 ...
$ index2: int 270 265 262 267 266 265 262 268 260 268 ...
$ index3: int 290 287 286 289 281 285 286 285 284 283 ...
$ index4: int 298 297 296 295 298 294 296 298 298 300 ...
> time_dplyr; time_datatable
user system elapsed
95.010 33.704 128.722
user system elapsed
26.175 3.147 29.312
rowname Class MB
1 dat data.frame 50821
2 df3 data.table 2588
3 df2 data.table 2004
4 df1 data.table 888
5 df.ref data.table 97
6 df data.table 70
If I increased the number of years to 150, dplyr broke even on a HPC cluster node with 256 GB RAM
Error in left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) :
negative length vectors are not allowed
Calls: %>% ... left_join -> left_join.tbl_df -> left_join_impl -> .Call
Execution halted
Here's a starting point that will be much faster. Should be trivial figuring out the rest.
library(data.table)
setDT(df)
df[df.ref, on = .(loc.id, years, day >= index1, day <= index2), sum(rain), by = .EACHI]

reshape wide to long using data.table with multiple columns

I have a dataframe in a wide format, like below. i want to reshape the wide to long using data.table melt function.in simple case, i can split two data, and rbind two datasets. but in my case, there are multiple test(i) testgr(i) columns. But there must be a better and a more efficient way to do this. thx in advance.
from =>
id<-c("106E1258","106E2037","104E1182","105E1248","105E1470","10241247",
"10241703")
yr<-c(2017,2017,2015,2016,2016,2013,2013)
finalgr<-c(72,76,75,71,75,77,78)
test01<-c("R0560","R0066","R0308","R0129","R0354","R0483",
"R0503")
test01gr<-c(73,74,67,80,64,80,70)
test02<-c("R0660","R0266","R0302","R0139","R0324","R0383" ,
"R0503")
test02gr<-c(71,54,67,70,68,81,61)
dt<-data.frame(id=id,yr=yr,
finalgr=finalgr,
test01=test01,test01gr=test01gr,
test02=test02,test02gr=test02gr)
to=>
id2<-c("106E1258","106E1258","104E1182","104E1182")
yr2<-c(2017,2017,2015,2015)
finalgr<-c(72,72,75,75)
testid<-c("R0560","R0660","R0308","R0302")
testgr<-c(73,71,67,67)
dt2<-data.frame(id=id2,yr=yr2,finalgr=finalgr,testid=testid,testgr=testgr)
You indeed should use melt:
setDT(dt)
melt(dt, id.vars = c('id', 'yr', 'finalgr'),
measure.vars = list(testid = c('test01', 'test02'),
testgr = c('test01gr', 'test02gr')))
# id yr finalgr variable testid testgr
# 1: 106E1258 2017 72 1 R0560 73
# 2: 106E2037 2017 76 1 R0066 74
# 3: 104E1182 2015 75 1 R0308 67
# 4: 105E1248 2016 71 1 R0129 80
# 5: 105E1470 2016 75 1 R0354 64
# 6: 10241247 2013 77 1 R0483 80
# 7: 10241703 2013 78 1 R0503 70
# 8: 106E1258 2017 72 2 R0660 71
# 9: 106E2037 2017 76 2 R0266 54
# 10: 104E1182 2015 75 2 R0302 67
# 11: 105E1248 2016 71 2 R0139 70
# 12: 105E1470 2016 75 2 R0324 68
# 13: 10241247 2013 77 2 R0383 81
# 14: 10241703 2013 78 2 R0503 61
If there are many more test columns, you can use patterns:
melt(dt, id.vars = c('id', 'yr', 'finalgr'),
measure.vars = patterns(testid = 'test[0-9]+$', testgr = 'test[0-9]+gr'))

Combine data from previous week with current week in R

Hi I have a dataframe given below
s.no week number Income
1 6 52
2 7 74
3 8 13
4 9 60
I need this dataframe to be changed with an addition of a new column Total_income having the formula -
100% value of Income in current week +
60% value of income in previous week +
30% value of income in previous to previous week
Output should be like -
s.no week number Income Total_Income
1 6 52 52
2 7 74 74+0.6(52)=105.2
3 8 13 13+0.6(74)+0.3(52) = 73
4 9 60 60+0.6(13)+0.3(74) = 90
Kindly help
Assuming that the dataframe data is ordered by week number. If not, begin with data %>% arrange(week_number) %>% (note column renamed to remove space).
library(dplyr)
data %>%
mutate(Total_Income = Income + 0.6 * lag(Income, default = 0) +
0.3 * lag(Income, n = 2, default = 0))
The base R option with sapply. For every value in week_number we find the Income for that week, along with previous two weeks and sum them together with necessary arithmetic.
with(df, sapply(week_number, function(x) { sum(Income[week_number == x],
0.6 * Income[week_number == x-1], 0.3 *Income[week_number == x-2])
}))
#[1] 52.0 105.2 73.0 90.0
We can use data.table and do this in a single line
library(data.table)
setDT(d)[,Total_Income := Reduce(`+`, Map(`*`,shift(Income,n=0:2,fill=0), c(1, 0.6, 0.3)))]
d
# s.no weeknumber Income Total_Income
#1: 1 6 52 52.0
#2: 2 7 74 105.2
#3: 3 8 13 73.0
#4: 4 9 60 90.0
Or we can do a cross product
c(crossprod(do.call(rbind, shift(d$Income, n = 0:2, fill = 0)), c(1, 0.6, 0.3)))
#[1] 52.0 105.2 73.0 90.0

reshape wide into long while splitting

I am looking for reshaping:
ID p2012 p2010 p2008 p2006 c2012 c2010 c2008 c2006
1 1 160 162 163 165 37.3 37.3 37.1 37.1
2 2 163 164 164 163 2.6 2.6 2.6 2.6
into:
ID year p c
1 1 2006 165 37.1
2 1 2008 164 37.1
3 1 2010 162 37.3
4 1 2012 160 37.3
5 2 2006 163 2.6
6 2 2008 163 2.6
7 2 2010 164 2.6
8 2 2012 163 2.6
I am new to R, have been trying around with melt and dcast functions, but there are just to many twists for me at this stage. Help would be much appreciated!
A dput of my df:
structure(list(ID = 1:2, p2012 = c(160L, 163L), p2010 = c(162L, 164L), p2008 = 163:164, p2006 = c(165L, 163L), c2012 = c(37.3, 2.6), c2010 = c(37.3, 2.6), c2008 = c(37.1, 2.6), c2006 = c(37.1, 2.6)), .Names = c("ID", "p2012", "p2010", "p2008", "p2006", "c2012", "c2010", "c2008", "c2006"), class = "data.frame", row.names = c(NA, -2L))
An alternative to shadow's answer is to use the reshape function:
reshape(d, direction='long', varying=list(2:5, 6:9), v.names=c("p", "c"), idvar="ID", times=c(2012, 2010, 2008, 2006))
This assumes that you know the column indices of the p and c beforehand (or add additional code to figure them out). Furthermore, the times vector above could be found by using something similar to the gsub function of shadow.
Which way to use probably is a matter of taste.
You probably have to melt the data first, then split the variable and the year and then dcast to your final data.frame.
require(reshape2)
# melt data.frame
dfmelt <- melt(df, id.vars="ID", variable.name="var.year")
# split "var.year" into new variables "var" and "year"
dfmelt[, "var"] <- gsub("[0-9]", "", as.character(dfmelt[, "var.year"]))
dfmelt[, "year"] <- as.numeric(gsub("[a-z, A-Z]", "", as.character(dfmelt[, "var.year"])))
# cast to data with column for each var-name
dcast(dfmelt, ID+year~var, value.var="value")
You can also use the following solution from tidyr. You don't actually need to use regular expressions, if "p" or "c" is always the first letter of the column names:
library(tidyr)
library(dplyr) # only loaded for the %>% operator
dat %>%
gather(key,value,p2012:c2006) %>%
separate(key,c("category","year"),1) %>%
spread(category,value)
ID year c p
1 1 2006 37.1 165
2 1 2008 37.1 163
3 1 2010 37.3 162
4 1 2012 37.3 160
5 2 2006 2.6 163
6 2 2008 2.6 164
7 2 2010 2.6 164
8 2 2012 2.6 163

creating new column after joining two data.tables

I have two data.tables, main and metrics, both keyed by cid
I want to add to table main the average of each of several values located in metrics.
However, I would like to filter by code, only averaging those rows in metrics with a given code.
> metrics
cid code DZ value1 value2
1: 1001 A 101 8 21
2: 1001 B 102 11 26
3: 1001 A 103 17 25
4: 1002 A 104 25 39
5: 1002 B 105 6 30
6: 1002 A 106 23 40
7: 1003 A 107 27 32
8: 1003 B 108 16 37
9: 1003 A 109 14 42
# DESIRED OUTPUT
> main
cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
1: 1001 12.5 23.0 11 26
2: 1002 24.0 39.5 6 30
3: 1003 20.5 37.0 16 37
# SAMPLE DATA
set.seed(1)
main <- data.table(cid=1e3+1:3, key="cid")
metrics <- data.table(cid=rep(1e3+1:3, each=3), code=rep(c("A", "B", "A"), 3), DZ=101:109, value1=sample(30, 9), value2=sample(20:50, 9), key="cid")
code.filters <- c("A", "B")
These lines get the desired output, but I am having difficulty assigning the new col back into main. (also, doing it programatically would be preferred).
main[metrics[code==code.filters[[1]]]][, list(mean(c(value1))), by=cid]
main[metrics[code==code.filters[[1]]]][, list(mean(c(value2))), by=cid]
main[metrics[code==code.filters[[2]]]][, list(mean(c(value1))), by=cid]
main[metrics[code==code.filters[[1]]]][, list(mean(c(value2))), by=cid]
Additionally, can someone explain why the following line only takes the last value in each group?
main[metrics[ code=="A"], A.avg.val1 := mean(c(value1))]
You don't need main. You can get it directly from metrics as follows:
> tmp.dt <- metrics[, list(A.avg.val1 = mean(value1[code=="A"]),
A.avg.val2 = mean(value2[code=="A"]),
B.avg.val1 = mean(value1[code == "B"]),
B.avg.val2 = mean(value2[code == "B"])), by=cid]
# cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
# 1: 1001 12.5 23.0 11 26
# 2: 1002 24.0 39.5 6 30
# 3: 1003 20.5 37.0 16 37
If you still want to subset with main just do:
main <- data.table(cid = c(1001:1002))
> tmp.dt[main]
# cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
# 1: 1001 12.5 23.0 11 26
# 2: 1002 24.0 39.5 6 30
I would do this in two steps. First, get your means, then reshape the data
foo <- main[metrics]
bar <- foo[, list(val1 = mean(value1),
val2 = mean(value2)),
by=c('cid', 'code')]
library(reshape2)
bar.melt <- melt(bar, id.var=c('cid', 'code'))
dcast(data=bar.melt,
cid ~ code + variable)
But really, I'd probably leave the data in the "long" format because I find it much easier to work with!
working off of #Arun's answer, the following gets the desired results:
invisible(
sapply(code.filters, function(cf)
main[metrics[code==cf, list(avgv1 = mean(value1), avgv2 = mean(value2)), by=cid],
paste0(cf, c(".avg.val1", ".avg.val2")) :=list(avgv1, avgv2)]
))
> main
cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
1: 1001 12.5 23.0 11 26
2: 1002 24.0 39.5 6 30
3: 1003 20.5 37.0 16 37

Resources