Is my way of duplicating rows in data.table efficient? - r

I have monthly data in one data.table and annual data in another data.table and now I want to match the annual data to the respective observation in the monthly data.
My approach is as follows: Duplicating the annual data for every month and then join the monthly and annual data. And now I have a question regarding the duplication of rows. I know how to do it, but I'm not sure if it is the best way to do it, so some opinions would be great.
Here is an exemplatory data.table DT for my annual data and how I currently duplicate:
library(data.table)
DT <- data.table(ID = paste(rep(c("a", "b"), each=3), c(1:3, 1:3), sep="_"),
values = 10:15,
startMonth = seq(from=1, by=2, length=6),
endMonth = seq(from=3, by=3, length=6))
DT
ID values startMonth endMonth
[1,] a_1 10 1 3
[2,] a_2 11 3 6
[3,] a_3 12 5 9
[4,] b_1 13 7 12
[5,] b_2 14 9 15
[6,] b_3 15 11 18
#1. Alternative
DT1 <- DT[, list(MONTH=startMonth:endMonth), by="ID"]
setkey(DT, ID)
setkey(DT1, ID)
DT1[DT]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
[...]
The last join is exactly what I want. However, DT[, list(MONTH=startMonth:endMonth), by="ID"] already does everything I want except adding the other columns to DT, so I was wondering if I could get rid of the last three rows in my code, i.e. the setkey and join operations. It turns out, you can, just do the following:
#2. Alternative: More intuitiv and just one line of code
DT[, list(MONTH=startMonth:endMonth, values, startMonth, endMonth), by="ID"]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
...
This, however, only works because I hardcoded the column names into the list expression. In my real data, I do not know the names of all columns in advance, so I was wondering if I could just tell data.table to return the column MONTH that I compute as shown above and all the other columns of DT. .SD seemed to be able to do the trick, but:
DT[, list(MONTH=startMonth:endMonth, .SD), by="ID"]
Error in `[.data.table`(DT, , list(YEAR = startMonth:endMonth, .SD), by = "ID") :
maxn (4) is not exact multiple of this j column's length (3)
So to summarize, I know how it's been done, but I was just wondering if this is the best way to do it because I'm still struggling a little bit with the syntax of data.table and often read in posts and on the wiki that there are good and bads ways of doing things. Also, I don't quite get why I get an error when using .SD. I thought it is just any easy way to tell data.table that you want all columns. What do I miss?

Looking at this I realized that the answer was only possible because ID was a unique key (without duplicates). Here is another answer with duplicates. But, by the way, some NA seem to creep in. Could this be a bug? I'm using v1.8.7 (commit 796).
library(data.table)
DT <- data.table(x=c(1,1,1,1,2,2,3),y=c(1,1,2,3,1,1,2))
DT[,rep:=1L][c(2,7),rep:=c(2L,3L)] # duplicate row 2 and triple row 7
DT[,num:=1:.N] # to group each row by itself
DT
x y rep num
1: 1 1 1 1
2: 1 1 2 2
3: 1 2 1 3
4: 1 3 1 4
5: 2 1 1 5
6: 2 1 1 6
7: 3 2 3 7
DT[,cbind(.SD,dup=1:rep),by="num"]
num x y rep dup
1: 1 1 1 1 1
2: 2 1 1 1 NA # why these NA?
3: 2 1 1 2 NA
4: 3 1 2 1 1
5: 4 1 3 1 1
6: 5 2 1 1 1
7: 6 2 1 1 1
8: 7 3 2 3 1
9: 7 3 2 3 2
10: 7 3 2 3 3
Just for completeness, a faster way is to rep the row numbers and then take the subset in one step (no grouping and no use of cbind or .SD) :
DT[rep(num,rep)]
x y rep num
1: 1 1 1 1
2: 1 1 2 2
3: 1 1 2 2
4: 1 2 1 3
5: 1 3 1 4
6: 2 1 1 5
7: 2 1 1 6
8: 3 2 3 7
9: 3 2 3 7
10: 3 2 3 7
where in this example data the column rep happens to be the same name as the rep() base function.

Great question. What you tried was very reasonable. Assuming you're using v1.7.1 it's now easier to make list columns. In this case it's trying to make one list column out of .SD (3 items) alongside the MONTH column of the 2nd group (4 items). I'll raise it as a bug [EDIT: now fixed in v1.7.5], thanks.
In the meantime, try :
DT[, cbind(MONTH=startMonth:endMonth, .SD), by="ID"]
ID MONTH values startMonth endMonth
a_1 1 10 1 3
a_1 2 10 1 3
a_1 3 10 1 3
a_2 3 11 3 6
...
Also, just to check you've seen roll=TRUE? Typically you'd have just one startMonth column (irregular with gaps) and then just roll join to it. Your example data has overlapping month ranges though, so that complicates it.

Here is a function I wrote which mimics disaggregate (I needed something that handled complex data). It might be useful for you, if it isn't overkill. To expand only rows, set the argument fact to c(1,12) where 12 would be for 12 'month' rows for each 'year' row.
zexpand<-function(inarray, fact=2, interp=FALSE, ...) {
fact<-as.integer(round(fact))
switch(as.character(length(fact)),
'1' = xfact<-yfact<-fact,
'2'= {xfact<-fact[1]; yfact<-fact[2]},
{xfact<-fact[1]; yfact<-fact[2];warning(' fact is too long. First two values used.')})
if (xfact < 1) { stop('fact[1] must be > 0') }
if (yfact < 1) { stop('fact[2] must be > 0') }
# new nonloop method, seems to work just ducky
bigtmp <- matrix(rep(t(inarray), each=xfact), nrow(inarray), ncol(inarray)*xfact, byr=T)
#does column expansion
bigx <- t(matrix(rep((bigtmp),each=yfact),ncol(bigtmp),nrow(bigtmp)*yfact,byr=T))
return(invisible(bigx))
}

The fastest and most succinct way of doing it:
DT[rep(1:nrow(DT), endMonth - startMonth)]
We can also enumerate by group by:
dd <- DT[rep(1:nrow(DT), endMonth - startMonth)]
dd[, nn := 1:.N, by = ID]
dd

Related

How to crosstabulate the missings with data.table

Say we have this toy example:
prueba <- data.table(aa=1:7,bb=c(1,2,NA, NA, 3,1,1),
cc=c(1,2,NA, NA, 3,1,1) , YEAR=c(1,1,1,2,2,2,2))
aa bb cc YEAR
1: 1 1 1 1
2: 2 2 2 1
3: 3 NA NA 1
4: 4 NA NA 2
5: 5 3 3 2
6: 6 1 1 2
7: 7 1 1 2
I want to create a table with the values of something by YEAR.
In this simple example I will just ask for the table that says how many missing and non-missing I have.
This is an ugly way to do it, specifying everything by hand:
prueba[,.(sum(is.na(.SD)),sum(!is.na(.SD))), by=YEAR]
Though it doesn't label automatically the new columns we see it says I have 2 missings and 7 non-missing values for year 1, and ...
YEAR V1 V2
1: 1 2 7
2: 2 2 10
It works but what I would really like is to be able to use table() or some data.table equivalent command instead of specifying by hand every term. That would be much more efficient if I have many of them or if we don't know them beforehand.
I've tried with:
prueba[,table(is.na(.SD)), by=YEAR]
but it doesn't work, I get this:
YEAR V1
1: 1 7
2: 1 2
3: 2 10
4: 2 2
How can I get the same format than above?
I've unluckily tried by using as.datable, unlist, lapply, and other things. I think some people use dcast but I don't know how to use it here.
Is there a simple way to do it?
My real table is very large.
Is it better to use the names of the columns instead of .SD?
You can convert the table to a list if you want it as two separate columns
prueba[, as.list(table(is.na(.SD))), by=YEAR]
# YEAR FALSE TRUE
# 1: 1 7 2
# 2: 2 10 2
I suggest not using TRUE and FALSE as column names though.
prueba[, setNames(as.list(table(is.na(.SD))), c('notNA', 'isNA'))
, by = YEAR]
# YEAR notNA isNA
# 1: 1 7 2
# 2: 2 10 2
Another option is to add a new column and then dcast
na_summ <- prueba[, table(is.na(.SD)), by = YEAR]
na_summ[, vname := c('notNA', 'isNA'), YEAR]
dcast(na_summ, YEAR ~ vname, value.var = 'V1')
# YEAR isNA notNA
# 1: 1 2 7
# 2: 2 2 10

Wrapping cumulative sum from a set starting row in R

I have a data frame that looks a bit like this:
wt <- data.frame(region = c(rep("A", 5), rep("B", 5)), time = c(1:5, 1:5),
start = c(rep(2,5), rep(4, 5)), value = rep(1, 10))
The values in the value column could be any numbers (I am working in a very large data set), but each region will be over an equal-length time series and have a single starting point.
I want to perform a cumulative sum within each region that begins accumulating at the starting point, continues forward in the time series, and wraps to the rows before the starting point in the time series.
The full data table, WITH the intended result, would look like this:
region time start value result
A 1 2 1 5
A 2 2 1 1
A 3 2 1 2
A 4 2 1 3
A 5 2 1 4
B 1 4 1 3
B 2 4 1 4
B 3 4 1 5
B 4 4 1 1
B 5 4 1 2
A simple transformation of the time column followed by cumsum does not work, since the function cares about row order and not any particular factor.
With that in mind, I am operating on a huge data table, and runtime is absolutely a concern, so any solution must avoid re-ordering rows.
Ideas of how to do this? Thanks in advance.
EDIT: Consider time to be a cycle such as hours in a day - and for example, if the start time is 2, that means observations start at one instance of time 2 and end at the next time 1.
We can do this in an efficient way with data.table
library(data.table)
setDT(wt)[time>=start, result := seq_len(.N), region]
wt[, Max := max(result, na.rm = TRUE), region]
wt[is.na(result), result := Max +seq_len(.N) , region][, Max := NULL][]
# region time start value result
#1: A 1 2 1 5
#2: A 2 2 1 1
#3: A 3 2 1 2
#4: A 4 2 1 3
#5: A 5 2 1 4
#6: B 1 4 1 3
#7: B 2 4 1 4
#8: B 3 4 1 5
#9: B 4 4 1 1
#10: B 5 4 1 2
akrun's solution works for the example I gave (hence I accepted it as the answer), but here's a version that works for any values in the value column:
library(data.table)
setDT(wt)[time>=start, result := cumsum(value), region]
wt[, Max := max(result, na.rm = TRUE), region]
wt[is.na(result), result := Max +cumsum(value) , region][, Max := NULL][]
Just adding the... unfortunately named cumsum function in place of a calculated sequence.

Summing the number of times a value appears in either of 2 columns

I have a large data set - around 32mil rows. I have information on the telephone number, the origin of the call, and the destination.
For each telephone number, I want to count the number of times it appeared either as Origin or as Destination.
An example data table is as follows:
library(data.table)
dt <- data.table(Tel=seq(1,5,1), Origin=seq(1,5,1), Destination=seq(3,7,1))
Tel Origin Destination
1: 1 1 3
2: 2 2 4
3: 3 3 5
4: 4 4 6
5: 5 5 7
I have working code, but it takes too long for my data since it involves a for loop. How can I optimize it?
Here it is:
for (i in unique(dt$Tel)){
index <- (dt$Origin == i | dt$Destination == i)
dt[dt$Tel ==i, "N"] <- sum(index)
}
Result:
Tel Origin Destination N
1: 1 1 3 1
2: 2 2 4 1
3: 3 3 5 2
4: 4 4 6 2
5: 5 5 7 2
Where N tells that Tel=1 appears 1, Tel=2 appears 1, Tel=3,4 and 5 each appear 2 times.
We can do a melt and match
dt[, N := melt(dt, id.var = "Tel")[, tabulate(match(value, Tel))]]
Or another option is to loop through the columns 2 and 3, use %in% to check whether the values in 'Tel' are present, then with Reduce and + get the sum of logical elements for each 'Tel', assign (:=) the values to 'N'
dt[, N := Reduce(`+`, lapply(.SD, function(x) Tel %in% x)), .SDcols = 2:3]
dt
# Tel Origin Destination N
#1: 1 1 3 1
#2: 2 2 4 1
#3: 3 3 5 2
#4: 4 4 6 2
#5: 5 5 7 2
A second method constructs a temporary data.table which is then joins to the original. This is longer and likely less efficient than #akrun's, but can be useful to see.
# get temporary data.table as the sum of origin and destination frequencies
temp <- setnames(data.table(table(unlist(dt[, .(Origin, Destination)], use.names=FALSE))),
c("Tel", "N"))
# turn the variables into integers (Tel is the name of the table above, and thus character)
temp <- temp[, lapply(temp, as.integer)]
Now, join the original table on
dt <- temp[dt, on="Tel"]
dt
Tel N Origin Destination
1: 1 1 1 3
2: 2 1 2 4
3: 3 2 3 5
4: 4 2 4 6
5: 5 2 5 7
You can get the desired column order using setcolorder
setcolorder(dt, c("Tel", "Origin", "Destination", "N"))

How can I reshape my dataframe?

I have a huge data frame, that in a simple version it looks like this:
trials=c("1","2","3","4","5","6","7","8","9","10")
co =c(rep ("1",10))
stim=c("8","9","11","2","4","7","8","1","12","16")
ansbin=c("1","0","1","0","0","1","0","1","1","0")
stim.1=c("11","2","11","7","4","3","9","1","4","16")
ansbin.1=c("0","0","1","0","0","1","0","1","1","1")
trials.1=c("1","2","3","4","5","6","7","8","9","10")
co.1 =c(rep ("2",10))
stim1.1=c("11","2","11","2","5","7","8","15","17","10")
ansbin1.1=c("1","1","1","0","0","1","1","1","0","1")
stim2.1=c("11","2","14","1","4","8","9","10","4","12")
ansbin2.1=c("0","1","1","0","0","1","0","0","1","0")
ID<- data.frame(trials,co,stim,ansbin,stim.1,ansbin.1,trials.1,co.1,stim1.1,ansbin1.1,stim2.1,ansbin2.1)
View(ID)
Now I would like to form my new data.frame in the way that "stim", "stim.1","stim1.1" and "stim2.1" are under the same column called "stimulus", and the same thing for the answers: I would like all "ansbin", "ansbin.1", "ansbin1.1" and "ansbin2.1" under the same column called "answers".
Trials and Trials.1 at the same time should be under the same column, but the difference will the "co" column.
I tryied to use "reshape" like this:
df<-reshape(ID, direction="long",
idvar=c("trials", "co"),
varying= c("stim","stim.1", "stim1.1","stim2.1","ansbin","ansbin.1","ansbin1.1","ansbin2.1"
v.names=c("stimulus","answer"),
timevar="num",
)
but I have some problems and warning at the everytimes. I think it should be a problem linked to columns's name.
Can you help me?
Thank you in advance! :)
Here's the approach I would take:
library(data.table)
melt(
rbindlist(split.default(ID, cumsum(grepl("^trials", names(ID))))),
measure.vars = patterns("^stim", "^ansbin"), value.name = c("stim", "ansbin"))
# trials co variable stim ansbin
# 1: 1 1 1 8 1
# 2: 2 1 1 9 0
# 3: 3 1 1 11 1
# 4: 4 1 1 2 0
# 5: 5 1 1 4 0
# ---
# 36: 6 2 2 8 1
# 37: 7 2 2 9 0
# 38: 8 2 2 10 0
# 39: 9 2 2 4 1
# 40: 10 2 2 12 0
Basically, it sounds like you're looking at two rounds of "reshaping".
Stacking the columns from "trials" to the second set of "ansbin" on top of each other. I've done that with the rbindlist(split.default(...)) part of my answer.
Stacking each resulting pair of "stim" and "ansbin" columns on top of each other. I've done that with the melt(...) part of my answer.
Consider building a list of reshaped dataframes for each set: co, trials, stimulus, and answers, then merge them together. However, because co and trials only carry two columns while latter two carries four columns consider repeating columns prior to reshaping:
ID$co2 <- ID$co
ID$co3 <- ID$co.1
ID$trials.2 <- ID$trials
ID$trials.3 <- ID$trials.1
df_list <- lapply(c("co", "trials", "stim", "ans"), function(s)
reshape(ID, direction="long",
varying= grep(s, names(ID)),
v.names=c(s),
drop = grep(paste0("^", s), names(ID), invert=TRUE),
timevar="num",
new.row.names = 1:1000)
)
# CHAIN MERGE
finaldf <- Reduce(function(x, y) merge(x, y, by=c('id', 'num')), df_list)
finaldf <- with(finaldf, finaldf[order(num, id),]) # SORT DATAFRAME
rownames(finaldf) <- NULL # RESET ROWNAMES
head(finaldf)
# id num co trials stim ans
# 1 1 1 1 1 8 1
# 2 2 1 1 2 9 0
# 3 3 1 1 3 11 1
# 4 4 1 1 4 2 0
# 5 5 1 1 5 4 0
# 6 6 1 1 6 7 1

Using R: Make a new column that counts the number of times 'n' conditions from 'n' other columns occur

I have columns 1 and 2 (ID and value). Next I would like a count column that lists the # of times that the same value occurs per id. If it occurs more than once, it will obviously repeat the value. There are other variables in this data set, but the new count variable needs to be conditional only on 2 of them. I have scoured this blog, but I can't find a way to make the new variable conditional on more than one variable.
ID Value Count
1 a 2
1 a 2
1 b 1
2 a 2
2 a 2
3 a 1
3 b 3
3 b 3
3 b 3
Thank you in advance!
You can use ave:
df <- within(df, Count <- ave(ID, list(ID, Value), FUN=length))
You can use ddply from plyr package:
library(plyr)
df1<-ddply(df,.(ID,Value), transform, count1=length(ID))
>df1
ID Value Count count1
1 1 a 2 2
2 1 a 2 2
3 1 b 1 1
4 2 a 2 2
5 2 a 2 2
6 3 a 1 1
7 3 b 3 3
8 3 b 3 3
9 3 b 3 3
> identical(df1$Count,df1$count1)
[1] TRUE
Update: As suggested by #Arun, you can replace transform with mutate if you are working with large data.frame
Of course, data.table also has a solution!
data[, Count := .N, by = list(ID, Value)
The built-in constant, ".N", is a length 1 vector reporting the number of observations in each group.
The downside to this approach would be joining this result with your initial data.frame (assuming you wish to retain the original dimensions).

Resources