LHS:RHS vs functional in data.table - r

Why does the functional ':=' not aggregate unique rows using 'by' yet LHS:RHS does aggregate using 'by'? Below is a .csv file of 20 rows of data with 58 variables. A simple copy, paste, delim = .csv works. I am still trying to find the best way to post sample data to SO. The 2 variants of my code are:
prodMatrix <- so.sample[, ':=' (Count = .N), by = eval(names(so.sample)[2:28])]
---this version does not aggregate the rowID using by---
prodMatrix <- so.sample[, (Count = .N), by = eval(names(so.sample)[2:28])]
---this version does aggregate the rowID using by---
"CID","NetIncome_length_Auto Advantage","NetIncome_length_Certificates","NetIncome_length_Comm. Share Draft","NetIncome_length_Escrow Shares","NetIncome_length_HE Fixed","NetIncome_length_HE Variable","NetIncome_length_Holiday Club","NetIncome_length_IRA Certificates","NetIncome_length_IRA Shares","NetIncome_length_Indirect Balloon","NetIncome_length_Indirect New","NetIncome_length_Indirect RV","NetIncome_length_Indirect Used","NetIncome_length_Loanline/CR","NetIncome_length_New Auto","NetIncome_length_Non-Owner","NetIncome_length_Personal","NetIncome_length_Preferred Plus Shares","NetIncome_length_Preferred Shares","NetIncome_length_RV","NetIncome_length_Regular Shares","NetIncome_length_S/L Fixed","NetIncome_length_S/L Variable","NetIncome_length_SBA","NetIncome_length_Share Draft","NetIncome_length_Share/CD Secured","NetIncome_length_Used Auto","NetIncome_sum_Auto Advantage","NetIncome_sum_Certificates","NetIncome_sum_Comm. Share Draft","NetIncome_sum_Escrow Shares","NetIncome_sum_HE Fixed","NetIncome_sum_HE Variable","NetIncome_sum_Holiday Club","NetIncome_sum_IRA Certificates","NetIncome_sum_IRA Shares","NetIncome_sum_Indirect Balloon","NetIncome_sum_Indirect New","NetIncome_sum_Indirect RV","NetIncome_sum_Indirect Used","NetIncome_sum_Loanline/CR","NetIncome_sum_New Auto","NetIncome_sum_Non-Owner","NetIncome_sum_Personal","NetIncome_sum_Preferred Plus Shares","NetIncome_sum_Preferred Shares","NetIncome_sum_RV","NetIncome_sum_Regular Shares","NetIncome_sum_S/L Fixed","NetIncome_sum_S/L Variable","NetIncome_sum_SBA","NetIncome_sum_Share Draft","NetIncome_sum_Share/CD Secured","NetIncome_sum_Used Auto","totNI","Count","totalNI"
93,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,-123.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,212.97,0,0,0,-71.36,0,0,0,49.01,0,0,67.42,6,404.52
114,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,14.54,0,0,0,0,0,-285.44,0,0,0,49.01,0,0,-221.89,90,-19970.1
1112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,60.23,0,0,0,0,-101.55,0,-71.36,0,0,0,98.02,0,0,-14.66,28,-410.48
5366,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
6078,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,7,0,0,0,1,0,0,0,0,0,0,0,0,-17.44,0,0,0,0,0,0,0,14.54,0,0,0,0,0,-499.52,0,0,0,49.01,0,0,-453.41,3,-1360.23
11684,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
47358,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,-14.43,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-85.79,3194,-274013.26
193761,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-101.55,0,-71.36,0,0,0,49.01,0,0,-123.9,9973,-1235654.7
232530,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
604897,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
1021309,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-71.36,43262,-3087176.32
1023633,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-71.36,43262,-3087176.32
1029726,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,60.23,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,37.88,8688,329101.44
1040005,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
1040092,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
1064453,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,14.54,0,212.97,0,0,0,-142.72,0,0,0,0,0,0,84.79,49,4154.71
1067508,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,-123.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-194.56,4162,-809758.72
1080303,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-71.36,43262,-3087176.32
1181005,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-101.55,0,-142.72,0,0,0,98.02,0,0,-146.25,614,-89797.5
1200484,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-101.55,0,-285.44,0,0,0,0,0,0,-386.99,50,-19349.5

Because := is making operations by reference. It means it will not invoke in-memory copy of your dataset but it will update it in-place.
Making aggregation of your dataset is a copy of it's original unaggregated form.
You can read more about it in Reference semantics vignette.
This is a design concept in data.table that := is used to update by reference and other forms - .(), list() or direct expression are used to query data. And query data isn't a by reference operation. The by reference operation is not able to aggregate rows, it can just calculate aggregates and put it into dataset in-place. Query is able to aggregate dataset because query result is not the same object in memory as original data.table.

Related

R data.table best way to modify dynamically selected columns

I have a data.table DT which contains many columns. I have to assign percentages of dynamically selected set of columns (stored in string vector 'percentVars') with respect to another column called 'Awake' to themselves. I have tried following expressions.
# I am a Matlab user and new to R. This is R equivalent of how I would do this in Matlab
DT[,percentVars]=DT[,(percentVars)]/DT$Awake*100
#looks elegant, and perhaps more efficient?
DT[,(percentVars):= .SD/Awake*100, .SDcols=percentVars]
#why I see people always use lapply with := operator, what's the difference compared to above?
DT[,(percentVars):= lapply(.SD, function(z) z/Awake*100), .SDcols=percentVars]
# otherwise loop over percentVars with set?
for (col in percentVars) set(DT,,col,DT[,..col]/DT$Awake*100)
Which expression is better (memory and computational efficient, works by reference, vectorized etc)? Is there a better way to do this?

Most efficient way (fastest) to modify a data.frame using indexing

Little introduction to the question :
I am developing an ecophysiological model, and I use a reference class list called S that store every object the model need for input/output (e.g. meteo, physiological parameters etc...).
This list contains 5 objects (see example below):
- two dataframes, S$Table_Day (the outputs from the model) and S$Met_c(the meteo in input), which both have variables in columns, and observations (input or output) in row.
- a list of parameters S$Parameters.
- a matrix
- a vector
The model runs many functions with a daily time step. Each day is computed in a for loop that runs from the first day i=1 to the last day i=n. This list is passed to the functions that often take data from S$Met_c and/or S$Parameters in input and compute something that is stored in S$Table_Day, using indexes (the ith day). S is a Reference Class list because they avoid copy on modification, which is very important considering the number of computations.
The question itself :
As the model is very slow, I am trying to decrease computation time by micro-benchmarking different solutions.
Today I found something surprising when comparing two solutions to store my data. Storing data by indexing in one of the preallocated dataframes is longer than storing it into an undeclared vector. After reading this, I thought preallocating memory was always faster, but it seems that R performs more operations while modifying by index (probably comparing the length, type etc...).
My question is : is there a better way to perform such operations ? In other words, is there a way for me to use/store more efficiently the inputs/outputs (in a data.frame, a list of vector or else) to keep track of all computations of each day ? For example would it be better to use many vectors (one for each variable) and regroup them in more complex objects (e.g. list of dataframe) at then end ?
By the way, am I right to use Reference Classes to avoid copy of the big objects in S while passing it to functions and modify it from within them ?
Reproducible example for the comparison:
SimulationClass <- setRefClass("Simulation",
fields = list(Table_Day = "data.frame",
Met_c= "data.frame",
PerCohortFruitDemand_c="matrix",
Parameters= "list",
Zero_then_One="vector"))
S= SimulationClass$new()
# Initializing the table with dummy numbers :
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))
f1= function(i){
a= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}
f2= function(i){
S$Table_Day$Bud_dd[(i-1000):i]= cumsum(S$Met_c$DegreeDays[i:(i-1000)])
}
res= microbenchmark(f1(1000),f2(1000),times = 10000)
autoplot(res)
And the result :
Also if someone has any experience in programming such models, I am deeply interested in any advice for model development.
I read more about the question, and I'll just write here for prosperity some of the solutions that were proposed on other posts.
Apparently, reading and writing are both worth to consider when trying to reduce the computation time of assignation to a data.frame by index.
The sources are all found in other discussions:
How to optimize Read and Write to subsections of a matrix in R (possibly using data.table)
Faster i, j matrix cell fill
Time in getting single elements from data.table and data.frame objects
Several solutions appeared relevant :
Use a matrix instead of a data.frame if possible to leverage in place modification (Advanced R).
Use a list instead of a data.frame, because [<-.data.frame is not a primitive function (Advanced R).
Write functions in C++ and use Rcpp (from this source)
Use .subset2 to read instead of [ (third source)
Use data.table as recommanded by #JulienNavarre and #Emmanuel-Lin and the different sources, and use either set for data.frame or := if using a data.table is not a problem.
Use [[ instead of [ when possible (index by one value only). This one is not very effective, and very restrictive, so I removed it from the following comparison.
Here is the analysis of performance using the different solutions :
The code :
# Loading packages :
library(data.table)
library(microbenchmark)
library(ggplot2)
# Creating dummy data :
SimulationClass <- setRefClass("Simulation",
fields = list(Table_Day = "data.frame",
Met_c= "data.frame",
PerCohortFruitDemand_c="matrix",
Parameters= "list",
Zero_then_One="vector"))
S= SimulationClass$new()
S$Table_Day= data.frame(one= 1:10000, two= rnorm(n = 10000), three= runif(n = 10000),Bud_dd= rep(0,10000))
S$Met_c= data.frame(DegreeDays= rnorm(n=10000, mean = 10, sd = 1))
# Transforming data objects into simpler forms :
mat= as.matrix(S$Table_Day)
Slist= as.list(S$Table_Day)
Metlist= as.list(S$Met_c)
MetDT= as.data.table(S$Met_c)
SDT= as.data.table(S$Table_Day)
# Setting up the functions for the tests :
f1= function(i){
S$Table_Day$Bud_dd[i]= cumsum(S$Met_c$DegreeDays[i])
}
f2= function(i){
mat[i,4]= cumsum(S$Met_c$DegreeDays[i])
}
f3= function(i){
mat[i,4]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
}
f4= function(i){
Slist$Bud_dd[i]= cumsum(.subset2(S$Met_c, "DegreeDays")[i])
}
f5= function(i){
Slist$Bud_dd[i]= cumsum(Metlist$DegreeDays[i])
}
f6= function(i){
set(S$Table_Day, i=as.integer(i), j="Bud_dd", cumsum(S$Met_c$DegreeDays[i]))
}
f7= function(i){
set(S$Table_Day, i=as.integer(i), j="Bud_dd", MetDT[i,cumsum(DegreeDays)])
}
f8= function(i){
SDT[i,Bud_dd := MetDT[i,cumsum(DegreeDays)]]
}
i= 6000:6500
res= microbenchmark(f1(i),f3(i),f4(i),f5(i),f7(i),f8(i), times = 10000)
autoplot(res)
And the resulting autoplot :
With f1 the reference base assignment, f2 using a matrix instead of a data.frame, f3 using the combination of .subset2 and matrix, f4 using a list and .subset2, f5 using two lists (both reading and writing), f6 using data.table::set, f7 using data.table::set and data.table for cumulative sum, and f8using data.table :=.
As we can see the best solution is to use lists for reading and writing. This is pretty surprising to see that data.table is the worst solution. I believe I did something wrong with it, because it is supposed to be the best. If you can improve it, please tell me.

SparkR gapply - function returns a multi-row R dataframe

Let's say I want to execute something as follows:
library(SparkR)
...
df = spark.read.parquet(<some_address>)
df.gapply(
df,
df$column1,
function(key, x) {
return(data.frame(x, newcol1=f1(x), newcol2=f2(x))
}
)
where the return of the function has multiple rows. To be clear, the examples in the documentation (which sadly echoes much of the Spark documentation where the examples are trivially simple) don't help me identify whether this will be handled as I expect.
I would expect that the outcome of this would be, for k groups created in the DataFrame with n_k output rows per group, that the result of the gapply() call would have sum(1..k, n_k) rows, where the key value is replicated for each of n_k rows for each group in key k ... However, the schema-field suggests to me that this is not how this will be handled - in fact it suggests that it will either want the result pushed into a single row.
Hopefully this is clear, albeit theoretical (I'm sorry I can't share my actual code example). Can someone verify or explain how such a function will actually be treated?
Exact expectations regarding input and output are clearly stated in the official documentation:
Apply a function to each group of a SparkDataFrame. The function is to be applied to each group of the SparkDataFrame and should have only two parameters: grouping key and R data.frame corresponding to that key. The groups are chosen from SparkDataFrames column(s). The output of function should be a data.frame.
Schema specifies the row format of the resulting SparkDataFrame. It must represent R function’s output schema on the basis of Spark data types. The column names of the returned data.frame are set by user. Below is the data type mapping between R and Spark.
In other words your function should take a key and data.frame of rows corresponding to that key and return data.frame that can be represented using Spark SQL types with schema provided as schema argument. There are no restriction regarding number of rows. You could for example apply identity transformation as follows:
df <- as.DataFrame(iris)
gapply(df, "Species", function(k, x) x, schema(df))
the same way as aggregations:
gapply(df, "Species",
function(k, x) {
dplyr::summarize(dplyr::group_by(x, Species), max(Sepal_Width))
},
structType(
structField("species", "string"),
structField("max_s_width", "double"))
)
although in practice you should prefer aggregations directly on DataFrame (groupBy %>% agg).

How to quickly split values in column to create a table for plotting in R

I was wondering if anyone could offer any advice on speeding
the following up in R.
I’ve got a table in a format like this
chr1, A, G, v1,v2,v3;w1w2w3, ...
...
The header is
chr, ref, alt, sample1, sample2 ...(many samples)
In each row for each sample I’ve got 3 values for v and 3 values for w,
separated by “;"
I want to extract v1 and w1 for each sample make a table
that can be plotted using ggplot, it would look like this
chr, ref, alt, sam, v1, w1
I am doing this by strsplit and rbind one by one like the
following
varsam <- c()
for(i in 1:n.var){
chrm <- variants[i,1]
ref <- as.character(variants[i,3])
alt <- as.character(variants[i,4])
amp <- as.character(variants[i,5])
for(j in 1:n.sam){
vs <- strsplit(as.character(vcftable[i,j+6]), split=":")[[1]
vsc <- strsplit(vs[1], split=",")[[1]]
vsp <- strsplit(vs[2], split=",")[[1]]
varsam <- rbind(varsam, c(chrm, pos, ref, j, vsc[1], vsp[1]))
}
This is very slow as you would expect. Any idea how to speed this up?
As noted by others, the first thing you need is some timings, so that you can compare performance if you intend to optimize performance. This would be my first step:
Create some timings
Play around with different aspects of your code to see where the main time is being used.
Basic timing analysis can be done with system.time() method to help with performance analysis
Beyond that, there are some candidates you might like to consider to improve performance - but importantly, it is important to get the timings first so that you have something to compare against.
the dplyr library contains a mutate function which can be used to create new columns, e.g. mynewtablewithextracolumn <- mutate(table, v1 = whatever you want it to be). In the previous statement, simply insert how to calculate each column value where v1 is a new column. There are lots of examples on the internet.
In order to use dplyr, you would need to perform a call to library(dplyr) in your code.
You may need to install.packages("dplyr") if not already installed.
In order to use dplyr, you might be best converting your table into the appropriate type of table for dplyr, e.g. if your current table is data frame, then use table = tbl_df(df) to create a table
As noted, these are just some possible areas. The important thing is to get timings and explore the performance to try to get a handle on where the best place to focus is and to make sure you can measure the performance improvement.
Thanks for the comments. I think I've found way to improve this.
I used melt in "reshape" to firstly convert my input table to
chr, ref, alt, variable
I can then use apply to modify "variable", each row for which contains a concatenated string. This achieves good speed.

Update data.table column changes data type

I am testing a small scale scenario before rolling it out in a larger production environment and am experiencing a strange occurrence.
I have 2 data sets:
dtL <- data.table(URN=c(1,2,3,4,5), DonorType=c("Cash","RG","Emergency","Emergency","Cash"))
dtL[,c("EmergVal","EmergDate") := list(as.numeric(NA),as.Date(NA))]
setkey(dtL,URN)
dtR <- data.table(URN = c(1,1,1,2,3,3 ,3 ,4,4, 4,4,5),
class=c(5,5,5,1,5,40,40,5,40,5,40,5),
xx= c(25,50,25,10,100,20,25,20,40,35,20,25),
xdate=as.Date(c("2013-01-01","2013-06-05","2014-05-27","2014-10-14",
"2014-06-09","2014-04-07","2014-10-16",
"2014-07-16","2014-10-21","2014-10-22","2014-09-18","2013-12-19")))
setkey(dtR,URN)
I am wanting to update dtL where the DonorType is equal to "Emergency", but only for a subset of records from dtR. I have seen Update subset of data.table based on join and thus have used that as a foundation for my solution.
dtL[dtR[class==40,list(maxxx=max(xx)),by=URN],
EmergVal := ifelse(DonorType=="Emergency",i.maxxx,as.numeric(NA))]
dtL[dtR[class==40,list(maxdate=max(xdate)),by=URN],
EmergDate := ifelse(DonorType=="Emergency",as.Date(i.maxdate),as.Date(NA)),nomatch=0]
I don't get any errors, however when I look at the data now in dtL it has changed the datatype for EmergDate to num rather than what it originally was (i.e. Date).
So three questions
Why has it changed the data type (especially when it is a Date when first created in dtL, and I tell it to put it as a date in my ifelse statement?
How do I get it to keep the date type when I assign it? or will I have to do some post assignment conversion/castint?
Is there a clean way I could do my assignment of EmergVal and EmergDate in a single statement given that I don't have a field DonorType in dtR and I don't want to add it in (so can't use a multiple key for the join)?

Resources