I have a data.table DT which contains many columns. I have to assign percentages of dynamically selected set of columns (stored in string vector 'percentVars') with respect to another column called 'Awake' to themselves. I have tried following expressions.
# I am a Matlab user and new to R. This is R equivalent of how I would do this in Matlab
DT[,percentVars]=DT[,(percentVars)]/DT$Awake*100
#looks elegant, and perhaps more efficient?
DT[,(percentVars):= .SD/Awake*100, .SDcols=percentVars]
#why I see people always use lapply with := operator, what's the difference compared to above?
DT[,(percentVars):= lapply(.SD, function(z) z/Awake*100), .SDcols=percentVars]
# otherwise loop over percentVars with set?
for (col in percentVars) set(DT,,col,DT[,..col]/DT$Awake*100)
Which expression is better (memory and computational efficient, works by reference, vectorized etc)? Is there a better way to do this?
Related
I'm working with some data that has hundreds of covariates, so I decided to write some functions to make pre-processing much faster and cleaner (like scaling certain numeric variables). An important part of all of these functions is type-checking the columns before I apply a particular function to them.
Here is my function for scaling continuous columns:
# rm (vector): names of columns not to be scaled
scale.continuous <- function(df, rm=NULL) {
cols <- setdiff(colnames(df), rm)
for(col in cols) {
if(is.numeric(df[,col])){
df[,col] <- as.numeric(scale(df[,col]))
}
}
df
}
This works perfectly fine if I load the data frame using read.csv(), but the data I have is huge so the speed boost of using read_csv() from readr/tidyverse is significant. Unfortunately, if I load my data using read_csv() all of my functions break.
I narrowed down the issue to the type-checking, specifically when type-checking a column I am accessing by a string of its column name. Here's some code to demonstrate what I mean:
# When using read.csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] TRUE
# When using read_csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] FALSE
I realized the issue here was that indexing the dataframe with a string the way I do above returns a tibble instead of a regular list like other methods of indexing do. What I don't understand is why this behavior exists, why as.numeric() (or any type-check) does not work with a tibble and in general why there is this difference in the way the default and tidyverse dataframes are constructed. Also, it would be nice to know if there is a parameter I can change in read_csv() that will make the behavior of this type of indexing the same as with a default dataframe.
I should mention, I realize there are probably better ways of writing this code (for example, just using df$"col" to index fixes the issue), but I still don't understand what the root of the issue was with my first approach. I am now working with much larger data sets that require much more involved pre-processing than what I have been used to in the past so I want to have as complete an understanding of the data structures I am using as possible.
Tibbles have a slightly different default behaviour than regular data frames when using the [ extracting function which can be a bit of a gotcha. Specifically df[,"col"] on a tibble will return a one column tibble whereas on a regular data frame it will return a vector. So you need to use:
df[["col"]]
Or explicitly state that you want to coerce to the lowest dimension and do:
df[, "col", drop = TRUE]
From the documentation:
df[, j] returns a tibble; it does not automatically extract the column
inside. df[, j, drop = FALSE] is the default.
I am trying to do some regression matching using the following code:
library(data.table)
colCount = 3
Test = data.table(ID=c(1,2,3),R1=c(1,1,1),R2=c(1,2,3),R3=c(4,2,1))
Compare = data.table(ID=c(1,2,3),R1=c(2,1,2),R2=c(6,2,3),R3=c(1,1,4))
# Example, in real run, I know colCount will give the arbitrary number of data columns in Test and Compare; I will not know their names
for (i in 1:nrow(Compare)){
Test[,paste0('Factor_',i):=sum(Compare[i,2:(1+colCount)]*.SD)/sum(.SD*.SD),.SDcols=2:(1+colCount),by=1:nrow(Test)]
Test[,paste0('Error_',i) :=sum((Compare[i,2:(1+colCount)]-.SD*paste0('Factor_',i))^2),.SDcols=2:(1+colCount),by=1:nrow(Test)]
}
The paste on the second to last line does not work as I was hoping; is there a better way to refer to columns with generated names?
Also more generally, is there a smarter way to do this? The sum-by-row method I'm doing seems way too complicated, but I couldn't get it working using matrix math or lapply instead
I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.
I am trying to benefit from data.table fast grouping to fill a matrix (or do other stuff externally from the data.table).
For example, I have a data.table like this:
DT = data.table(x_id=rep(c(1,2),c(100,100)),x_value = rnorm(200))
setkey(DT,x_id)
(representing two different time-series)
I want to put the same information a matrix of 100 rows and 2 columns.
I tried
A = matrix(NA,100,2)
DT[,{A[,.GRP] = x_value},by=x_id]
But it doesn't work. This raises two questions for me: (I was unable to find help in the doc)
1) Is there a nice way (without loops) to transform the data.table into the matrix.
2) Generally speaking, can we assign value to outside variables in the j environment.
Many thanks for your help.
Try:
DT[,A[,.GRP] <<- x_value,by=x_id]
<<- assigns through to the global environment, which is what you need to do since the data.table expressions are evaluated in a child environment that doesn't contain A.
I would add this is a fairly odd way to use data.table. If you are guaranteed that each group has the same number of rows, then all you need to do is (assuming you have already sorted by x_id:
A <- matrix(DT[, x_value], 100)
Which takes advantage of the underlying vector-like nature of matrices.
I just started using R, and came across data.table. I found it brilliant.
A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?
From the data.table FAQ
FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?
As FAQ 1.1 highlights, j in [.data.table is fundamentally
different from j in [.data.frame. Even something as simple as
DF[,1] would break existing code in many packages and user code.
This is by design, and we want it to work this way for more
complicated syntax to work. There are other differences, too (see FAQ
2.17).
Furthermore, data.table inherits from data.frame. It is a
data.frame, too. A data.table can be passed to any package that
only accepts data.frame and that package can use [.data.frame
syntax on the data.table.
We have proposed enhancements to R wherever possible, too. One of
these was accepted as a new feature in R 2.12.0 :
unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked
encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements
to the way the hash code is generated in unique.c.
A second proposal was to use memcpy in duplicate.c, which is much
faster than a for loop in C. This would improve the way that R copies
data internally (on some measures by 13 times). The thread on r-devel
is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
What are the smaller syntax differences between data.frame and data.table
DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
DT[3, ] == DT[3], but DF[ , 3] == DF[3] (somewhat confusingly in data.frame, whereas data.table is consistent)
For this reason we say the comma is optional in DT, but not optional in DF
DT[[3]] == DF[, 3] == DF[[3]]
DT[i, ], where i is a single integer, returns a single row, just like DF[i, ], but unlike a matrix single-row subset which returns a vector.
DT[ , j] where j is a single integer returns a one-column data.table, unlike DF[, j] which returns a vector by default
DT[ , "colA"][[1]] == DF[ , "colA"].
DT[ , colA] == DF[ , "colA"] (currently in data.table v1.9.8 but is about to change, see release notes)
DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
DT[NA] returns 1 row of NA, but DF[NA] returns an entire copy of DF containing NA throughout. The symbol NA is type logical in R and is therefore recycled by [.data.frame. The user's intention was probably DF[NA_integer_]. [.data.table diverts to this probable intention automatically, for convenience.
DT[c(TRUE, NA, FALSE)] treats the NA as FALSE, but DF[c(TRUE, NA, FALSE)] returns
NA rows for each NA
DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
data.frame(list(1:2, "k", 1:4)) creates 3 columns, data.table creates one list column.
check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.
Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
In [.data.frame we very often set drop = FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and dropped drop.
When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
For example
see this question and prompt response
From the NEWS for v 1.8.2
base::unname(DT) now works again, as needed by plyr::melt(). Thanks to
Christoph Jaeckel for reporting. Test added.
An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2
without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added.
ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2
doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.