Equivalent to ddply(...,transform,...) in data.table - r

I have the following code using ddply from plyr package:
ddply(mtcars,.(cyl),transform,freq=length(cyl))
The data.table version of this is :
DT<-data.table(mtcars)
DT[,freq:=.N,by=cyl]
How can I extend this when I have more than one function like the one below?
Now, I want to perform more than one function on ddply and data.table:
ddply(mtcars,.(cyl),transform,freq=length(cyl),sum=sum(mpg))
DT[,list(freq=.N,sum=sum(mpg)),by=cyl]
But, data.table gives me only three columns cyl,freq, and sum. Well, I can do like this:
DT[,list(freq=.N,sum=sum(mpg),mpg,disp,hp,drat,wt,qsec,vs,am,gear,carb),by=cyl]
But, I have large number of variables in my read data and I want all of them to be there as in ddply(...transform....). Is there shortcut in data.table just like doing := when we have only one function (as above) or something like this paste(names(mtcars),collapse=",") within data.table?
Note: I also have a large number of function to run. So, I can't repeat =: a number of times (but I would prefer this if lapply can be applied here).

Use backquoted := like this...
DT[ , `:=`( freq = .N , sum = sum(mpg) ) , by=cyl ]
head( DT , 3 )
# mpg cyl disp hp drat wt qsec vs am gear carb freq sum
#1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 7 138.2
#2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 7 138.2
#3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 11 293.3

Also useful in some situations:
newvars <- c("freq","sum")
DT[, `:=`(eval(newvars), list(.N,sum(mpg)))]

Related

Create subset of ranges and individual items

I'm using R and have a dataset with ~3000 psychological test data. The data is all dyadic in male-female partners (though this shouldn't matter for you). I'm creating a new data frame with just the variables of interest, most of them are not sequentially listed in the original data so I select them by name like below:
new_df <- subset(data, select=c("MQ4", "FQ4", #RX STATUS
"MQ9", "FQ9", #ETHNICITY
"MQ10", "FQ10", #RACE
"MQ465", "FQ465", #SEX
"MQ13", "FQ13", #GENDER
"MQ14", "FQ14", #SEXORIENT
"MQ180", "MQ181", "MQ182", "MQ182" ### HERE IS WHERE I NEED HELP
))
However, I have about 150 unique items that are listed sequentially and I'd like to select them without writing out "MQ180" through "MQ310" to select them all. I've been trying to figure out a way to select the range as well to the individual items I have been doing. This is currently what I'm trying:
new_df <- subset(data, select=c("MQ4", "FQ4", #RX STATUS
"MQ9", "FQ9", #ETHNICITY
"MQ10", "FQ10", #RACE
"MQ465", "FQ465", #SEX
"MQ13", "FQ13", #GENDER
"MQ14", "FQ14", #SEXORIENT
163:310 ### HERE IS WHERE I NEED HELP
))
One option:
dplyr::select(mtcars, "cyl", 5:8)
This subsets the mtcars dataframe to just the cyl column and the 5th thru 8th column:
cyl drat wt qsec vs
Mazda RX4 6 3.90 2.620 16.46 0
Mazda RX4 Wag 6 3.90 2.875 17.02 0
Datsun 710 4 3.85 2.320 18.61 1
Here's a base R alternative but there's probably a better way:
cbind(mtcars[, 'cyl'], mtcars[, 5:8])
mtcars originally:
5 6 7 8
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
In the index part of subset select can use names
subset(data, select=c("MQ4", "FQ4", #RX STATUS
"MQ9", "FQ9", #ETHNICITY
"MQ10", "FQ10", #RACE
"MQ465", "FQ465", #SEX
"MQ13", "FQ13", #GENDER
"MQ14", "FQ14", #SEXORIENT
names(data)[163:310]
))
The issue arises because of the property of vector which can only have a single class. So, when we have both character and integer, the integers are converted to character and thus it will look for column names that matches the name "163" instead of the position index

How to use a short script to eliminate all but one duplicate column variables based on the prefix of the colname

I want to know to use a short script to eliminate all but one duplicate column variables based on the prefix of the colname without inputting the variables I want to remove by hand.
For example, I created repeats of the mtcars$am variables, called am1, am2, am3, and am4 in a data frame called mtcars_example_2. I removed the original am variable in the mtcars_example_2 data frame.
I can use the script below to eliminate all variables with the prefix "am" but the am1 variable into a new variable called mtcars_example_3 using the code below, which inputs all variables to remove by hand:
## long way of removing all variable with am prefix that were not am1
mtcars_example_3 <-
mtcars_example_2 %>%
select(
-c(
"am2", "am3", "am4"
)
)
But this seems like the long way of doing this. Is there a faster way that does not require me to individual type in the names of each of the variables that I want to remove from the data.
Is this possible? If so, how can this be done?
Thanks ahead of time.
Here is the code for the example:
# example data
## loads packages
library(tidyverse)
## creates mtcars_example data
mtcars_example_1 <- data.frame(mtcars)
mtcars_example_2 <- data.frame(mtcars_example_1)
## creates duplicate variables, based on am variable
mtcars_example_2$am1 <- mtcars_example_1$am
mtcars_example_2$am2 <- mtcars_example_1$am
mtcars_example_2$am3 <- mtcars_example_1$am
mtcars_example_2$am4 <- mtcars_example_1$am
## removes original variable
mtcars_example_2 <-
mtcars_example_2 %>%
select(
-c(
"am"
)
)
## long way of removing all variable with am prefix that were not am1
mtcars_example_3 <-
mtcars_example_2 %>%
select(
-c(
"am2", "am3", "am4"
)
)
You can remove all the variables that start with am but keep am1 :
library(dplyr)
mtcars_example_2 %>% select(-starts_with('am'), am1) %>% head
# mpg cyl disp hp drat wt qsec vs gear carb am1
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 4 4 1
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 4 4 1
#Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 4 1 1
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 3 1 0
#Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 3 2 0
#Valiant 18.1 6 225 105 2.76 3.460 20.22 1 3 1 0
Depending on your actual scenario you can also use regex to remove columns.
mtcars_example_2 %>% select(-matches('am[2-4]')) %>% head
We could also do
library(dplyr)
mtcars_example_2 %>%
select(-contains('am'), am1)

Cannot use a variable named with numbers in R

I have some dataframes named as:
1_patient
2_patient
3_patient
Now I am not able to access its variables. For example:
I am not able to obtain:
2_patient$age
If I press tab when writing the name, it automatically gets quoted, but I am still unable to use it.
Do you know how can I solve this?
It is not recommended to name an object with numbers as prefix, but we can use backquote to extract the value from the object
`1_patient`$age
If there are more than object, we can use mget to return the objects in a list and then extract the 'age' column by looping over the list with lapply
mget(ls(pattern = "^\\d+_mtcars$"))
#$`1_mtcars`
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
lapply(mget(ls(pattern = "^\\d+_patient$")), `[[`, 'age')
Using a small reproducible example
data(mtcars)
`1_mtcars` <- head(mtcars, 2)
1_mtcars$mpg
Error: unexpected input in "1_"
`1_mtcars`$mpg
#[1] 21 21

Apply variable function to columns in data.table

I'm wondering if there's a way to apply a function in a string variable to .SD cols in a data.table.
I can generalize all other parts of function calls using a data.table, including input and output columns, which I'm very happy about. But the final piece seems to be applying a variable function to a data.table, which is something I believe I've done before with dplyr and do.call.
mtcars <- as.data.table(mtcars)
returnNames <- "calculatedColumn"
SDnames <- c("mpg","hp")
myfunc <- function(data) {
print(data)
return(data[,1]*data[,2])
}
This obviously works:
mtcars[,eval(returnNames) := myfunc(.SD),.SDcols = SDnames,by = cyl]
But if I want to apply a dynamic function, something like this does not work:
functionCall <- "myfunc"
mtcars[,eval(returnNames) := lapply(.SD,eval(functionCall)),.SDcols = SDnames,by = cyl]
I get this error:
Error in `[.data.table`(mtcars, , `:=`(eval(returnNames), lapply(.SD, : attempt to apply non-function
Is using "apply" with "eval" the right idea, or am I on the wrong track entirely?
You don't want lapply. Since myfunc takes a data.table with multiple columns, you just want to feed such a data table into the function as one object.
To get the function you need get instead of eval
On the left-hand-side of :=, you can just put the character vector in parentheses, eval isn't needed
-
mtcars[, (returnNames) := get(functionCall)(.SD)
, .SDcols = SDnames
, by = cyl]
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb calculatedColumn
# 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2310.0
# 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2310.0
# 3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 2120.4
# 4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2354.0
# 5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3272.5
# 6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1900.5
The code above was run after the following code
mtcars <- as.data.table(mtcars)
returnNames <- "calculatedColumn"
SDnames <- c("mpg","hp")
myfunc <- function(data) {
print(data)
return(data[,1]*data[,2])
}
functionCall <- "myfunc"

Using 'mutate_' to sum a bunch of columns row-wise

In this blog post, Paul Hiemstra shows how to sum up two columns using dplyr::mutate_. Copy/paste-ing relevant parts:
library(lazyeval)
f = function(col1, col2, new_col_name) {
mutate_call = lazyeval::interp(~ a + b, a = as.name(col1), b = as.name(col2))
mtcars %>% mutate_(.dots = setNames(list(mutate_call), new_col_name))
}
allows one to then do:
head(f('wt', 'mpg', 'hahaaa'))
Great!
I followed up with a question (see comments) as to how one could extend this to a 100 columns, since it wasn't quite clear (to me) how one could do it without having to type all the names using the above method. Paul was kind enough to indulge me and provided this answer (thanks!):
# data
df = data.frame(matrix(1:100, 10, 10))
names(df) = LETTERS[1:10]
# answer
sum_all_rows = function(list_of_cols) {
summarise_calls = sapply(list_of_cols, function(col) {
lazyeval::interp(~col_name, col_name = as.name(col))
})
df %>% select_(.dots = summarise_calls) %>% mutate(ans1 = rowSums(.))
}
sum_all_rows(LETTERS[sample(1:10, 5)])
I'd like to improve this answer on these points:
The other columns are gone. I'd like to keep them.
It uses rowSums() which has to coerce the data.frame to a matrix which I'd like to avoid.
Also I'm not sure if the use of . within non-do() verbs is encouraged? Because . within mutate() doesn't seem to adapt to just those rows when used with group_by().
And most importantly, how can I do the same using mutate_() instead of mutate()?
I found this answer, which addresses point 1, but unfortunately, both dplyr answers use rowSums() along with mutate().
PS: I just read Hadley's comment under that answer. IIUC, 'reshape to long form + group by + sum + reshape to wide form' is the recommend dplyr way for these type of operations?
Here's a different approach:
library(dplyr); library(lazyeval)
f <- function(df, list_of_cols, new_col) {
df %>%
mutate_(.dots = ~Reduce(`+`, .[list_of_cols])) %>%
setNames(c(names(df), new_col))
}
head(f(mtcars, c("mpg", "cyl"), "x"))
# mpg cyl disp hp drat wt qsec vs am gear carb x
#1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 27.0
#2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 27.0
#3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 26.8
#4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 27.4
#5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 26.7
#6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 24.1
Regarding your points:
Other columns are kept
It doesn't use rowSums
You are specifically asking for a row-wise operation here so I'm not sure (yet) how a group_by could do any harm when using . inside mutate/mutate_
It makes use of mutate_

Resources