I want to lappy two functions on a data set conditional on the value of a specific variable.
first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}
df <- data.frame(Letters = c("A","B","B"), Numbers = 1:3)
Someting like:
df <- lapply(df, if(df$Letters=="A") first_function else second_function )
To produce:
df_desired <- data.frame(Letters = c("A","B","B"), Numbers = c(1,3,4))
You can do it with dplyr and purrr. Obviously this is a basic function, but you should be able to build on it for your needs:
library(dplyr)
library(purrr)
calc <- function(y, x){
first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}
if(y == "A")
return(first_function(x))
return(second_function(x))
}
df <- data.frame(Letters = c("A","B","B"), Numbers = 1:3)
df %>%
mutate(Numbers = map2_dbl(Letters, Numbers, ~calc(.x,.y)))
Letters Numbers
1 A 1
2 B 3
3 B 4
>(df_desired <- data.frame(Letters = c("A","B","B"), Numbers = c(1,3,4)))
Letters Numbers
1 A 1
2 B 3
3 B 4
BENCHMARKING
I am not a data.table expert (feel free to add), so did not incorporate here. But, #R Yoda is correct. Although it reads nicely and future you will find it easier to read and extend the function, the purrr solution is not that fast. I liked the ifelse approach, so added case_when which is easier to scale when dealing with multiple functions. Here are a couple solutions:
library(dplyr)
library(purrr)
library(microbenchmark)
first_function <- function(x) {return (x + 0)}
second_function <- function(x) {return (x + 1)}
calc <- function(y, x){
if(y == "A")
return(first_function(x))
return(second_function(x))
}
df <- data.frame(Letters = rep(c("A","B","B"),1000), Numbers = 1:3)
basic <- function(){
data.frame(df$Letters, apply(df, 1, function(row) {
num <- as.numeric(row['Numbers'])
if (row['Letters'] == 'A') first_function(num) else second_function(num)
}))
}
dplyr_purrr <- function(){
df %>%
mutate(Numbers = map2_dbl(Letters, Numbers, ~calc(.x,.y)))
}
dplyr_case_when <- function(){
df %>%
mutate(Numbers = case_when(
Letters == "A" ~ first_function(Numbers),
TRUE ~ second_function(Numbers)))
}
map_list <- function(){
data.frame(df$Letters, map2_dbl(df2$Letters, df2$Numbers, ~calc(.x, .y)))
}
within_mapply <- function(){
within(df, Numbers <- mapply(Letters, Numbers,
FUN = function(x, y){
switch(x,
"A" = first_function(y),
"B" = second_function(y))
}))
}
within_ifelse <- function(){
within(df, Numbers <- ifelse(Letters == "A",
first_function(Numbers),
second_function(Numbers)))
}
within_case_when <- function(){
within(df, Numbers <- case_when(
Letters == "A" ~ first_function(Numbers),
TRUE ~ second_function(Numbers)))
}
(mbm <- microbenchmark(
basic(),
dplyr_purrr(),
dplyr_case_when(),
map_list(),
within_mapply(),
within_ifelse(),
within_case_when(),
times = 1000
))
Unit: microseconds
expr min lq mean median uq max neval cld
basic() 12816.427 24028.3375 27719.8182 26741.7770 29417.267 277756.650 1000 f
dplyr_purrr() 9682.884 17817.0475 20072.2752 19736.8445 21767.001 48344.265 1000 e
dplyr_case_when() 1098.258 2096.2080 2426.7183 2325.7470 2625.439 9039.601 1000 b
map_list() 8764.319 16873.8670 18962.8540 18586.2790 20599.000 41524.564 1000 d
within_mapply() 6718.368 12397.1440 13806.1752 13671.8120 14942.583 24958.390 1000 c
within_ifelse() 279.796 586.6675 690.1919 653.3345 737.232 8131.292 1000 a
within_case_when() 470.155 955.8990 1170.4641 1070.5655 1219.284 46736.879 1000 a
The simple way to do this with *apply would be to put the whole logic (with the conditional and the two functions) into another function and use apply with MARGIN=1 to pass the data in row by row (lapply will pass in the data by column):
apply(df, 1, function(row) {
num <- as.numeric(row['Numbers'])
if (row['Letters'] == 'A') first_function(num) else second_function(num)
})
[1] 1 3 4
The problem with this approach, at #r2evans points out in the comment below, is that when you use apply with a heterogeneous data.frame (in this case, Letters is type factor while Numbers is type integer) each row passed into the applied function is passed as a vector which can only have a single type, so everything in the row is coerced to the same type (in this case character). This is why it's necessary to use as.numeric(row['Numbers']), to turn Numbers back into type numeric. Depending on your data, this could be a simple fix (as above) or it could make things much more complicated and bug-prone. Either way #akrun's solution is much better, since it preserves each variable's original data type.
lapply has difficulty in this case because it's column-based. However you can try transpose your data by t() and use lapply if you persist. Here I provide two ways which use mapply and ifelse :
df$Letters <- as.character(df$Letters)
# Method 1
within(df, Numbers <- mapply(Letters, Numbers, FUN = function(x, y){
switch(x, "A" = first_function(y),
"B" = second_function(y))
}))
# Method 2
within(df, Numbers <- ifelse(Letters == "A",
first_function(Numbers),
second_function(Numbers)))
Both above got the same outputs :
# Letters Numbers
# 1 A 1
# 2 B 3
# 3 B 4
Here a data.table variant for better performance in case of many data rows (but also showing an implicit conversion problem):
library(data.table)
setDT(df) # fast convertion from data.frame to data.table
df[ Letters == "A", Numbers := first_function(Numbers) ]
df[!(Letters == "A"), Numbers := second_function(Numbers)] # issues a warning, see below
df
# Letters Numbers
# 1: A 1
# 2: B 3
# 3: B 4
The issued warning is:
Warning message: In [.data.table(df, !(Letters == "A"),
:=(Numbers, second_function(Numbers))) : Coerced 'double' RHS to
'integer' to match the column's type; may have truncated precision.
Either change the target column ['Numbers'] to 'double' first (by
creating a new 'double' vector length 3 (nrows of entire table) and
assign that; i.e. 'replace' column), or coerce RHS to 'integer' (e.g.
1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for
speed. Or, set the column type correctly up front when you create the
table and stick to it, please.
The reason is that the data.frame column Numbers is an integer
> str(df)
'data.frame': 3 obs. of 2 variables:
$ Letters: Factor w/ 2 levels "A","B": 1 2 2
$ Numbers: int 1 2 3
but the functions return a double (for whatever reason):
> typeof(first_function(df$Numbers))
[1] "double"
Related
I want my function to be able to take a value or a column name. How can I do this with data.table?
library(data.table)
df <- data.table(a = c(1:5),
b = c(5:1),
c = c(1, 3, 5, 3, 1))
myfunc <- function(val) {
df[a >= val]
}
# This works:
myfunc(2)
# This does not work:
myfunc("c")
If I define my function as:
myfunc <- function(val) {
df[a >= get(val)]
}
# This doesn't work:
myfunc(2)
# This works:
myfunc("c")
What is the best way to resolve this?
Edit: To be clear, I want to results to be the same as:
# myfunc(2)
df %>%
filter(a >= 2)
# myfunc("c")
df %>%
filter(a >= c)
EDIT:
Thanks all for the responses, I think I like dww's answer the best.
I wish it was as easy as in dplyr, where I can do:
myfunc <- function(val) {
df %>%
filter(a >= {{val}})
}
# Both work:
myfunc(2)
myfunc(c)
If you build and parse the whole expression, then you can evaluate it in its entirety. For example
myfunc <- function(val) {
df[eval(parse(text=paste("a >= ", val)))]
}
Though relying on a function that lets you mix values and variable names in the same parameter might be dangerous. Especially in the case where you actually wanted to match on character values rather than variable names. If you passed in the whole expression you could do
myfunc <- function(expr) {
expr <- substitute(expr)
df[eval(expr)]
}
myfunc(a>=3)
myfunc(a>=c)
The question did not actually define the desired behavior so we assume that df must be a data.table and if a character string is passed then the column of that name should be returned and if a number is passed then those rows whose a column exceed that number should be returned.
Define an S3 generic and methods for character and default.
myfunc <- function(x, data = df) UseMethod("myfunc")
myfunc.character <- function(x, data = df) data[[x]]
myfunc.default <- function(x, data = df) data[a > x]
myfunc(2)
## a b c
## 1: 3 3 5
## 2: 4 2 3
## 3: 5 1 1
myfunc("c")
## [1] 1 3 5 3 1
So, I know my title is a little bit confusing but I was hoping you could help me out here.
I have this data frame df where one column is a RNA sequence alignment. The class of this column is a character.
And then I have these other columns: "Allele_1", "Allele_2" which represent the variants of a single position in the RNA sequence (column 1) and that position is given by column 3 ("Position"). However those positions do not account for the "-", i.e., for instance in row 2 the position of the alleles is U--ACCGU--G----UAUUUGAU--CTAD and NOT U--ACCGU--G----UAUUUGAU--CTAD.
sequence Allele_1 Allele_2 Position
UAAGGCUCA----UAGGCAGAU--AUaa A U 3
U--ACCGU--G----UAUUUGAU--CTAD C G 5
cctaACCGU-UUAGCC---------T U C 2
The length of the sequence in column 1 can be variable.
What I want to do is to replace specific letters of the character in specific locations given by "position" and the replacement is given by "Allele_1" and "Allele_2". For instance if the position matches "Allele_2", then I want to replace it by "Allele_2" and vice-versa.
I have tried:
substr(df[,"sequence"],
start = df[,"Position"],
stop = df[,"Position"]) <- df[,"Allele_1"]
However because my position column does not take into account the "-", it replaces in the wrong place. For instance and back to row 2, it replaces here U--ACCGU--G----UAUUUGAU--CTADinstead of here U--ACCGU--G----UAUUUGAU--CTAD.
Also I haven't figure out how to do "the position matches "Allele_2", then I want to replace it by "Allele_2" and vice-versa" thing.
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS
Really hoping that you can help me figure this out!!
Cheers!
UPDATE:
Sorry, it's supposed to be "if the position matches "Allele_1", then I want to replace it by "Allele_2" and vice-versa" and not "Allele_2", then I want to replace it by "Allele_2".
Here are two options. Both are case-sensitive and thus don't replace anything in the third sequence. If you don't want them to be, wrap the appropriate variables in the ifelses in toupper.
strsplit
You can split each sequence into a vector of letters, against which you can then check equality directly. Implemented in mapply, the multivariate version of sapply:
df$new_seq <- mapply(function(seq, a1, a2, pos){
seq <- strsplit(seq, '')[[1]] # split into letters
to_replace <- seq[seq != '-'][pos] # identify allele to replace
# assign appropriate replacement to subset
seq[seq != '-'][pos] <- ifelse(a1 == to_replace,
a2, ifelse(a2 == to_replace,
a1, to_replace))
paste(seq, collapse = '') # reassemble vector to string
}, df$sequence, df$Allele_1, df$Allele_2, df$Position)
df
## sequence Allele_1 Allele_2 Position new_seq
## 1 UAAGGCUCA----UAGGCAGAU--AUaa A U 3 UAUGGCUCA----UAGGCAGAU--AUaa
## 2 U--ACCGU--G----UAUUUGAU--CTAD C G 5 U--ACCCU--G----UAUUUGAU--CTAD
## 3 cctaACCGU-UUAGCC---------T U C 2 cctaACCGU-UUAGCC---------T
If you prefer, you can break the operation into multiple steps, assigning the result of each to a variable.
sub (regex)
If you're comfortable with regex, you can assemble expressions to extract the allele in question and then replace it with the appropriate replacement:
df$to_replace <- mapply(function(seq, pos){
sub(paste0('(?:-*(?:\\w)-*){', pos - 1, '}(\\w).*'), '\\1', seq)
}, df$sequence, df$Position)
df$new_seq <- mapply(function(seq, pos, a1, a2, to_rpl){
replacement <- ifelse(to_rpl == a1, a2, ifelse(to_rpl == a2, a1, to_rpl))
sub(paste0('((?:-*(?:\\w)-*){', pos - 1, '})\\w(.*)'),
paste0('\\1', replacement, '\\2'),
seq)
}, df$sequence, df$Position, df$Allele_1, df$Allele_2, df$to_replace)
df[-5]
## sequence Allele_1 Allele_2 Position new_seq
## 1 UAAGGCUCA----UAGGCAGAU--AUaa A U 3 UAUGGCUCA----UAGGCAGAU--AUaa
## 2 U--ACCGU--G----UAUUUGAU--CTAD C G 5 U--ACCCU--G----UAUUUGAU--CTAD
## 3 cctaACCGU-UUAGCC---------T U C 2 cctaACCGU-UUAGCC---------T
Data
I created a first, small dataframe, and a larger more realistic data frame.
df <- data.frame(sequence = c('UAAGGCUCA----UAGGCAGAU--AUaa',
'U--ACCGU--G----UAUUUGAU--CTAD',
'cctaACCGU-UUAGCC---------T'),
Allele_1 = c('A', 'C', 'U'),
Allele_2 = c('U', 'G', 'C'),
Position = c(3, 5, 2))
df_big <- data.frame(sequence = rep(c(paste(rep('UAAGGCUCA----UAGGCAGAU--AUaa', 100), collapse=''),
paste(rep('U--ACCGU--G----UAUUUGAU--CTAD', 100), collapse=''),
paste(rep('cctaACCGU-UUAGCC---------T', 100), collapse='')), 100),
Allele_1 = rep(c('A', 'C', 'U'), 100),
Allele_2 = rep(c('U', 'G', 'C'), 100),
Position = rep(c(1000, 1000, 1000), 100))
Functions
I created a function to return the 'new' position after the multiple alignment (ie not counting -), that works with vectors, as follows; also a faster but less safe function:
find_pos <- function(string, pos) {
vectorized <- data.frame(string=string, pos=pos)
sapply(seq_len(nrow(vectorized)), function(i) {
guess <- vectorized[i,'pos']; oldguess <- 0
offset <- -1; trynum <- 0
while(offset != 0 & trynum < 500) {
offset <- nchar(gsub('[^-]', '', substr(vectorized[i,'string'], oldguess+1, guess)))
oldguess <- guess
guess <- oldguess + offset
trynum <- trynum+1
}
return(guess)
})
}
find_pos_unsafe <- function(string, pos) {
sapply(seq_along(string), function(i) {
guess <- pos[i]; oldguess <- 0
offset <- -1
while(offset != 0) {
offset <- nchar(gsub('[^-]', '', substr(string[i], oldguess+1, guess)))
oldguess <- guess
guess <- oldguess + offset
}
return(guess)
})
}
The first can be used with different length variables as follows (but incurs an overhead for this flexibility):
> find_pos(string, 1:5)
[1] 1 4 5 6 7
For benchmarking purposes I've wrapped the other code required to get the solution in a function. Again, two forms, one calling the faster matching function and one calling the safer version:
ds440 <- function(df) {
pos <- find_pos(df$sequence, df$Position)
toswap <- ifelse(substr(df$sequence, pos, pos)==df$Allele_1, as.character(df$Allele_2), #if A1, A2
ifelse(substr(df$sequence, pos, pos)==df$Allele_2, as.character(df$Allele_1), #If A2, A1
substr(df$sequence, pos, pos))) # else keep same
df$replaced <- as.character(df$sequence)
substr(df$replaced, pos, pos) <- as.character(toswap)
df
}
ds440_quick <- function(df) {
pos <- find_pos_unsafe(df$sequence, df$Position)
toswap <- ifelse(substr(df$sequence, pos, pos)==df$Allele_1, as.character(df$Allele_2), #if A1, A2
ifelse(substr(df$sequence, pos, pos)==df$Allele_2, as.character(df$Allele_1), #If A2, A1
substr(df$sequence, pos, pos))) # else keep same
df$replaced <- as.character(df$sequence)
substr(df$replaced, pos, pos) <- toswap
df
}
Other functions adapted from #alistaire
alistaire_split <- function(df) {
df$new_seq <- mapply(function(seq, a1, a2, pos){
seq <- strsplit(seq, '')[[1]] # split into letters
to_replace <- seq[seq != '-'][pos] # identify allele to replace
# assign appropriate replacement to subset
seq[seq != '-'][pos] <- ifelse(a1 == to_replace,
a2, ifelse(a2 == to_replace,
a1, to_replace))
paste(seq, collapse = '') # reassemble vector to string
}, as.character(df$sequence), as.character(df$Allele_1), as.character(df$Allele_2), df$Position)
df
}
alistaire_sub <- function(df) {
df$to_replace <- mapply(function(seq, pos){
sub(paste0('(?:-*(?:\\w)-*){', pos - 1, '}(\\w).*'), '\\1', seq)
}, df$sequence, df$Position)
df$new_seq <- mapply(function(seq, pos, a1, a2, to_rpl){
replacement <- ifelse(to_rpl == a1, a2, ifelse(to_rpl == a2, a1, to_rpl))
sub(paste0('((?:-*(?:\\w)-*){', pos - 1, '})\\w(.*)'),
paste0('\\1', replacement, '\\2'),
seq)
}, df$sequence, df$Position, df$Allele_1, df$Allele_2, df$to_replace)
df[-5]
}
Results
Note the case sensitivity in the matching above. Use toupper or similar before checking for equality if you don't care about case.
ds440(df)
sequence Allele_1 Allele_2 Position replaced
1 UAAGGCUCA----UAGGCAGAU--AUaa A U 3 UAUGGCUCA----UAGGCAGAU--AUaa
2 U--ACCGU--G----UAUUUGAU--CTAD C G 5 U--ACCCU--G----UAUUUGAU--CTAD
3 cctaACCGU-UUAGCC---------T U C 2 cctaACCGU-UUAGCC---------T
Benchmarking
Functions were defined above.
library(microbenchmark)
microbenchmark(ds440(df), ds440_quick(df), alistaire_split(df), alistaire_sub(df))
Unit: microseconds
expr min lq mean median uq max neval cld
ds440(df) 727.157 747.0065 801.3529 764.1475 781.6400 3658.410 100 c
ds440_quick(df) 339.784 354.7140 364.4484 364.1440 370.7955 421.400 100 b
alistaire_split(df) 138.929 144.9500 150.6806 148.5890 153.8570 261.136 100 a
alistaire_sub(df) 815.853 833.6630 855.2414 844.0770 857.2130 1499.370 100 c
microbenchmark(ds440(df_big), ds440_quick(df_big), alistaire_split(df_big))
Unit: milliseconds
expr min lq mean median uq max neval cld
ds440(df_big) 195.4233 199.2827 204.4030 204.9039 208.0195 216.1879 100 c
ds440_quick(df_big) 136.5985 139.3442 143.7216 145.1585 146.7837 153.9201 100 a
alistaire_split(df_big) 138.9117 146.3977 150.8772 148.2299 151.0723 278.7308 100 b
Clear winner for time in the small examle is #alistaire's split function, however as the df gets bigger, the alistaire_sub function breaks (regex won't handle >999) and my ds440_quick function actually works out slightly faster.
I am trying to pass all columns from a data.frame matching a criteria to a function within the summarize function of dplyr as follows:
df %>% group_by(Version, Type) %>%
summarize(mcll(TrueClass, starts_with("pred")))
Error: argument is of length zero
Is there a way to do this? A working example follows:
Build a simulated data.frame of sample predictions. These are interpreted as the output of a classification algorithm.
library(dplyr)
nrow <- 40
ncol <- 4
set.seed(567879)
getProbs <- function(i) {
p <- runif(i)
return(p / sum(p))
}
df <- data.frame(matrix(NA, nrow, ncol))
for (i in seq(nrow)) df[i, ] <- getProbs(ncol)
names(df) <- paste0("pred.", seq(ncol))
add a column indicating the true class
df$TrueClass <- factor(ceiling(runif(nrow, min = 0, max = ncol)))
add categorical columns for sub-setting
df$Type <- c(rep("a", nrow / 2), rep("b", nrow / 2))
df$Version <- rep(1:4, times = nrow / 4)
now I want to calculate the Multiclass LogLoss for these predictions using the function below:
mcll <- function (act, pred)
{
if (class(act) != "factor") {
stop("act must be a factor")
}
pred[pred == 0] <- 1e-15
pred[pred == 1] <- 1 - 1e-15
dummies <- model.matrix(~act - 1)
if (nrow(dummies) != nrow(pred)) {
return(0)
}
return(-1 * (sum(dummies * log(pred)))/length(act))
}
this is easily done with the entire data set
act <- df$TrueClass
pred <- df %>% select(starts_with("pred"))
mcll(act, pred)
but I want to use dplyr group_by to calculate mcll for each subset of the data
df %>% group_by(Version, Type) %>%
summarize(mcll(TrueClass, starts_with("pred")))
Ideally I could do this without changing the mcll() function, but I am open to doing that if it simplifies the other code.
Thanks!
EDIT: Note that the input to mcll is a vector of true values and a matrix of probabilities with one column for each "pred" column. For each subset of data, mcll should return a scalar. I can get exactly what I want with the code below, but I was hoping for something within the context of dplyr.
mcll_df <- data.frame(matrix(ncol = 3, nrow = 8))
names(mcll_df) <- c("Type", "Version", "mcll")
count = 1
for (ver in unique(df$Version)) {
for (type in unique(df$Type)) {
subdat <- df %>% filter(Type == type & Version == ver)
val <- mcll(subdat$TrueClass, subdat %>% select(starts_with("pred")))
mcll_df[count, ] <- c(Type = type, Version = ver, mcll = val)
count = count + 1
}
}
head(mcll_df)
Type Version mcll
1 a 1 1.42972507510096
2 b 1 1.97189000832723
3 a 2 1.97988830406062
4 b 2 1.21387875938737
5 a 3 1.30629638026735
6 b 3 1.48799237895462
This is easy to do using data.table:
library(data.table)
setDT(df)[, mcll(TrueClass, .SD), by = .(Version, Type), .SDcols = grep("^pred", names(df))]
# Version Type V1
#1: 1 a 1.429725
#2: 2 a 1.979888
#3: 3 a 1.306296
#4: 4 a 1.668330
#5: 1 b 1.971890
#6: 2 b 1.213879
#7: 3 b 1.487992
#8: 4 b 1.171286
I had to change the mcll function a little bit but then it worked. The problem is occurring with the second if statement. You are telling the function to get nrow(pred), but if you are summarizing over multiple columns you are actually only supplying a vector each time (because each column gets analyzed separately). Additionally, I switched the order of the arguments being entered into the function.
mcll <- function (pred, act)
{
if (class(act) != "factor") {
stop("act must be a factor")
}
pred[pred == 0] <- 1e-15
pred[pred == 1] <- 1 - 1e-15
dummies <- model.matrix(~act - 1)
if (nrow(dummies) != length(pred)) { # the main change is here
return(0)
}
return(-1 * (sum(dummies * log(pred)))/length(act))
}
From there we can use the summarise_each function.
df %>% group_by(Version,Type) %>% summarise_each(funs(mcll(., TrueClass)), matches("pred"))
Version Type pred.1 pred.2 pred.3 pred.4
(int) (chr) (dbl) (dbl) (dbl) (dbl)
1 1 a 1.475232 1.972779 1.743491 1.161984
2 1 b 2.030829 1.331629 1.397577 1.484865
3 2 a 1.589256 1.740858 1.898906 2.005511
I checked this against a subset of the data and it looks like it works.
mcll(df$pred.1[which(df$Type=="a" & df$Version==1)],
df$TrueClass[which(df$Type=="a" & df$Version==1)])
[1] 1.475232 #pred.1 mcll when Version equals 1 and Type equals a.
I have a data.table with columns of different data types. My goal is to select only numeric columns and replace NA values within these columns by 0.
I am aware that replacing na-values with zero goes like this:
DT[is.na(DT)] <- 0
To select only numeric columns, I found this solution, which works fine:
DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
I can achieve what I want by assigning
DT2 <- DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
and then do:
DT2[is.na(DT2)] <- 0
But of course I would like to have my original DT modified by reference. With the following, however:
DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
[is.na(DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE])]<- 0
I get
"Error in [.data.table([...] i is invalid type (matrix)"
What am I missing?
Any help is much appreciated!!
We can use set
for(j in seq_along(DT)){
set(DT, i = which(is.na(DT[[j]]) & is.numeric(DT[[j]])), j = j, value = 0)
}
Or create a index for numeric columns, loop through it and set the NA values to 0
ind <- which(sapply(DT, is.numeric))
for(j in ind){
set(DT, i = which(is.na(DT[[j]])), j = j, value = 0)
}
data
set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
I wanted to explore and possibly improve on the excellent answer given above by #akrun. Here's the data he used in his example:
library(data.table)
set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
DT
#> v1 v2 v3
#> 1: NA <NA> -0.5458808
#> 2: 1 A 0.5365853
#> 3: 2 B 0.4196231
#> 4: 3 C -0.5836272
#> 5: 4 D NA
And the two methods he suggested to use:
fun1 <- function(x){
for(j in seq_along(x)){
set(x, i = which(is.na(x[[j]]) & is.numeric(x[[j]])), j = j, value = 0)
}
}
fun2 <- function(x){
ind <- which(sapply(x, is.numeric))
for(j in ind){
set(x, i = which(is.na(x[[j]])), j = j, value = 0)
}
}
I think the first method above is really genius as it exploits the fact that NAs are typed.
First of all, even though .SD is not available in i argument, it is possible to pull the column name with get(), so I thought I could sub-assign data.table this way:
fun3 <- function(x){
nms <- names(x)[sapply(x, is.numeric)]
for(j in nms){
x[is.na(get(j)), (j):=0]
}
}
Generic case, of course would be to rely on .SD and .SDcols to work only on numeric columns
fun4 <- function(x){
nms <- names(x)[sapply(x, is.numeric)]
x[, (nms):=lapply(.SD, function(i) replace(i, is.na(i), 0)), .SDcols=nms]
}
But then I thought to myself "Hey, who says we can't go all the way to base R for this sort of operation. Here's simple lapply() with conditional statement, wrapped into setDT()
fun5 <- function(x){
setDT(
lapply(x, function(i){
if(is.numeric(i))
i[is.na(i)]<-0
i
})
)
}
Finally,we could use the same idea of conditional to limit the columns on which we apply the set()
fun6 <- function(x){
for(j in seq_along(x)){
if (is.numeric(x[[j]]) )
set(x, i = which(is.na(x[[j]])), j = j, value = 0)
}
}
Here are the benchmarks:
microbenchmark::microbenchmark(
for.set.2cond = fun1(copy(DT)),
for.set.ind = fun2(copy(DT)),
for.get = fun3(copy(DT)),
for.SDcol = fun4(copy(DT)),
for.list = fun5(copy(DT)),
for.set.if =fun6(copy(DT))
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> for.set.2cond 59.812 67.599 131.6392 75.5620 114.6690 4561.597 100 a
#> for.set.ind 71.492 79.985 142.2814 87.0640 130.0650 4410.476 100 a
#> for.get 553.522 569.979 732.6097 581.3045 789.9365 7157.202 100 c
#> for.SDcol 376.919 391.784 527.5202 398.3310 629.9675 5935.491 100 b
#> for.list 69.722 81.932 137.2275 87.7720 123.6935 3906.149 100 a
#> for.set.if 52.380 58.397 116.1909 65.1215 72.5535 4570.445 100 a
You need tidyverse purrr function map_if along with ifelse to do the job in a single line of code.
library(tidyverse)
set.seed(24)
DT <- data.table(v1= sample(c(1:3,NA),20,replace = T), v2 = sample(c(LETTERS[1:3],NA),20,replace = T), v3=sample(c(1:3,NA),20,replace = T))
Below single line code takes a DT with numeric and non numeric columns and operates just on the numeric columns to replace the NAs to 0:
DT %>% map_if(is.numeric,~ifelse(is.na(.x),0,.x)) %>% as.data.table
So, tidyverse can be less verbose than data.table sometimes :-)
I want to assign multiple variables in a single line in R. Is it possible to do something like this?
values # initialize some vector of values
(a, b) = values[c(2,4)] # assign a and b to values at 2 and 4 indices of 'values'
Typically I want to assign about 5-6 variables in a single line, instead of having multiple lines. Is there an alternative?
I put together an R package zeallot to tackle this very problem. zeallot includes an operator (%<-%) for unpacking, multiple, and destructuring assignment. The LHS of the assignment expression is built using calls to c(). The RHS of the assignment expression may be any expression which returns or is a vector, list, nested list, data frame, character string, date object, or custom objects (assuming there is a destructure implementation).
Here is the initial question reworked using zeallot (latest version, 0.0.5).
library(zeallot)
values <- c(1, 2, 3, 4) # initialize a vector of values
c(a, b) %<-% values[c(2, 4)] # assign `a` and `b`
a
#[1] 2
b
#[1] 4
For more examples and information one can check out the package vignette.
There is a great answer on the Struggling Through Problems Blog
This is taken from there, with very minor modifications.
USING THE FOLLOWING THREE FUNCTIONS
(Plus one for allowing for lists of different sizes)
# Generic form
'%=%' = function(l, r, ...) UseMethod('%=%')
# Binary Operator
'%=%.lbunch' = function(l, r, ...) {
Envir = as.environment(-1)
if (length(r) > length(l))
warning("RHS has more args than LHS. Only first", length(l), "used.")
if (length(l) > length(r)) {
warning("LHS has more args than RHS. RHS will be repeated.")
r <- extendToMatch(r, l)
}
for (II in 1:length(l)) {
do.call('<-', list(l[[II]], r[[II]]), envir=Envir)
}
}
# Used if LHS is larger than RHS
extendToMatch <- function(source, destin) {
s <- length(source)
d <- length(destin)
# Assume that destin is a length when it is a single number and source is not
if(d==1 && s>1 && !is.null(as.numeric(destin)))
d <- destin
dif <- d - s
if (dif > 0) {
source <- rep(source, ceiling(d/s))[1:d]
}
return (source)
}
# Grouping the left hand side
g = function(...) {
List = as.list(substitute(list(...)))[-1L]
class(List) = 'lbunch'
return(List)
}
Then to execute:
Group the left hand side using the new function g()
The right hand side should be a vector or a list
Use the newly-created binary operator %=%
# Example Call; Note the use of g() AND `%=%`
# Right-hand side can be a list or vector
g(a, b, c) %=% list("hello", 123, list("apples, oranges"))
g(d, e, f) %=% 101:103
# Results:
> a
[1] "hello"
> b
[1] 123
> c
[[1]]
[1] "apples, oranges"
> d
[1] 101
> e
[1] 102
> f
[1] 103
Example using lists of different sizes:
Longer Left Hand Side
g(x, y, z) %=% list("first", "second")
# Warning message:
# In `%=%.lbunch`(g(x, y, z), list("first", "second")) :
# LHS has more args than RHS. RHS will be repeated.
> x
[1] "first"
> y
[1] "second"
> z
[1] "first"
Longer Right Hand Side
g(j, k) %=% list("first", "second", "third")
# Warning message:
# In `%=%.lbunch`(g(j, k), list("first", "second", "third")) :
# RHS has more args than LHS. Only first2used.
> j
[1] "first"
> k
[1] "second"
Consider using functionality included in base R.
For instance, create a 1 row dataframe (say V) and initialize your variables in it. Now you can assign to multiple variables at once V[,c("a", "b")] <- values[c(2, 4)], call each one by name (V$a), or use many of them at the same time (values[c(5, 6)] <- V[,c("a", "b")]).
If you get lazy and don't want to go around calling variables from the dataframe, you could attach(V) (though I personally don't ever do it).
# Initialize values
values <- 1:100
# V for variables
V <- data.frame(a=NA, b=NA, c=NA, d=NA, e=NA)
# Assign elements from a vector
V[, c("a", "b", "e")] = values[c(2,4, 8)]
# Also other class
V[, "d"] <- "R"
# Use your variables
V$a
V$b
V$c # OOps, NA
V$d
V$e
here is my idea. Probably the syntax is quite simple:
`%tin%` <- function(x, y) {
mapply(assign, as.character(substitute(x)[-1]), y,
MoreArgs = list(envir = parent.frame()))
invisible()
}
c(a, b) %tin% c(1, 2)
gives like this:
> a
Error: object 'a' not found
> b
Error: object 'b' not found
> c(a, b) %tin% c(1, 2)
> a
[1] 1
> b
[1] 2
this is not well tested though.
A potentially dangerous (in as much as using assign is risky) option would be to Vectorize assign:
assignVec <- Vectorize("assign",c("x","value"))
#.GlobalEnv is probably not what one wants in general; see below.
assignVec(c('a','b'),c(0,4),envir = .GlobalEnv)
a b
0 4
> b
[1] 4
> a
[1] 0
Or I suppose you could vectorize it yourself manually with your own function using mapply that maybe uses a sensible default for the envir argument. For instance, Vectorize will return a function with the same environment properties of assign, which in this case is namespace:base, or you could just set envir = parent.env(environment(assignVec)).
As others explained, there doesn't seem to be anything built in. ...but you could design a vassign function as follows:
vassign <- function(..., values, envir=parent.frame()) {
vars <- as.character(substitute(...()))
values <- rep(values, length.out=length(vars))
for(i in seq_along(vars)) {
assign(vars[[i]], values[[i]], envir)
}
}
# Then test it
vals <- 11:14
vassign(aa,bb,cc,dd, values=vals)
cc # 13
One thing to consider though is how to handle the cases where you e.g. specify 3 variables and 5 values or the other way around. Here I simply repeat (or truncate) the values to be of the same length as the variables. Maybe a warning would be prudent. But it allows the following:
vassign(aa,bb,cc,dd, values=0)
cc # 0
list2env(setNames(as.list(rep(2,5)), letters[1:5]), .GlobalEnv)
Served my purpose, i.e., assigning five 2s into first five letters.
Had a similar problem recently and here was my try using purrr::walk2
purrr::walk2(letters,1:26,assign,envir =parent.frame())
https://stat.ethz.ch/R-manual/R-devel/library/base/html/list2env.html:
list2env(
list(
a=1,
b=2:4,
c=rpois(10,10),
d=gl(3,4,LETTERS[9:11])
),
envir=.GlobalEnv
)
If your only requirement is to have a single line of code, then how about:
> a<-values[2]; b<-values[4]
I'm afraid that elegent solution you are looking for (like c(a, b) = c(2, 4)) unfortunatelly does not exist. But don't give up, I'm not sure! The nearest solution I can think of is this one:
attach(data.frame(a = 2, b = 4))
or if you are bothered with warnings, switch them off:
attach(data.frame(a = 2, b = 4), warn = F)
But I suppose you're not satisfied with this solution, I wouldn't be either...
R> values = c(1,2,3,4)
R> a <- values[2]; b <- values[3]; c <- values[4]
R> a
[1] 2
R> b
[1] 3
R> c
[1] 4
Another version with recursion:
let <- function(..., env = parent.frame()) {
f <- function(x, ..., i = 1) {
if(is.null(substitute(...))){
if(length(x) == 1)
x <- rep(x, i - 1);
stopifnot(length(x) == i - 1)
return(x);
}
val <- f(..., i = i + 1);
assign(deparse(substitute(x)), val[[i]], env = env);
return(val)
}
f(...)
}
example:
> let(a, b, 4:10)
[1] 4 5 6 7 8 9 10
> a
[1] 4
> b
[1] 5
> let(c, d, e, f, c(4, 3, 2, 1))
[1] 4 3 2 1
> c
[1] 4
> f
[1] 1
My version:
let <- function(x, value) {
mapply(
assign,
as.character(substitute(x)[-1]),
value,
MoreArgs = list(envir = parent.frame()))
invisible()
}
example:
> let(c(x, y), 1:2 + 3)
> x
[1] 4
> y
[1]
Combining some of the answers given here + a little bit of salt, how about this solution:
assignVec <- Vectorize("assign", c("x", "value"))
`%<<-%` <- function(x, value) invisible(assignVec(x, value, envir = .GlobalEnv))
c("a", "b") %<<-% c(2, 4)
a
## [1] 2
b
## [1] 4
I used this to add the R section here: http://rosettacode.org/wiki/Sort_three_variables#R
Caveat: It only works for assigning global variables (like <<-). If there is a better, more general solution, pls. tell me in the comments.
For a named list, use
list2env(mylist, environment())
For instance:
mylist <- list(foo = 1, bar = 2)
list2env(mylist, environment())
will add foo = 1, bar = 2 to the current environement, and override any object with those names. This is equivalent to
mylist <- list(foo = 1, bar = 2)
foo <- mylist$foo
bar <- mylist$bar
This works in a function, too:
f <- function(mylist) {
list2env(mylist, environment())
foo * bar
}
mylist <- list(foo = 1, bar = 2)
f(mylist)
However, it is good practice to name the elements you want to include in the current environment, lest you override another object... and so write preferrably
list2env(mylist[c("foo", "bar")], environment())
Finally, if you want different names for the new imported objects, write:
list2env(`names<-`(mylist[c"foo", "bar"]), c("foo2", "bar2")), environment())
which is equivalent to
foo2 <- mylist$foo
bar2 <- mylist$bar