I am trying to write a function that would mutate multiple columns in a df and produce a new column for each recoded variable. In this case the mutation I am running is to subtract each element in the column from 15. I was able to write the following code for three columns, which worked, but in the future I want to run something like this over 20+ columns and writing out each new column name (as you do in mutate) seems burdensome.
I can't seem to get lapply to work with a recode or mutate function to produce new columns.
df2 <- mutate(df1, new_col1 = 15-old_col1,
new_col2 = 15 - old_col2, new_col3 = 15 - old_col3)
A data.table solution, assuming you want to mutate all of the columns* (see below for a more flexible version).
*as #sb0709 mentions in the comments, mutate_all would do this as well.
library( data.table )
df <- data.table( old_col_1 = 20:24,
old_col_2 = 55:49,
old_col_3 = rnorm( 5, 100, 30 ) )
df[ , sub( "old", "new", names( df ) ) := lapply( .SD, function(x) 15-x ) ]
Which gives:
R> df
old_col_1 old_col_2 old_col_3 new_col_1 new_col_2 new_col_3
1: 20 55 86.29104 -5 -40 -71.29104
2: 21 56 144.21564 -6 -41 -129.21564
3: 22 57 104.84574 -7 -42 -89.84574
4: 23 58 93.18084 -8 -43 -78.18084
5: 24 59 104.96188 -9 -44 -89.96188
If you want to select less than all of the columns, you just need to subset the names vector and the .SD list. For example, to run your mutation on only columns 2 and 3:
df[ , sub( "old", "new", names( df )[2:3] ) := lapply( .SD[,2:3], function(x) 15-x ) ]
Which instead gives:
R> df
old_col_1 old_col_2 old_col_3 new_col_2 new_col_3
1: 20 55 138.28667 -40 -123.28667
2: 21 56 69.03836 -41 -54.03836
3: 22 57 147.39790 -42 -132.39790
4: 23 58 88.15505 -43 -73.15505
5: 24 59 28.96437 -44 -13.96437
Related
Suppose, I have a dataframe, df, and I want to create a new column called "c" based on the addition of two existing columns, "a" and "b". I would simply run the following code:
df$c <- df$a + df$b
But I also want to do this for many other columns. So why won't my code below work?
# Reproducible data:
martial_arts <- data.frame(gym_branch=c("downtown_a", "downtown_b", "uptown", "island"),
day_boxing=c(5,30,25,10),day_muaythai=c(34,18,20,30),
day_bjj=c(0,0,0,0),day_judo=c(10,0,5,0),
evening_boxing=c(50,45,32,40), evening_muaythai=c(50,50,45,50),
evening_bjj=c(60,60,55,40), evening_judo=c(25,15,30,0))
# Creating a list of the new column names of the columns that need to be added to the martial_arts dataframe:
pattern<-c("_boxing","_muaythai","_bjj","_judo")
d<- expand.grid(paste0("martial_arts$total",pattern))
# Creating lists of the columns that will be added to each other:
e<- names(martial_arts %>% select(day_boxing:day_judo))
f<- names(martial_arts %>% select(evening_boxing:evening_judo))
# Writing a function and using mapply:
kick_him <- function(d,e,f){d <- rowSums(martial_arts[ , c(e, f)], na.rm=T)}
mapply(kick_him,d,e,f)
Now, mapply produces the correct results in terms of the addition:
> mapply(ff,d,e,f)
Var1 <NA> <NA> <NA>
[1,] 55 84 60 35
[2,] 75 68 60 15
[3,] 57 65 55 35
[4,] 50 80 40 0
But it doesn't add the new columns to the martial_arts dataframe. The function in theory should do the following
martial_arts$total_boxing <- martial_arts$day_boxing + martial_arts$evening_boxing
...
...
martial_arts$total_judo <- martial_arts$day_judo + martial_arts$evening_judo
and add four new total columns to martial_arts.
So what am I doing wrong?
The assignment is wrong here i.e. instead of having martial_arts$total_boxing as a string, it should be "total_boxing" alone and this should be on the lhs of the Map/mapply. As the OP already created the 'martial_arts$' in 'd' dataset as a column, we are removing the prefix part and do the assignment
kick_him <- function(e,f){rowSums(martial_arts[ , c(e, f)], na.rm=TRUE)}
martial_arts[sub(".*\\$", "", d$Var1)] <- Map(kick_him, e, f)
-check the dataset now
> martial_arts
gym_branch day_boxing day_muaythai day_bjj day_judo evening_boxing evening_muaythai evening_bjj evening_judo total_boxing total_muaythai total_bjj total_judo
1 downtown_a 5 34 0 10 50 50 60 25 55 84 60 35
2 downtown_b 30 18 0 0 45 50 60 15 75 68 60 15
3 uptown 25 20 0 5 32 45 55 30 57 65 55 35
4 island 10 30 0 0 40 50 40 0 50 80 40 0
I would like to add these numbers together in the following code in the col3.
I have tried using gsub, to add a + and calculate in r
I have tried using separate to do a sum across.
train <- data.table(col1=c(rep('a0001',4),rep('b0002',4)), col2=c(seq(1,4,1),seq(1,4,1)), col3=c("12 43 543 1232 43 543", "","","","15 24 85 64 85 25 46","","658 1568 12 584 15684",""))
I would like the results to be a sum of the number in col3 by row like in col4
result<-data.frame(col1=c("a0001","b0002"), col3=c("12 43 543 1232 43 543", "","","","15 24 85 64 85 25 46","","658 1568 12 584 15684",""),col4=c("2416",'18850'))
Grouped by 'col1', we can split by the space, unlist, convert to numeric, get the sum and assign (:=) to create new column
train[, col4 := sum(as.numeric(unlist(strsplit(col3, ' '))), na.rm = TRUE), col1]
Or another option is scan
train[, col4 := sum(scan(text = col3, what = numeric(), quiet = TRUE)), col1]
I have one column with 950 numbers. I want to sum row 1:40 and place it in a new column on row 50, then sum row 2:41 and place it on row 51 in the new column and so on. How do I do?
You can use the function RcppRoll::roll_sum()
Hope this helps:
r <- 50
df1 <- data.frame(c1 = 1:951)
v1 <- RcppRoll::roll_sum(df1$c1, n=40)
df1$c2 <- c(rep(NA, r), v1[1:(nrow(df1)-r)])
View(df1) # in RStudio
You decide what happens with the sum from row 911 onwards (I've ignored them)
You can use RcppRoll::roll_sum() and dplyr::lag()...
df <- data.frame(v = 1:950)
library(dplyr)
library(RcppRoll)
range <- 40 # how many values to sum, i.e. window size
offset <- 10 # e.g sum(1:40) goes to row 50
df <- mutate(df, roll_sum = RcppRoll::roll_sum(lag(v, n = offset),
n = range, fill = NA, align = "right"))
df[(range+offset):(range+offset+5), ]
# v roll_sum
# 50 50 820
# 51 51 860
# 52 52 900
# 53 53 940
# 54 54 980
# 55 55 1020
sum(1:range); sum(2:(range+1))
# [1] 820
# [1] 860
Here is a code to generate a data.frame :
ref_variables=LETTERS[1:10]
row=100
d0=seq(1:100)
for (i in seq_along(ref_variables)){
dtemp=sample(seq(1:row),row,TRUE)
d0=data.frame(d0,dtemp)
}
d0[,1]=NULL
names(d0)=ref_variables
I have a dataset, data.frame or data.table, whatever.
Let's say I want to modify the columns 2 to 4 by dividing each of them by the first one. Of Course, I can make a loop like this :
columns_name_to_divide=c("B","C","H")
column_divisor="A"
for (i in seq_along(columns_name_to_divide)){
ds[columns_name_to_divide[i]] = ds[columns_name_to_divide[i]] / ds[column_divisor]
}
But is there a way more elegant to do it?
> d0[2:4] <- d0[,2:4]/d0[,1]
This will substitute your original values with result you get after dividing column 2,3,4 by column 1. Rest will remain the same.
If you want to create 3 new columns in d0 with new values after dividing column 2,3,4 by column 1 This will not replace the original values in column 2,3, and 4. The calculated values would be in column 11,12 and 13 respectively.
> dim(d0)
# [1] 100 10
> d0[11:13] <- d0[,2:4]/d0[,1]
> dim(d0)
# [1] 100 13
To round up the new values, you can simply add round() function to 2 decimal places like below:
> d0[2:4] <- round(d0[,2:4]/d0[,1],2) # Original values subtituted at 2,3,4
# OR
> d0[11:13] <- round(d0[,2:4]/d0[,1],2) # New columns added, original columns are untouched.
We can use set from data.table which would be make this more efficient as the overhead of .[data.table is avoided when called multiple times (though not in this case).
library(data.table)
setDT(d0)
for(j in columns_name_to_divide){
set(d0, i = NULL, j = j, value = d0[[j]]/d0[[column_divisor]])
}
Or using lapply
setDT(d0)[, (columns_name_to_divide) := lapply(.SD, `/`,
d0[[column_divisor]]), .SDcols = columns_name_to_divide]
Or an elegant option using dplyr
library(dplyr)
library(magrittr)
d0 %<>%
mutate_each_(funs(./d0[[column_divisor]]), columns_name_to_divide)
head(d0)
# A B C D E F G H I J
#1 60 0.4000000 1.1500000 6 86 27 19 0.150000 94 97
#2 11 0.6363636 0.3636364 25 52 44 82 8.818182 84 68
#3 80 0.8750000 1.1375000 72 34 56 69 0.125000 34 17
#4 77 0.3116883 1.0259740 9 44 87 61 1.064935 79 40
#5 18 0.3333333 5.0555556 60 69 62 89 2.166667 21 34
#6 42 1.3333333 2.3095238 61 20 87 95 1.428571 78 63
Benchmarks
set.seed(42)
d1 <- as.data.frame(matrix(sample(1:9, 1e7*7, replace=TRUE), ncol=7))
d2 <- copy(d1)
d3 <- copy(d1)
system.time({
d2 %<>%
mutate_each(funs(./d2[["V2"]]), V4:V7)
})
# user system elapsed
# 0.52 0.39 0.91
system.time({
d1[,4:7] <- d1[,4:7]/d1[,2]
})
# user system elapsed
# 1.72 0.72 2.44
system.time({
setDT(d3)
for(j in 4:7){
set(d3, i = NULL, j = j, value = d3[[j]]/d3[["V2"]])
}
})
# user system elapsed
# 0.32 0.16 0.47
You can do this:
library(data.table)
cols <- names(df)[2:4]
col1 <- names(df)[1]
setDT(df)[, (cols) := lapply (cols, function(x) get(x) / get(col1) )]
# sample data for reproducible example:
df <- data.frame(V1=rep(10,5),
V2=rep(20,5),
V3=rep(30,5),
V4=rep(40,5),
V5=rep(50,5))
I try to apply a function over all rows and columns of two dataframes but I don't know how to solve it with apply.
I think the following script explains what I intend to do and the way i tried to solve it. Any advice would be warmly appreciated! Please note, that the simplefunction is only intended to be an example function to keep it simple.
# some data and a function
df1<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
df2<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
simplefunction<-function(a,b){a+b}
# apply on a single row
simplefunction(df1[1,2],df2[1,2])
# apply over all colums
apply(?)
## apply over all columns and rows
# create df to receive results
df3<-df2
# loop it
for (i in 2:5)df3[i]<-apply(?)
My first mapply answer!! For your simple example you have...
mapply( FUN = `+` , df1[,-1] , df2[,-1] )
# a b c
# [1,] 60 35 75
# [2,] 57 39 92
# [3,] 72 71 48
# [4,] 31 19 85
# [5,] 47 66 58
You can extend it like so...
mapply( FUN = function(x,y,z,etc){ simplefunctioncodehere} , df1[,-1] , df2[,-1] , ... other dataframes here )
The dataframes will be passed in order to the function, so in this example df1 would be x, df2 would be y and z and etc would be some other dataframes that you specify in that order. Hopefully that makes sense. mapply will take the first row, first column values of all dataframes and apply the function, then the first row, second column of all data frames and apply the function and so on.
You can also use Reduce:
set.seed(45) # for reproducibility
Reduce(function(x,y) { x + y}, list(df1[, -1], df2[,-1]))
# a b c
# 1 53 22 23
# 2 64 28 91
# 3 19 56 51
# 4 38 41 53
# 5 28 42 30
You can just do :
df1[,-1] + df2[,-1]
Which gives :
a b c
1 52 24 37
2 65 63 62
3 31 90 89
4 90 35 33
5 51 33 45