I am struggling with creating multiple columns in a DRY way.
I have searched google and stack exchange and I am still struggling with the below.
df <- data.frame(red = 1:10, blue=seq(1,30,3))
myfunction <- function(x){
log(x) + 10
}
df$green <- myfunction(df$red)
df$yellow <- myfunction(df$blue)
My questions are:
how can I create the green and yellow columns using a for loop?
how can I create the green and yellow using an apply function?
I've spent a bit of time working on these kinds of things. Most of the time you're going to want to either know all the names of the new variables or have them work in an orderly pattern so you can just paste together the names with an indexing variable.
df <- data.frame(red = 1:10, blue=seq(1,30,3))
myfunction <- function(x){
log(x) + 10
}
newcols = apply(X = df, MARGIN = 2, FUN = myfunction)
colnames(newcols) = c("greeen","yellow")
df = cbind(df,newcols)
# Alternative
df <- data.frame(red = 1:10, blue=seq(1,30,3))
colors = c("green", "yellow")
for(i in 1:length(colors)){
df[,colors[i]] = myfunction(df[,i])
}
As pointed out by Sotos, apply is slower than lapply. So I believe the optimal solution is:
df[,c("green","yellow")] <- lapply(df, myfunction)
Related
I have a dataset with x number of columns, consisting of groups of test results, for example test1_1, test1_2 etc. Each set of tests has a different number of test results associated with it so the actual numbers aren't the same across each test. The final column is my target variable. I'm looking to establish which tests are correlated with the target variable, but I also want to create datasets for each set of tests. I'm also going to be plotting correlation plots of each test against the target variable. I suspect I could probably achieve all of this in a few lines of code within a for/while loop, however, I'm not sure where to begin.
Using lapply this could be achieved like so:
library(dplyr)
library(corrplot)
set.seed(42)
dataset <- data.frame(
test1_1 = runif(20),
test1_2 = runif(20),
test2_1 = runif(20),
test2_2 = runif(20),
Target = runif(20)
)
test_cols <- gsub("_\\d+$", "", names(dataset))
test_cols <- test_cols[grepl("^test", test_cols)]
test_cols <- unique(test_cols)
test_cols <- setNames(test_cols, test_cols)
test_fun <- function(x, test) {
x <- x %>%
select((starts_with(test)) | matches("Target"))
cor(x)
}
cor_test <- lapply(test_cols, test_fun, x = dataset)
cplot <- lapply(cor_test, corrplot)
This is similar to #stefan's answer using split.default to split the columns by pattern in the column names.
tmp <- dplyr::select(dataset, -Target)
list_plot <- lapply(split.default(tmp, sub('_.*', '', names(tmp))), function(x) {
corrplot::corrplot(cor(cbind(x, Target = dataset$Target)))
})
reprod:
df1 <- data.frame(X = c(0:9), Y = c(10:19))
df2 <- data.frame(X = c(0:9), Y = c(10:19))
df3 <- data.frame(X = c(0:9), Y = c(10:19))
list_of_df <- list(A = df1, B = df2, C = df3)
list_of_df
I'm trying to apply the rollmean function from zoo to every 'Y' column in this list of dataframes.
I've tried lapply with no success, It seems no matter which way i spin it, there is no way to get around specifying the dataframe you want to apply to at some point.
This does one of the dataframes
roll_mean <- rollmean(list_of_df$A, 2)
roll_mean
obviously this doesn't work:
roll_mean1 <- rollmean(list_of_df, 2)
roll_mean1
I also tried this:
subset(may not be necessary)
Sub1 <- lapply(list_of_df, "[", 2)
roll_mean1 <- rollmean(Sub1, 2)
roll_mean1
there doesn't seem to be a way to do it without having to
specify the particular dataframe in the rollmean function
lapply(list_of_df), function(x) rollmean(list_of_df, 2))
for loop? also no success
For (i in list_of_df) {roll_mean1 <- rollmean(Sub1, 2)
Exp
}
Stating the obvious but I'm very new to coding in general and would appreciate some pointers.
It has occurred to me that even if it did work, the column that has been averaged would be one value longer than the rest of the dataframe; how would I get around that?
The question at one point says that it wants to perform the rollmean only on Y and at another point says that this works roll_mean <- rollmean(list_of_df$A, 2) but that does all columns.
1) Assuming that you want to apply rollmean to all columns:
Use lapply like this:
lapply(list_of_df, rollmean, 2)
This also works:
for(i in seq_along(list_of_df)) list_of_df[[i]] <- rollmean(list_of_df[[i]], 2)
2) If you only want to apply it to the Y column:
lapply(list_of_df, transform, Y = rollmean(Y, 2, fill = NA))
or
for(i in seq_along(list_of_df)) {
list_of_df[[i]]$Y <- rollmean(list_of_df[[i]]$Y, 2, fill = NA)
}
I would like to use the apply family instead of a for loop.
My for loop is nested and contains several vectors and a list, for which I am unsure how to input as parameters with apply.
Codes <- c("A","B","C")
Samples <- c("A","A","B","B","B","C")
Samples_Names <- c("A1","A2","B1","B2","B3","C1")
Samples_folder <- c("Alpha","Alpha","Beta","Beta","Beta","Charlie")
Df <- list(data.frame(T1 = c(1,2,3)), data.frame(T1 = c(1,2,3)), data.frame(T1 = c(1,2,3)))
for (i in 1:length(Codes)){
for (j in 1:length(Samples)) {
if(Codes[i] == Samples[j]) {
write_csv(Df[[i]], path = paste0(Working_Directory,Samples_folder[j],"/",Samples_Names[j],".csv"))
}
}
}
This will give an output of A1,A2 in Alpha, B1,B2,B3 in Beta, and C1 in charlie.
Since you are looking to just use write_csv, we can use pwalk from purrr to accomplish this over the three equal size vectors. No need to include the loop on Codes, as for each iteration in the apply we can write_csv the dataset corresponding to where Samples is found in Codes.
I shortened Working_Directory to WD.
library(purrr)
pwalk(list(Samples, Samples_folder, Samples_Names),
function(x, y, z) write_csv(Df[[match(x, Codes)]], path = paste0(WD, y, "/", z, ".csv")))
My understanding regarding the difference between the merge() function (in base R) and the join() functions of plyr and dplyr are that join() is faster and more efficient when working with "large" data sets.
Is there some way to determine a threshold to regarding when to use join() over merge(), without using a heuristic approach?
I am sure you will be hard pressed to find a "hard and fast" rule around when to switch from one function to another. As others have mentioned, there are a set of tools in R to help you measure performance. object.size and system.time are two such function that look at memory usage and performance time, respectively. One general approach is to measure the two directly over an arbitrarily expanding data set. Below is one attempt at this. We will create a data frame with an 'id' column and a random set of numeric values, allowing the data frame to grow and measuring how it changes. I'll use inner_join here as you mentioned dplyr. We will measure time as "elapsed" time.
library(tidyverse)
setseed(424)
#number of rows in a cycle
growth <- c(100,1000,10000,100000,1000000,5000000)
#empty lists
n <- 1
l1 <- c()
l2 <- c()
#test for inner join in dplyr
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- inner_join(x,y, by = c('id' = 'id'))
l1[[n]] <- object.size(test)
print(system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3])
l2[[n]] <- system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3]
n <- n+1
}
#empty lists
n <- 1
l3 <- c()
l4 <- c()
#test for merge
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- merge(x,y, by = c('id'))
l3[[n]] <- object.size(test)
# print(object.size(test))
print(system.time(test <- merge(x,y, by = c('id')))[3])
l4[[n]] <- system.time(test <- merge(x,y, by = c('id')))[3]
n <- n+1
}
#ploting output (some coercing may happen, so be it)
plot <- bind_rows(data.frame("size_bytes" = l3, "time_sec" = l4, "id" = "merge"),
data.frame("size_bytes" = l1, "time_sec" = l2, "id" = "inner_join"))
plot$size_MB <- plot$size_bytes/1000000
ggplot(plot, aes(x = size_MB, y =time_sec, color = id)) + geom_line()
merge seems to perform worse out the gate, but really kicks off around ~20MB. Is this the final word on the matter? No. But such testing can give you a idea of how to choose a function.
I am new to data analytic and learning R. I have few very basic questions which I am not very clear about. I hope to find some help here. Please bear with me..still learning -
I wrote a small function to perform basic exploratory analysis on a data set with 9 variables out of which 8 are of Int/Numeric type and 1 is Factor. The function is like this :
out <- function(x)
{
c <- class(x)
na.len <- length(which(is.na(x)))
m <- mean(x, na.rm = TRUE)
s <- sd(x, na.rm = TRUE)
uc <- m+3*s
lc <- m-3*s
return(c(classofvar = c, noofNA = na.len, mean=m, stdev=s, UpperCap = uc, LowerCap = lc))
}
And I apply it to the data set using :
stats <- apply(train, 2, FUN = out)
But the output file has all the class of variables as Character and all the Means as NA. After some head hurting, I figured that the problem is due to the Factor variable. I converted it to Numeric using this :
train$MonthlyIncome=as.numeric(as.character(train$MonthlyIncome))
It worked fine. But I am confused that if without looking at the dataset I use the above function - it wont work. How can I handle this situation.
When should I consider creating dummy variables?
Thank you in advance, and I hope the questions are not too silly!
Note that c() results in a vector and all element within the vector must be of the same class. If the elements have different classes, then c() uses the least complex class which is able to hold all information. E.g. numeric and integer will result in numeric. character and integer will result in character.
Use a list or a data.frame if you need different classes.
out <- function(x)
{
c <- class(x)
na.len <- length(which(is.na(x)))
m <- mean(x, na.rm = TRUE)
s <- sd(x, na.rm = TRUE)
uc <- m+3*s
lc <- m-3*s
return(data.frame(classofvar = c, noofNA = na.len, mean=m, stdev=s, UpperCap = uc, LowerCap = lc))
}
sum(is.na(x)) is faster than length(which(is.na(x)))
Use lapply to run the function on each variable. Use do.call to append the resulting dataframes.
stats <- do.call(
rbind,
lapply(train, out)
)