replacing nested for loop with lapply() - r

I have sen this example to explain how to replace an nested for loop with the lapply() function. However i do not fully understand what is happening in the nested for loop?
according to my understanding, the for loop creates for every country for all years two new variables called tempX and tempY, but what happens in the last line of the argument in the for loop?
what is the purpose of variable1 and variable2 ?
# Generate random data:
allCountries <- LETTERS[1:10]
allYears <- 1990:2012
myData <- expand.grid(allCountries, allYears) # create a dataframe with all possible combinations
colnames(myData) <- c("Country", "Year")
myData$variable1 <- rnorm(nrow(myData))
myData$variable2 <- rnorm(nrow(myData))
# Silly function to perform
myFunction <- function(x, y){
x * y - x / y
}
### Doubly-nested loop ###
myData$computedFigure <- NA # Make an "empty" variable in my data.frame
for(ii in allCountries){
for(jj in allYears){
tempX <- myData[myData$Country == ii & myData$Year == jj, c("variable1")]
tempY <- myData[myData$Country == ii & myData$Year == jj, c("variable2")]
# "Save" results into appropriate location in my data.frame
myData[myData$Country == ii & myData$Year == jj, c("computedFigure")] <- myFunction(tempX, tempY)
}
}
### Simple lapply() approach ###
computedFigureList <- lapply(1:nrow(myData), function(x){
tempX <- myData[x, c("variable1")]
tempY <- myData[x, c("variable2")]
# "Save" results into appropriate location in my data.frame
myFunction(tempX, tempY)
})
myData$computedFigure2 <- unlist(computedFigureList)
with(myData, plot(computedFigure, computedFigure2))

In the last line of the loop myData[myData$Country == ii & myData$Year == jj, c("computedFigure")] <- myFunction(tempX, tempY)
, the function myFunctionis applied and recorded in the computedFigure column.
variable1 and variable2 are set randomly to illustrate the data in myData (x and y) in myFunction.
The for loops are exploring the combinations in countries and years...
The two codes (for loop and lappy) will not generate exactly the same result. The lapply will generate a list just with the result of the myFunction. The for loops will generate a dataframe.

Actually you don't need nested *apply functions, you can actually use outer + diag to compute computedFigure, which can achieve the same results as you did via nested for loop.
myData$computedFigure <- diag(with(myData,outer(variable1,variable2,myFunction)))

Related

create nested lists in r

I have a dataframe, which I want to use for a regression analysis. The data looks like this:
COMPANY <- c(rep("COMPANY_1",24),rep("COMPANY_2",24),rep("COMPANY_3",24))
YEAR <- rep(rep(2014:2019, each = 4),3)
SEASON <- rep(c("SPRING","SUMMER","AUTUMN","WINTER"),18)
X <- sample(100:1000,72)
Y <- sample(10:100,72)
df_ALL <- data.frame(COMPANY, YEAR, SEASON, X, Y)
However, I do not want to base my analysis solely on this data set, but also on a variety of subsets (for example: Only Company 1, only for Winter, only for Company 1 in Winter etc.). I succeeded in doing this, by creating a nested list and than performing regressions (plm) on each dataframe in the nested list by using lapply. The regression results are than stored in a second nested list from where I can easily access them.
However, the process in which I created the nested lists seems to be very unprofessional and error-prone to me. Here is the code which I used for creating the nested list:
nested_list <- vector(mode="list", length=2)
nested_list <- setNames(nested_list, c("2014-2016","2017-2019"))
for (i in 1:2) {
nested_list[[i]] <- vector(mode = "list", length = 5)
nested_list[[i]] <- setNames(nested_list[[i]],c("ALL_SEASONS","SPRING","SUMMER","AUTUMN","WINTER"))
for (j in 1:5) {
nested_list[[i]][[j]] <- vector(mode="list",length=4)
nested_list[[i]][[j]] <- setNames(nested_list[[i]][[j]],c("ALL_COMPANIES","COMPANY_1","COMPANY_2","COMPANY_3"))
}
}
nested_list[["2014-2016"]][["ALL_SEASONS"]][["ALL"]] <- subset(df_ALL, YEAR >= 2014 & YEAR <= 2016)
nested_list[["2017-2019"]][["ALL_SEASONS"]][["ALL"]] <- subset(df_ALL, YEAR >= 2017 & YEAR <= 2019)
for (i in 1:2) {
nested_list[[i]][["ALL_SEASONS"]][["COMPANY_1"]] <- subset(nested_list[[i]][["ALL_SEASONS"]][["ALL_COMPANIES"]],COMPANY == "COMPANY_1")
nested_list[[i]][["ALL_SEASONS"]][["COMPANY_2"]] <- subset(nested_list[[i]][["ALL_SEASONS"]][["ALL_COMPANIES"]],COMPANY == "COMPANY_2")
nested_list[[i]][["ALL_SEASONS"]][["COMPANY_3"]] <- subset(nested_list[[i]][["ALL_SEASONS"]][["ALL_COMPANIES"]],COMPANY == "COMPANY_3")
for (k in 1:3) {
nested_list[[i]][["SPRING"]][[k]] <- subset(nested_list[[i]][["ALL_SEASONS"]][[k]], SEASON == "SPRING")
nested_list[[i]][["SUMMER"]][[k]] <- subset(nested_list[[i]][["ALL_SEASONS"]][[k]], SEASON == "SUMMER")
nested_list[[i]][["AUTUMN"]][[k]] <- subset(nested_list[[i]][["ALL_SEASONS"]][[k]], SEASON == "AUTUMN")
nested_list[[i]][["WINTER"]][[k]] <- subset(nested_list[[i]][["ALL_SEASONS"]][[k]], SEASON == "WINTER")
}
}
Is there are more elegant way, to create a nested list? Or is the whole process of creating a nested list and than using lapply not very recommendable? What would be the alternative than?
You can do it using split, which creates lists based on the provided function.
Here I just introduced another column using the years and named them to according to your preference. Then, split the rest using only the column names.
df_ALL$split <- ifelse(df_ALL$YEAR > 2016, "2017-2019", "2014-2016")
nested_list <- lapply(split(df_ALL, df_ALL$split), function(x) {
y <- split(x, x$SEASON)
return(lapply(y, function(x) {
split(x, x$COMPANY)
}))
})
Created on 2020-11-05 by the reprex package (v0.3.0)

Nested for loop to functions and lapply

I am trying to write to functions and call code for a nested for loop. The below code I could easily put as it is with for loop and my function runs too. But I am trying to avoid for loop with in my function and go for lapply. How can I create the function and its respective call code using lapply?
Code with for loop:
df <- data.frame(actual=c("reaok_oc giade_len","reaok_oc giade_len reaok_oc giade_len"),
Predicted = c("giade_len","reaok_oc giade_len reaok_oc giade_len"))
df[] <- lapply(df, as.character)
str(df)
all_acc<-NULL
for(s in 1:nrow(df)){
sub_df1<-df[s,]
actual_words<-unlist(strsplit(sub_df1$actual," "))
all_count<-0
for(g in 1:length(actual_words)){
count_len<-ifelse(grep(actual_words[g],sub_df1$Predicted),1,0)
all_count<-sum(all_count,count_len)
}
sub_acc<-all_count/length(actual_words)
all_acc<-c(all_acc,sub_acc)
}
df$trans_acc<-all_acc
sensitivity=sum(df$trans_acc)/nrow(df)
sensitivity
Here is the non-working code using lapply call code to functions:
a1 <- function(df){
sub_df1<-df[s,]
actual_words<-unlist(strsplit(sub_df1$actual," "))
all_count<-0
}
a2 <- function(df){
count_len<-ifelse(grep(actual_words[g],sub_df1$Predicted),1,0)
all_count<-sum(all_count,count_len)
sub_acc<-all_count/length(actual_words)
all_acc<-c(all_acc,sub_acc)
df$trans_acc<-all_acc
sensitivity=sum(df$trans_acc)/nrow(df)
sensitivity
}
lapply(1:nrow(df) FUN = a1, lapply(1:length(actual_words) FUN = a2, actual_words,sub_aa1))
In base R it is usually best to find solutions that are 'vectorized' (only one R function call) rather than 'iterated' (one call for each element). So for instance
for(s in 1:nrow(df)){
sub_df1<-df[s,]
actual_words<-unlist(strsplit(sub_df1$actual," "))
...
involves nrow(df) calls to strsplit(), but
actual <- strsplit(df$actual, " ")
involves just one but performs the same transformation.
I think also that when you say
for(g in 1:length(actual_words)){
count_len<-ifelse(grep(actual_words[g],sub_df1$Predicted),1,0)
all_count<-sum(all_count,count_len)
}
really you are just looking for exact matches between actual words and predicted words. So you could split the predicted words
predicted <- strsplit(df$Predicted, " ")
and calculate sum(actual[[1]] %in% predicted[[1]]), and so on. Write this as a function
actual_in_predicted <- function(actual, predicted) {
sum(actual %in% predicted)
}
A 'for' loop might iterate over each element of actual and predicted
all_count <- integer()
for (i in 1:nrow(df))
all_count[[i]] <- actual_in_predicted(actual[[i]], predicted[[i]])
but it's better to use mapply() to iterate over each element of actual and predicted
all_count <- mapply(actual_in_predicted, actual, predicted)
Your variable all_acc is this vector of numbers divided by the number of actual words in each comparison
all_acc <- all_count / lengths(actual)
The complete revised code uses a function to compare actual and predicted words in each row, and uses a loop to iterate over each row.
actual_in_predicted <- function(actual, predicted) {
sum(actual %in% predicted)
}
actual <- strsplit(df$actual, " ")
predicted <- strsplit(df$Predicted, " ")
all_count <- mapply(actual_in_predicted, actual, predicted)
all_acc <- all_count / lengths(actual)
df$trans_acc <- all_acc
sensitivity <- sum(df$trans_acc) / nrow(df)
Perhaps, we can use separate_rows
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(actual, sep="_") %>%
summarise(perc = mean(str_detect(Predicted, actual)))
# perc
#1 0.75
It can be wrapped into a function
f1 <- function(data, act, pred) {
data %>%
separate_rows({{act}}, sep="_") %>%
summarise(perc = mean(str_detect({{pred}}, {{act}})))
}
f1(df, actual, Predicted)
# perc
#1 0.75

lapply() changing global variable in R

Using R, I wanted to save each variable's value when running lapply().
Below is what I tested now:
list_C <- list()
list_D <- list()
n <- 1
data_partition <- split(data, with(data, paste(A, B, sep=":")))
final_result <- lapply(data_partition,
function(dat) {
if(... condition ...) {
<Some R codes to run>
list_C[[n]] <- dat$C
list_D[[n]] <- dat$D
n <- n + 1
}
})
However, after running the code, 'n' remains just '1' and there's no change. How can I change the variable of 'n' to get the right saving lists of 'list_C' and 'list_D'?

Using R to loop through vector and copy some sequences to data.frame

I want to search through a vector for the sequence of strings "hello" "world". When I find this sequence, I want to copy it, including the 10 elements before and after, as a row in a data.frame to which I'll apply further analysis.
My problem: I get an error "new column would leave holes after existing columns". I'm new to coding, so I'm not sure how to manipulate data.frames. Maybe I need to create rows in the loop?
This is what I have:
df = data.frame()
i <- 1
for(n in 1:length(v))
{
if(v[n] == 'hello' & v[n+1] == 'world')
{
df[i,n-11:n+11] <- v[n-10:n+11]
i <- i+1
}
}
Thanks!
May be this helps
indx <- which(v1[-length(v1)]=='hello'& v1[-1]=='world')
lst <- Map(function(x,y) {s1 <- seq(x,y)
v1[s1[s1>0 & s1 < length(v1)]]}, indx-10, indx+11)
len <- max(sapply(lst, length))
d1 <- as.data.frame(do.call(rbind,lapply(lst, `length<-`, len)))
data
set.seed(496)
v1 <- sample(c(letters[1:3], 'hello', 'world'), 100, replace=TRUE)

subsetting a list of data frames using a for loop

My question is why does the last statement "a <- ..." work to give me a subset of that data frame within the list, but when I try to automate the process with a for loop through all data frames in the list I am met with all kinds of warnings and not the answer I am looking for??
time <- c(1:20)
temp <- c(2,3,4,5,6,2,3,4,5,6,2,3,4,5,6,2,3,4,5,6)
data <- data.frame(time,temp)
tmp <- c(1,diff(data[[2]]))
tmp2 <- tmp < 0
tmp3 <- cumsum(tmp2)
data1 <- split(data, tmp3)
#this does not work. I want to automate the successful process below through all data frames in the list "data1"
for(i in 1:length(data1)){
finale[i] <- subset(data1[[i]], data1[[i]][,2] > 3)
}
#this works to give me a part of what I want
a <- subset(data1[[1]], data1[[1]][,2] >3)
Maybe you may want to try with lapply
lapply(data1, function(x) subset(x, x[,2]>3))
Same result using a for loop
finale <- vector("list", length(data1))
for(i in 1:length(data1)){
finale[[i]] <- subset(data1[[i]], data1[[i]][,2] > 3)
}
It works because I preallocate a type and a length for finale, it didn't work for you, because you did not declare what finale should be.
You're trying to save a data.frame (2D object) in a vector (1D objetc). Just define finale as list and the code will work:
time <- c(1:20)
temp <- c(2,3,4,5,6,2,3,4,5,6,2,3,4,5,6,2,3,4,5,6)
data <- data.frame(time,temp)
tmp <- c(1,diff(data[[2]]))
tmp2 <- tmp < 0
tmp3 <- cumsum(tmp2)
data1 <- split(data, tmp3)
#this does not work. I want to automate the successful process below through all data frames in the list "data1"
finale <- vector(mode='list')
for(i in 1:length(data1)){
finale[[i]] <- subset(data1[[i]], data1[[i]][,2] > 3) # Use [[i]] instead of [i]
}
To save all in 1 data.frame:
finale <- do.call(rbind, finale)

Resources