I have a dataframe, which I want to use for a regression analysis. The data looks like this:
COMPANY <- c(rep("COMPANY_1",24),rep("COMPANY_2",24),rep("COMPANY_3",24))
YEAR <- rep(rep(2014:2019, each = 4),3)
SEASON <- rep(c("SPRING","SUMMER","AUTUMN","WINTER"),18)
X <- sample(100:1000,72)
Y <- sample(10:100,72)
df_ALL <- data.frame(COMPANY, YEAR, SEASON, X, Y)
However, I do not want to base my analysis solely on this data set, but also on a variety of subsets (for example: Only Company 1, only for Winter, only for Company 1 in Winter etc.). I succeeded in doing this, by creating a nested list and than performing regressions (plm) on each dataframe in the nested list by using lapply. The regression results are than stored in a second nested list from where I can easily access them.
However, the process in which I created the nested lists seems to be very unprofessional and error-prone to me. Here is the code which I used for creating the nested list:
nested_list <- vector(mode="list", length=2)
nested_list <- setNames(nested_list, c("2014-2016","2017-2019"))
for (i in 1:2) {
nested_list[[i]] <- vector(mode = "list", length = 5)
nested_list[[i]] <- setNames(nested_list[[i]],c("ALL_SEASONS","SPRING","SUMMER","AUTUMN","WINTER"))
for (j in 1:5) {
nested_list[[i]][[j]] <- vector(mode="list",length=4)
nested_list[[i]][[j]] <- setNames(nested_list[[i]][[j]],c("ALL_COMPANIES","COMPANY_1","COMPANY_2","COMPANY_3"))
}
}
nested_list[["2014-2016"]][["ALL_SEASONS"]][["ALL"]] <- subset(df_ALL, YEAR >= 2014 & YEAR <= 2016)
nested_list[["2017-2019"]][["ALL_SEASONS"]][["ALL"]] <- subset(df_ALL, YEAR >= 2017 & YEAR <= 2019)
for (i in 1:2) {
nested_list[[i]][["ALL_SEASONS"]][["COMPANY_1"]] <- subset(nested_list[[i]][["ALL_SEASONS"]][["ALL_COMPANIES"]],COMPANY == "COMPANY_1")
nested_list[[i]][["ALL_SEASONS"]][["COMPANY_2"]] <- subset(nested_list[[i]][["ALL_SEASONS"]][["ALL_COMPANIES"]],COMPANY == "COMPANY_2")
nested_list[[i]][["ALL_SEASONS"]][["COMPANY_3"]] <- subset(nested_list[[i]][["ALL_SEASONS"]][["ALL_COMPANIES"]],COMPANY == "COMPANY_3")
for (k in 1:3) {
nested_list[[i]][["SPRING"]][[k]] <- subset(nested_list[[i]][["ALL_SEASONS"]][[k]], SEASON == "SPRING")
nested_list[[i]][["SUMMER"]][[k]] <- subset(nested_list[[i]][["ALL_SEASONS"]][[k]], SEASON == "SUMMER")
nested_list[[i]][["AUTUMN"]][[k]] <- subset(nested_list[[i]][["ALL_SEASONS"]][[k]], SEASON == "AUTUMN")
nested_list[[i]][["WINTER"]][[k]] <- subset(nested_list[[i]][["ALL_SEASONS"]][[k]], SEASON == "WINTER")
}
}
Is there are more elegant way, to create a nested list? Or is the whole process of creating a nested list and than using lapply not very recommendable? What would be the alternative than?
You can do it using split, which creates lists based on the provided function.
Here I just introduced another column using the years and named them to according to your preference. Then, split the rest using only the column names.
df_ALL$split <- ifelse(df_ALL$YEAR > 2016, "2017-2019", "2014-2016")
nested_list <- lapply(split(df_ALL, df_ALL$split), function(x) {
y <- split(x, x$SEASON)
return(lapply(y, function(x) {
split(x, x$COMPANY)
}))
})
Created on 2020-11-05 by the reprex package (v0.3.0)
Related
I am working on a project in which I need to filter my list if it has certain values for each of my IDs but unfortunately, it isn't working.
So I have one list with 12 different matrixes with the same columns but diff
library(tidyverse)
trajectory_C <- list()
trajectory_D <- list()
file_list_C <- list.files(pattern=".trajectory_C.csv")
file_list_D <- list.files(pattern=".trajectory_D.csv")
for (i in 1:length(file_list_C)) {
trajectory_C[[i]] <- read.csv(file_list_C[i])
}
for (i in 1:length(file_list_D)) {
trajectory_D[[i]] <- read.csv(file_list_D[i])
}
So my two lists are trajectory_D and trajectory_C. I then created two other lists and I saved the unique values of a certain column called "ID" and added a validation column to it.
unique_ID_C <- list()
unique_ID_D <- list()
for (i in 1:12) {
unique_ID_C[[i]] <- unique(trajectory_C[[i]]["ID"])
unique_ID_D[[i]] <- unique(trajectory_D[[i]]["ID"])
}
for (i in 1:12) {
Turning <- matrix(data=0,nrow = length(unique_ID_C[[i]]), ncol = 1)
unique_ID_C[[i]] <- cbind(unique_ID_C[[i]],Turning)
names(unique_ID_C[[i]]) <- c("ID","Validation")
}
What I want to do right now is understand if each of my unique values has certain elements (28 and 29) in the variable "Segment". For all the twelve different levels of my list.
for (i in 1:12) {
for (ID in unique_ID_C[[1]]) {
c <- unique(trajectory_C[[i]][trajectory_C[[i]]["ID"] == ID,"Segment"])
unique_ID_C[[i]][unique_ID_C[[i]]["ID"] == ID,2] <- ifelse(any(28 == c) == TRUE & any(29 == c) == TRUE,1,0)
}
}
I am new in programming and this is the first time I am using Lists so this might be my problem.
I have written the following very simple while loop in R.
i=1
while (i <= 5) {
print(10*i)
i = i+1
}
I would like to save the results to a dataframe that will be a single column of data. How can this be done?
You may try(if you want while)
df1 <- c()
i=1
while (i <= 5) {
print(10*i)
df1 <- c(df1, 10*i)
i = i+1
}
as.data.frame(df1)
df1
1 10
2 20
3 30
4 40
5 50
Or
df1 <- data.frame()
i=1
while (i <= 5) {
df1[i,1] <- 10 * i
i = i+1
}
df1
If you already have a data frame (let's call it dat), you can create a new, empty column in the data frame, and then assign each value to that column by its row number:
# Make a data frame with column `x`
n <- 5
dat <- data.frame(x = 1:n)
# Fill the column `y` with the "missing value" `NA`
dat$y <- NA
# Run your loop, assigning values back to `y`
i <- 1
while (i <= 5) {
result <- 10*i
print(result)
dat$y[i] <- result
i <- i+1
}
Of course, in R we rarely need to write loops like his. Normally, we use vectorized operations to carry out tasks like this faster and more succinctly:
n <- 5
dat <- data.frame(x = 1:n)
# Same result as your loop
dat$y <- 10 * (1:n)
Also note that, if you really did need a loop instead of a vectorized operation, that particular while loop could also be expressed as a for loop.
I recommend consulting an introductory book or other guide to data manipulation in R. Data frames are very powerful and their use is a necessary and essential part of programming in R.
I have hundreds of data frames. I need to perform ANOVA RM tests on each of these data frames. The output should be one single data frame with the mean of each p-value.
I tried:
#crate dataframes
df1 <- data.frame(replicate(16,sample(-10:10,10,rep=TRUE)))
df2 <- data.frame(replicate(16,sample(-10:10,10,rep=TRUE)))
df3 <- data.frame(replicate(16,sample(-10:10,10,rep=TRUE)))
Group <- c(rep("A",8),rep("B",8))
Time <- c(rep("before",4),rep("after",4),rep("before",4),rep("after",4))
Name <- rep(rep(1:4, 4))
conds <- data.frame(Name,Time,Group)
#create list
list <- list(df1,df2,df3)
#for loop ANOVA repeated measures
for ( i in list){
data <- cbind(conds,i)
t=NULL
name <- colnames(data)[4:ncol(data)]
for(i in 4:ncol(data)) { z <- aov(data[,i] ~ Group*Time+Error(Name/(Group*Time)), data=data)
sz <- as.list(summary(z))
t <- as.data.frame(c(t,sz[4]$`Error: Name:Group:Time`[[1]]$`Pr(>F)`[1]))
t
}
}
mean(t)
R as a vectorized language is designed to avoid for loops where possible. You could do an sapply approach.
When you list your data frames use names like df1=, which later helps in the result on which of them were done calculations.
(And don't use list as object name since you'll get confused because there is also a list function. Also data, df and friends are "bad" names, you may always check, using e.g. ?list if the name is already occupied.)
list1 <- list(df1=df1, df2=df2, df3=df3)
res <- sapply(list1, function(x) {
dat <- cbind(conds, x)
sapply(dat[-(1:3)], function(y) {
z <- aov(y ~ Group*Time + Error(Name/(Group*Time)), data=dat)
sz <- summary(z)
p <- sz$`Error: Name:Group:Time`[[1]][1, 5]
p
})
})
From the resulting matrix we take the column means.
colMeans(res)
# df1 df2 df3
# 0.4487419 0.4806528 0.4847789
Data:
set.seed(42)
df1 <- data.frame(replicate(16,sample(-10:10,16,rep=TRUE)))
df2 <- data.frame(replicate(16,sample(-10:10,16,rep=TRUE)))
df3 <- data.frame(replicate(16,sample(-10:10,16,rep=TRUE)))
conds <- data.frame(Name=c(rep("A",8),rep("B",8)),
Time=c(rep("before",4),rep("after",4),
rep("before",4),rep("after",4)),
Group=rep(1:4, 4))
I have sen this example to explain how to replace an nested for loop with the lapply() function. However i do not fully understand what is happening in the nested for loop?
according to my understanding, the for loop creates for every country for all years two new variables called tempX and tempY, but what happens in the last line of the argument in the for loop?
what is the purpose of variable1 and variable2 ?
# Generate random data:
allCountries <- LETTERS[1:10]
allYears <- 1990:2012
myData <- expand.grid(allCountries, allYears) # create a dataframe with all possible combinations
colnames(myData) <- c("Country", "Year")
myData$variable1 <- rnorm(nrow(myData))
myData$variable2 <- rnorm(nrow(myData))
# Silly function to perform
myFunction <- function(x, y){
x * y - x / y
}
### Doubly-nested loop ###
myData$computedFigure <- NA # Make an "empty" variable in my data.frame
for(ii in allCountries){
for(jj in allYears){
tempX <- myData[myData$Country == ii & myData$Year == jj, c("variable1")]
tempY <- myData[myData$Country == ii & myData$Year == jj, c("variable2")]
# "Save" results into appropriate location in my data.frame
myData[myData$Country == ii & myData$Year == jj, c("computedFigure")] <- myFunction(tempX, tempY)
}
}
### Simple lapply() approach ###
computedFigureList <- lapply(1:nrow(myData), function(x){
tempX <- myData[x, c("variable1")]
tempY <- myData[x, c("variable2")]
# "Save" results into appropriate location in my data.frame
myFunction(tempX, tempY)
})
myData$computedFigure2 <- unlist(computedFigureList)
with(myData, plot(computedFigure, computedFigure2))
In the last line of the loop myData[myData$Country == ii & myData$Year == jj, c("computedFigure")] <- myFunction(tempX, tempY)
, the function myFunctionis applied and recorded in the computedFigure column.
variable1 and variable2 are set randomly to illustrate the data in myData (x and y) in myFunction.
The for loops are exploring the combinations in countries and years...
The two codes (for loop and lappy) will not generate exactly the same result. The lapply will generate a list just with the result of the myFunction. The for loops will generate a dataframe.
Actually you don't need nested *apply functions, you can actually use outer + diag to compute computedFigure, which can achieve the same results as you did via nested for loop.
myData$computedFigure <- diag(with(myData,outer(variable1,variable2,myFunction)))
I have a list of data frames (generated by the permutation order of an initial dataframe) to which I would like to apply complicated calculus using group_by_at() and mutate(). It works well with a single data frame but fail using a for loop since mutate requires the name of the dataframe and some of my calculus as well. So I thought, well, let's create a list of different dataframes all having the same name and loop over the initial sequence of names. Unfortunately the trick does not work and I get the following message:
Error: object of type 'closure' is not subsettable.
Here is the self contained example showing all my steps. I think the problem comes from mutate. So, how could I force the use of for loop with mutate?
data <- read.table(text = 'obs gender ageclass weight year subdata income
1 F 1 10 yearA sub1 1000
2 M 2 25 yearA sub1 1200
3 M 2 5 yearB sub2 1400
4 M 1 11 yearB sub1 1350',
header = TRUE)
library(dplyr)
library(GiniWegNeg)
dataA <- select(data, gender, ageclass)
dataB <- select(data, -gender, -ageclass)
rm(data)
# Generate permutation of indexes based on the number of column in dataA
library(combinat)
index <- permn(ncol(dataA))
# Attach dataA to the previous list of index
res <- lapply(index, function(x) dataA[x])
# name my list keeping track of permutation order in dataframe name
names(res) <- unlist(lapply(res,function(x) sprintf('data%s',paste0(toupper(substr(colnames(x),1,1)),collapse = ''))))
# Create a list containing the name of each data.frame name
NameList <- unlist(lapply(res,function(x) sprintf('data%s',paste0(toupper(substr(colnames(x),1,1)),collapse = ''))))
# Define as N the number of columns/permutation/dataframes
N <- length(res)
# Merge res and dataB for all permutation of dataframes
res <- lapply(res,function(x) cbind(x,dataB))
# Change the name of res so that all data frames are named data
names(res) <- rep("data", N)
# APPLY FOR LOOP TO ALL DATAFRAMES
for (j in NameList){
runCalc <- function(data, y){
data <- data %>%
group_by_at(1) %>%
mutate(Income_1 = weighted.mean(income, weight))
data <- data %>%
group_by_at(2) %>%
mutate(Income_2 = weighted.mean(income, weight))
gini <- c(Gini_RSV(data$Income_1, data$weight), Gini_RSV(data$Income_2,data$weight))
Gini <- data.frame(gini)
colnames(Gini) <- c("Income_1","Income_2")
rownames(Gini) <- c(paste0("Gini_", y))
return(Gini)
}
runOtherCalc <- function(df, y){
Contrib <- (1/5) * df$Income_1 + df$Income_2
Contrib <- data.frame(Contrib)
colnames(Contrib) <- c("myresult")
rownames(Contrib) <- c(paste0("Contrib_", y)
return(Contrib)
}
# Run runCalc over dataframe data by year
df1_List <- lapply(unique(data$year), function(i) {
byperiod <- subset(data, year == i)
runCalc(byperiod, i)
})
# runCalc returns df which then passes to runOtherCalc, again by year
df1_OtherList <- lapply(unique(data$year), function(i)
byperiod <- subset(data, year == i)
df <- runCalc(byperiod, i)
runOtherCalc(df, i)
})
# Run runCalc over dataframe data by subdata
df2_List <- lapply(unique(data$subdata), function(i) {
byperiod <- subset(data, subdata == i)
runCalc(bysubdata, i)
})
# runCalc returns df which then passes to runOtherCalc, again by subdata
df2_OtherList <- lapply(unique(data$subdata), function(i)
bysubdata <- subset(data, subdata == i)
df <- runCalc(bysubdata, i)
runOtherCalc(df, i)
})
# Return all results in separate frames, then append by row in 2 frames
Gini_df1 <- do.call(rbind, df1_List)
Contrib_df1 <- do.call(rbind,df1_OtherList)
Gini_df2 <- do.call(rbind, df1_List)
Contrib_df2 <- do.call(rbind,df1_OtherList)
Gini <- rbind(Gini_df1, Gini_df2)
Contrib <- rbind(Contrib_df1, Contrib_df2)
}
Admittedly, the R error you receive below is a bit cryptic but usually it means you are running an operation on an object that does not exist.
Error: object of type 'closure' is not subsettable.
Specifically, it comes with your lapply call as data is not defined anywhere globally (only within the runCalc method) and as above you remove it with rm(data).
dfList <- lapply(unique(data$year), function(i) {
byperiod <- subset(data, year == i)
runCalc(byperiod, i)
})
By, the way the use of lapply...unique...subset can be replaced with the underused grouping base R function, by().
Gathering from your text and code, I believe you intend to run a year grouping on each dataframe of your list, res. Then consider two by calls, wrapped in a larger function that receives as a parameter a dataframe, df. Then run lapply across all items of list to return a new list of nested dataframe pairs.
# SECONDARY FUNCTIONS
runCalc <- function(data) {
data <- data %>%
group_by_at(1) %>%
mutate(Income_1 = weighted.mean(income, weight))
data <- data %>%
group_by_at(2) %>%
mutate(Income_2 = weighted.mean(income, weight))
Gini <- data.frame(
year = data$year[[1]],
Income_1 = unname(Gini_RSV(data$Income_1, data$weight)),
Income_2 = unname(Gini_RSV(data$Income_2, data$weight)),
row.names = paste0("Gini_", data$year[[1]])
)
return(Gini)
}
runOtherCalc <- function(df){
Contrib <- data.frame(
myresult = (1/5) * df$Income_1 + df$Income_2,
row.names = paste0("Contrib_", df$year[[1]])
)
return(Contrib)
}
# PRIMARY FUNCTION
runDfOperations <- function(df) {
gList <- by(df, df$year, runCalc)
gTmp <- do.call(rbind, gList)
cList <- by(gTmp, gTmp$year, runOtherCalc)
cTmp <- do.call(rbind, cList)
gtmp$year <- NULL
return(list(gTmp, cTmp))
}
# RETURNS NESTED LIST OF TWO DFs FOR EACH ORIGINAL DF
new_res <- lapply(res, runDfOperations)
# SEPARATE LISTS IF NEEDED (EQUAL LENGTH)
Gini <- lapply(new_res, "[[", 1)
Contrib <- lapply(new_res, "[[", 2)