Loop over several dataframes to do several actions in R - r

I have several dataframes (dataframe_1, dataframe_2...) that I want to loop in order to execute the same functions over all the dataframes. These functions are:
Select specific columns:
dataframe_1 <- dataframe_1[, c("Column_1", "Column_2")]
Rename the columns:
dataframe_1 <- rename(dtaframe_1, New_Name_for_Column_1 = Column_1)
Create new columns. For example, by using the ifelse() function:
dataframe_1$Column_3 <- ifelse(dataframe_1$Column_1 = 5, 1, 0)
I have proven the code with some dataframes individually without errors.
However, if I execute the following loop:
list_dataframes = list(dataframe_1, dataframe_2)
for (dataframe in 1:length(list_dataframes)){
dataframe <- dataframe[, c("Column_1", "Column_2")]
dataframe <- rename(dtaframe, New_Name_for_Column_1 = Column_1)
dataframe$Column_3 <- ifelse(dataframe$Column_1 = 5, 1, 0)
}
The following error arises:
Error in dataframe[, c("Column_1", "Column_2", :
incorrect number of dimensions
(All dataframes have the same column names.)
Any idea?
Thanks!

You are not iterating over the list of dataframes, but rather over a sequence 1:length(list_dataframes). Consider the following for illustration:
a = list("a", "b")
for (i in a){print(i)}
for (i in 1:length(a)){print(i)}
In your code, you need to explicitly access the list elements like this:
list_dataframes = list(dataframe_1, dataframe_2)
for (df_number in 1:length(list_dataframes)){
list_dataframes[[df_number]] <- list_dataframes[[df_number]][, c("Column_1", "Column_2")]
list_dataframes[[df_number]] <- rename(list_dataframes[[df_number]], New_Name_for_Column_1 = Column_1)
list_dataframes[[df_number]]$Column_3 <- ifelse(list_dataframes[[df_number]]$Column_1 = 5, 1, 0)
}

the code for (dataframe in 1:length(list_dataframes)) creates a vector of numbers c(1,2) in which the value of one value at a time is stored in a variable named dataframe. This iteration variable is scalar i.e. it has 1 dimension and a length of 1. This is why you can not subset doing dataframe[, c("Column_1", "Column_2")] Do this instead: list_dataframes[[dataframe]][, c("Column_1", "Column_2")]

You could try to iterate over dataframes using purrr::map_dfr(), e.g.
list_dataframes = list(dataframe_1, dataframe_2)
library(dplyr)
library(purrr)
list_dataframes %>%
map_dfr(~.x %>%
select(Column_1, Column_2) %>%
rename(New_Name_for_Column_1 = Column_1) %>%
mutate(Column3= ifelse(Column_1 == 5, 1, 0)))

Related

Add a Column created Within a Function to a dataframe in R

I have searched and tried multiple previously asked questions that might be similar to my question, but none worked.
I have a dataframe in R called df2, a column called df2$col. I created a function to take the df, the df$col, and two parameters that are names for two new columns I want created and worked on within the function. After the function finishes running, I want a return df with the two new columns included. I get the two columns back indeed, but they are named after the placeholders in the function shell. See below:
df2 = data.frame(col = c(1, 3, 4, 5),
col1 = c(9, 6, 8, 3),
col2 = c(8, 2, 8, 4))
the function I created will take col and do something to it; return the transformed col, as well as the two newly created columns:
no_way <- function(df, df_col_name, df_col_flagH, df_col_flagL) {
lo_perc <- 2
hi_perc <- 6
df$df_col_flagH <- as.factor(ifelse(df_col_name<lo_perc, 1, 0))
df$df_col_flagL <- as.factor(ifelse(df_col_name>hi_perc, 1, 0))
df_col_name <- df_col_name + 1.4
df_col_name <- df_col_name * .12
return(df)
}
When I call the function, no_way(df2, col, df$new_col, df$new_col2), instead of getting a df with col, col1, col2, new_col1, new_col2, I get the first three right but get the parametric names for the last two. So something like df, col, col1, col2, df_col_flagH, df_col_flagL. I essentially want the function to return the df with the new columns' names I give it when I am calling it. Please help.
I don't see what your function is trying to do, but this might point you in the right direction:
no_way <- function(df = df2, df_col_name = "col", df_col_flagH = "col1", df_col_flagL = "col2") {
lo_perc <- 2
hi_perc <- 6
df[[df_col_flagH]] <- as.factor(ifelse(df[[df_col_name]] < lo_perc, 1, 0)) # as.factor?
df[[df_col_flagL]] <- as.factor(ifelse(df[[df_col_name]] > hi_perc, 1, 0))
df[[df_col_name]] <- (df[[df_col_name]] + 1.4) * 0.12 # Do in one step
return(df)
}
I needed to call the function with the new column names as strings instead:
no_way(mball, 'TEAM_BATTING_H', 'hi_TBH', 'lo_TBH')
Additionally, I had to use brackets around the target column in my function.

Using a loop to select a column names from a list

I've been struggling with column selection with lists in R. I've loaded a bunch of csv's (all with different column names and different number of columns) with the goal of extracting all the columns that have the same name (just phone_number, subregion, and phonetype) and putting them together into a single data frame.
I can get the columns I want out of one list element with this;
var<-data[[1]] %>% select("phone_number","Subregion", "PhoneType")
But I cannot select the columns from all the elements in the list this way, just one at a time.
I then tried a for loop that looks like this:
new.function <- function(a) {
for(i in 1:a) {
tst<-datas[[i]] %>% select("phone_number","Subregion", "PhoneType")
}
print(tst)
}
But when I try:
new.function(5)
I'll only get the columns from the 5th element.
I know this might seem like a noob question for most, but I am struggling to learn lists and loops and R. I'm sure I'm missing something very easy to make this work. Thank you for your help.
Another way you could do this is to make a function that extracts your columns and apply it to all data.frames in your list with lapply:
library(dplyr)
extractColumns = function(x){
select(x,"phone_number","Subregion", "PhoneType")
#or x[,c("phone_number","Subregion","PhoneType")]
}
final_df = lapply(data,extractColumns) %>% bind_rows()
The way you have your loop set up currently is only saving the last iteration of the loop because tst is not set up to store more than a single value and is overwritten with each step of the loop.
You can establish tst as a list first with:
tst <- list()
Then in your code be explicit that each step is saved as a seperate element in the list by adding brackets and an index to tst. Here is a full example the way you were doing it.
#Example data.frame that could be in datas
df_1 <- data.frame("not_selected" = rep(0, 5),
"phone_number" = rep("1-800", 5),
"Subregion" = rep("earth", 5),
"PhoneType" = rep("flip", 5))
# Another bare data.frame that could be in datas
df_2 <- data.frame("also_not_selected" = rep(0, 5),
"phone_number" = rep("8675309", 5),
"Subregion" = rep("mars", 5),
"PhoneType" = rep("razr", 5))
# Datas is a list of data.frames, we want to pull only specific columns from all of them
datas <- list(df_1, df_2)
#create list to store new data.frames in once columns are selected
tst <- list()
#Function for looping through 'a' elements
new.function <- function(a) {
for(i in 1:a) {
tst[[i]] <- datas[[i]] %>% select("phone_number","Subregion", "PhoneType")
}
print(tst)
}
#Proof of concept for 2 elements
new.function(2)

Passing dataframe as argument to function

I am writing a function to process data from a huge dataframe (row by row) which always has the same column names. So I want to pass the dataframe itself as a function to read out the information I need from the individual rows. However, when I try to use it as argument I can't read the information from it for some reason.
Dataframe:
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
My code:
List <- do.call(list, Map(function(DT) {
DT <- as.data.frame(DT)
aa <- as.numeric(strsplit(DT$Age, ","))
mean.aa <- mean(aa)
},
DF))
Trying this I get a list with the column names, but all Values are NULL.
Expected output :
My expected output is a list with length equal to the number of rows in the data frame. Under each list index there should be another list with the age of the corresponding row (an also other stuff from the same row of the data table, later).
DF <- apply(data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"), "mean.aa" = c(179.7143, 100.8571)), 1, as.list)
What am I doing wrong?
Here is one way :
DF <- data.frame("Name" = c("A","B"), "SN" = 1:2, "Age" = c("21,34,456,567,23,123,34", "15,345,567,3,23,45,67,76,34,34,55,67,78,3"))
apply(DF, 1, function(row){
aa <- as.numeric(strsplit(row["Age"], ",")[[1]])
row["mean.aa"] <- mean(aa)
as.list(row)
})

Serial Subsetting in R

I am working with a large datasets. I have to extract values from one datasets, the identifiers for the values are stored in another dataset. So basically I am subsetting twice for each value of one category. For multiple category, I have to combine such double-subsetted values. So I am doing something similar to this shown below, but I think there must be a better way to do it.
example datasets
set.seed(1)
df <- data.frame(number= seq(5020, 5035, 1), value =rnorm(16, 20, 5),
type = rep(c("food", "bar", "sleep", "gym"), each = 4))
df2 <- data.frame(number= seq(5020, 5035, 1), type = rep(LETTERS[1:4], 4))
extract value for grade A
asub_df2 <-subset(df2, type == "A" )
asub_df <-subset(df, number == asub_df2$number)
new_a <- cbind(asub_df, grade = rep(c("A"),nrow(asub_df)))
similarly extract value for grade B in new_b and combine to do any analysis.
can we use
You can split the 'df2' and use lapply
Filter(Negate(is.null),
lapply(split(df2, df2$type), function(x) {
x1 <- subset(df, number==x$number)
if(nrow(x1)>0) {
transform(x1, grade=x$type[1])
}
}))

Referring to a data frame by a variable name when creating a new column in R

I have a series of ten data frames containing two columns, x and y. I want to add a new column to each data frame containing the name of the data frame. The problem I am running into is how to refer to the data frame using a variable so I can perform this task iteratively. In addition to just referring to it by the variable name, I have also tried get() as follows:
for(i in 1:10){
name <- paste(substr(fileList, 3, 7),i, sep = "")
assign(newName, as.data.frame(get(name)))
get(newName)$Species = c(paste(substr(fileList, 3, 7),i, sep = ""))
}
However, I get the following error when I do so:
Error in get(newName)$Species = c(paste(substr(fileList[a], 3, 7), i, :
could not find function "get<-"
Is there another way to phrase the column assignment command so that I can get around this error, or is the solution more complex?
Here are three different options if you put all your data frames into a named list:
df_list <- list(a = data.frame(x = 1:5),
b = data.frame(x = 1:5))
#Option 1
for (i in seq_along(df_list)){
df_list[[i]][,'Species'] <- names(df_list)[i]
}
#Option 2
tmp <- do.call(rbind,df_list)
tmp$Species <- rep(names(df_list),times = sapply(df_list,nrow))
split(tmp,tmp$Species)
#Option 3
mapply(function(x,y) {x$Species <- y; x},df_list,names(df_list),SIMPLIFY = FALSE)

Resources