Nested for loops displays only last iteration - r

I have created a function that filters a dataframe based on some unique values of 2 different columns. I'm trying to loop thru the unique values of cell type(3) as well as unique values from another column called Cell_line(2). I've created two lists to hold this information and am using a nested loop to count thru each. The New Dataframe seems to be the last iteration of each list (cell_line and types) and omits the other outputs. How can I obtain these as well. either a list of dataframes or a single dataframe with all the information bound together would work
### My function takes a few arguments and gives a new dataframe
myfunction <- function(data_frame,type,Cell) {
prism <- df2%>%
ungroup() %>%
filter(.,TYPE == type & Cell_Line == Cell) %>%
pivot_wider(., id_cols = c("Treatment_rep","value","lipid"),
names_from = Treatment_rep, values_from = value)
prism$Cell_line <- Cell
prism
}
### I'm attempting to feed these into my function iteratively
cell_lines <- unique(df2$Cell_Line) ## list of 3 types
types <- unique(df2$TYPE) ### list of 2 types
### nested loop
for (i in 1:length(types)) {
for(j in 1: length(cell_lines)) {
newdf <- myfunction(data_frame = df2, type = types[i], Cell = cell_lines[j])
}
}

You can do this
dflist <- list()
for (i in 1:length(types)){
for(j in 1: length(cell_lines)){
newdf <- myfunction(data_frame = df2, type = types[i], Cell = cell_lines[j])
dflist[[ length(dflist)+1 ]] <- newdf
}
}
And if afterwards you want to bind them all together
df_total <- do.call(rbind, datalist)

Related

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

Using Dataframe to Automatically create a list of values based off Subproduct

df <- data.frame("date"=
1:4,"product"=c("B","B","A","A"),"subproduct"=c("1","2","x","y"),"actuals"=1:4)
#creates df1,df2,dfx,dfy
for(i in unique(df$subproduct)) {
nam <- paste("df", i, sep = ".")
assign(nam, df[df$subproduct==i,])
}
# CREATES LIST OF DATAFRAMES
# How do I make this so i don't have to manually type list(df.,df.,df.)
list_df <- list(df.1,df.2,df.x,df.y) %>%
lapply( function(x) x[(names(x) %in% c("date", "actuals"))])
# creates df1,df2,df3,df4 only dates and actuals, removes the other column names
for (i in 1:length(list_df)) {
assign(paste0("df", i), as.data.frame(list_df[[i]]))
}
For the first for loop, it creates a df object based off unique subproduct. For the list() function, I want to be able to not have to type in df.1 ... df2... etc so if I have 100 unique subproducts in my data, I wouldn't need to type this df.1, df.2,df.x,df.y,df.z,df.zzz,df. over and over again. How would I best do this (1 question)
The last for loop creates separate dataframe objects with only date and actuals will be used to create time series for each. How can I put the values of these objects into a single dataframe or a list of dfs? (2nd question)
We can use mget to return the value of object on the subset of object names from ls. The pattern matches object names that starts with 'df'followed by a.` and any alphanumeric characters
mget(ls(pattern = '^df\\.[[:alnum:]]+$'))
If the OP wanted to create those objects in a different env
new_env <- new.env()
list2env(mget(ls(pattern = '^df\\.[[:alnum:]]+$')), envir = new_env)
If we want to create new objects from scratch, do a group_split on the 'subproduct' column, set the names accordingly, and create multiple objects (list2env - not recommended though)
library(dplyr)
library(stringr)
df %>%
group_split(subproduct) %>%
setNames(str_c('df.', c(1, 2, 'x', 'y'))) %>%
list2env(.GlobalEnv)

for loops nested in R

I have a dataset dt, it stored list dataset names, I need to use them to create some new datasets with select some variables, then I use the dataset I just created, repeat the same process .....
The first row and second row were data available.
Then use data available to create a new data.
Then use data just create to create a new data
The final output was list of datasets
I appreciated any helps or suggestions.
dt <- data.frame(name = c("mtcars","iris", "mtcars_new","mtcars_new_1"),
data_source = c("mtcars","iris", "mtcars","mtcars_new"),
variable = c("","","mpg,cyl,am,hp","mpg,cyl"), stringsAsFactors = FALSE)
> dt
name data_source variable
1 mtcars mtcars
2 iris iris
3 mtcars_new mtcars mpg,cyl,am,hp
4 mtcars_new_1 mtcars_new mpg,cyl
dt_list <- list(mtcars, iris)
names(dt_list ) <- c("mtcars","iris")
# The final list of datasets
final_dt <- list(mtcars, iris, mtcars_new, mtcars_new_1)
So far if I wrote a loop like that, I got only mtcars_new dataset, but I don't know how to return to the list and continue looping to get mtcars_new_1 and so on. I have many datasets, and I don't know how many times I should looping through nested data.
mtcars_new <- data.frame()
for(i in 1:nrow(dt)){
if(dt$data_source[[i]] %in% names(dt_list) && !dt$name[[i]] %in% names(dt_list)){
check <- eval(parse(text = dt$data_source[[i]]))
var <- c(unlist(strsplit(dt$variable[[i]],",")))
mtcars_new <- check[, colnames(check) %in% var]
}
}
This will produce the desired output shown. Since the fourth loop uses the data created in the third loop, you need to have a way to append the results of each loop to a growing list of available data sets. Then within each loop find which one is the right starting data set from the available list.
dt <- data.frame(name = c("mtcars","iris", "mtcars_new","mtcars_new_1"),
data_source = c("mtcars","iris", "mtcars","mtcars_new"),
variable = c("","","mpg,cyl,am,hp","mpg,cyl"), stringsAsFactors = FALSE)
input_data_sets <- list(mtcars, iris)
names(input_data_sets) <- c("mtcars","iris")
final_data_sets <- list()
for(i in 1:nrow(dt)) {
available_data_sets <- c(input_data_sets, final_data_sets) #Grows a list of all available data sets
num_to_use <- which(dt$data_source[[i]] == names(available_data_sets)) #finds the right list member to use
temp <- available_data_sets[num_to_use][[1]]
var <- c(unlist(strsplit(dt$variable[[i]],",")))
temp <- list(subset(temp, select = var)) #keep only the desired variables
names(temp) <- dt$name[i] #assign the name provided
final_data_sets <- c(final_data_sets, temp) #add to list of final data sets which will be the output. Anything listed here will become part of the available list in the next loop
}

How to add column with a specific value over a list of dataframes

I have a list of dataframes. I want to add a column to each dataframe with a fixed value. Here is the planned input of the function:
ar_data <- add_cols(ar_data, c("Data_source1", "Data_source2", "Data_source3"))
For example, the first dataframe in the list of dataframes (ar_data) needs a column added (the column is to be named 'type' across all dataframes in the list) with a value of "Data_source1". The second dataframe in the list will have a column added with a value of "Data_source2", and so on...
Here is my attempted function:
add_cols <- function(data, col_value) {
data <- map2(data, col_value, function(x, y) x['type'] = y)
return(data)
}
However it is not working as planned. Any ideas why?
You can use Map :
add_cols <- function(data, col_value) {
Map(cbind, data, type = col_value)
#Using `map2` from `purrr`
#purrr::map2(data, col_value, ~cbind(.x, type = .y))
}
add_cols(ar_data, c("Data_source1", "Data_source2", "Data_source3"))
In your attempt, you need to return the data frame back after adding a column. So this should work.
data <- map2(data, col_value, function(x, y) {x['type'] = y;x})
We can also use lapply
lapply(seq_along(data), function(i) transform(data[[i]], type = col_value[i]))

Using a loop to select a column names from a list

I've been struggling with column selection with lists in R. I've loaded a bunch of csv's (all with different column names and different number of columns) with the goal of extracting all the columns that have the same name (just phone_number, subregion, and phonetype) and putting them together into a single data frame.
I can get the columns I want out of one list element with this;
var<-data[[1]] %>% select("phone_number","Subregion", "PhoneType")
But I cannot select the columns from all the elements in the list this way, just one at a time.
I then tried a for loop that looks like this:
new.function <- function(a) {
for(i in 1:a) {
tst<-datas[[i]] %>% select("phone_number","Subregion", "PhoneType")
}
print(tst)
}
But when I try:
new.function(5)
I'll only get the columns from the 5th element.
I know this might seem like a noob question for most, but I am struggling to learn lists and loops and R. I'm sure I'm missing something very easy to make this work. Thank you for your help.
Another way you could do this is to make a function that extracts your columns and apply it to all data.frames in your list with lapply:
library(dplyr)
extractColumns = function(x){
select(x,"phone_number","Subregion", "PhoneType")
#or x[,c("phone_number","Subregion","PhoneType")]
}
final_df = lapply(data,extractColumns) %>% bind_rows()
The way you have your loop set up currently is only saving the last iteration of the loop because tst is not set up to store more than a single value and is overwritten with each step of the loop.
You can establish tst as a list first with:
tst <- list()
Then in your code be explicit that each step is saved as a seperate element in the list by adding brackets and an index to tst. Here is a full example the way you were doing it.
#Example data.frame that could be in datas
df_1 <- data.frame("not_selected" = rep(0, 5),
"phone_number" = rep("1-800", 5),
"Subregion" = rep("earth", 5),
"PhoneType" = rep("flip", 5))
# Another bare data.frame that could be in datas
df_2 <- data.frame("also_not_selected" = rep(0, 5),
"phone_number" = rep("8675309", 5),
"Subregion" = rep("mars", 5),
"PhoneType" = rep("razr", 5))
# Datas is a list of data.frames, we want to pull only specific columns from all of them
datas <- list(df_1, df_2)
#create list to store new data.frames in once columns are selected
tst <- list()
#Function for looping through 'a' elements
new.function <- function(a) {
for(i in 1:a) {
tst[[i]] <- datas[[i]] %>% select("phone_number","Subregion", "PhoneType")
}
print(tst)
}
#Proof of concept for 2 elements
new.function(2)

Resources