R lapply on list of dataframes resetting rownames - r

You can reset the rownames in a data frame by running
>rownames(df) <- NULL
I have a list of dataframes and want to reset all the rownames on every dataframe in the list, I tried
>newlist <- llply(mylist, function(df) { rownames(df) <- NULL })
Bu tit doesn't work, returns a list of NULLS and the original remains unchanged.

This is a job for the base function lapply; you don't need to load plyr. You also need to make sure that your anonymous function returns something.
df1 <- data.frame(a=1:10)
rownames(df1) <- letters[1:10]
df2 <- data.frame(b=1:10)
rownames(df2) <- LETTERS[1:10]
mylist <- list(df1,df2)
mylist <- lapply(mylist,function(DF) {rownames(DF) <- NULL; DF})

Use rownames<- :
newlist <- lapply(mylist, "rownames<-", NULL)

Related

Pass a function input as column name to data.frame function

I have a function taking a character input. Within the function, I want to use the data.frame() function. Within the data.frame() function, one column name should be the function's character input.
I tried it like this and it didn't work:
frame_create <- function(data, **character_input**){
...
some_vector <- c(1:50)
temp_frame <- data.frame(**character_input** = some_vector, ...)
return(temp_frame)
}
Either use, names to assign or with setNames as = wouldn't allow evaluation on the lhs of =. In package functions i.e tibble or lst, it can be created with := and !!
frame_create <- function(data, character_input){
some_vector <- 1:50
temp_frame <- data.frame(some_vector)
names(temp_frame) <- character_input
return(temp_frame)
}
Can you explain your requirement for using a function to create a new dataframe column? If you have a dataframe df and you want to make a copy with a new column appended then the trivial solution is:
df2 <- df
df2$new_col <- 1:50
Example of merging multiple dataframes in R:
cars1 <- mtcars
cars2 <- cars1
cars3 <- cars2
list1 <- list(cars1, cars2, cars3)
all_cars <- Reduce(rbind, list1)

Obtaining a vector with sapply and use it to remove rows from dataframes in a list with lapply

I have a list with dataframes:
df1 <- data.frame(id = seq(1:10), name = LETTERS[1:10])
df2 <- data.frame(id = seq(11:20), name = LETTERS[11:20])
mylist <- list(df1, df2)
I want to remove rows from each dataframe in the list based on a condition (in this case, the value stored in column id). I create an empty vector where I will store the ids:
ids_to_remove <- c()
Then I apply my function:
sapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
a <- rows_above_th$id # obtain the ids of the rows above the threshold
ids_to_remove <- append(ids_to_remove, a) # append each id to the vector
},
simplify = T
)
However, with or without simplify = T, this returns a matrix, while my desired output (ids_to_remove) would be a vector containing the ids, like this:
ids_to_remove <- c(9,10,9,10)
Because lastly I would use it in this way on single dataframes:
for(i in 1:length(ids_to_remove)){
mylist[[1]] <- mylist[[1]] %>%
filter(!id == ids_to_remove[i])
}
And like this on the whole list (which is not working and I don´t get why):
i = 1
lapply(mylist,
function(df) {
for(i in 1:length(ids_to_remove)){
df <- df %>%
filter(!id == ids_to_remove[i])
i = i + 1
}
} )
I get the errors may be in the append part of the sapply and maybe in the indexing of the lapply. I played around a bit but couldn´t still find the errors (or a better way to do this).
EDIT: original data has 70 dataframes (in a list) for a total of 2 million rows
If you are using sapply/lapply you want to avoid trying to change the values of global variables. Instead, you should return the values you want. For example generate a vector if IDs to remove for each item in the list as a list
ids_to_remove <- lapply(mylist, function(df) {
rows_above_th <- df[(df$id > 8),] # select the rows from each df above a threshold
rows_above_th$id # obtain the ids of the rows above the threshold
})
And then you can use that list with your data list and mapply to iterate the two lists together
mapply(function(data, ids) {
data %>% dplyr::filter(!id %in% ids)
}, mylist, ids_to_remove, SIMPLIFY=FALSE)
Using base R
Map(\(x, y) subset(x, !id %in% y), mylist, ids_to_remove)

Merge dataframes stored in two lists of the same length

I have two long lists of large dataframes that are equal in length. I want to merge Dataframe1 (from list1) with Dataframe1 (from list2) and Dataframe2 (from list1) with Dataframe2 (from list2) etc...
Below is a minimal reproducible example and some attempts.
#### EXAMPLE
#Create Dataframes
df_1 <- data.frame(c("Bah",NA,2,3,4),c("Bug",NA,5,6,NA))
df_2 <- data.frame(c("Blu",7,8,9,10),c(NA,NA,NA,12,13))
df_3 <- data.frame(c("Bah",NA,21,32,43),c("Rgh",NA,51,63,NA))
df_4 <- data.frame(c("Gar",7,8,9,10),c("Ghh",NA,NA,121,131))
#Create Lists
list1 <- list(df_1,df_2)
list2 <- list(df_3,df_4)
#Set column and row names for each dataframe
colnames(list1[[1]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
colnames(list1[[2]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
colnames(list2[[1]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
colnames(list2[[2]]) <- c("SampleID","Measure1","Measure2","Measure3","Measure4")
rownames(list1[[1]]) <- c("1","2")
rownames(list1[[2]]) <- c("1","2")
rownames(list2[[1]]) <- c("1","2")
rownames(list2[[2]]) <- c("1","2")
My desired output is a list of the same length as the input lists but with each dataframe merged by position into a single dataframe. The following yields my desired output for the dataframes and list but is low throughput.
#### DESIRED OUTPUT
DesiredOutput_DF1_Format <- merge(list1[[1]],list2[[1]], all = TRUE, by = "SampleID")
DesiredOutput_DF2_Format <- merge(list1[[2]],list2[[2]], all = TRUE, by = "SampleID")
DesiredOutput_List <- list(DesiredOutput_DF1_Format, DesiredOutput_DF2_Format)
How can I generate an output list in my desired format in a highthroughput way using an apply-like approach?
#### ATTEMPTS
#Attempt1:
attempt1 <- mapply(cbind, list1, list2, simplify=FALSE)
#Attempt2:
My instinct is to use `lapply` but i cant figure how to make it iterate through two lists simultaneously.
#Attempt3: Works but the order of the output list appears inverted. This is not intuitive, though it is easily corrected... There has to be a cleaner way.
output_list <- list()
dataset_iterator <- 1:length(list1)
for (x in dataset_iterator) {
df1 <- data.frame(list1[[x]])
df2 <- data.frame(list2[[x]])
df_merged <- data.frame(merge(df1, df2, by = "Barcodes", all=TRUE))
output_list <- append(output_list, list(df_merged), 0)
Based on the code showed, we may need Map (or mapply with SIMPLIFY = FALSE)
out <- Map(merge, list1, list2, MoreArgs = list(all = TRUE, by = "SampleID"))
-checking with expected output
> identical(DesiredOutput_List, out)
[1] TRUE
Or using tidyverse
library(purrr)
library(dplyr)
map2(list1, list2, full_join, by = "SampleID")

iterate over a set of dataframes in R

Let's say I have a set of dataframes: df1, df2, d3, df4. I want to apply some sort of behaviour over each of these dataframes. Rather than copying the code repeatedly, I want to do this through some sort of for loop. For example, let's say I want to take the df and re assign it so that the first column is row names. The normal way I'd do this is:
df1_b <- df1[,-1]
rownames(df1_b) <- df1[,1]
How would I go about doing this to all four dataframes that I have. I imagine I'd need to somehow make group the dataframes into a single set and then do something like
for (i in set) {
i+"_b" <- i[,-1]
rownames(i_b) <- i[,1]
}
I tried to do this with a cbind:
df_set <- c(df1, df2, df3, df4)
for (i in df_set) {
i+"_b" <- i[,-1]
rownames(i_b) <- i[,1]
}
But of course that didn't work (I'm pretty sure R does not do string concatenation like this).
Any help would be appreciated!
We can use mget to get the values of the multiple objects into a list and then do the processing in the list by looping over the list with lapply
lst1 <- lapply(mget(paste0("df", 1:4)), function(x) {
row.names(x) <- x[,1]
x[,-1]
})
If we want to change the original objects (not recommended)
list2env(lst1, .GlobalEnv)
Or another option is tidyverse
library(purrr)
library(tibble)
library(dplyr)
mget(ls(pattern = "^df\\d+$")) %>%
map(~ .x %>%
column_to_rownames(names(.)[1]))
You can apply a function this way for example:
# getting some dummy data
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
df4 <- mtcars
lst <- list(df1, df2, df3, df4)
# example of applying the function row.names to the data
Map(row.names, lst)

Rbinding large list of dataframes after I did some data cleaning on the list

My problem is, that I can't merge a large list of dataframes before doing some data cleaning. But it seems like my data cleaning is missing from the list.
I have 43 xlsx-files, which I've put in a list.
Here's my code for that part:
file.list <- list.files(recursive=T,pattern='*.xlsx')
dat = lapply(file.list, function(i){
x = read.xlsx(i, sheet=1, startRow=2, colNames = T,
skipEmptyCols = T, skipEmptyRows = T)
# Create column with file name
x$file = i
# Return data
x
})
I then did some datacleaning. Some of the dataframes had some empty columns that weren't skipped in the loading and some columns I just didn't need.
Example of how I removed one column (X1) from all dataframes in the list:
dat <- lapply(dat, function(x) { x["X1"] <- NULL; x })
I also applies column names:
colnames <- c("ID", "UDLIGNNR","BILAGNR", "AKT", "BA",
"IART", "HTRANS", "DTRANS", "BELOB", "REGD",
"BOGFD", "AFVBOGFD", "VALORD", "UDLIGND",
"UÅ", "AFSTEMNGL", "NRBASIS", "SPECIFIK1",
"SPECIFIK2", "SPECIFIK3", "PERIODE","FILE")
dat <- lapply(dat, setNames, colnames)
My problem is, when I open the list or look at the elements in the list, my data cleaning is missing.
And I can't bind the dataframes before the data cleaning since they're aren't looking the same.
What am I doing wrong here?
EDIT: Sample data*
# Sample data
a <- c("a","b","c")
b <- c(1,2,3)
X1 <- c("", "","")
c <- c("a","b","c")
X2 <- c(1,2,3)
X1 <- c("", "","")
df1 <- data.frame(a,b,c,X1)
df2 <- data.frame(a,b,c,X1,X2)
# Putting in list
dat <- list(df1,df2)
# Removing unwanted columns
dat <- lapply(dat, function(x) { x["X1"] <- NULL; x })
dat <- lapply(dat, function(x) { x["X2"] <- NULL; x })
# Setting column names
colnames <- c("Alpha", "Beta", "Gamma")
dat <- lapply(dat, setNames, colnames)
# Merging dataframes
df <- do.call(rbind,dat)
So I've just found that with my sample data this goes smoothly.
I had to reopen the list in View-mode to see the changes I made. That doesn't change the fact that when writing to csv and reopening all the data cleaning is missing (haven'tr tried this with my sample data).
I am wondering if it's because I've changed the merge?
# My merge when I wrote this question:
df <- do.call("rbindlist", dat)
# My merge now:
df <- do.call(rbind,dat)
When I use my real data it doesnøt go as smoothly, so I guess the sample data is bad. I don't know what I'm doing wrong so I can't give some better sample data.
The message I get when merging with rbind:
error in rbind(deparse.level ...) numbers of columns of arguments do not match

Resources