R use grep to clean column in list of lists - r

I have a large data set stored as a list of lists that may be simplified thus:
list1 <- list(1,"bob", "age=14;years")
list2 <- list(2,"bill", "age=24;years")
list3 <- list(3,"bert", "age=36;years")
data.list <- list(list1, list2, list3)
I wish to clean the third column such that I have only the numeric value of age.
This can be done with the following function that returns a new list:
clean <- function(x){
x <- as.numeric(gsub('.*age=(.*?);.*','\\1', x[3]))
}
data.age <- lapply(data.list, clean)
But how may I either
a) directly clean the column to return the value
or
b) replace the origional column [3] with the data.age column?

You need to return the list back in your function, so modify your function as:
clean <- function(x){
x[[3]] <- as.numeric(gsub('.*age=(.*?);.*','\\1', x[[3]]))
x
}
data.age <- lapply(data.list, clean)
should do the trick.

Related

How to change column names of many dataframes in R?

I would like to make the same changes to the column names of many dataframes. Here's an example:
ChangeNames <- function(x) {
colnames(x) <- toupper(colnames(x))
colnames(x) <- str_replace_all(colnames(x), pattern = "_", replacement = ".")
return(x)
}
files <- list(mtcars, nycflights13::flights, nycflights13::airports)
lapply(files, ChangeNames)
I know that lapply only changes a copy. How do I change the underlying dataframe? I want to still use each dataframe separately.
Create a named list, apply the function and use list2env to reflect those changes in the original dataframes.
library(nycflights13)
files <- dplyr::lst(mtcars, flights, airports)
result <- lapply(files, ChangeNames)
list2env(result, .GlobalEnv)

Add file names to columns in a list of dataframes

I am importing a list of several dataframes using a custom function. I want to take the name of the imported file (e.g. file1 from file1.csv) and add it onto all of the column names in that dataframe. In this example, all column names will look like this:
# Column names as they are
q1 q2 q3
# Column names with added name of the file they come from
q1_file1 q2_file1 q3_file1
This is what I've tried, but it doesn't work (the list ends up having 0 dataframes):
my_function<- function (x) {
df <- read.csv(x)
tag <- sub('\\.csv$', '', x)
colnames(df) <- paste0(tag, colnames(df))
}
lapply(my_list, my_function)
Thanks!
It can be:
#Code
tucson_function<- function (x) {
df <- read.csv(x)
tag <- sub('\\.csv$', '', x)
df$tag <- tag
}
Or:
#Code
tucson_function<- function (x) {
df <- read.csv(x)
tag <- sub('\\.csv$', '', x)
names(df) <- paste0(tag,'.',names(df))
return(df)
}
We can use transform with tools::file_path_sans_ext to create a column
my_function<- function(x) {
df <- read.csv(x)
transform(df, tag = tools::file_path_sans_ext(x))
}
and then call the function with lapply
lapply(my_list, my_function)
In the OP's function the issue seems to be that the return is the last assignment i.e. the column names assignment. We need to return the data i.e. 'df'
my_function<- function (x) {
df <- read.csv(x)
tag <- sub('\\.csv$', '', x)
colnames(df) <- paste0(tag, colnames(df))
df
}

Using lapply variable in read.csv

I'm just getting used to using lapply and I've been trying to figure out how I can use names from a vector to append within filenames I am calling, and naming new dataframes. I understand I can use paste to call the files of interest, but not sure I can create the new dataframes with the _var name appended.
site_list <- c("site_a","site_b", "site_c","site_d")
lapply(site_list,
function(var) {
all_var <- read.csv(paste("I:/Results/",var,"_all.csv"))
tbl_var <- read.csv(paste("I:/Results/",var,"_tbl.csv"))
rsid_var <- read.csv(paste("I:/Results/",var,"_rsid.csv"))
return(var)
})
Generally, it often makes more sense to apply a function to the list elements and then to return a list when using lapply, where your variables are stored and can be named. Example (edit: use split to process files together):
files <- list.files(path= "I:/Results/", pattern = "site_[abcd]_.*csv", full.names = TRUE)
files <- split(files, gsub(".*site_([abcd]).*", "\\1", files))
processFiles <- function(x){
all <- read.csv(x[grep("_all.csv", x)])
rsid <- read.csv(x[grep("_rsid.csv", x)])
tbl <- read.csv(x[grep("_tbl.csv", x)])
# do more stuff, generate df, return(df)
}
res <- lapply(files, processFiles)

Usage of iteration element within name pattern

I would like to iterate with a for loop trough a list applying the following function to all list elements:
new_x = do.call("rbind",mget(ls(pattern = "^x.*")))
where x is a certain name pattern of a dataframe.
How do I iterate through a list where the list element i is the name pattern for my function?
The goal would be to get something like this:
for (i in filenames){
i = do.call("rbind",mget(ls(pattern = "^i.*")))
}
So my question is basically how to use i within a name pattern, so I'm able to use the loop to rbind togerther seperate parts of a dataframe xpart1, xpart2, xpart3 to x; ypart1, ypart2, ypart3 to y and so on....
Thank you in advance!
If we are using a for loop, then an option
v1 <- ls(pattern = "^x.*")
lst1 <- vector('list', length(v1))
for(i in seq_along(v1)){
lst1[[i]] <- get(v1[i])
}
do.call(rbind, lst1)
Or if we need to use i to create the pattern, we can use paste
lst1 <- vector('list', length(filenames))
names(lst1) <- filenames
for(i in filenames){
lst1[[i]] <- get(ls(pattern = paste0(i, ".*")))
}
do.call(rbind, lst1)
NOTE: get returns the value of a single object, whereas mget returns more than one object in a list. If we use for loop, we assume that it is returning one object within the loop and get is only needed
Based on the OP's clarification, we can also use mget
xs <- paste0("xpart", 1:100)
ys <- paste0("ypart", 1:100)
xsdat <- do.call(rbind, mget(xs))
ysdat <- do.call(rbind, mget(ys))

Run a loop based on vector elements

I have to run a loop based on certain vector values. Some example code and data is shown below:
list_store <- list()
vec <- c(3,2,3)
data_list <- lapply(list(head(mtcars,10), head(mtcars,15), head(mtcars,20), head(mtcars, 9),
head(mtcars,14), head(mtcars,18), head(mtcars,20), head(mtcars,10)),
function(x) rownames_to_column(x))
data_list1 <- lapply(list(head(mtcars,7), head(mtcars,8), head(mtcars,10)), function(x) rownames_to_column(x))
result <- lapply(data_list, function(i){
list_store[[length(list_store) + 1]] <- merge(i, data_list1[[1]], all.y = TRUE)
})
The above code is that I want to merge first three files of data_list with first file of data_list1, the next two files of data_list with second file of data_list1 and finally the other three files of data_list with the third file of data_list1. In my code I merge all the files of data_list with the first file of data_list1, but I want to change data_list1 as per vec
I can have a loop keeping track of i, j and so on to do all the process, but I want to know if there is any efficient way.
We replicate the 'vec' by the sequence of 'vec', use that to split the 'data_list' into 3 list elements each having a list. Then, use Map to pass the corresponding list elements from the split dataset and 'data_list1', loop through the nested list with lapply and merge with the elements of 'data_list1', use c to convert the nested list back to the normal list structure of 'data_list'.
do.call(c,
Map(function(x,y) lapply(x, function(dat)
merge(dat, y, all.y = TRUE)),
split(data_list, rep(seq_along(vec), vec)),
data_list1))

Resources