Add file names to columns in a list of dataframes - r

I am importing a list of several dataframes using a custom function. I want to take the name of the imported file (e.g. file1 from file1.csv) and add it onto all of the column names in that dataframe. In this example, all column names will look like this:
# Column names as they are
q1 q2 q3
# Column names with added name of the file they come from
q1_file1 q2_file1 q3_file1
This is what I've tried, but it doesn't work (the list ends up having 0 dataframes):
my_function<- function (x) {
df <- read.csv(x)
tag <- sub('\\.csv$', '', x)
colnames(df) <- paste0(tag, colnames(df))
}
lapply(my_list, my_function)
Thanks!

It can be:
#Code
tucson_function<- function (x) {
df <- read.csv(x)
tag <- sub('\\.csv$', '', x)
df$tag <- tag
}
Or:
#Code
tucson_function<- function (x) {
df <- read.csv(x)
tag <- sub('\\.csv$', '', x)
names(df) <- paste0(tag,'.',names(df))
return(df)
}

We can use transform with tools::file_path_sans_ext to create a column
my_function<- function(x) {
df <- read.csv(x)
transform(df, tag = tools::file_path_sans_ext(x))
}
and then call the function with lapply
lapply(my_list, my_function)
In the OP's function the issue seems to be that the return is the last assignment i.e. the column names assignment. We need to return the data i.e. 'df'
my_function<- function (x) {
df <- read.csv(x)
tag <- sub('\\.csv$', '', x)
colnames(df) <- paste0(tag, colnames(df))
df
}

Related

List elements getting overwritten in for loop R?

I have a bunch of csv files that I'm trying to read into R all at once, with each data frame from a csv becoming an element of a list. The loops largely work, but they keep overriding the list elements. So, for example, if I loop over the first 2 files, both data frames in list[[1]] and list[[2]] will contain the data frame for the second file.
#function to open one group of files named with "cores"
open_csv_core<- function(year, orgtype){
file<- paste(year, "/coreco.core", year, orgtype, ".csv", sep = "")
df <- read.csv(file)
names(df) <- tolower(names(df))
df <- df[df$ntee1 %in% c("C","D"),]
df<- df[!(df$nteecc %in% c("D20","D40", "D50", "D60", "D61")),]
return(df)
}
#function to open one group of files named with "nccs"
open_csv_nccs<- function(year, orgtype){
file2<- paste(year, "/nccs.core", year, orgtype, ".csv", sep="")
df2 <- read.csv(file2)
names(df2) <- tolower(names(df2))
df2 <- df2[df2$ntee1 %in% c("C","D"),]
df2<- df2[!(df2$nteecc %in% c("D20","D40", "D50", "D60", "D61")),]
return(df2)
}
#############################################################################
yrpc<- list()
yrpf<- list()
yrco<- list()
fname<- vector()
file_yrs<- as.character(c(1989:2019))
for(i in 1:length(file_yrs)){
fname<- list.files(path = file_yrs[i], pattern = NULL)
#accessing files in a folder and assigning to the proper function to open them based on how the file is named
for(j in 1:length(fname)){
if(grepl("pc.csv", fname[j])==T) {
if(grepl("nccs", fname[j])==T){
a <- open_csv_nccs(file_yrs[j], "pc")
yrpc[[paste0(file_yrs[i], "pc")]] <- a
} else {
b<- open_csv_core(file_yrs[j], "pc")
yrpc[[paste0(file_yrs[i], "pc")]] <- b
}
} else if (grepl("pf.csv", fname[j])==T){
if(grepl("nccs", fname[j])==T){
c <- open_csv_nccs(file_yrs[j], "pf")
yrpf[[paste0(file_yrs[i], "pf")]] <- c
} else {
d<- open_csv_core(file_yrs[j], "pf")
yrpf[[paste0(file_yrs[i], "pf")]] <- d
}
} else {
if(grepl("nccs", fname[j])==T){
e<- open_csv_nccs(file_yrs[j], "co")
yrco[[paste0(file_yrs[i], "co")]] <- e
} else {
f<- open_csv_core(file_yrs[j], "co")
yrco[[paste0(file_yrs[i], "co")]] <- f
}
}
}
}
Actually, both of your csv reading functions do exactly the same,
except that the paths are different.
If you find a way to list your files with abstract paths instead of relative
paths (just the file names), you wouldn't need to reconstruct the paths like
you do. This is possible by full.names = TRUE in list.files().
The second point is, it seems there is never from same year and same type
a "nccs.core" file in addition to a "coreco.core" file. So they are mutually
exclusive. So then, there is no logics necessary to distinguish those cases, which simplifies our code.
The third point is, you just want to separate the data frames by filetype ("pc", "pf", "co") and years.
Instead of creating 3 lists for each type, I would create one res-ults list, which contains for each type an inner list.
I would solve this like this:
years <- c(1989:2019)
path_to_type <- function(path) gsub(".*(pc|pf|co)\\.csv", "\\1", path)
res <- list("pc" = list(),
"pf" = list(),
"co" = list())
lapply(years, function(year) {
files <- list.files(path = year, pattern = "\\.csv", full.names = TRUE)
dfs <- lapply(files, function(path) {
print(path) # just to signal that the path is getting processed
df <- read.csv(path)
file_type <- path_to_type(path)
names(df) <- tolower(names(df))
df <- df[df$ntee1 %in% c("C", "D"), ]
df <- df[!(df$nteecc %in% c("D20", "D40", "D50", "D60", "D61")), ]
res[[file_type]][[year]] <- df
})
})
Now you can call from result's list by file_type and year
e.g.:
res[["co"]][[1995]]
res[["pf"]][[2018]]
And so on.
Actually, the results of the lapply() calls in this case are
not interesting. Just the content of res ... (result list).
It seems that in your for(j in 1:length(fname)){... you are creating one of 4 variable a, b, c or d. And you're reusing these variable names, so they are getting overwritten.
The "correct" way to do this is to use lapply in place of the for loop. Pass the list of files, and the required function (i.e. open_csv_core, etc) to lapply, and the return value that you get back is a list of the results.

R use grep to clean column in list of lists

I have a large data set stored as a list of lists that may be simplified thus:
list1 <- list(1,"bob", "age=14;years")
list2 <- list(2,"bill", "age=24;years")
list3 <- list(3,"bert", "age=36;years")
data.list <- list(list1, list2, list3)
I wish to clean the third column such that I have only the numeric value of age.
This can be done with the following function that returns a new list:
clean <- function(x){
x <- as.numeric(gsub('.*age=(.*?);.*','\\1', x[3]))
}
data.age <- lapply(data.list, clean)
But how may I either
a) directly clean the column to return the value
or
b) replace the origional column [3] with the data.age column?
You need to return the list back in your function, so modify your function as:
clean <- function(x){
x[[3]] <- as.numeric(gsub('.*age=(.*?);.*','\\1', x[[3]]))
x
}
data.age <- lapply(data.list, clean)
should do the trick.

Overwriting result with for loop in R

I have a number of csv files and my goal is to find the number of complete cases for a file or set of files given by id argument. My function should return a data frame with column id specifying the file and column obs giving the number of complete cases for this id. However, my function overwrites the previous value of nobs in each loop and the resulting data frame gives me only its last value. Do you have any idea how to get the value of nobs for each value of id?
myfunction<-function(id=1:20) {
files<-list.files(pattern="*.csv")
myfiles = do.call(rbind, lapply(files, function(x) read.csv(x,stringsAsFactors = FALSE)))
for (i in id) {
good<-complete.cases(myfiles)
newframe<-myfiles[good,]
cases<-newframe[newframe$ID %in% i,]
nobs<-nrow(cases)
}
clean<-data.frame(id,nobs)
clean
}
Thanks.
We can do all inside lapply(), something like below (not tested):
myfunction <- function(id = 1:20) {
files <- list.files(pattern = "*.csv")[id]
do.call(rbind,
lapply(files, function(x){
df <- read.csv(x,stringsAsFactors = FALSE)
df <- df[complete.cases(df), ]
data.frame(ID=x,nobs=nrow(df))
}
)
)
}

How to save results from for loop on list into a new list under "i" vector name?

I have the following code:
final_results <- list()
myfunc <- function(v1) {
deparse(substitute(v1))
}
for (i in mylist) {
...calculations...
tmp_results <- as.data.frame(cbind(effcrs,weights))
colnames(tmp_results) <- c('efficiency',names(inputs),
names(outputs)) # header
rownames(tmp_results) <- namesDMU[,1]
#Save to list
name_in_list <- myfunc(i)
dea_results[[name_in_list]] <- tmp_results
}
The above code loops through a list of data frames. I would like each result yielded from the loop to be stored in a separate list under the same name as the original file obtained from mylist or i
I tried using the deparse substitute. when i apply it to an individual item in mylist it looks like this:
myfunc(standard_DEA$'2010-11-11')
[1] "standard_DEA$\"2010-11-11\""
I don't know what the issue is. At the moment it saves everything under the name "i" and replaces all vectors so the end result is a list of 1.
Thank you in advance
This looks like you want a do loop.
library(dplyr)
function_which_returns_dataframe = function(i) {
...calculations...
tmp_results <- as.data.frame(cbind(effcrs,weights))
colnames(tmp_results) <- c('efficiency',names(inputs),
names(outputs)) # header
rownames(tmp_results) <- namesDMU[,1]
tmp_results
}
data_frame(mylist = mylist,
name = names(mylist)) %>%
group_by(mylist, name) %>%
do(function_which_returns_dataframe(.$mylist[[1]]))

Referring to 'i'th element of list in R

I've got a problem in R.
I have loaded files from folder (as filelist) using this method:
ff <- list.files(path=" ", full.names=TRUE)
myfilelist <- lapply(ff, read.table)
names(myfilelist) <- list.files(path=" ", full.names=FALSE)
In myfilelist I have dataframe name as: A1.txt, A2.txt, A3.txt.. etc
Now I would like to use the 'i'th element of list to change my data, for example
with each data frame delete rows the sum of which = 0.
I tried:
A1 <- A1[which(rowSums(A1) > 0),]
and it works.
How can I do it for all A[i] at once?
Try this code:
lapply(myfilelist, function(x) {
x <- x[which(rowSums(x) > 0),]
return(x)
})

Resources