I've got a problem in R.
I have loaded files from folder (as filelist) using this method:
ff <- list.files(path=" ", full.names=TRUE)
myfilelist <- lapply(ff, read.table)
names(myfilelist) <- list.files(path=" ", full.names=FALSE)
In myfilelist I have dataframe name as: A1.txt, A2.txt, A3.txt.. etc
Now I would like to use the 'i'th element of list to change my data, for example
with each data frame delete rows the sum of which = 0.
I tried:
A1 <- A1[which(rowSums(A1) > 0),]
and it works.
How can I do it for all A[i] at once?
Try this code:
lapply(myfilelist, function(x) {
x <- x[which(rowSums(x) > 0),]
return(x)
})
Related
I have a bunch of csv files that I'm trying to read into R all at once, with each data frame from a csv becoming an element of a list. The loops largely work, but they keep overriding the list elements. So, for example, if I loop over the first 2 files, both data frames in list[[1]] and list[[2]] will contain the data frame for the second file.
#function to open one group of files named with "cores"
open_csv_core<- function(year, orgtype){
file<- paste(year, "/coreco.core", year, orgtype, ".csv", sep = "")
df <- read.csv(file)
names(df) <- tolower(names(df))
df <- df[df$ntee1 %in% c("C","D"),]
df<- df[!(df$nteecc %in% c("D20","D40", "D50", "D60", "D61")),]
return(df)
}
#function to open one group of files named with "nccs"
open_csv_nccs<- function(year, orgtype){
file2<- paste(year, "/nccs.core", year, orgtype, ".csv", sep="")
df2 <- read.csv(file2)
names(df2) <- tolower(names(df2))
df2 <- df2[df2$ntee1 %in% c("C","D"),]
df2<- df2[!(df2$nteecc %in% c("D20","D40", "D50", "D60", "D61")),]
return(df2)
}
#############################################################################
yrpc<- list()
yrpf<- list()
yrco<- list()
fname<- vector()
file_yrs<- as.character(c(1989:2019))
for(i in 1:length(file_yrs)){
fname<- list.files(path = file_yrs[i], pattern = NULL)
#accessing files in a folder and assigning to the proper function to open them based on how the file is named
for(j in 1:length(fname)){
if(grepl("pc.csv", fname[j])==T) {
if(grepl("nccs", fname[j])==T){
a <- open_csv_nccs(file_yrs[j], "pc")
yrpc[[paste0(file_yrs[i], "pc")]] <- a
} else {
b<- open_csv_core(file_yrs[j], "pc")
yrpc[[paste0(file_yrs[i], "pc")]] <- b
}
} else if (grepl("pf.csv", fname[j])==T){
if(grepl("nccs", fname[j])==T){
c <- open_csv_nccs(file_yrs[j], "pf")
yrpf[[paste0(file_yrs[i], "pf")]] <- c
} else {
d<- open_csv_core(file_yrs[j], "pf")
yrpf[[paste0(file_yrs[i], "pf")]] <- d
}
} else {
if(grepl("nccs", fname[j])==T){
e<- open_csv_nccs(file_yrs[j], "co")
yrco[[paste0(file_yrs[i], "co")]] <- e
} else {
f<- open_csv_core(file_yrs[j], "co")
yrco[[paste0(file_yrs[i], "co")]] <- f
}
}
}
}
Actually, both of your csv reading functions do exactly the same,
except that the paths are different.
If you find a way to list your files with abstract paths instead of relative
paths (just the file names), you wouldn't need to reconstruct the paths like
you do. This is possible by full.names = TRUE in list.files().
The second point is, it seems there is never from same year and same type
a "nccs.core" file in addition to a "coreco.core" file. So they are mutually
exclusive. So then, there is no logics necessary to distinguish those cases, which simplifies our code.
The third point is, you just want to separate the data frames by filetype ("pc", "pf", "co") and years.
Instead of creating 3 lists for each type, I would create one res-ults list, which contains for each type an inner list.
I would solve this like this:
years <- c(1989:2019)
path_to_type <- function(path) gsub(".*(pc|pf|co)\\.csv", "\\1", path)
res <- list("pc" = list(),
"pf" = list(),
"co" = list())
lapply(years, function(year) {
files <- list.files(path = year, pattern = "\\.csv", full.names = TRUE)
dfs <- lapply(files, function(path) {
print(path) # just to signal that the path is getting processed
df <- read.csv(path)
file_type <- path_to_type(path)
names(df) <- tolower(names(df))
df <- df[df$ntee1 %in% c("C", "D"), ]
df <- df[!(df$nteecc %in% c("D20", "D40", "D50", "D60", "D61")), ]
res[[file_type]][[year]] <- df
})
})
Now you can call from result's list by file_type and year
e.g.:
res[["co"]][[1995]]
res[["pf"]][[2018]]
And so on.
Actually, the results of the lapply() calls in this case are
not interesting. Just the content of res ... (result list).
It seems that in your for(j in 1:length(fname)){... you are creating one of 4 variable a, b, c or d. And you're reusing these variable names, so they are getting overwritten.
The "correct" way to do this is to use lapply in place of the for loop. Pass the list of files, and the required function (i.e. open_csv_core, etc) to lapply, and the return value that you get back is a list of the results.
file.names <- list.files(path = 'mypath')
file.names <- paste("mypath", file.names, sep="/")
for(i in 1:length(file.names))
{
assign(paste("Frame",i,""), read.table(file.names[i], sep="", header=FALSE))
}
My above code reads files from a directory and adds them to a data frame. I have thousands of these files. The question is how can i get all the data frames that i create for each file and average each value across all data frames. Its just like when you have a 100x 100 matrix of 1000 files (dataframes) you just want one 100 x 100 matrix with average values across the dataframes. Any help is really appreciated. I have been stuck for a while with this.
The following code seem to do the trick. Thanks to #Gregor
X <- NULL
mylist <- list()
args = commandArgs(trailingOnly=TRUE)
# test if there is at least one argument: if not, return an error
if (length(args)==0){
stop("At least one argument must be supplied (input file).n", call.=FALSE)
} else if (length(args)==1){
file.names <- list.files(path =args[1],pattern=".gdat")
file.names <- paste(args[1], file.names, sep="/")
args[2] <- paste(args[1], "avg.txt", sep="/")
for(i in 1:length(file.names))
{mylist[i] <- list(read.table(file.names[i], sep="", header=FALSE))}
X <- Reduce("+", mylist) / length(mylist) #this is the funx that averages across dataframes
write.table(X, file=args[2], sep="\t",row.names=FALSE, quote=FALSE)
}
This question already has answers here:
What's wrong with my function to load multiple .csv files into single dataframe in R using rbind?
(6 answers)
Closed 5 years ago.
I am quite new to R and I need some help. I have multiple csv files labeled from 001 to 332. I would like to combine all of them into one data.frame. This is what I have done so far:
filesCSV <- function(id = 1:332){
fileNames <- paste(id) ## I want fileNames to be a vector with the names of all the csv files that I want to join together
for(i in id){
if(i < 10){
fileNames[i] <- paste("00",fileNames[i],".csv",sep = "")
}
if(i < 100 && i > 9){
fileNames[i] <- paste("0", fileNames[i],".csv", sep = "")
}
else if (i > 99){
fileNames[i] <- paste(fileNames[i], ".csv", sep = "")
}
}
theData <- read.csv(fileNames[1], header = TRUE) ## here I was trying to create the data.frame with the first csv file
for(a in 2:332){
theData <- rbind(theData,read.csv(fileNames[a])) ## here I wanted to use a for loop to cycle through the names of files in the fileNames vector, and open them all and add them to the 'theData' data.frame
}
theData
}
Any help would be appreciated, Thanks!
Hmm it looks roughly like your function should already be working. What is the issue?
Anyways here would be a more idiomatic R way to achieve what you want to do that reduces the whole function to three lines of code:
Construct the filenames:
infiles <- sprintf("%03d.csv", 1:300)
the %03d means: insert an integer value d padded to length 3 zeroes (0). Refer to the help of ?sprintf() for details.
Read the files:
res <- lapply(infiles, read.csv, header = TRUE)
lapply maps the function read.csv with the argument header = TRUE to each element of the vector "infiles" and returns a list (in this case a list of data.frames)
Bind the data frames together:
do.call(rbind, res)
This is the same as entering rbind(df1, df2, df3, df4, ..., dfn) where df1...dfn are the elments of the list res
You were very close; just needed ideas to append 0s to files and cater for cases when the final data should just read the csv or be an rbind
filesCSV <- function(id = 1:332){
library(stringr)
# Append 0 ids in front
fileNames <- paste(str_pad(id, 3, pad = "0"),".csv", sep = "")
# Initially the data is NULL
the_data <- NULL
for(i in seq_along(id)
{
# Read the data in dat object
dat <- read.csv(fileNames[i], header = TRUE)
if(is.null(the_data) # For the first pass when dat is NULL
{
the_data <- dat
}else{ # For all other passes
theData <- rbind(theData, dat)
}
}
return(the_data)
}
I would like to design a function. Say I have files file1.csv, file2.csv, file3.csv, ..., file100.csv. I only want to read some of them every time I call the function by specifying an integer vector id, e.g., id = 1:10, then I will read file1.csv,...,file10.csv.
After reading those csv files, I would like to row combine them into a single variable. All csv files have the same column structure.
My code is below:
namelist <- list.files()
for (i in id) {
assign(paste0( "file", i ), read.csv(namelist[i], header=T))
}
As you can see, after I read in all the data matrix, I stuck at combining them since they all have different variable names.
You should read in each file as an element of a list. Then you can combine them as follows:
namelist <- list.files()
df <- vector("list", length = length(id))
for (i in id) {
df[[i]] <- read.csv(namelist[i], header = TRUE)
}
df <- do.call("rbind", df)
Or more concisely:
df <- do.call(rbind, lapply(list.files(), read.csv))
I do this, which is more R like without the for loop:
## assuming you have a folder full of .csv's to merge
filenames <- list.files()
all_files <- Reduce(rbind, lapply(filenames, read.csv))
If I understand correctly what you want to do then this is all you need:
namelist <- list.files()
singlevar = c()
for (i in id) {
singlevar = rbind(singlevar, read.csv(namelist[i], header=T))
}
Since in the end you want one single object to contain all the partial information from the single files, rbind as you go.
I am new to R program and currently working on a set of financial data. Now I got around 10 csv files under my working directory and I want to analyze one of them and apply the same command to the rest of csv files.
Here are all the names of these files: ("US%10y.csv", "UK%10y.csv", "GER%10y.csv","JAP%10y.csv", "CHI%10y.csv", "SWI%10y.csv","SOA%10y.csv", "BRA%10y.csv", "CAN%10y.csv", "AUS%10y.csv")
For example, because the Date column in CSV files are Factor so I need to change them to Date format:
CAN <- read.csv("CAN%10y.csv", header = T, sep = ",")
CAN$Date <- as.character(CAN$Date)
CAN$Date <- as.Date(CAN$Date, format ="%m/%d/%y")
CAN_merge <- merge(all.dates.frame, CAN, all = T)
CAN_merge$Bid.Yield.To.Maturity <- NULL
all.dates.frame is a data frame of 731 consecutive days. I want to merge them so that each file will have the same number of rows which later enables me to combine 10 files together to get a 731 X 11 master data frame.
Surely I can copy and paste this code and change the file name, but is there any simple approach to use apply or for loop to do that ???
Thank you very much for your help.
This should do the trick. Leave a comment if a certain part doesn't work. Wrote this blind without testing.
Get a list of files in your current directory ending in name .csv
L = list.files(".", ".csv")
Loop through each of the name and reads in each file, perform the actions you want to perform, return the data.frame DF_Merge and store them in a list.
O = lapply(L, function(x) {
DF <- read.csv(x, header = T, sep = ",")
DF$Date <- as.character(CAN$Date)
DF$Date <- as.Date(CAN$Date, format ="%m/%d/%y")
DF_Merge <- merge(all.dates.frame, CAN, all = T)
DF_Merge$Bid.Yield.To.Maturity <- NULL
return(DF_Merge)})
Bind all the DF_Merge data.frames into one big data.frame
do.call(rbind, O)
I'm guessing you need some kind of indicator, so this may be useful. Create a indicator column based on the first 3 characters of your file name rep(substring(L, 1, 3), each = 731)
A dplyr solution (though untested since no reproducible example given):
library(dplyr)
file_list <- c("US%10y.csv", "UK%10y.csv", "GER%10y.csv","JAP%10y.csv", "CHI%10y.csv", "SWI%10y.csv","SOA%10y.csv", "BRA%10y.csv", "CAN%10y.csv", "AUS%10y.csv")
can_l <- lapply(
file_list
, read.csv
)
can_l <- lapply(
can_l
, function(df) {
df %>% mutate(Date = as.Date(as.character(Date), format ="%m/%d/%y"))
}
)
# Rows do need to match when column-binding
can_merge <- left_join(
all.dates.frame
, bind_cols(can_l)
)
can_merge <- can_merge %>%
select(-Bid.Yield.To.Maturity)
One possible solution would be to read all the files into R in the form of a list, and then use lapply to to apply a function to all data files. For example:
# Create vector of file names in working direcotry
files <- list.files()
files <- files[grep("csv", files)]
#create empty list
lst <- vector("list", length(files))
#Read files in to list
for(i in 1:length(files)) {
lst[[i]] <- read.csv(files[i])
}
#Apply a function to the list
l <- lapply(lst, function(x) {
x$Date <- as.Date(as.character(x$Date), format = "%m/%d/%y")
return(x)
})
Hope it's helpful.