I have a folder containing a bunch of CSV files that are titled "yob1980", "yob1981", "yob1982" etc.
I have to use a for loop to go through each file and put its contents into a data frame - the columns in the data frame should be "1980", "1981", "1982" etc
Here is what I have:
file_list <- list.files()
temp = list.files(pattern="*.txt")
babynames <- do.call(rbind,lapply(temp,read.csv, FALSE))
names(babynames) <- c("Name", "Gender", "Count")
I feel like I need a for loop, but I'm not sure how to loop through the files. Anyone point me in the right direction?
My favourite way to do this is using ldply from the plyr package. It has the advantage of returning a dataframe, so you don't need to do the rbind step afterwards:
library( plyr )
babynames <- ldply( .data = list.files(pattern="*.txt"),
.fun = read.csv,
header = FALSE,
col.names=c("Name", "Gender", "Count") )
As an added benefit, you can multi-thread the import very easily, making importing large multi-file datasets quite a bit faster:
library( plyr )
library( doMC )
registerDoMC( cores = 4 )
babynames <- ldply( .data = list.files(pattern="*.txt"),
.fun = read.csv,
header = FALSE,
col.names=c("Name", "Gender", "Count"),
.parallel = TRUE )
Changing the above slightly to include a Year column in the resulting data frame, you can create a function first, then execute that function within ldply in the same way you would execute read.csv
readFun <- function( filename ) {
# read in the data
data <- read.csv( filename,
header = FALSE,
col.names = c( "Name", "Gender", "Count" ) )
# add a "Year" column by removing both "yob" and ".txt" from file name
data$Year <- gsub( "yob|.txt", "", filename )
return( data )
}
# execute that function across all files, outputting a data frame
doMC::registerDoMC( cores = 4 )
babynames <- plyr::ldply( .data = list.files(pattern="*.txt"),
.fun = readFun,
.parallel = TRUE )
This will give you your data in a concise and tidy way, which is how I'd recommend moving forward from here. While it is possible to then separate each year's data into it's own column, it's likely not the best way to go.
Note: depending on your preference, it may be a good idea to convert the Year column to say, integer class. But that's up to you.
Using purrr
library(tidyverse)
files <- list.files(path = "./data/", pattern = "*.csv")
df <- files %>%
map(function(x) {
read.csv(paste0("./data/", x))
}) %>%
reduce(rbind)
A for loop might be more appropriate than lapply in this case.
file_list = list.files(pattern="*.txt")
data_list <- vector("list", "length" = length(file.list))
for (i in seq_along(file_list)) {
filename = file_list[[i]]
# Read data in
df <- read.csv(filename, header = FALSE, col.names = c("Name", "Gender", "Count"))
# Extract year from filename
year = gsub("yob", "", filename)
df[["Filename"]] = year
# Add year to data_list
data_list[[i]] <- df
}
babynames <- do.call(rbind, data_list)
Consider an anonymous function within an lapply():
files = list.files(pattern="*.txt")
dfList <- lapply(files, function(i) {
df <- read.csv(i, header=FALSE, col.names=c("Name", "Gender", "Count"))
df$Year <- gsub("yob", "", i)
return(df)
})
finaldf <- do.call(rbind, dflist)
Related
I have discovered R a couple of years ago and it has been very handy to clean up dataframes, prepare some data and to handle other basic tasks.
Now I would like to try using R to apply basic treatments but on many different files stored in different folders at once.
Here is the script I would like to improve into one function that would loop through my folder "dataset_2006" and "dataset_2007" to do all the work.
library(dplyr)
library(readr)
library(sf)
library(purrr)
setwd("C:/Users/Downloads/global_data/dataset_2006")
shp2006 <- list.files(pattern = 'data_2006.*\\.shp$', full.names = TRUE)
listOfShp <- lapply(shp2006, st_read)
combinedShp <- do.call(what = sf:::rbind.sf, args=listOfShp)
#import and merge CSV files into one data frame
folderfiles <- list.files(pattern = 'csv_2006_.*\\.csv$', full.names = TRUE)
csv_data <- folderfiles %>%
set_names() %>%
map_dfr(.f = read_delim,
delim = ";",
.id = "file_name")
new_shp_2006 <- merge(combinedShp, csv_data , by = "ID") %>% filter(label %in% c("AR45T", "GK879"))
st_write(new_shp_2006 , "new_shp_2006.shp", overwrite = TRUE)
setwd("C:/Users/Downloads/global_data/dataset_2007")
shp2007 <- list.files(pattern = 'data_2007.*\\.shp$', full.names = TRUE)
listOfShp <- lapply(shp2007, st_read)
combinedShp <- do.call(what = sf:::rbind.sf, args=listOfShp)
#import and merge CSV files into one data frame
folderfiles <- list.files(pattern = 'csv_2007_.*\\.csv$', full.names = TRUE)
csv_data <- folderfiles %>%
set_names() %>%
map_dfr(.f = read_delim,
delim = ";",
.id = "file_name")
new_shp_2007 <- merge(combinedShp, csv_data , by = "ID") %>% filter(label %in% c("AR45T", "GK879"))
st_write(new_shp_2007 , "new_shp_2007.shp", overwrite = TRUE)
This is easy to achieve with a for-loop to loop over multiple items. To allow us to use wildcards, we can also add the function Sys.glob():
myfunction <- function(directories) {
for(dir in Sys.glob(directories)) {
# do something with a single dir
print(dir)
}
}
# you can specify multiple directories manually:
myfunction(c('C:/Users/Downloads/global_data/dataset_2006',
'C:/Users/Downloads/global_data/dataset_2007'))
# or use a wildcard to automatically get all files/directories that match the pattern:
myfunction('C:/Users/Downloads/global_data/dataset_200*')
I have three files list
filesl <- list.files(pattern = 'RP-L-.*\\.xls', full.names = TRUE)
filesr <- list.files(pattern = 'RP-R-.*\\.xls', full.names = TRUE)
filesv <- list.files(pattern = 'RP-V-.*\\.xls', full.names = TRUE)
And I've attemped to read and merge two of them as follows:
data <- map2_dfr(filesl, filesr, ~ {
ldat <- readxl::read_excel(.x)
rdat <- readxl::read_excel(.y)
df <- rbind.data.frame(ldat, rdat, by = c('ID', 'GR', 'SES', 'COND'))
})
If I would like to add the third one list filesv what am I supposed to set as function? I guess I should use one from package purrr enabling function iteration on multiple elements (such pmap, imap and so on, but I am not able to change the commands properly).
Any suggestions as regards
Thanks in advance
data <- pmap_dfr(list(filesl, filesr, filesv), ~ {
ldat <- readxl::read_excel(..1)
rdat <- readxl::read_excel(..2)
vdat <- readxl::read_excel(..3)
df <- rbind.data.frame(ldat, rdat, vdat, by = c('ID', 'GR', 'SES', 'COND'))
})
pmap takes a list of lists to iterate on. Then the arguments can be referred to as ..1, ..2, etc in the anonymous function.
I have roughly 50000 .rda files. Each contains a dataframe named results with exactly one row. I would like to append them all into one dataframe.
I tried the following, which works, but is slow:
root_dir <- paste(path, "models/", sep="")
files <- paste(root_dir, list.files(root_dir), sep="")
load(files[1])
results_table = results
rm(results)
for(i in c(2:length(files))) {
print(paste("We are at step ", i,sep=""))
load(files[i])
results_table= bind_rows(list(results_table, results))
rm(results)
}
Is there a more efficient way to do this?
Using .rds is a little bit easier. But if we are limited to .rda the following might be useful. I'm not certain if this is faster than what you have done:
library(purrr)
library(dplyr)
library(tidyr)
## make and write some sample data to .rda
x <- 1:10
fake_files <- function(x){
df <- tibble(x = x)
save(df, file = here::here(paste0(as.character(x),
".rda")))
return(NULL)
}
purrr::map(x,
~fake_files(x = .x))
## map and load the .rda files into a single tibble
load_rda <- function(file) {
foo <- load(file = file) # foo just provides the name of the objects loaded
return(df) # note df is the name of the rda returned object
}
rda_files <- tibble(files = list.files(path = here::here(""),
pattern = "*.rda",
full.names = TRUE)) %>%
mutate(data = pmap(., ~load_rda(file = .x))) %>%
unnest(data)
This is untested code but should be pretty efficient:
root_dir <- paste(path, "models/", sep="")
files <- paste(root_dir, list.files(root_dir), sep="")
data_list <- lapply("mydata.rda", function(f) {
message("loading file: ", f)
name <- load(f) # this should capture the name of the loaded object
return(eval(parse(text = name))) # returns the object with the name saved in `name`
})
results_table <- data.table::rbindlist(data_list)
data.table::rbindlist is very similar to dplyr::bind_rows but a little faster.
Current dilemma: I have a massive data frame that I am trying to break down into smaller files based on a partial string match in the column. I have made a script that works great for this:
df <- read.csv("file.csv", header = TRUE, sep = ",")
newdf <- select(df, matches('threshold1',))
write.csv(newdf,"threshold1.file.csv", row.names = FALSE)
The problem is that I have hundreds of thresholds to break apart into separate files. There must be a way I can loop this script to create all the files for me rather than manually editing the script to say threshold2, threshold3, etc.
You can try to solve it with lapply.
# Functions that splits and saves the data.frame
split_df <- function(threshold, df){
newdf <- select(df, matches(threshold,))
write.csv(newdf,
paste(".file.csv", sep = ""), row.names = FALSE)
return(threshold)
}
df <- read.csv("file.csv", header = TRUE, sep = ",")
# Number for thresholds
N <- 100
threshold_l <- paste("threshold", 1:N, sep = "")
lapply(threshold_l, split_df, df = df)
I need to run the same set of code for multiple CSV files. I want to do it with the same with macro. Below is the code that I am executing, but results are not coming properly. It is reading the data in 2-d format while I need to run in 3-d format.
lf = list.files(path = "D:/THD/data", pattern = ".csv",
full.names = TRUE, recursive = TRUE, include.dirs = TRUE)
ds<-lapply(lf,read.table)
I dont know if this is going to be useful but one of the way I do is:
##Step 1 read files
mycsv = dir(pattern=".csv")
n <- length(mycsv)
mylist <- vector("list", n)
for(i in 1:n) mylist[[i]] <- read.csv(mycsv[i],header = T)
then I useually just use apply function to change things, for example,
## Change coloumn name
mylist <- lapply(mylist, function(x) {names(x) <- c("type","date","v1","v2","v3","v4","v5","v6","v7","v8","v9","v10","v11","v12","v13","v14","v15","v16","v17","v18","v19","v20","v21","v22","v23","v24","total") ; return(x)})
## changing type coloumn for weekday/weekend
mylist <- lapply(mylist, function(x) {
f = c("we", "we", "wd", "wd", "wd", "wd", "wd")
x$type = rep(f,52, length.out = 365)
return(x)
})
and so on.
Then I save with this following code again after all the changes I made (it is also sometime useful to split original file name and rename each files to save with a part of file name so that I can track each individual files later)
## for example some of my file had a pattern in file name such as "201_E424220_N563500.csv",so I split this to save with a new name like this:
mylist <-lapply(1:length(mylist), function(i) {
mylist.i <- mylist[[i]]
s = strsplit(mycsv[i], "_" , fixed = TRUE)[[1]]
d = cbind(mylist.i[, c("type", "date")], ID = s[1], Easting = s[2], Northing = s[3], mylist.i[, 3:ncol(mylist.i)])
return(d)
})
for(i in 1:n)
write.csv(file = paste("file", i, ".csv", sep = ""), mylist[i], row.names = F)
I hope this will help. When you get some time pleaes read about the PLYR package as I am sure this will be very useful for you, it is a very useful package with lots of data analysis options. PLYR has apply functions such as:
## l_ply split list, apply function and discard result
## ldply split list, apply function and return result in data frame
## laply split list, apply function and return result in an array
for example you can use the ldply to read all your csv and return a data frame simething like:
data = ldply(list.files(pattern = ".csv"), function(fname) {
j = read.csv(fname, header = T)
return(j)
})
So here J will be your data frame with all your csv files data.
Thanks,Ayan