R iteration over files from different directory - r

I have a data in different folders that I need to import and transform in a loop using purr. The path and names of the csv files follow the pattern below:
data/csd-alberta/
data/csd-ontario/
data/csd-pei/
data/csd-bc/
# for all of the province
c(alberta, bc, newbruns, newfoundland, nova, nunavut, nw, ont, pei, qc, sask, yukon)
There are many csv files in each province folder, but the main dataset I want to import starts with 98. For example:
# note that all data sets must begin with 98 and end with .csv.
csd_alberta_raw <- read_csv("csd-alberta/98-1.csv")
csd_bc_raw <- read_csv("csd-bc/98-2.csv")
csd_ont_raw <- read_csv("csd-ont/98-3.csv")
There are other csv files in the folder so I only need to import these that start with 98.
I would like to use purr and map_df to integrate the data transformation for all the files, since they all have the same columns and require the same data cleaning. But I'm not sure how to do this for all of the directory, and also specify the pattern for the csv.

You can use the following :
Use list.files to get the complete path of the filenames in all the folders with a specific pattern ('^98.*\\.csv$').
Use map_df to read the all the files and combine them. I have also included a new column called file which will identify the file from which data is coming from.
filenames <- list.files('data/', recursive = TRUE, full.names = TRUE, pattern = '^98.*\\.csv$')
combine_data <- purrr::map_df(filenames, readr::read_csv, .id = 'file')

Related

Copy-paste files to folders that have matching names using R

I am trying to copy files to various folders that have matching filenames.
Here's an extract of the filenames:
20201026_ABCD.txt
20201026_XYZ.txt
20201027_ABCD.txt
20201027_POR.txt
20201028_ABCD.txt
20201028_PQR.txt
I want to create folders that have just the date components from the files above. I have managed to get that far based on the code below:
setwd("C:/Projects/TEST")
library(stringr)
filenames<-list.files(path = "C:/Projects/TEST", pattern = NULL)
#create a variable that contains all the desired filenames
foldernames.unique<-unique(str_extract(filenames,"[0-9]{1,8}"))
#create folders based on this variable
foldernames.unique<-paste("dates/",foldernames.unique,sep='')
lapply(foldernames.unique,dir.create,recursive = TRUE)
Now, how do I copy 20201026_ABCD.txt and 20201026_XYZ.txt to the folder 20201026, so on and so forth?
Now you just need to use file.rename to move the files. First i'll change things a bit to capture the non-unique folder names so I don't have to recauclate them. How about this
srcfolder <- "C:/Projects/TEST"
filenames <- list.files(path = srcfolder, pattern = NULL)
#create a variable that contains the desired foldername for each file
foldernames <- file.path("dates", str_extract(filenames,"[0-9]{1,8}"))
foldernames.unique <- unique(foldernames)
#create folders based on unique values of variable
lapply(foldernames.unique, dir.create, recursive = TRUE)
# Now move files
file.rename(file.path(srcfolder, filenames), file.path(foldernames, filenames))
We just build the file names with file.path which is a bit more robust than paste()

Reading in multiple srt files

I'd like to read in multiple srt files in R. I can read them into a list but I need to load them in sequentially by the way they were created in the file directory.
I'd also like to make a column to tell which file they come from. So I can tell which data came from file 1, file 2.. etc.
I can read them in as a list; but the files are names like "1 - FileTest"; "2 - FileTest", "#10 FileTest",... etc
This then loads the list like 1, 10, 11... etc. Even though if I arrange the files in my file directory file 11 was created after 9 for instance. I should just need a parameter for them to load sequentially so then when I put them in dataframe they show in chronological order.
list_of_files <- list.files(path=path,
pattern = "*.srt",
full.names = TRUE)
Files <- lapply(list_of_files, srt.read)
Files <- data.frame(matrix(unlist(Files), byrow=T),stringsAsFactors=FALSE)
The files load in but they don't load in chronological order it is difficult to tell what data is associated with which file.
I have approximately 150 files so being able to compile them into a single dataframe would be very helpful. Thanks!
Consider extracting meta data of the files with file.info (includes created/modified time, file size, owner, group, etc.). Then order that resulting data frame by created date/time, and finally import .srt files with ordered list of files:
raw_list_of_files <- list.files(path=path,
pattern = "*.srt",
full.names = TRUE)
# CREATE DATA FRAME OF FILE INFO
meta_df <- file.info(raw_list_of_files)
# SORT BY CREATED DATE/TIME
meta_df <- with(meta_df, meta_df[order(ctime),])
# IMPORT DATA FRAMES IN ORDERED FILES
srt_list <- lapply(row.names(meta_df), srt.read)
final_df <- data.frame(matrix(unlist(srt_list), byrow=TRUE),
stringsAsFactors=FALSE)

using list.files to read many shape files and then merge them into one big file

I have more than 1000 shape files in a directory, and I want to select only 10 of them whose names are already known to me as follows:
15TVN44102267_Polygons.shp, 15TVN44102275_Polygons.shp
15TVN44102282_Polygons.shp, 15TVN44102290_Polygons.shp
15TVN44102297_Polygons.shp, 15TVN44102305_Polygons.shp
15TVN44102312_Polygons.shp, 15TVN44102320_Polygons.shp
15TVN44102327_Polygons.shp, 15TVN44102335_Polygons.shp
First I want to read only these shape files using the list.files command, and then merge them into one big file. I tried the following command, but it failed. I will appreciate any assistance from the community.
setwd('D/LiDAR/CHM_tree_objects')
files <- list.files(pattern="15TVN44102267_Polygons|
15TVN44102275_Polygons| 15TVN44102282_Polygons|
15TVN44102290_Polygons| 15TVN44102297_Polygons|
15TVN44102305_Polygons| 15TVN44102312_Polygons|
15TVN44102320_Polygons| 15TVN44102327_Polygons|
15TVN44102335_Polygons| 15TVN44102342_Polygons|
15TVN44102350_Polygons| 15TVN44102357_Polygons",
recursive = TRUE, full.names = TRUE)
Here's a slightly different approach. If you already know the location of the files and their file names, you don't need to use list.files:
library(sf)
baseDir <- '/temp/r/'
filenames <- c('Denisonia-maculata.shp', 'Denisonia-devisi.shp')
filepaths <- paste(baseDir, filenames, sep='')
# Read each shapefile and return a list of sf objects
listOfShp <- lapply(filepaths, st_read)
# Look to make sure they're all in the same CRS
unique(sapply(listOfShp, st_crs))
# Combine the list of sf objects into a single object
combinedShp <- do.call(what = sf:::rbind.sf, args=listOfShp)
combinedShp will then be an sf object that has all the features in your individual shapefiles. You can then write that out to a single file in your chosen format with st_write.

How to call several variables in a for loop in R?

I have several .csv files of data stored in a directory, and I need to import all of them into R.
Each .csv has two columns when imported into R. However, the 1001st row needs to be stored as a separate variable for each of the .csv files (it corresponds to an expected value which was stored here during the simulation; I want it to be outside of the main data).
So far I have the following code to import my .csv files as matrices.
#Load all .csv in directory into list
dataFiles <- list.files(pattern="*.csv")
for(i in dataFiles) {
#read all of the csv files
name <- gsub("-",".",i)
name <- gsub(".csv","",name)
i <- paste(".\\",i,sep="")
assign(name,read.csv(i, header=T))
}
This produces several matrices with the naming convention "sim_data_L_mu" where L and mu are parameters from the simulation. How can I remove the 1001st row (which has a number in the first column, and the second column is null) from each matrix and store it as a variable named "sim_data_L_mu_EV"? The main problem I have is that I do not know how to call all of the newly created matrices in my for loop.
Couldn't post long code in comments so am writing here:
# Use dialog to select folder
# Full names are required to access files that are not in the current working directory
file_list <- list.files(path = choose.dir(), pattern = "*.csv", full.names = T)
big_list <- lapply(file_list, function(z){
df <- read.csv(z)
scalar <- df[1000,1]
return(list(df, scalar))
})
To access the scalar value from the third file, you can use
big_list[[3]][2]
The elements in big_list follow the order of file_list so you always know which file the data comes from.
If you use data.table::fread() instead of read.csv, you can play around with assigning column names, selecting which rows/columns to read etc. It's also considerably faster for large datafiles.
Hope this helps!

How to store a folder containing over 30 zipped files into a variable in r

I used the package 'GDELTtools' to download data from GDELT. Now, the data was downloaded however, no variable was stored in the global environment. I want to store the data into a dataframe variable so I can analyze it.
The folder contains over 30 zipped files. Every zipped file contains one csv. I need to store all these csvs in one variable in the Global Environment of r. I hope this can be done.
Thank you in advance!
Haven't written R for a while so I will try my best.
Read the comments carefully, cause they will explain the procedure.
I will attach the links to check information for: unzip, readCSV, mergeDataFrames, emptyDataFrame, concatinateStrings
According to docs of GDELTtools you can easily specify folder of download by providing local.folder="~/gdeltdata" as parameter to GetGDELT() function.
After that you can list.files("path/to/files/directory") function to obtain a vector of file names used in the explanation code bellow. Check the docs for more examples and explanation.
# set path to of unzip output
outDir <-"C:\\Users\\Name\\Documents\\unzipfolder"
# relative path where zip files are stored
relativePath <- "C:\\path\\to\\my\\directory\\"
# create varible to store all the paths to the zip files in a vector
zipPaths <- vector()
# since we have 30 files we should iterate through
# I assume you have a vector with file names in the variable fileNames
for (name in fileNamesZip) {
# Not sure if it will work but use paste() to concat strings
zipfilepath <- paste0(relativePath, name, ".zip")
# append filepath
append(zipPaths, zipfilepath)
}
# now we have a vector which contains all the paths to zip files
# use unzip() function and pass zipPaths to it. (Read official docs)
unzip(files=zipPaths, exdir=outDir)
# initialize dataframe for all the data. You must provide datatypes for the columns.
total <- data.frame=(Doubles=double(),
Ints=integer(),
Factors=factor(),
Logicals=logical(),
Characters=character(),
stringsAsFactors=FALSE)
# now its time to store data by reading csv files and storing them into dataframe.
# again, I assume you have a vector with file names in the variable fileNames
for (name in fileNamesCSV) {
# create the csv file path
csvfilepath <- paste0(outDir, name, ".csv")
# read data from csv file and store in in a dataframe
dataFrame = read.csv(file=csvfilepath, header=TRUE, sep=",")
# you will be able to merge dataframes only if they are equal in structure. Specify the column names to merge by.
total <- merge(data total, data dataFrame, by=c("Name1","Name2"))
}
Something potentially much simpler:
list.files() lists the files in a directory
readr::read_csv() will automatically unzip files as necessary
dplyr::bind_rows() will combine data frames
So try:
lf <- list.files(pattern="\\.zip")
dfs <- lapply(lf,readr::read_csv)
result <- dplyr::bind_rows(dfs)

Resources