Reading in multiple srt files - r

I'd like to read in multiple srt files in R. I can read them into a list but I need to load them in sequentially by the way they were created in the file directory.
I'd also like to make a column to tell which file they come from. So I can tell which data came from file 1, file 2.. etc.
I can read them in as a list; but the files are names like "1 - FileTest"; "2 - FileTest", "#10 FileTest",... etc
This then loads the list like 1, 10, 11... etc. Even though if I arrange the files in my file directory file 11 was created after 9 for instance. I should just need a parameter for them to load sequentially so then when I put them in dataframe they show in chronological order.
list_of_files <- list.files(path=path,
pattern = "*.srt",
full.names = TRUE)
Files <- lapply(list_of_files, srt.read)
Files <- data.frame(matrix(unlist(Files), byrow=T),stringsAsFactors=FALSE)
The files load in but they don't load in chronological order it is difficult to tell what data is associated with which file.
I have approximately 150 files so being able to compile them into a single dataframe would be very helpful. Thanks!

Consider extracting meta data of the files with file.info (includes created/modified time, file size, owner, group, etc.). Then order that resulting data frame by created date/time, and finally import .srt files with ordered list of files:
raw_list_of_files <- list.files(path=path,
pattern = "*.srt",
full.names = TRUE)
# CREATE DATA FRAME OF FILE INFO
meta_df <- file.info(raw_list_of_files)
# SORT BY CREATED DATE/TIME
meta_df <- with(meta_df, meta_df[order(ctime),])
# IMPORT DATA FRAMES IN ORDERED FILES
srt_list <- lapply(row.names(meta_df), srt.read)
final_df <- data.frame(matrix(unlist(srt_list), byrow=TRUE),
stringsAsFactors=FALSE)

Related

Merging csv files that their name starts with the same string using R

I have a number of csv files in the working directory. Some of these files share a string (ex. ny, nj, etc.) at the beginning of their name. Below is a screenshot:
What I want to do is to import and merge the csv files that share a string. I have searched and seen people suggesting regex, however I am not sure if that is best way to go. I appreciate any help with this.
Best,
Kaveh
Here's a function that may be more efficient than for loops, though there may be more elegant solutions.
Since I dont know what your excel files contain, I created several dummy files with a few columns ("A", "B", and "C"). I dont know what you would merge by; in this example I merged by column "A".
Given the ambiguity in the files, I have edited this to include both merge and bind approaches, depending on what is needed.
To test these functions, create a few CSV files in a folder (I created NJ_1.csv, NJ_2.csv, NJ_3.csv, NY_1.csv, NY_2.csv, each with columns A, B, and C.)
For all options, this code needs to be run.
setwd("insert path where folder with csv files is located")
library(dplyr)
OPTION 1:
If you want to merge files containing different data with a unique identifier.
Example: one file contains temperature and one file contains precipitation for a given geographic location
importMerge <- function(x, mergeby){
temp <- list.files(pattern = paste0("*",x))
files <- lapply(temp, read.csv)
merge <- files %>% Reduce(function(dtf1, dtf2) left_join(dtf1, dtf2, by = mergeby), .)
return(merge)
}
NJmerge <- importMerge("NJ", "A")
NYmerge <- importMerge("NY", "A")
OPTION 2:
If you want to bind files containing the same columns.
Example: Files contain both temperature and precipitation, and each file is a given geographic location. Note: All columns need to be the same name in each file
importBind <- function(x){
temp <- list.files(pattern = paste0("*",x))
files <- lapply(temp, read.csv)
bind <- do.call("rbind", files)
return(bind)
}
NJbind <- importBind("NJ")
NYbind <- importBind("NY")
OPTION 3
If you want to bind only certain columns from files containing the same column names
Example: Files contain temperature and precipitation, along with other columns that aren't needed, and each file is a given geographic location. Note: All columns need to be the same name in each file. Since default is NULL, leaving keeps out will default to option 2 above.
importBindKeep <- function(x, keeps = NULL){ # default is to keep all columns
temp <- list.files(pattern = paste0("*",x))
files <- lapply(temp, read.csv)
# if you wanted to only keep a few columns, use the following.
if(!is.null(keeps)) files <- lapply(files, "[", , keeps)
bind <- do.call("rbind", files)
return(bind)
}
NJbind.keeps <- importBindKeep("NJ", keeps = c("A","B")) # keep only columns A and B
NYbind.keeps <- importBindKeep("NY", keeps = c("A","B"))
See How to import multiple .csv files at once? and Simultaneously merge multiple data.frames in a list, for more information.

R iteration over files from different directory

I have a data in different folders that I need to import and transform in a loop using purr. The path and names of the csv files follow the pattern below:
data/csd-alberta/
data/csd-ontario/
data/csd-pei/
data/csd-bc/
# for all of the province
c(alberta, bc, newbruns, newfoundland, nova, nunavut, nw, ont, pei, qc, sask, yukon)
There are many csv files in each province folder, but the main dataset I want to import starts with 98. For example:
# note that all data sets must begin with 98 and end with .csv.
csd_alberta_raw <- read_csv("csd-alberta/98-1.csv")
csd_bc_raw <- read_csv("csd-bc/98-2.csv")
csd_ont_raw <- read_csv("csd-ont/98-3.csv")
There are other csv files in the folder so I only need to import these that start with 98.
I would like to use purr and map_df to integrate the data transformation for all the files, since they all have the same columns and require the same data cleaning. But I'm not sure how to do this for all of the directory, and also specify the pattern for the csv.
You can use the following :
Use list.files to get the complete path of the filenames in all the folders with a specific pattern ('^98.*\\.csv$').
Use map_df to read the all the files and combine them. I have also included a new column called file which will identify the file from which data is coming from.
filenames <- list.files('data/', recursive = TRUE, full.names = TRUE, pattern = '^98.*\\.csv$')
combine_data <- purrr::map_df(filenames, readr::read_csv, .id = 'file')

How to remove some certain column in multiple files in R?

everyone. I want to remove some certain columns in multiple files(csv.).
for example, I have 50 files. And I want to delete a,b,c column in every file.
The point is I don't know how to get the files. Save the change in every single file and remain the original file name.
library(tidyverse)
# I want to delet some column which contain messy code
# input a list of file
df <- list.files(here("Data"),pattern=".csv",full.names = TRUE) %>%
lapply(read_csv) %>% #read csv
lapply(subset,select = -c(a,b,c)) #To remove the messy code
write.csv(df, file = here())
# I want to save the change in the original files, but I don't know how to do it.
Read all the files (if all the files are in the working directory) directly into a list and process it.
files <- list.files() #if you want to read all the files in working directory
lst2 <- lapply(files, function(x) read.table(x, header=TRUE))
lapply(lst2,`[`,c(-a,-b,-c)

How to store a folder containing over 30 zipped files into a variable in r

I used the package 'GDELTtools' to download data from GDELT. Now, the data was downloaded however, no variable was stored in the global environment. I want to store the data into a dataframe variable so I can analyze it.
The folder contains over 30 zipped files. Every zipped file contains one csv. I need to store all these csvs in one variable in the Global Environment of r. I hope this can be done.
Thank you in advance!
Haven't written R for a while so I will try my best.
Read the comments carefully, cause they will explain the procedure.
I will attach the links to check information for: unzip, readCSV, mergeDataFrames, emptyDataFrame, concatinateStrings
According to docs of GDELTtools you can easily specify folder of download by providing local.folder="~/gdeltdata" as parameter to GetGDELT() function.
After that you can list.files("path/to/files/directory") function to obtain a vector of file names used in the explanation code bellow. Check the docs for more examples and explanation.
# set path to of unzip output
outDir <-"C:\\Users\\Name\\Documents\\unzipfolder"
# relative path where zip files are stored
relativePath <- "C:\\path\\to\\my\\directory\\"
# create varible to store all the paths to the zip files in a vector
zipPaths <- vector()
# since we have 30 files we should iterate through
# I assume you have a vector with file names in the variable fileNames
for (name in fileNamesZip) {
# Not sure if it will work but use paste() to concat strings
zipfilepath <- paste0(relativePath, name, ".zip")
# append filepath
append(zipPaths, zipfilepath)
}
# now we have a vector which contains all the paths to zip files
# use unzip() function and pass zipPaths to it. (Read official docs)
unzip(files=zipPaths, exdir=outDir)
# initialize dataframe for all the data. You must provide datatypes for the columns.
total <- data.frame=(Doubles=double(),
Ints=integer(),
Factors=factor(),
Logicals=logical(),
Characters=character(),
stringsAsFactors=FALSE)
# now its time to store data by reading csv files and storing them into dataframe.
# again, I assume you have a vector with file names in the variable fileNames
for (name in fileNamesCSV) {
# create the csv file path
csvfilepath <- paste0(outDir, name, ".csv")
# read data from csv file and store in in a dataframe
dataFrame = read.csv(file=csvfilepath, header=TRUE, sep=",")
# you will be able to merge dataframes only if they are equal in structure. Specify the column names to merge by.
total <- merge(data total, data dataFrame, by=c("Name1","Name2"))
}
Something potentially much simpler:
list.files() lists the files in a directory
readr::read_csv() will automatically unzip files as necessary
dplyr::bind_rows() will combine data frames
So try:
lf <- list.files(pattern="\\.zip")
dfs <- lapply(lf,readr::read_csv)
result <- dplyr::bind_rows(dfs)

How to merge many databases in R?

I have this huge database from a telescope at the institute where I currently am working, this telescope saves every single day in a file, it takes values for each of the 8 channels it measures every 10 seconds, and every day starts at 00:00 and finishes at 23:59, unless there was a connection error, in which case there are 2 or more files for one single day.
Also, the database has measurement mistakes, missing data, repeated values, etc.
File extensions are .sn1 for days saved in one single file and, .sn1, .sn2, .sn3...... for days saved in multiple files, all the files have the same number of rows and variables, besides that there are 2 formats of databases, one has a sort of a header and it uses the first 5 lines of the file, the other one doesn't have it.
Every month has it's own folder including the days it has, and then this folders are saved in the year they belong to, so for 10 years I'm talking about more than 3000 files, and to be honest I had never worked with .sn1 files before
I have code to merge 2 or a handful of files into 1, but this time I have thousands of files (which is way more then what I've used before and also the reason of why I can't provide a simple example) and I would like to generate a program that would merge all of the files to 1 huge database, so I can get a better sample from it.
I have an Excel extension that would list all the file locations in a specific folder, can I use a list like this to put all the files together?
Suggestions were too long for a comment, so I'm posting them as an aswer here.
It appears that you are able to read the files into R (at least one at a time) so I'm not getting into that.
Multiple Locations: If you have a list of all the locations, you can search in those locations to give you just the files you need. You mentioned an excel file (let's call it paths.csv - has only one column with the directory locations):
library(data.table)
all_directories <- fread(paths.csv, col.names = "paths")
# Focussing on only .sn1 files to begin with
files_names <- dir(path = all_directories$paths[1], pattern = ".sn1")
# Getting the full path for each file
file_names <- paste(all_directories$path[1], file_names, sep = "/")
Reading all the files: I created a space-delimited dummy file and gave it the extension ".sn1" - I was able to read it properly with data.table::fread(). If you're able to open the files using notepad or something similar, it should work for you too. Need more information on how the files with different headers can be distinguished from one another - do they follow a naming convention, or have different extensions (appears to be the case). Focusing on the files with 5 rows of headers/other info for now.
read_func <- function(fname){
dat <- fread(fname, sep = " ", skip = 5)
dat$file_name <- fname # Add file name as a variable - to use for sorting the big dataset
}
# Get all files into a list
data_list <- lapply(file_names, read_func)
# Merge list to get one big dataset
dat <- rdbindlist(data_list, use.names = T, fill = T)
Doing all of the above will give you a dataset for all the files that have the extension ".sn1" in the first directory from your list of directories (paths.csv). You can enclose all of this in a function and use lapply over all the different directories to get a list wherein each element is a dataset of all such files.
To include files with ".sn2", ".sn3" ... extensions you can modify the call as below:
ptrns <- paste(sapply(1:5, function(z) paste(".sn",z,sep = "")), collapse = "|")
# ".sn1|.sn2|.sn3|.sn4|.sn5"
dir(paths[1], pattern = ptrns)
Here's the simplified version that should work for all file extensions in all directories right away - might take some time if the files are too large etc. You may want to consider doing this in chunks instead.
# Assuming only one column with no header. sep is set to ";" since by default fread may treate spaces
# as separators. You can use any other symbol that is unlikely to be present in the location names
# We need the output to be a vector so we can use `lapply` without any unwanted behaviour
paths_vec <- as.character(fread("paths.csv", sep = ";", select = 1, header = F)$V1)
# Get all file names incl. location)
file_names <- unlist(lapply(paths_vec, function(z){
ptrns <- paste(sapply(1:5, function(q) paste(".sn",q,sep = "")), collapse = "|")
inter <- dir(z, pattern = ptrns)
return(paste(z,inter, sep = "/"))
}))
# Get all data in a single data.table using read_func previously defined
dat <- rbindlist(lapply(file_names, read_func), use.names = T, fill = T)

Resources