How to merge many databases in R? - r

I have this huge database from a telescope at the institute where I currently am working, this telescope saves every single day in a file, it takes values for each of the 8 channels it measures every 10 seconds, and every day starts at 00:00 and finishes at 23:59, unless there was a connection error, in which case there are 2 or more files for one single day.
Also, the database has measurement mistakes, missing data, repeated values, etc.
File extensions are .sn1 for days saved in one single file and, .sn1, .sn2, .sn3...... for days saved in multiple files, all the files have the same number of rows and variables, besides that there are 2 formats of databases, one has a sort of a header and it uses the first 5 lines of the file, the other one doesn't have it.
Every month has it's own folder including the days it has, and then this folders are saved in the year they belong to, so for 10 years I'm talking about more than 3000 files, and to be honest I had never worked with .sn1 files before
I have code to merge 2 or a handful of files into 1, but this time I have thousands of files (which is way more then what I've used before and also the reason of why I can't provide a simple example) and I would like to generate a program that would merge all of the files to 1 huge database, so I can get a better sample from it.
I have an Excel extension that would list all the file locations in a specific folder, can I use a list like this to put all the files together?

Suggestions were too long for a comment, so I'm posting them as an aswer here.
It appears that you are able to read the files into R (at least one at a time) so I'm not getting into that.
Multiple Locations: If you have a list of all the locations, you can search in those locations to give you just the files you need. You mentioned an excel file (let's call it paths.csv - has only one column with the directory locations):
library(data.table)
all_directories <- fread(paths.csv, col.names = "paths")
# Focussing on only .sn1 files to begin with
files_names <- dir(path = all_directories$paths[1], pattern = ".sn1")
# Getting the full path for each file
file_names <- paste(all_directories$path[1], file_names, sep = "/")
Reading all the files: I created a space-delimited dummy file and gave it the extension ".sn1" - I was able to read it properly with data.table::fread(). If you're able to open the files using notepad or something similar, it should work for you too. Need more information on how the files with different headers can be distinguished from one another - do they follow a naming convention, or have different extensions (appears to be the case). Focusing on the files with 5 rows of headers/other info for now.
read_func <- function(fname){
dat <- fread(fname, sep = " ", skip = 5)
dat$file_name <- fname # Add file name as a variable - to use for sorting the big dataset
}
# Get all files into a list
data_list <- lapply(file_names, read_func)
# Merge list to get one big dataset
dat <- rdbindlist(data_list, use.names = T, fill = T)
Doing all of the above will give you a dataset for all the files that have the extension ".sn1" in the first directory from your list of directories (paths.csv). You can enclose all of this in a function and use lapply over all the different directories to get a list wherein each element is a dataset of all such files.
To include files with ".sn2", ".sn3" ... extensions you can modify the call as below:
ptrns <- paste(sapply(1:5, function(z) paste(".sn",z,sep = "")), collapse = "|")
# ".sn1|.sn2|.sn3|.sn4|.sn5"
dir(paths[1], pattern = ptrns)
Here's the simplified version that should work for all file extensions in all directories right away - might take some time if the files are too large etc. You may want to consider doing this in chunks instead.
# Assuming only one column with no header. sep is set to ";" since by default fread may treate spaces
# as separators. You can use any other symbol that is unlikely to be present in the location names
# We need the output to be a vector so we can use `lapply` without any unwanted behaviour
paths_vec <- as.character(fread("paths.csv", sep = ";", select = 1, header = F)$V1)
# Get all file names incl. location)
file_names <- unlist(lapply(paths_vec, function(z){
ptrns <- paste(sapply(1:5, function(q) paste(".sn",q,sep = "")), collapse = "|")
inter <- dir(z, pattern = ptrns)
return(paste(z,inter, sep = "/"))
}))
# Get all data in a single data.table using read_func previously defined
dat <- rbindlist(lapply(file_names, read_func), use.names = T, fill = T)

Related

Naming a .csv with text that can be updated each year

I'm looking for a way to automate the update of files names. The code will be used annually to download several .csv files. I would like to be able to change the 2020_2021 portion of the name to whatever assessment (i.e. 2021_2022, 2022_2023etc.) year it is at the beginning of the script so the file names don't have to be updated manually.
write.csv(SJRML_00010,
file = "SJRML__00010_2020_2021.csv")
write.csv(SJRML_00095,
file = "SJRML_00095_2020_2021.csv")
write.csv(SJRML_00480,
file = "SJRML_00480_2020_2021.csv")
lastyear <- 2020
prevassessment <- sprintf("%i_%i", lastyear, lastyear+1)
nextassessment <- sprintf("%i_%i", lastyear+1, lastyear+2)
prevassessment
# [1] "2020_2021"
filenames <- c("SJRML__00010_2020_2021.csv", "SJRML_00095_2020_2021.csv")
gsub(prevassessment, nextassessment, filenames, fixed = TRUE)
# [1] "SJRML__00010_2021_2022.csv" "SJRML_00095_2021_2022.csv"
You can do the gsub on a vector of filenames or one at a time, however you are implementing your processing.
To create a .csv with name that can be updated
Year <- "_2020"
then
write.csv(file_name, paste0("file_name", Year,".csv"))
This returns file_name_2020.csv

Reading in multiple srt files

I'd like to read in multiple srt files in R. I can read them into a list but I need to load them in sequentially by the way they were created in the file directory.
I'd also like to make a column to tell which file they come from. So I can tell which data came from file 1, file 2.. etc.
I can read them in as a list; but the files are names like "1 - FileTest"; "2 - FileTest", "#10 FileTest",... etc
This then loads the list like 1, 10, 11... etc. Even though if I arrange the files in my file directory file 11 was created after 9 for instance. I should just need a parameter for them to load sequentially so then when I put them in dataframe they show in chronological order.
list_of_files <- list.files(path=path,
pattern = "*.srt",
full.names = TRUE)
Files <- lapply(list_of_files, srt.read)
Files <- data.frame(matrix(unlist(Files), byrow=T),stringsAsFactors=FALSE)
The files load in but they don't load in chronological order it is difficult to tell what data is associated with which file.
I have approximately 150 files so being able to compile them into a single dataframe would be very helpful. Thanks!
Consider extracting meta data of the files with file.info (includes created/modified time, file size, owner, group, etc.). Then order that resulting data frame by created date/time, and finally import .srt files with ordered list of files:
raw_list_of_files <- list.files(path=path,
pattern = "*.srt",
full.names = TRUE)
# CREATE DATA FRAME OF FILE INFO
meta_df <- file.info(raw_list_of_files)
# SORT BY CREATED DATE/TIME
meta_df <- with(meta_df, meta_df[order(ctime),])
# IMPORT DATA FRAMES IN ORDERED FILES
srt_list <- lapply(row.names(meta_df), srt.read)
final_df <- data.frame(matrix(unlist(srt_list), byrow=TRUE),
stringsAsFactors=FALSE)

Using R to merge many large CSV files across sub-directories

I have over 300 large CSV files with the same filename, each in a separate sub-directory, that I would like to merge into a single dataset using R. I'm asking for help on how to remove columns I don't need in each CSV file, while merging in a way that breaks the process down into smaller chunks that my memory can more easily handle.
My objective is to create a single CSV file that I can then import into STATA for further analysis using code I have already written and tested on one of these files.
Each of my CSVs is itself rather large (about 80 columns, many of which are unnecessary, and each file has tens to hundreds of thousands of rows), and there are almost 16 million observations in total, or roughly 12GB.
I have written some code which manages to do this successfully for a test case of two CSVs. The challenge is that neither my work nor my personal computers have enough memory to do this for all 300+ files.
The code I have tried is here:
library(here) ##installs package to find files
( allfiles = list.files(path = here("data"), ##creates a list of the files, read as [1], [2], ... [n]
pattern = "candidates.csv", ##[identifies the relevant files]
full.names = TRUE, ##identifies the full file name
recursive = TRUE) ) ##searches in sub-directories
read_fun = function(path) {
test = read.csv(path,
header = TRUE )
test
} ###reads all the files
(test = read.csv(allfiles,
header = TRUE ) )###tests that file [1] has been read
library(purrr) ###installs package to unlock map_dfr
library(dplyr) ###installs packages to unlock map_dfr
( combined_dat = map_dfr(allfiles, read_fun) )
I expect the result to be a single RDS file, and this works for the test case. Unfortunately, the amount of memory this process requires when looking at 15.5m observations across all my files causes RStudio to crash, and no RDS file is produced.
I am looking for help on how to 1) reduce the load on my memory by stripping out some of the variables in my CSV files I don't need (columns with headers junk1, junk2, etc); and 2) how to merge in a more manageable way that merges my CSV files in sequence, either into a few RDS files to themselves be merged later, or through a loop cumulatively into a single RDS file.
However, I don't know how to proceed with these - I am still new to R, and any help on how to proceed with both 1) and 2) would be much appreciated.
Thanks,
Twelve GB is quite a bit for one object. It's probably not practical to use a single RDS or CSV unless you have far more than 12GB of RAM. You might want to look into using a database, a techology that is made for this kind of thing. I'm sure Stata can also interact with databases. You might also want to read up on how to interact with large CSVs using various strategies and packages.
Creating a large CSV isn't at all difficult. Just remember that you have to work with said giant CSV sometime in the future, which probably will be difficult. To create a large CSV, just process each component CSV individually and then append them to your new CSV. The following reads in each CSV, removes unwanted columns, and then appends the resulting dataframe to a flat file:
library(dplyr)
library(readr)
library(purrr)
load_select_append <- function(path) {
# Read in CSV. Let every column be of class character.
df <- read_csv(path, col_types = paste(rep("c", 82), collapse = ""))
# Remove variables beginning with "junk"
df <- select(df, -starts_with("junk"))
# If file exists, append to it without column names, otherwise create with
# column names.
if (file.exists("big_csv.csv")) {
write_csv(df, "big_csv.csv", col_names = F, append = T)
} else {
write_csv(df, "big_csv.csv", col_names = T)
}
}
# Get the paths to the CSVs.
csv_paths <- list.files(path = "dir_of_csvs",
pattern = "\\.csv.*",
recursive = T,
full.names = T
)
# Apply function to each path.
walk(csv_paths, load_select_append)
When you're ready to work with your CSV you might want to consider using something like the ff package, which enables interaction with on-disk objects. You are somewhat restricted in what you can do with an ffdf object, so eventually you'll have to work with samples:
library(ff)
df_ff <- read.csv.ffdf(file = "big_csv.csv")
df_samp <- df_ff[sample.int(nrow(df_ff), size = 100000),]
df_samp <- mutate(df_samp, ID = factor(ID))
summary(df_samp)
#### OUTPUT ####
values ID
Min. :-9.861 17267 : 6
1st Qu.: 6.643 19618 : 6
Median :10.032 40258 : 6
Mean :10.031 46804 : 6
3rd Qu.:13.388 51269 : 6
Max. :30.465 52089 : 6
(Other):99964
As far as I know, chunking and on-disk interactions are not possible with RDS or RDA, so you are stuck with flat files (or you go with one of the other options I mentioned above).

CSV with multiple datasets/different-number-of-columns

Similar to How can you read a CSV file in R with different number of columns, I have some complex CSV-files. Mine are from SAP BusinessObjects and hold challenges different to those of the quoted question. I want to automate the capture of an arbitrary number of datasets held in one CSV file. There are many CSV-files, but let's start with one of them.
Given: One CSV file containing several flat tables.
Wanted: Several dataframes or other structure holding all data (S4?)
The method so far:
get line numbers of header data by counting number of columns
get headers by reading every line index held in vector described above
read data by calculating skip and nrows between data sets in index described by header lines as above.
give the read data column names from read header
I need help getting me on the right track to avoid loops/making the code more readable/compact when reading headers and datasets.
These CSVs are formatted as normal CSVs, only that they contain an more or less arbitrary amount of subtables. For each dataset I export, the structure is different. In the current example I will suppose there are five tables included in the CSV.
In order to give you an idea, here is some fictous sample data with line numbers. Separator and quote has been stripped:
1: n, Name, Species, Description, Classification
2: 90, Mickey, Mouse, Big ears, rat
3: 45, Minnie, Mouse, Big bow, rat
...
16835: Code, Species
16836: RT, rat
...
22673: n, Code, Country
22674: 1, RT, Murica
...
33211: Activity, Code, Descriptor
32212: running, RU, senseless activity
...
34749: Last update
34750: 2017/05/09 02:09:14
There are a number of ways going about reading each data set. What I have come up with so far:
filepath <- file.path(paste0(Sys.getenv("USERPROFILE"), "\\SAMPLE.CSV)
# Make a vector with column number per line
fieldVector <- utils::count.fields(filepath, sep = ",", quote = "\"")
# Make a vector with unique number of fields in file
nFields <- base::unique(fieldVector)
# Make a vector with indices for position of new dataset
iHeaders <- base::match(nFields, fieldVector)
With this, I can do things like:
header <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4], nrows = iHeaders[5]-iHeaders[4]-1)
data <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4] + 1, nrows = iHeaders[5]-iHeaders[4]-1)
names(data) <- header
As in the intro of this post, I have made a couple of functions which makes it easier to get headers for each dataset:
Headers <- GetHeaders(filepath, iHeaders)
colnames(data) <- Headers[[4]]
I have two functions now - one is GetHeader, which captures one line from the file with utils::read.csv2 while ensuring safe headernames (no æøå % etc).
The other returns a list of string vectors holding all headers:
GetHeaders <- function(filepath, linenums) {
# init an empty list of length(linenums)
l.headers <- vector(mode = "list", length = length(linenums))
for(i in seq_along(linenums)) {
# read.csv2(filepath, skip = linenums[i]-1, nrows = 1)
l.headers[[i]] <- GetHeader(filepath, linenums[i])
}
l.headers
}
What I struggle with is how to read in all possible datasets in one go. Specifically the last set is a bit hard to wrap my head around if I should write a common function, where I only know the line number of header, and not the number of lines in the following data.
Also, what is the best data structure for such a structure as described? The data in the subtables are all relevant to each other (can be used to normalize parts of the data). I understand that I must do manual work for each read CSV, but as I have to read TONS of these files, some common functions to structure them in a predictable manner at each pass would be excellent.
Before you answer, please keep in mind that, no, using a different export format is not an option.
Thank you so much for any pointers. I am a beginner in R and haven't completely wrapped my head around all possible solutions in this particular domain.

Run separate functions on multiple elements of list based on regex criteria in data.frame

The following works, but I'm missing a functional programming technique, indexing, or a better way of structuring my data. After a month, it will take a bit to remember exactly how this works instead of being easy to maintain. It seems like a workaround when it shouldn't be. I want to use regex to decide which function to use for expected groups of files. When a new file format comes along, I can write the read function, then add the function along with the regex to the data.frame to run it alongside all the rest.
I have different formats of Excel and csv files that need to be read in and standardized. I want to maintain a list or data.frame of the file name regex and appropriate function to use. Sometimes there will be new file formats that won't be matched, and old formats without new files. But then it gets complicated which is something I would prefer to avoid.
# files to read in based on filename
fileexamples <- data.frame(
filename = c('notanyregex.xlsx','regex1today.xlsx','regex2today.xlsx','nomatch.xlsx','regex1yesterday.xlsx','regex2yesterday.xlsx','regex3yesterday.xlsx'),
readfunctionname = NA
)
# regex and corresponding read function
filesourcelist <- read.table(header = T,stringsAsFactors = F,text = "
greptext readfunction
'.*regex1.*' 'readsheettype1'
'.*nonematchthis.*' 'readsheetwrench'
'.*regex2.*' 'readsheettype2'
'.*regex3.*' 'readsheettype3'
")
# list of grepped files
fileindex <- lapply(filesourcelist$greptext,function(greptext,files){
grepmatches <- grep(pattern = greptext,x = data.frame(files)[,1],ignore.case = T)
},files = fileexamples$filename)
# run function on files based on fileindex from grep
for(i in 1:length(fileindex)){
fileexamples[fileindex[[i]],'readfunctionname'] <- filesourcelist$readfunction[i]
}

Resources