How do I load multiple CSV into DataFrames in Julia? - julia

I already know how to load a single CSV into a DataFrame:
using CSV
using DataFrames
df = DataFrame(CSV.File("C:\\Users\\username\\Table_01.csv"))
How would I do this when I have several CSV files, e.g. Table_01.csv, Table_02.csv, Table_03.csv?
Would I create a bunch of empty DataFrames and use a for loop to fill them? Or is there an easier way in Julia? Many thanks in advance!

If you want multiple data frames (not a single data frame holding the data from multiple files) there are several options.
Let me start with the simplest approach using broadcasting:
dfs = DataFrame.(CSV.File.(["Table_01.csv", "Table_02.csv", "Table_03.csv"]))
or
dfs = #. DataFrame(CSV.File(["Table_01.csv", "Table_02.csv", "Table_03.csv"]))
or (with a bit of more advanced stuff, using function composition):
(DataFrame∘CSV.File).(["Table_01.csv", "Table_02.csv", "Table_03.csv"])
or using chaining:
CSV.File.(["Table_01.csv", "Table_02.csv", "Table_03.csv"]) .|> DataFrame
Now other options are map as it was suggested in the comment:
map(DataFrame∘CSV.File, ["Table_01.csv", "Table_02.csv", "Table_03.csv"])
or just use a comprehension:
[DataFrame(CSV.File(f)) for f in ["Table_01.csv", "Table_02.csv", "Table_03.csv"]]
(I am listing the options to show different syntactic possibilities in Julia)

This is how I have done it, but there might be an easier way.
using DataFrames, Glob
import CSV
function readcsvs(path)
files=glob("*.csv", path) #Vector of filenames. Glob allows you to use the asterisk.
numfiles=length(files) #Number of files to read.
tempdfs=Vector{DataFrame}(undef, numfiles) #Create a vector of empty dataframes.
for i in 1:numfiles
tempdfs[i]=CSV.read(files[i]) #Read each CSV into its own dataframe.
end
masterdf=outerjoin(tempdfs..., on="Column In Common") #Join the temporary dataframes into one dataframe.
end

A simple solution where you don't have to explicitly enter filenames:
using CSV, Glob, DataFrames
path = raw"C:\..." # directory of your files (raw is useful in Windows to add a \)
files=glob("*.csv", path) # to load all CSVs from a folder (* means arbitrary pattern)
dfs = DataFrame.( CSV.File.( files ) ) # creates a list of dataframes
# add an index column to be able to later discern the different sources
for i in 1:length(dfs)
dfs[i][!, :sample] .= i # I called the new col sample
end
# finally, if you want, reduce your collection of dfs via vertical concatenation
df = reduce(vcat, dfs)

Related

Iterating over CSVs to different dataframes based on file names

I have a dataframe that contains the names of a bunch of .CSV files. It looks how it does in the snippet below:
What I'm trying to do is convert each of these .CSVs into a dataframe that appends the results of each. What I'm trying to do is create three different dataframes based on what's in the file names:
Create a dataframe with all results from .CSV files with -callers- in its file name
Create a dataframe with all results from .CSV files with -results in its filename
Create a dataframe with all results from .CSV files with -script_results- in its filename
The command to actually convert the .CSV file into a dataframe looks like this if I were using the first .CSV in the dataframe below:
data <- aws.s3::s3read_using(read.csv, object = "s3://abc-testtalk/08182020-testpilot-arizona-results-08-18-2020--08-18-2020-168701001.csv")
But what I'm trying to do is:
Iterate ALL the .csv files under Key using the s3read_using function
Put them in three separate dataframes based on the file names as listed above
Key
08182020-testpilot-arizona-results-08-18-2020--08-18-2020-168701001.csv
08182020-testpilot-arizona-results-08-18-2020--08-18-2020-606698088.csv
08182020-testpilot-arizona-script_results-08-18-2020--08-18-2020-114004469.csv
08182020-testpilot-arizona-script_results-08-18-2020--08-18-2020-450823767.csv
08182020-testpilot-iowa-callers-08-18-2020-374839084.csv
08182020-testpilot-maine-callers-08-18-2020-396935866.csv
08182020-testpilot-maine-results-08-18-2020--08-18-2020-990912614.csv
08182020-testpilot-maine-script_results-08-18-2020--08-18-2020-897037786.csv
08182020-testpilot-michigan-callers-08-18-2020-367670258.csv
08182020-testpilot-michigan-follow-ups-08-18-2020--08-18-2020-049435266.csv
08182020-testpilot-michigan-results-08-18-2020--08-18-2020-544974900.csv
08182020-testpilot-michigan-script_results-08-18-2020--08-18-2020-239089219.csv
08182020-testpilot-nevada-callers-08-18-2020-782329503.csv
08182020-testpilot-nevada-results-08-18-2020--08-18-2020-348644934.csv
08182020-testpilot-nevada-script_results-08-18-2020--08-18-2020-517037762.csv
08182020-testpilot-new-hampshire-callers-08-18-2020-134150800.csv
08182020-testpilot-north-carolina-callers-08-18-2020-739838755.csv
08182020-testpilot-pennsylvania-callers-08-18-2020-223839956.csv
08182020-testpilot-pennsylvania-results-08-18-2020--08-18-2020-747438886.csv
08182020-testpilot-pennsylvania-script_results-08-18-2020--08-18-2020-546894204.csv
08182020-testpilot-virginia-callers-08-18-2020-027531377.csv
08182020-testpilot-virginia-follow-ups-08-18-2020--08-18-2020-419338697.csv
08182020-testpilot-virginia-results-08-18-2020--08-18-2020-193170030.csv
Create 3 empty dataframes. You will probably also need to indicate column names matching column names from each of the file you want to append:
results <- data.frame()
script_results <- data.frame()
callers <- data.frame()
Then iterate over file_name and read it into data object. Conditionally on what pattern ("-results-", "-script_results-" or "-caller-" is contanied in the name of each file, it will be appended to the correct dataframe:
for (file in file_name) {
data <- aws.s3::s3read_using(read.csv, object = paste0("s3://abc-testtalk/", file))
if (grepl(file, "-results-")) { results <- rbind(results, data)}
if (grepl(file, "-script_results-")) { script_results <- rbind(script_results, data)}
if (grepl(file, "-callers-")) { callers <- rbind(callers, data)}
}
As an alternative to #JohnFranchak's recommendation for map_dfr (which likely works just fine), the method that I referenced in comments would look something like this:
alldat <- lapply(setNames(nm = dat$file_name),
function(obj) aws.s3::s3read_using(read.csv, object = obj))
callers <- do.call(rbind, alldat[grepl("-callers-", names(alldat))])
results <- do.call(rbind, alldat[grepl("-results-", names(alldat))])
script_results <- do.call(rbind, alldat[grepl("-script_results-", names(alldat))])
others <- do.call(rbind, alldat[!grepl("-(callers|results|script_results)-", names(alldat))])
The do.call(rbind, ...) part is analogous to dplyr::bind_rows and data.table::rbindlist in that it accepts a list of frames, and the result is a single frame. Some differences:
do.call(rbind, ...) really requires all columns to exist in all frames, in the same order. It's not hard to enforce this externally (e.g., adding missing columns, rearranging), but it's not automatic.
data.table::rbindlist will complain for the same conditions (missing columns or different order), but it has fill= and use.names= arguments that need to be set TRUE.
dplyr::bind_rows will fill and row-bind by-name by default, without message or warning. (I don't agree that a default of silence is good all of the time, but it is the simplest.)
Lastly, my use of setNames(nm=..) is merely to assign the filename to each object. This is not strictly necessary since we still have dat$file_name, but I've found that with two separate objects, it is feasible to accidentally change (delete, append, or reorder) one of them and not the other, so I prefer to keep the names and the objects (frames) perfectly tied together. These two calls are relatively the same in the resulting named-list:
lapply(setNames(nm = dat$file_name), ...)
sapply(dat$file_name, ..., simplify = FALSE)

Create list of files in directory, apply (lapply) a custom function to each, and cbind results to new file

rewrote in attempt to simplify my problem statement.
I am using R V1.3.959 and relatively new to R overall. I have a custom excel form, which means the objects are in various cells in excel and the variable is also in some cell. I have over 1000 of these forms as product specs. I read in only 1 file and created a function called tidy.form to pull data out and then cbind into new file as below.
read_customer_file = "C:/Users/..../FABRIC TECHNICAL SUBMISSION AGREEMENT J123abd.xlsx"
product_tech <- read_excel(read_customer_file, sheet = "Form") %>% clean_names()
#function for make form tidy
form.extract <- function(tidy.form) {
#extract the object / data point looking for but with entire column
fabric.supplier.name <- product_tech[c( 0,5)]
#extract the specific row in the column with the data point desired
fabric.supplier.name <- slice(fabric.supplier.name, 3,0)
#rename column to correct variable
colnames(fabric.supplier.name)[colnames(fabric.supplier.name) == "x5"] <- "fabric.supplier.name"
combine <- cbind(date, fabric.supplier.name, address)
return(combine)
}
Now I need a way to read in all of the xlsx files from a directory and do the same thing for each.
I figured out how to read the file names in through:
files <- list.files(path="C:/Users/me/productspecfolder", pattern="*.xlsx", full.names=TRUE, recursive=FALSE)
However I am stuck at how to loop / lapply through my list.files and apply the function tidy.form to each.
Any help would be so much appreciated!

how to use "for loop" to write multiple .csv file names?

Does anyone know the best way to carry out a "for loop" that would read in different subject id's and append them to the name of an exported csv?
As an example, I have multiple output files from an electrocardiogram software program (each file belongs to one individual). The files are named C800_HR.bdf.evt, C801_HR.bdf.evt, C802_HR.bdf.evt etc. Each file gets read into r and then has a script applied to calculate heart rate variability. At the end of the script, I need to add a loop that will extract the subject id (e.g., C800, C801, C802) and write a new file name for each individual so that it becomes C800_RtoR.csv. Essentially, I would like to avoid changing the syntax every time I read in and export a file name.
I am currently using the following syntax to read in multiple files:
>setwd("/Users/kmpc/Downloads")
>myhrvdata <-lapply(Sys.glob("C8**_HR.bdf.evt"), read.delim)
Try this out:
cardio_files <- list.files(pattern = "C8\\d{2}_HR.bdf.evt")
subject_ids <- sub("^(C8\\d{2})_.*", "\\1" cardio_files)
myList <- lapply(cardio_files, read.delim)
## do calculations on the list
for (i in names(myList)) {
write.csv(myList[[i]], paste0(subject_ids[i], "_RtoR.csv"))
}
The only thing is, you have to deal with using a list when doing your calculations. You could combine them to a single data.frame, but it would be best to leave it as a list to write the files at the end.
Consider generalizing your process by creating a function that: 1) reads in file, 2) processes data, 3) outputs to csv. Then have lapply call the defined method iteratively across all Sys.glob items and even return a list of calculated data frames.
proc_heart_rate <- function(f_name) {
# READ IN .evt FILE INTO df
df <- read.delim(f_name)
# CALCULATE HEART RATE VARIABILITY WITH df
...
# OUTPUT df TO CSV
subject_id <- gsub("\\_.*", "", f_name)
write.csv(df, paste0(subject_id, "_RtoR.csv"))
# RETURN df FOR OTHER USES
return(df)
}
# LIST OF DATA FRAMES WITH CALCULATIONS
myhrvdata_list <-lapply(Sys.glob("C8**_HR.bdf.evt"), proc_heart_rate)

Extracting a single cell value from multiple csv files in R and

I have 500 csv. files with data that looks like:
sample data
I want to extract one cell (e.g. B4 or 0.477) per a csv file and combine those values into a single csv. What are some recommendations on how to do this easily?
You can try something like this
all.fi <- list.files("/path/to/csvfiles", pattern=".csv", full.names=TRUE) # store names of csv files in path as a string vector
library(readr) # package for read_lines and write_lines
ans <- sapply(all.fi, function(i) { eachline <- read_lines(i, n=4) # read only the 4th line of the file
ans <- unlist(strsplit(eachline, ","))[2] # split the string on commas, then extract the 2nd element of the resulting vector
return(ans) })
write_lines(ans, "/path/to/output.csv")
I can not add a comment. So, I will write my comment here.
Since your data is very large and it is very difficult to load it individually, then try this: Importing multiple .csv files into R. It is similar to the first part of your problem. For second part, try this:
You can save your data as a data.frame (as with the comment of #Bruno Zamengo) and then you can use select and merge functions in R. Then, you can easily combine them in single csv file. With select and merge functions you can select all the values you need and them combine them. I used this idea in my project. Do not forget to use lapply.

R: import identically formatted data, make a column corresponding to the original filename, vertically concatenate?

I am working with some ocean sensors that were deployed at different depths. Each sensor recorded several parameters (time, temperature, oxygen) at different depths, and each outputted an identically formatted file which I have renamed to 'top.csv', 'mid.csv', bot.csv' (for top, middle, bottom).
I currently have only three files, but will eventually have more so I want to set this up iteratively. Optimally I would have something set up such that:
R will import all csv files from a specified directory
It will add a column to each data frame called "depth" with the name of the original file.
rbind them into a single data frame.
I am able to do steps 1 and 3 with the two lines below. The first line gets the file names from a specific directory that match the pattern, while the second line uses lapply nested in do.call to read all the files and vertically concatenate.
files = list.files('./data/', pattern="*.csv")
oxygenData= do.call(rbind, lapply(files, function(x) read.csv(paste('./data/',x)))
The justification to end up with a single data file is to plot them easier, as such:
ggplot(data = oxygenData, aes(x = time, y = oxygen, group = depth, color = depth))+geom_line()
Also, would dealing with this kind of data be easier with data.table? Thank you!
You can accomplish this by building your own function:
myFunc <- function(fileName) {
# read in file
temp <- read.csv(paste0("<filePath>/", fileName), as.is=TRUE)
# assign file name
temp$fileName <- fileName
# return data.frame
temp
}
Note that you could generalize myFunc by adding a second argument that takes the file path, allowing the directory to be set dynamically. Next, put this into lapply to get a list of data.frames:
myList <- lapply(fileNameVector, myFunc)
Finally, append the files using do.call and rbind.
res <- do.call(rbind, myList)

Resources