Average multiple csv files into 1 averaged file in r - r

I have approximately 300 csv files of wind speed, temp, pressure, etc, columns and each row is a different time from 2007 to 2012. Each file is from a different location. I want to combine all files into one that is the average of all 300 files. So the new file would have the same number of rows and columns of each individual file but each cell would be a corresponding average of all the 300 files. Is there an easy way to do this?

Following this post, you could read all the files into a list (here I've assumed they're named weather*.csv):
csvs <- lapply(list.files(pattern="weather*.csv"), read.csv)
All that remains is to take the average of all those data frames. You might try something like:
Reduce("+", csvs) / length(csvs)
If you wanted to only add a subset of the columns, you could pass Reduce a list of data frames with the appropriate subset of the columns. For instance, if you wanted to remove the first column from each, you could do something like:
Reduce("+", lapply(csvs, "[", -1)) / length(csvs)

Related

How to extract a common column from multiple tsv files and combine them into one dataframe in R?

I want to extract a common column named "framewise_displacement" from 162 tsv files arranged by subject ID numbers (eg., sub-CC123_timeseries.tsv, sub-CC124_timeseries.tsv, etc) with different number of columns and same number of rows, and merge them into a single dataframe.
The new dataframe is desired to have the columns to be the "framewise_displacement" from different subjects files with subject ID along, and the rows to be the same from the original files.
I tried to use vroom function in R, but failed because the files have different number of columns.
Also tried this code, but the output stacked all the columns into 1 single columns.
files = fs::dir_ls(path = "Documents/subject_timeseries", glob = "*.tsv")
merged_df <- map_df(files, ~vroom(.x, col_select=c(framewise_displacement)))
What should I do to merge them into one dataframe with the desired column side by side?
Any suggestions would be appreciated.
Many thanks!!!

Importing multing multiple JSON files in R single dataframe

Hei,
I want to import JSON files from a folder to R data frame (as a single matrix). I have about 40000 JSON files with one observation each and different variable sizes.
I tried following codes
library(rjson)
jsonresults_all <- list.files("mydata", pattern="*.json", full.names=TRUE)
myJSON <- lapply(jsonresults_all, function(x) fromJSON(file=x))
myJSONmat <- as.data.frame(myJSON)
I want my data frame something like (40000 observations (rows) and some 175 variables (column) with some variable values NA.
But I get a single row containing each observation appended to the right.
Many thanks for your suggesion.

Merging CSV files of the same names from different folders into one file

I have 14 years of precipitation data for different meteo stations (more than 400) in the structure as follows for years 2008-2021:
/2008/Meteo_2008-01/249180090.csv
/2008/Meteo_2008-02/249180090.csv
/2008/Meteo_2008-03/249180090.csv ... and so on for the rest of the months.
/2009/Meteo_2009-01/249180090.csv
/2009/Meteo_2009-02/249180090.csv
/2009/Meteo_2009-03/249180090.csv ... and so on for the rest of the months.
I have a structure like that until 2021. 249180090.csv - that stands for the station code, as I wrote above I have more than 400 stations.
In the CSV file, there are data on daily precipitation for desired rainfall station.
I would like to create one CSV file for EVERY STATION for every year from 2088 to 2021, which will contain merged information from January until December on the precipitation. The name of CSV file should contain the station number.
Would someone be kind and help me how can I do that in a loop? My goal is not to create just a one file out of all data, but a separate CSV file for every meteo station. On the forum, I have found a question, which was solving relatively similar problem but merging all data just into one file, without sub-division into separate files.
The problem can be split into parts:
Identify all files in all subfolders of the working directory by using list.files(..., recursive = TRUE).
Keep only the csv files
Import them all into r - for example, by mapping read.csv to all paths
Joining everything into a single dataframe, for example with reduce and bind_rows (assuming that all csvs have the same structure)
Split this single dataframes according to station code, for example with group_split()
Writing these split dataframes to csv, for example by mapping write.csv.
This way you can avoid using for loops.
library(here)
library(stringr)
library(purrr)
library(dplyr)
# Identify all files
filenames <- list.files(here(), recursive = TRUE, full.names = TRUE)
# Limit to csv files
joined <- filenames[str_detect(filenames, ".csv")] |>
# Read them all in
map(read.csv) |>
# Join them in
reduce(bind_rows)
# Split into one dataframe per station
split_df <- joined |> group_split(station_code)
# Save each dataframe to a separate csv
map(seq_along(split_df), function(i) {
write.csv(split_df[[i]],
paste0(split_df[[i]][1,1], "_combined.csv"),
row.names = FALSE)
})

Pull subset of large data by multiple specific characters

I have a large database which divides files into chunks for ease of analysis/storage and I am trying to extract multiple specific values stored in the character format from a single column to take a chunk of the overall data for further analysis.
Within these files I am interested in pulling ALL rows in which column "Cat" is equals any number of characters (different for each pull and each file).
Files are set up (for example) as:
2001_x.sas
2002_x.sas
....
2018_x.sas
Currently, I am doing the following:
#Create a list of files- fill out pattern to chose specific files with similar names
'x<-list.files(pattern = "_x.sas")'
#Read and subset files when Cat is C21 C98 or D27 etc
'z<-lapply(x, function(x) {
a<-read.sas(x)
c<-subset(a, (Cat=="C21" | Cat=="C98 | Cat=="D27))
})'
#Bind df's into master df
'y<-bind_rows(z)'
y is then a really nice pull from multiple files at once. The advantage of this, as the total dataset is several terabytes, is that this works within the individual files and doesn't overwhelm the memory on my desktop.
The problem is that I can't always use Cat equals variable with just three values. Sometimes, I need to input hundreds of values, which is very tedious. I have tried replacing this with lists or vectors.
Ideally, I'd like the code to look more like this, if you know what I mean, but this doesnt work:
'b<-List or vector with character values of interest
z<-lapply(x, function(x) { a<-read.sas(x) subset(a, Cat==any(b)) })
y<-bind_rows(z)'
Such that any value in list b would be included in the subset if it equals Cat. However, I've only been able to get this to work with Cat equals variable and the or symbol.
Thanks!

R; Rbind Excel files from a List of vectors of files in R

I have web-scraped ~1000 Excel Files into a specific folder on my computer
I then read these files in which returned a value of chr [1:1049]
I then grouped these files by similar names which was every 6 belonged in one group
This returned a List of 175, with values of the group of 6 file names.
I am confused on how I would run a loop that would merge/rbind the 6 file names for each group from that list. I would also need to remove the first row but I know how to do that part with read.xlsx
My code so far is
setwd("C:\\Users\\ewarren\\OneDrive\\Documents\\Reservoir Storage")
files <- list.files()
file_groups <- split(files, ceiling(seq_along(files)/6))
with
for (i in file_groups) {
print(i)
}
returning each group of file names
The files for example are:
files
They are each compromised of two columns, date and amount
I need to add a third to each that is the reservoir name
That way when all the rows from all the files are combined theres a date, an amount, and a reservoir. If I do them all at once w/o the reservoir, I wouldnt know which rows belong to which.
You can use startRow = 2 to not get the first row in read.xlsx
for merging the groups of file. If you have an identifier e.g. x in each file that matches with their others in the group, but not with the ones which are in other groups.
you have make a list group1 <- list.files(pattern = "x)
then use do.call(cbind, group1)

Resources