Reading multiple .csv files with varying names from URL - r

I'm trying to read multiple .csv files from an URL starting with http. All files can be found on the same website. Generally, the structure of the file's name is: yyyy_mm_dd_location_XX.csv
Now, there are three different locations (lets say locA, locB, locC) for which there is a file for every day of the month each. So, the file name would be e.g. "2009_10_01_locA_XX.csv", "2009_10_02_locA_XX.csv" and so forth.
The structure, meaning the number of columns of all csv files is the same, however the length is not.
I'd like to combine all these files into one csv file but have problems reading them from the website due to the changing names.
Thanks a lot for any ideas!

Here is a way to programmatically generate the names of the files, and then run download.file() to download them. Since no reproducible example was given with the question, one needs to change the code to the correct HTTP location to access the files.
startDate <- as.Date("2019-10-01","%Y-%m-%d")
dateVec <- date + 0:4 # create additional dates by adding integers
library(lubridate)
downloadFileNames <- unlist(lapply(dateVec,function(x) {
locs <- c("locA","locB","locC")
paste(year(x),month(x),day(x),locs,"XX",sep="_")
}))
head(downloadFileNames)
We print the head() of the vector to show the correct naming pattern.
> head(downloadFileNames)
[1] "2019_10_1_locA_XX" "2019_10_1_locB_XX" "2019_10_1_locC_XX"
[4] "2019_10_2_locA_XX" "2019_10_2_locB_XX" "2019_10_2_locC_XX"
>
Next, we'll create a directory to store the files, and download them.
# create a subdirectory to store the files
if(!dir.exists("./data")) dir.create("./data")
# download files, as https://www.example.com/2019_10_01_locA_XX.csv
# to ./data/2019_10_01_locA_XX.csv, etc.
result <- lapply(downloadFileNames,function(x){
download.file(paste0("https://www.example.com/",x,".csv"),
paste0("./data/",x,".csv"))
})
Once the files are downloaded, we can use list.files() to retrieve the path names, read the data with read.csv(), and combine them into a single data frame with do.call().
theFiles <- list.files("./data",pattern = ".csv",full.names = TRUE)
dataList <- lapply(theFiles,read.csv)
data <- do.call(rbind,dataList)

Related

Renaming files in folder with R

I have many files in my working directory with the same name followed by a number as such "name_#.csv". They each contain the same formatted time series data.
The file names are very long so when I import them into dataframes the df name is super long and I'd like to rename each as "df_#" so that I can then create another function to plot each individually or quickly plot, for example, the first three without typing in this megalong name.
I don't want to concatenate anything to the current names, but take each one and rename it completely ending with the number in the list as it iterates through the files.
Here is an example I have so far.
name = list.files(pattern="*.csv")
for (i in 1:length(name)) assign(paste0("df", name[i]), read.csv(name[i], skip = 15 ))
This is just adding a 'df' to the front and not changing the whole name.
I'm also not sure if it makes sense to proceed this way. Essentially my data is three replicates of time series data on the same sample and I eventually want to take three at a time and plot them on the same graph and so forth until the end of the files.
You can name the file in your global environment without renaming the original file in the folder by just telling R you want to assign it that name in the loop with a few modifications to your original code. For instance:
# Define file path to desired folder
file_path <- "Desktop/SO Example/" #example file path, though could eb working directory
# Your code to get CSV file names in the folder
name <- list.files(path = file_path, pattern="*.csv")
# Modify the loop to assign a new name
for(x in seq_along(name)){
assign(paste0("df_", x),
read.csv(paste0(file_path, name[x]), skip = 15))
}
This will load the data as df_1, df_2, etc. I believe in your assign function you were using paste0("df", name[i]), which concatenates "df" with the filename in position i, not the value of I in the loop - that is why you were getting df prepended to each name on import.

How can I write multiple csv files in a specific directory and then merge them into a single csv?

I am trying to deal with extracting a subset from multiple .grb2 files in the same file path, and write them in separate csv files. I am using the following code which does the job and stores the csv files in the same directory as the .grb2 files.
path <- "file path"
input.file.names <- dir(path, pattern =".grb2")
output.file.names <-
paste0(tools::file_path_sans_ext(input.file.names),".csv")
for(i in 1:length(input.file.names)){
GRIB <- brick(input.file.names[i])
GRIB <- as.array(GRIB)
tmp2m.6hr <- GRIB[46,13,c(1:20)]
str(tmp2m.6hr)
tmp2m.data <- data.frame(tmp2m.6hr)
write.csv(tmp2m.data,output.file.names[i])
}
My first question is this: how can I store the csv files in a different directory than the .grb2 files?
My .grb2 files, and thus the resulting csv files, end in four different types, i.e. 00.grb2, 06.grb2, 12.grb2, 18.grb2. The resulting csv files have the following form:
enter image description here
My second question is: how can I merge all my 00.csv, 06.csv, 12.csv, 18.csv files (each category in the same column) in a single csv file in a directory of my choice with the following headrs: 00_tmp2m.6hr, 06_tmp2m.6hr, 12_tmp2m.6hr, 18_tmp2m.6hr, and also create a fifth column with the average of the other four? The result that I want is the following:
enter image description here
As I m not an experienced user this is too complicated for me. I would very much apreciate any assistance with this.
For your fist question, you might try specifying the path using a relative reference to the folder, as in write.csv(paste0("./myfolder/", output.file.names[i])).
Your second question might be easier if you read the data and then write your results as a new file. you might also want to take a look at the optional parameters of write.csv(append = FALSE, ...).
Also, you might get a better answer by creating a minimal example.

Extracting one text files from multiple zip archives in R

I am trying to extract one text file from each of the zip files located in one folder. Then I want to combine those text files into one dataframe.
The folder has multiple Zip files:
pf_0915.zip
pf_0914.zip
pf_0913.zip
.....
Inside of those zip files are multiple text files. I am only interested in the one called abc.txt. This is a fixed width format file without header. I have already set up a read for this file using read_fwd. Since all the extracted text files have the same name, it might be better to rename them according the name of their archive. i.e. the abc.txt from pf_0915.zip could be called abc_0915.txt. Once they are all read they should be combined into a large file called abcCombined.txt.
Or as each new abc.txt file is read, we could add it to the abcCombined.txt.
I have tried various version of unzip() and unz() without much success. This was done without looping through all the zip files. And finally, this directory contains many zip files, are there ways to read only some of them by using pattern matching like grep. I would for example be interested in reading only September files, those .._09...txt.
Any hints would be appreciated.
The following:
Creates a vector of the files in a directory
Uses the list parameter to unzip() to see the metadata for the contents
Builds a regular expression to find only the target file (I did that in the event your use-case generalizes to a broader pattern)
Tests if any of the files meet your criteria
Keeps only those files into a resultant vector
Iterates over that vector and
Extracts only the target file into a temporary directory
Reads it into a data.frame
Ultimately binds the individual data.frames into one big one
You can write out the resultant combined data.frame however you wish.
library(purrr)
target_dir <- "so"
extract_file <- "abc.txt"
list.files(target_dir, full.names=TRUE) %>%
keep(~any(grepl(sprintf("^%s$", extract_file), unzip(., list=TRUE)$Name))) %>%
map_df(function(x) {
td <- tempdir()
read.fwf(unzip(x, extract_file, exdir=td), widths=c(4,1,4,2))
}) -> combined_df
The version below just expands some of the shortcuts in the one above:
only_files_with_this_name <- function(zip_path, name) {
zip_contents <- unzip(zip_path, list=TRUE)
look_for <- sprintf("^%s$", name)
any(grepl(look_for, zip_contents$Name))
}
list.files(target_dir, full.names=TRUE) %>%
keep(only_files_with_this_name, name=extract_file)) %>%
map_df(function(x) {
td <- tempdir()
file_in_zip <- unzip(x, extract_file, exdir=td)
read.fwf(file_in_zip, widths=c(4,1,4,2))
unlink(file_in_zip)
}) -> combined_df
Can't comment because of my low reputation, so although this is a partial answer:
If you know the file name within the various zips the syntax to get just that file would be something like the following:
my_data<-read.csv(unz("pf_0915.zip","abc.txt"))
This is the code for a csv obviously, not a fixed width text, but if you already have that set up, it'll be something like
my_data<-read_fwd(unz("pf_0915.zip","abc.txt") ... )
with all your other parameters in the ...
You can do this in a loop if you have many zips, and accumulate them in a data frame, data table, whatever structure floats your boat...

To stack up results in one masterfile in R

Using this script I have created a specific folder for each csv file and then saved all my further analysis results in this folder. The name of the folder and csv file are same. The csv files are stored in the main/master directory.
Now, I have created a csv file in each of these folders which contains a list of all the fitted values.
I would now like to do the following:
Set the working directory to the particular filename
Read fitted values file
Add a row/column stating the name of the site/ unique ID
Add it to the masterfile which is stored in the main directory with a title specifying site name/filename. It can be stacked by rows or by columns it doesn't really matter.
Come to the main directory to pick the next file
Repeat the loop
Using the merge(), rbind(), cbind() combines all the data under one column name. I want to keep all the sites separate for comparison at a later on stage.
This is what I'm using at the moment and I'm lost on how to proceed further.
setwd( "path") # main directory
path <-"path" # need this for convenience while switching back to main directory
# import all files and create a character type array
files <- list.files(path=path, pattern="*.csv")
for(i in seq(1, length(files), by = 1)){
fileName <- read.csv(files[i]) # repeat to set the required working directory
base <- strsplit(files[i], ".csv")[[1]] # getting the filename
setwd(file.path(path, base)) # setting the working directory to the same filename
master <- read.csv(paste(base,"_fiited_values curve.csv"))
# read the fitted value csv file for the site and store it in a list
}
I want to construct a for loop to make one master file with the files in different directories. I do not want to merge all under one column name.
For example, If I have 50 similar csv files and each had two columns of data, I would like to have one csv file which accommodates all of it; but in its original format rather than appending to the existing row/column. So then I will have 100 columns of data.
Please tell me what further information can I provide?
for reading a group of files, from a number of different directories, with pathnames patha pathb pathc:
paths = c('patha','pathb','pathc')
files = unlist(sapply(paths, function(path) list.files(path,pattern = "*.csv", full.names = TRUE)))
listContainingAllFiles = lapply(files, read.csv)
If you want to be really quick about it, you can grab fread from data.table:
library(data.table)
listContainingAllFiles = lapply(files, fread)
Either way this will give you a list of all objects, kept separate. If you want to join them together vertically/horizontally, then:
do.call(rbind, listContainingAllFiles)
do.call(cbind, listContainingAllFiles)
EDIT: NOTE, the latter makes no sense unless your rows actually mean something when they're corresponding. It makes far more sense to just create a field tracking what location the data is from.
if you want to include the names of the files as the method of determining sample location (I don't see where you're getting this info from in your example), then you want to do this as you read in the files, so:
listContainingAllFiles = lapply(files,
function(file) data.frame(filename = file,
read.csv(file)))
then later you can split that column to get your details (Assuming of course you have a standard naming convention)

How to use R to Iterate through Subfolders and bind CSV files of the same ID?

I am stuck. I need a way to iterate through a bunch of subfolders in a directory, pull out 4 .csv files , bind the contents of those 4 .csv files, then write out the new .csv to a new directory using the name of the initial subfolder as the name of the new .csv.
I know R could do this. But I am stuck at how to iterate across the subfolders and bind the csv files together. My obstacle is that each subfolder contains the same 4 .csv files using the same 8-digit id. For example, subfolder A contains 09061234.csv, 09061345.csv, 09061456.csv, and 09061560.csv. subfolder B contains 9061234.csv, 09061345.csv, 09061456.csv, and 09061560.csv. (...). There are 42 subfolders, and hence 168 csv files with the same names. I want to compact the files down to 42.
I can use list.files to retrieve all the subfolders. But then what?
##Get Files from directory
TF = "H:/working/TC/TMS/Counts/June09"
##List Sub folders
SF <- list.files(TF)
##List of File names inside folders
FN <- list.files(SF)
#Returns list of 168 filenames
###?????###
#How to iterate through each subfolder, read each 8-digit integer id file,
#bind them all together into one single csv,
#Then write to new directory using
#the name of the subfolder as the name of the new csv?
There is probably a way to do this easily but I am a noob with R. Something involving functions, paste and write.table perhaps? Any hints/help/suggestions is greatly appreciated. Thanks!
You can use recursive=T option for list.files,
lapply(c('1234' ,'1345','1456','1560'),function(x){
sources.files <- list.files(path=TF,
recursive=T,
pattern=paste('*09061*',x,'*.csv',sep='')
,full.names=T)
## ou read all files with the id and bind them
dat <- do.call(rbind,lapply(sources.files,read.csv))
### write the file for the
write(dat,paste('agg',x,'.csv',sep='')
}
After some tweaking of agstudy's code, I came up with the solution I was ultimately after. There were a couple of missing pieces that are more due to the nature of my specific problem, so I am leaving agstudy's answer as "accepted".
Turns out a function really wasn't needed. At least not for now. If I need to perform this same task again, I will create a function out of it. For now, I can solve this particular problem without it.
Also, for my instance, I needed a conditional "if" statement to handle any non-csv files that may have lived in the subfolders. By adding an if statement, R throws warnings and skips any files that are not comma-separated.
Code:
##Define directory path##
TF = "H:/working/TC/TMS/Counts/June09"
##List of subfolder files where file name starts with "0906"##
SF <- list.files(TF,recursive=T, pattern=paste("*09061*",x,'*.csv',sep=""))
##Define the list of files to search for##
x <- (c('1234' ,'1345','1456','1560')
##Create a conditional to skip over the non-csv files in each folder##
if (is.integer(x)){
sources.files <- list.files(TF, recursive=T,full.names=T)}
dat <- do.call(rbind,lapply(sources.files,read.csv))
#the warnings thrown are ok--these are generated due to the fact that some of the folders contain .xls files
write.table(dat,file="H:/working/TC/TMS/June09Output/June09Batched.csv",row.names=FALSE,sep=",")

Resources