Naming a .csv with text that can be updated each year - r

I'm looking for a way to automate the update of files names. The code will be used annually to download several .csv files. I would like to be able to change the 2020_2021 portion of the name to whatever assessment (i.e. 2021_2022, 2022_2023etc.) year it is at the beginning of the script so the file names don't have to be updated manually.
write.csv(SJRML_00010,
file = "SJRML__00010_2020_2021.csv")
write.csv(SJRML_00095,
file = "SJRML_00095_2020_2021.csv")
write.csv(SJRML_00480,
file = "SJRML_00480_2020_2021.csv")

lastyear <- 2020
prevassessment <- sprintf("%i_%i", lastyear, lastyear+1)
nextassessment <- sprintf("%i_%i", lastyear+1, lastyear+2)
prevassessment
# [1] "2020_2021"
filenames <- c("SJRML__00010_2020_2021.csv", "SJRML_00095_2020_2021.csv")
gsub(prevassessment, nextassessment, filenames, fixed = TRUE)
# [1] "SJRML__00010_2021_2022.csv" "SJRML_00095_2021_2022.csv"
You can do the gsub on a vector of filenames or one at a time, however you are implementing your processing.

To create a .csv with name that can be updated
Year <- "_2020"
then
write.csv(file_name, paste0("file_name", Year,".csv"))
This returns file_name_2020.csv

Related

Select specific csv file according to their file name In R

I work with R language . I have 8031 csv files and file name of first file is Mean_SST_1997-12-31 and every file name has a pattern the only change is the date. The dates range from 1997.12.31 to 2019.12.31 and every single day has a csv file. What i need to do is i need to select files for sfecific month for example february . Can anyone help me ?
Pull in the files, extract and parse the dates, and then select the ones you want:
library(lubridate)
# get the file names
# use whatever directory or regex you need, see `?list.files` for help
files = data.frame(fn = list.files(pattern = "Mean_SST_.*"))
# pull everything after the last `_` and
# convert it to date in year/month/day order
files$date = ymd(sub(".*_", "", files$fn))
# select the files you want based on date and read them in
feb_data = lapply(files[month(files$date) == 2, "fn"], read.csv)

How to merge many databases in R?

I have this huge database from a telescope at the institute where I currently am working, this telescope saves every single day in a file, it takes values for each of the 8 channels it measures every 10 seconds, and every day starts at 00:00 and finishes at 23:59, unless there was a connection error, in which case there are 2 or more files for one single day.
Also, the database has measurement mistakes, missing data, repeated values, etc.
File extensions are .sn1 for days saved in one single file and, .sn1, .sn2, .sn3...... for days saved in multiple files, all the files have the same number of rows and variables, besides that there are 2 formats of databases, one has a sort of a header and it uses the first 5 lines of the file, the other one doesn't have it.
Every month has it's own folder including the days it has, and then this folders are saved in the year they belong to, so for 10 years I'm talking about more than 3000 files, and to be honest I had never worked with .sn1 files before
I have code to merge 2 or a handful of files into 1, but this time I have thousands of files (which is way more then what I've used before and also the reason of why I can't provide a simple example) and I would like to generate a program that would merge all of the files to 1 huge database, so I can get a better sample from it.
I have an Excel extension that would list all the file locations in a specific folder, can I use a list like this to put all the files together?
Suggestions were too long for a comment, so I'm posting them as an aswer here.
It appears that you are able to read the files into R (at least one at a time) so I'm not getting into that.
Multiple Locations: If you have a list of all the locations, you can search in those locations to give you just the files you need. You mentioned an excel file (let's call it paths.csv - has only one column with the directory locations):
library(data.table)
all_directories <- fread(paths.csv, col.names = "paths")
# Focussing on only .sn1 files to begin with
files_names <- dir(path = all_directories$paths[1], pattern = ".sn1")
# Getting the full path for each file
file_names <- paste(all_directories$path[1], file_names, sep = "/")
Reading all the files: I created a space-delimited dummy file and gave it the extension ".sn1" - I was able to read it properly with data.table::fread(). If you're able to open the files using notepad or something similar, it should work for you too. Need more information on how the files with different headers can be distinguished from one another - do they follow a naming convention, or have different extensions (appears to be the case). Focusing on the files with 5 rows of headers/other info for now.
read_func <- function(fname){
dat <- fread(fname, sep = " ", skip = 5)
dat$file_name <- fname # Add file name as a variable - to use for sorting the big dataset
}
# Get all files into a list
data_list <- lapply(file_names, read_func)
# Merge list to get one big dataset
dat <- rdbindlist(data_list, use.names = T, fill = T)
Doing all of the above will give you a dataset for all the files that have the extension ".sn1" in the first directory from your list of directories (paths.csv). You can enclose all of this in a function and use lapply over all the different directories to get a list wherein each element is a dataset of all such files.
To include files with ".sn2", ".sn3" ... extensions you can modify the call as below:
ptrns <- paste(sapply(1:5, function(z) paste(".sn",z,sep = "")), collapse = "|")
# ".sn1|.sn2|.sn3|.sn4|.sn5"
dir(paths[1], pattern = ptrns)
Here's the simplified version that should work for all file extensions in all directories right away - might take some time if the files are too large etc. You may want to consider doing this in chunks instead.
# Assuming only one column with no header. sep is set to ";" since by default fread may treate spaces
# as separators. You can use any other symbol that is unlikely to be present in the location names
# We need the output to be a vector so we can use `lapply` without any unwanted behaviour
paths_vec <- as.character(fread("paths.csv", sep = ";", select = 1, header = F)$V1)
# Get all file names incl. location)
file_names <- unlist(lapply(paths_vec, function(z){
ptrns <- paste(sapply(1:5, function(q) paste(".sn",q,sep = "")), collapse = "|")
inter <- dir(z, pattern = ptrns)
return(paste(z,inter, sep = "/"))
}))
# Get all data in a single data.table using read_func previously defined
dat <- rbindlist(lapply(file_names, read_func), use.names = T, fill = T)

How can I loop over multiple files when the function argument is different each time?

I'm trying to extract sea surface temperature data from a series of .nc files.
So I have one folder containing the 30 downloaded .nc files all written like "1981.nc", "1982.nc" and so on.
But rather than load them all in individually I want to loop over each one and calculate the mean temperature for each file so I'd have 30 values of temperature at the end.
The problem is the year in date arguments have to change for each file. I thought of including something like years<-substr(filenames, 1,4) on the files which extracts the value of the year but it doesn't work.
I was thinking of something along the following lines:
library(ncdf4)
setwd("C:\\Users\\Desktop\\sst")
source("C:\\Users\\Desktop\\NOAA_OISST_ncdf4.R")
out.file<-""
filenames <- dir(pattern =".nc")
years<-substr(filenames, 1,4)
lst <- vector("list", length(filenames ))
for (i in 1:length(filenames)) {
ssts = extractOISSTdaily(filenames[i], "C:\\Users\\Desktop\\lsmask.oisst.v2.nc",
lonW=350,lonE=351,latS=52,latN=56,date1='years[i]-11-23', date2='years[i]-12-31')
mean(ssts)
}
The extractOISSTdaily function to do the extracting is described here: http://lukemiller.org/index.php/2014/11/extracting-noaa-sea-surface-temperatures-with-ncdf4/
The .nc files are here: https://www.esrl.noaa.gov/psd/data/gridded/data.noaa.oisst.v2.highres.html#detail
Does this work?
# Get filenames
filenames <- dir(pattern =".nc")
# Mean SSTs
m.ssts <- NULL
# Loop through filenames
for (i in filenames) {
# Get year (assuming form of filename is, e.g., 1981.nc)
year <- sub(".nc", "", i)
# Do whatever this function does
ssts <- extractOISSTdaily(i, "C:\\Users\\Desktop\\lsmask.oisst.v2.nc",
lonW=350, lonE=351, latS=52, latN=56,
date1=paste(year, "-11-23", sep = ""),
date2=paste(year, "-12-31", sep = ""))
# Profit!
m.ssts <- c(m.ssts, mean(ssts))
}
The code works by first collecting all filenames in the current directory with the extension .nc and creating an empty object in which to store the mean SSTs. The for loop goes through the filenames in turn stripping off the files extension to get the year (i.e., 1981.nc becomes 1981) by substituting an empty string in place of .nc. Next, the netCDF data for the specified interval is placed in ssts. The interval is created by pasting together the current year with the desired month and day. Finally, the mean is calculated and appended to the m.ssts object. As the OP says below, this should actually read m.ssts <- c(m.ssts, mean(ssts, na.rm = TRUE)) to allow for NA in the data.

Create a function that returns a data frame from CSV files?

I am trying to make a function that outputs a dataframe from 8 different CSV files. They all have the same variables and same sort of data. The only difference in them is the year. I have tried to write out the function, but I can't seem to make it work. I am thinking a lapply woulf work, but I am not sure how to incorporate it.
These are the instructions:
Write a function named 'air' that takes a 'year' argument and returns a data.frame containing that data for that year, suppressing the automatic conversion to factors.
path <- "C:/Users/Lacy Macc/Downloads/adviz/"
files <- list.files(path=path, pattern="*.csv")
for(y in files)
air <- function(year){
if (!exists(""))
}
}
If the filenames of each file varied, you might need to use list.files and search through the filenames to identify one matching the year. But with a fixed filename scheme, all you need to do is insert the year at the appropriate point in the filename:
path <- "C:/Users/Lacy Macc/Downloads/adviz/"
year <- 2013
file_path <- paste0(path, "ad_viz_plotval_data-", year, ".csv")
I have left out the full details of how to convert this into a function that takes in the year as I suspect this might be a homework Q.

lapply r to one column of a csv file

I have a folder with several hundred csv files. I want to use lappply to calculate the mean of one column within each csv file and save that value into a new csv file that would have two columns: Column 1 would be the name of the original file. Column 2 would be the mean value for the chosen field from the original file. Here's what I have so far:
setwd("C:/~~~~")
list.files()
filenames <- list.files()
read_csv <- lapply(filenames, read.csv, header = TRUE)
dataset <- lapply(filenames[1], mean)
write.csv(dataset, file = "Expected_Value.csv")
Which gives the error message:
Warning message: In mean.default("2pt.csv"[[1L]], ...) : argument is not numeric or logical: returning NA
So I think I have 2(at least) problems that I cannot figure out.
First, why doesn't r recognize that column 1 is numeric? I double, triple checked the csv files and I'm sure this column is numeric.
Second, how do I get the output file to return two columns the way I described above? I haven't gotten far with the second part yet.
I wanted to get the first part to work first. Any help is appreciated.
I didn't use lapply but have done something similar. Hope this helps!
i= 1:2 ##modify as per need
##create empty dataframe
df <- NULL
##list directory from where all files are to be read
directory <- ("C:/mydir/")
##read all file names from directory
x <- as.character(list.files(directory,,pattern='csv'))
xpath <- paste(directory, x, sep="")
##For loop to read each file and save metric and file name
for(i in i)
{
file <- read.csv(xpath[i], header=T, sep=",")
first_col <- file[,1]
d<-NULL
d$mean <- mean(first_col)
d$filename=x[i]
df <- rbind(df,d)
}
###write all output to csv
write.csv(df, file = "C:/mydir/final.csv")
CSV file looks like below
mean filename
1999.000661 hist_03082015.csv
1999.035121 hist_03092015.csv
Thanks for the two answers. After much review, it turns out that there was a much easier way to accomplish my goal. The csv files that I had were originally in one file. I split them into multiple files by location. At the time, I thought this was necessary to calculate mean on each type. Clearly, that was a mistake. I went to the original file and used aggregate. Code:
setwd("C:/~~")
allshots <- read.csv("All_Shots.csv", header=TRUE)
EV <- aggregate(allshots$points, list(Location = allshots$Loc), mean)
write.csv(EV, file= "EV_location.csv")
This was a simple solution. Thanks again or the answers. I'll need to get better at lapply for future projects so they were not a waste of time.

Resources