creating date variable from file names in R - r

I need some help creating a dataset in R where each observation contains a latitude, longitude, and date. Right now, I have a list of roughly 2,000 files gridded by lat/long, and each file contains observations for one date. Ultimately, what I need to do, is combine all of these files into one file where each observation contains a date variable that is pulled from the name of its file.
So for instance, a file is named "MERRA2_400.tavg1_2d_flx_Nx.20120217.SUB.nc". I want all observations from that file to contain a date variable for 02/17/2012.
That "nc" extension describes a netCDF file, which can be read into R as follows:
library(RNetCDF)
setwd("~/Desktop/Thesis Data")
p1a<-"MERRA2_300.tavg1_2d_flx_Nx.20050101.SUB.nc"
pid<-open.nc(p1a)
dat<-read.nc(pid)
I know the ldply command can by useful for extracting and designating a new variable from the file name. But I need to create a loop that combines all the files in the 'Thesis Data' folder above (set as my wd), and gives them date variables in the process.
I have been attempting this using two separate loops. The first loop uploads files one by one, creates a date variable from the file name, and then resaves them into a new folder. The second loop concatenates all files in that new folder. I have had little luck with this strategy.
view[dat]
As you can hopefully see in this picture, which describes the data file uploaded above, each file contains a time variable, but that time variable has one observation, which is 690, in each file. So I could replace that variable with the date within the file name, or I could create a new variable - either works.
Any help would be much appreciated!

I do not have any experience working with .nc files, but what I think you need to do, in broad strokes, is this:
filenames <- list.files(path = ".") # Creates a character vector of all file names in working directory
Creating empty dataframe with column names:
final_data <- data.frame(matrix(ncol = ..., nrow = 0)) # enter number of columns you will have in the final dataset
colnames(final_data) <- c("...", "...", "...", ...) # create column names
For each filename, read in file, create date column and write as object in global environment:
for (i in filenames) {
pid<-open.nc(i)
dat<-read.nc(pid)
date <- ... # use regex to get your date from i and convert it into date
dat$date <- date
final_data <- rbind(final_data, dat)
}

Related

How to store a folder containing over 30 zipped files into a variable in r

I used the package 'GDELTtools' to download data from GDELT. Now, the data was downloaded however, no variable was stored in the global environment. I want to store the data into a dataframe variable so I can analyze it.
The folder contains over 30 zipped files. Every zipped file contains one csv. I need to store all these csvs in one variable in the Global Environment of r. I hope this can be done.
Thank you in advance!
Haven't written R for a while so I will try my best.
Read the comments carefully, cause they will explain the procedure.
I will attach the links to check information for: unzip, readCSV, mergeDataFrames, emptyDataFrame, concatinateStrings
According to docs of GDELTtools you can easily specify folder of download by providing local.folder="~/gdeltdata" as parameter to GetGDELT() function.
After that you can list.files("path/to/files/directory") function to obtain a vector of file names used in the explanation code bellow. Check the docs for more examples and explanation.
# set path to of unzip output
outDir <-"C:\\Users\\Name\\Documents\\unzipfolder"
# relative path where zip files are stored
relativePath <- "C:\\path\\to\\my\\directory\\"
# create varible to store all the paths to the zip files in a vector
zipPaths <- vector()
# since we have 30 files we should iterate through
# I assume you have a vector with file names in the variable fileNames
for (name in fileNamesZip) {
# Not sure if it will work but use paste() to concat strings
zipfilepath <- paste0(relativePath, name, ".zip")
# append filepath
append(zipPaths, zipfilepath)
}
# now we have a vector which contains all the paths to zip files
# use unzip() function and pass zipPaths to it. (Read official docs)
unzip(files=zipPaths, exdir=outDir)
# initialize dataframe for all the data. You must provide datatypes for the columns.
total <- data.frame=(Doubles=double(),
Ints=integer(),
Factors=factor(),
Logicals=logical(),
Characters=character(),
stringsAsFactors=FALSE)
# now its time to store data by reading csv files and storing them into dataframe.
# again, I assume you have a vector with file names in the variable fileNames
for (name in fileNamesCSV) {
# create the csv file path
csvfilepath <- paste0(outDir, name, ".csv")
# read data from csv file and store in in a dataframe
dataFrame = read.csv(file=csvfilepath, header=TRUE, sep=",")
# you will be able to merge dataframes only if they are equal in structure. Specify the column names to merge by.
total <- merge(data total, data dataFrame, by=c("Name1","Name2"))
}
Something potentially much simpler:
list.files() lists the files in a directory
readr::read_csv() will automatically unzip files as necessary
dplyr::bind_rows() will combine data frames
So try:
lf <- list.files(pattern="\\.zip")
dfs <- lapply(lf,readr::read_csv)
result <- dplyr::bind_rows(dfs)

Create a function that returns a data frame from CSV files?

I am trying to make a function that outputs a dataframe from 8 different CSV files. They all have the same variables and same sort of data. The only difference in them is the year. I have tried to write out the function, but I can't seem to make it work. I am thinking a lapply woulf work, but I am not sure how to incorporate it.
These are the instructions:
Write a function named 'air' that takes a 'year' argument and returns a data.frame containing that data for that year, suppressing the automatic conversion to factors.
path <- "C:/Users/Lacy Macc/Downloads/adviz/"
files <- list.files(path=path, pattern="*.csv")
for(y in files)
air <- function(year){
if (!exists(""))
}
}
If the filenames of each file varied, you might need to use list.files and search through the filenames to identify one matching the year. But with a fixed filename scheme, all you need to do is insert the year at the appropriate point in the filename:
path <- "C:/Users/Lacy Macc/Downloads/adviz/"
year <- 2013
file_path <- paste0(path, "ad_viz_plotval_data-", year, ".csv")
I have left out the full details of how to convert this into a function that takes in the year as I suspect this might be a homework Q.

How to create a dataframe whose name is stored in a vector

I have to run a R script every month, it read a .csv file into a dataframe and perform some manipulations on it.
The name of this dataframe needs to be dynamic for example:
df_jan for January, df_feb for February and so on
I have created a character vector which contains the required data frame name using paste() function and Sys.Date() function
I want to automate this code therefore I don't want to rename this dataframe everytime I run this script
Now, how do I read the .csv into this data frame.
Currently I'm loading the file into a dataframe - 'df' and using the assign() function to assign it the required name, Is there any better method to accomplish the same?
Thanks
create.df <- function(path){
assign(paste0("df_", format(Sys.Date(), "%b")),
read.csv(path),
envir = .GlobalEnv
)
}
then call create.df with path to your .csv

read multiple files into r, no column names

I have a number of .txt files, with the data comma separated. There are no headers. Each contains the same information, but by different years: the name, the gender and the number of names.
I can read them all in in one rbind okay, but I lose the year information - the year is contained only in the file name... y1920.txt, y1995.txt, y2002.txt and so on.
I am very new to R.
To rbind them, I used do.call(file, rbind), where file is the list of data.frames.
Plyr has a nice workflow for this, assuming your files are all in the current working directory:
library(plyr)
years <- ldply(list.files(pattern="y\\d{4}\\.txt"),
function(file){
data <- read.csv(file, header=F);
data$date <- gsub("y","",gsub("\\.txt","", file));
data})
If you want to specify your files instead, e.g. files <- c("y1995.txt", "y1996.txt"), you can replace the first argument to ldply (list.files(...)) with files instead.

Change file name when using write.table() according to the name of the third column in data frame

I wrote a script in R that merges and modifies some csv data and then saves the resulting data frame using write table(). When it saves the file it adds the current date to the name of the file. The third column of the resulting data frame is always country specific, so I was wondering if there is a way to include in the file name using write.table the name of the country depending on the country code (name of the third column).
For example, if the name of the third column is "it", I want to add "Italy" to the name of the csv file using write.table.
Import list of country names and codes into R: (It would be wise to do this at the very top of your script: outside your processing loop so you dont read in the data over and over for each dataset being written out to .csv. The rest of the code goes just before your current write.table command
library(RCurl)
csv_src <- getURL("https://raw.githubusercontent.com/umpirsky/country-list/master/country/cldr/en/country.csv")
world <- read.csv(text=csv_src, header=T)`
Get name of third column in your data with country codes:
countrycode <- colnames(yourdata)[3]
Extract corresponding country name:
country_idx <- grep(pattern=countrycode, x=world$iso, ignore.case = TRUE)
country <- world$name[country_idx]
Attach country name to csv filename (Replace "..." with whatever other tags you want appended to the output filename. Otherwise remove "...")
csv_name <- paste0("...",country, ".csv")
Write out your data to file:
write.table(x=yourdata, file=csv_name)
Good luck :-)

Resources