Reading multiples uneven .csv files in one dataframe with R - r

I need the following help if possible please let me know your comments
My ObjectTive:-
I had multiples .csv files one location.
all .csv files have different numbers of row(m) and column (n) i.e. m=!n
all csv files have an almost similar date (Calendar day & time stamps eg: 04/01/2016 7:01) but the interesting point is some data have some time stamps missing
All .csv files have following data common ( Open,High,Low, Close,Date).
My Objective is to import only "Close" column from all .csv files but each file have different numbers of rows as some time stamps data is missing in some files.
If on any case any time stamps data is missing but the previous present then repeat previous values.
If on any case any time stamps data is missing and the previous also missing then put 'NA' on it. This is only applicable for first few data points.
Here is my planning:-
Reading/Writing Files: We’ll need to implement a logic to read files in a certain fashion and then write separate files for different sets of instruments separately.
Inconsistent time series: You’ll notice that the time series is not consistent and continuous for some securities, so you need to generate your own datetime stamps and then fill data against each datestamp (wherever available).
Missing data points: There will be certain timestamps against which you don’t have the data, make your timeseries continuous by filling in the missing points with data from pervious timestamp.

Maybe try
read_in <- function(csv){
f <- read.csv(csv)
f <- f[!is.na(f$time_stamp),]
f$close
}
l <- lapply(csv_list, read_in)
df <- rbindlist(l)

Related

Efficient way to compile NC4 file information from separate files in R

I am currently trying to compile temperature information from the WDFE5 Data set which is quite large in size and am struggling to find an efficient way to meet my goal. My main goals are to:
Determine the max temperature for individual days for each individual grid cell
Change the time step from hourly to daily and from UTC to MST.
The data set is stored in monthly NC4 files and contains the temperature data in a 3 dimensional matrix (time lat lon). My main question is if there is a efficient way to compile this data to meet my goals or to manipulate the NC4 files to be easier to play around with (Somehow merge the monthly files into one mega file?)
I have tried two rather convoluted ways to catch holes between months (Example : due to the time conversion, some dates end up spanning between two months, which requires me to read in the next file and then continuing to read the data).
My first try was to individually read 1 month / file at a time, using pmax() to get the max value of each grid cell, and comparing time steps for 24 hours, and then repeating the process. I have been using
ncvar_get() with start and count to only read one time step at a time. To catch days that span two months, I was able to create a convoluted function to merge the two, by calculating the number of 1 hour periods left in one month, and how much would be needed from the next.
My second try still involved pmax(), but I tried a different method to fill in any holes between months. I set a date vector from the time variable to each hour time step, and match by same day. While this seems better, it still has to read multiple NC4 files which gets very convoluting compared to being able to just reading one NC4 file with all the needed information.
In the end, I tested a few test cases and both seem to solutions seem to work, but run extremely slow and seem very overcomplicated to me. I was wondering if anyone had suggestions on how to better set up the NC4 files for reading and time conversion.

How do I use a for loop to open .ncdf files and average a matrix variable that has different values over all the files? (Using R Programming)

I'm currently trying to code an averaged matrix for all matrix values from a specific air quality variable (ColumnAmountNO2TropCloudScreened) positioned in different .ncdf4 files. The only way I was able to do it was listing all the files, opening them using lapply, creating a single NO2 variable for every ncdf. file and then applying abind to all of the variables. Even though I was able to do it, it took me a lot of time to type in different names for the NO2 variables (NO2_1, NO2_2,NO2_3,etc) and which row to access the original listed file ([[1]],[[2]],[[3]],etc).
I am trying to type in a code that's smarter and easier than just typing in a bunch of numbers. I have all the original .ncdf4 files listed, and am trying to loop over the files to open them and get the 'ColumnAmountNO2TropCloudScreened' matrix value from each, so then I can average them. However, I am having no luck. Would someone know what is wrong with this code/my thought over it? Thanks.
I'm trying the code as it follows:
# Load libraries
library(ncdf4)
library(abind)
library(plot.matrix)
# Set working directory
setwd("~/researchdatasets/2020")
# Declare data frame
df=NULL
# List all files in one file
files1= list.files(pattern='\\.nc4$',full.names=FALSE)
# Loop to open files, get NO2 variables
for(i in seq(along=files1)) {
nc_data = nc_open(files1[i])
NO2_var<-ncvar_get(nc_data,'ColumnAmountNO2TropCloudScreened')
nc_close(nc_data)
}
# Average variables
list_NO2= apply(abind::abind(NO2_var,along=3),1:2,mean,na.rm=TRUE)
NCO's ncra averages variables across all input files with, e.g.,
ncra in*.nc out.nc

Values in column change when reading in excel file

I am trying to read in an excel file using the readXL package with a column of time stamps. For this particular file, they are randomly distributed times, so they make look like 00:01, 00:03, 00:04, 00:08, 00:10, etc. so I need these timestamps to read into R correctly. The time stamps turn into random decimals.
I looked in the excel file (which is outputted from a different program) and it appears the column type within excel is "custom". When I convert that "custom" column to "text", it shows me the decimals that are actually stored and reading into R. Is there a way to load in the timestamps instead of the decimals?
I have already tried to using col_types to make it text or integers, but it is still reading the numbers in as decimals and not the timestamps.
df<-
readxl::read_xlsx(
"./Data/LAFLAC_60sec_9999Y1_2019-06-28.xlsx",
range = cell_cols("J:CE"),
col_names = T
)
The decimals are a representation in day after midnight. So 1:00 am is .041667, or 1/24.
In R, you can convert those numbers back into timestamps in a variety of ways.
Try this page for more info
https://stackoverflow.com/a/14484019/6912825

R - Writing data to CSV in a loop

I have a loop that is going through a list of variables (in a csv), accessing a database and extracting the relevant data. It does this for 4 different time periods (which depend on the variables).
I am trying to get R to write this data to a csv, but at current I can only get it to store the data for the last variable in 4 different csv files as it overwrites the previous variable each time.
I'd like it to have all of the data for these variables for one time period all in the same file/sheet. (So either 4 sheets or 4 csv files with all of the data on them) This is because I need to do some data manipulation on the variables before I feed them into the next loop of the script.
I'd like it to be something like this, but need 4 separate sheets/files so I can cover each time period.
date/time | var1 | var2 | ... | varn
I would post the code, but even only posting the relevant loop and none of the surrounding code would be ~150 lines. I am not familiar with R (I can follow the script but struggle writing my own), I inherited this project and don't have long to work on it.
Note: each variable is recorded at a different frequency - some will only have one data point an hour, others one every minute, so will need to match these up based on time recorded (to the nearest minute).
EDIT: I hope I've explained this clearly enough
Four different .csv files would be easiest, because you could do something like the following in your loop:
outfile.name <- paste('Sales', year.of.data, sep='')
write.csv(outfile.name, out.filepath, row.names=FALSE)
You could also append the data into one data.frame and then export it all at once into one sheet. You won't be able to export to multiple sheets for a .csv, because a CSV won't let you have multiple sheets.

nested for loops in R to parse csv files?

Edit: I've corrected the typo in the coding (copy and paste error). I can't add an example of the csv files, as its too complex to model in a simple example (I tried..)
I've spent hours looking through similarly titled questions to solve a for loop problem in R, and have tried a lot of different approaches, but I'm having no luck.
I have many different csv files, each of which has a set of 10 separate strings (variables) identifying a specific row (e.g., names = c("Delta values", "Scream factor", "nightmare mode"). Two rows below such a string, I need the max value of that row of data. I can create loops scanning files for such a value in single csv files using the following
test files-
test1.csv, test2.csv, test3.csv test4.csv
names<-list.files(pattern=".csv")
DF <- NULL
for (i in names){
dat <- read.csv(i, header=FALSE, stringsAsFactors=FALSE)
index <- which(dat=="Delta values", arr.ind=TRUE)
row=as.numeric(rownames(dat)[index[1]])
aver=dat[row+2,]
p=max(na.omit(as.numeric(aver)))
DF=rbind(DF, p)
colnames(DF)=dat[index]}
However, my problem comes in trying to generalize it, so that I get a data frame returned indicating the file each value was retrieved from as a row (not "p") and looping over the files so that I can retrieve the next several variables, while appending to the same data frame so that I end up with a data frame listing by row the filename the variable was derived from, and each variable listed in a separate column.
I'm pretty sure I need a nested loop listing the values I want to retrieve as calculated by "p" but I can't find any good examples describing how to iteratively loop using such an approach, and append the new variables to the growing data frame while staying consistent with the row numbering by file.
please help!

Resources