adding multiple date sets and plotting the average in R - r

I am using R to analyze multiple large data sets. I am trying to add a few together and averaging them to make a plot. They need to be added together with corresponding dates but the data sets are not all the same length/did not start or end at the same time. How would I go about adding them together while accounting for the differences in dates? My first option is to use an if statement, and say if date = date but I'm not sure of the correct process to call all file in the folder for comparison.
I have a script that plots one data set at a time and am simply trying to amend it to accomplish this new analysis:
library(openair)
filedir <-"C:/Users/dfmcg/Documents/Thesisfiles/NE"
myfiles <-c(list.files(path = filedir))
paste(filedir,myfiles,sep = '/')
npsfiles<-c(paste(filedir,myfiles,sep = '/'))
print(npsfiles)
for (i in npsfiles[1:3]){
x <- substr(i,54,61)
y<-paste(paste('C:/Users/dfmcg/Documents/Thesisfiles/NEavg',x,sep='/'), 'png', sep='')
png(filename = y)
timeozone<-import(i,date="DATE",date.format = "%m/%d/%Y %H",header=TRUE,na.strings="-999")
ozoneavg <- timeAverage(timeozone, pollutant = c("O3"), avg.time = "month")
timePlot(ozoneavg,pollutant=c("O3"), main = x)
dev.off()
}
Here is some of the data:
ABBR,DATE,O3,SWS,VWS,SWD,VWD,SDWD,TMP,RH,RNF,SOL
SHEN-BM,05/01/1983 00,-999,-999,-999,,-999,-999,-999,-999,-999,-999
SHEN-BM,05/01/1983 01,-999,-999,-999,,-999,-999,-999,-999,-999,-999
SHEN-BM,05/01/1983 02,-999,-999,-999,,-999,-999,-999,-999,-999,-999

Your question in not very clear. Not being very clear on exactly how you would like add the data frame together and what to average, here is a generic attempt to answer your question.
To read multiple files in and merge them into I large data frame:
#read 3 files
basefilename<-"oa_test"
npsfiles<-lapply(1:3, function(i) {read.csv(paste0(basefilename,i,".csv"))})
#merge files into one dataframe
df<-do.call(rbind, npsfiles)
#fix date column
df$DATE<-as.POSIXct(df$DATE, format="%m/%d/%Y %H")
You could use the import function from the openair package in here.
No once you have all the data into one data frame, the dplyr package makes it easy to group the data by the various variables and perform descriptive statistics on the groups:
library(dplyr)
#group by DATE and average
ozoneavedate<-summarize(group_by(df, DATE), mean(O3))
#group by ABBR and average
ozonesumabbr<-summarize(group_by(df, ABBR), sum(O3))
#group by ABBR and average
ozoneavedateabbr<-summarize(group_by(df, ABBR, DATE), mean(O3))
Hope this helps.
In the future a providing some sample data and what you hope to achieve goes a long way on soliciting help.

Related

Row bind then resave particular CSV files in a directory based on matching index column values in R

I am working with a large number of CSV files housed in the same directory. This time series data comes from sensors deployed at various locations over multiple time periods. Each file has a siteID column with name of the location. I want to iteratively row bind (in chronological order) files that come from the same location, then resave them such that there is only one file per location, rather than multiple files with the same siteID covering different time periods. Here is a (hopefully) reproducible set up:
# Reproducible set up:
library(tidyverse)
# making fake folder where all these CSVs are contained
dir.create("test_dir_1")
# create first of two datetime columns covering different time periods
datetime <- as.POSIXct(c("2022-07-15 15:25:00", "2022-07-16 15:30:00", "2022-07-17 15:35:00"))
# making fake column for data from the site
temperature <- 1:3
# making data frames for first time period and adding siteID column
sensor_SFM01_2_20220715_20220717 <- data.frame(datetime, temperature) %>%
mutate(siteID = "SFM01_2")
sensor_04M06_1_20220715_20220717 <- data.frame(datetime, temperature) %>%
mutate(siteID = "04M06_1")
sensor_20M04_2_20220715_20220717 <- data.frame(datetime, temperature) %>%
mutate(siteID = "20M04_2")
# now reassigning datetime to make data frames from the second time period
datetime<- as.POSIXct(c("2022-07-17 15:45:00", "2022-07-18 15:50:00", "2022-07-19 15:55:00"))
# making data frames for first time period and adding siteID column
sensor_SFM01_2_20220717_20220719 <- data.frame(datetime, temperature) %>%
mutate(siteID = "SFM01_2")
sensor_04M06_1_20220717_20220719 <- data.frame(datetime, temperature) %>%
mutate(siteID = "04M06_1")
sensor_20M04_2_20220717_20220719 <- data.frame(datetime, temperature) %>%
mutate(siteID = "20M04_2")
# saving files to directory
write_csv(sensor_SFM01_2_20220715_20220717, "test_dir_1/sensor_SFM01_2_20220715_20220717.csv")
write_csv(sensor_SFM01_2_20220717_20220719, "test_dir_1/sensor_SFM01_2_20220717_20220719.csv")
write_csv(sensor_04M06_1_20220715_20220717, "test_dir_1/sensor_04M06_1_20220715_20220717.csv")
write_csv(sensor_04M06_1_20220717_20220719, "test_dir_1/sensor_04M06_1_20220717_20220719.csv")
write_csv(sensor_20M04_2_20220715_20220717, "test_dir_1/sensor_20M04_2_20220715_20220717.csv")
write_csv(sensor_20M04_2_20220717_20220719, "test_dir_1/sensor_20M04_2_20220717_20220719.csv")
The following code worked for me using your sample data, though assumes that all the data frames in the global environment has the site ID in the name (as in the example data). Good luck!
# Define site IDs
ssites <- c("04M06", "20M04", "SFM01")
#' Put all files in the global directory with
#' the same site id in the name in a list
llist <- list()
for(i in ssites){
llist[[i]] <- mget(ls(pattern = ssites[i]))
}
#row bind all the same site IDs
bind_list <- lapply(llist, function(x) do.call(rbind, x))
# export to a CSV
for(i in seq_along(bind_list)){
write.csv(bind_list[[i]], paste0("dir_1/site_",i,".csv"))
}

How can I add a column with mutate () to each of the multiple data sets I read?

I am a beginner in R and currently learn how to do the data wrangling job in multiple data sets.
Right now I read 55 csv.file data sets with 300 rows using the following code:
Rawdata <- list.files(pattern = "*.csv")
for(i in 1:length(Rawdata)){
assign(Rawdata[i],read.csv(Rawdata[i], header = TRUE)[1:300])
}
Each data set has variables "acc_X_value", "acc_Y_value", and "acc_Z_value".
I failed to add a column with mutate() in these data sets. I want to show the average of these variables in a new column. Any ideas? Thank you!
Usually it is better to keep related things in lists rather than use assign to store them in the global environment. I would do it something like this:
library(tidyverse)
Rawdata <- map(list.files(pattern = "*.csv"), read_csv)
newData <- map(rawData, mutate, average = (acc_X_value + acc_Y_value + acc_Z_value) / 3)

Creating time series in R

I have a CSV file containing data as follows-
date, group, integer_value
The date starts from 01-January-2013 to 31-October-2015 for the 20 groups contained in the data.
I want to create a time series for the 20 different groups. But the dates are not continuous and have sporadic gaps in it, hence-
group4series <- ts(group4, frequency = 365.25, start = c(2013,1,1))
works from programming point of view but is not correct due to gaps in data.
How can I use the 'date' column of the data to create the time series instead of the usual 'frequency' parameter of 'ts()' function?
Thanks!
You could use zoo::zoo instead of ts.
Since you don't provide sample data, let's generate daily data, and remove some days to introduce "gaps".
set.seed(2018)
dates <- seq(as.Date("2015/12/01"), as.Date("2016/07/01"), by = "1 day")
dates <- dates[sample(length(dates), 100)]
We construct a sample data.frame
df <- data.frame(
dates = dates,
val = cumsum(runif(length(dates))))
To turn df into a zoo timeseries, you can do the following
library(zoo)
ts <- with(df, zoo(val, dates))
Let's plot the timeseries
plot.zoo(ts)

R year quarter for in loop

I am looking to loop over my R data frame that is in year-quarter and run a rolling regression across every year quarter. I then use the coefficients from this model to fit values that are 1 quarter ahead. But would like to use quarterly date format in R?
I had similar issue with
[Stata question] (Stata year-quarter for loop), but revisiting it in R. Does R have the notion of year quarters that can be easily used in a loop? For e.g., one possibly round about way is
months.list <- c("03","06","09","12")
years.list <- c(1992:2007)
## Loop over the month and years
for(yidx in years.list)
{
for(midx in months.list)
{
}
}
I see zoo:: package has some functions, but not sure which one can I use that is specific to my case. Some thing along the following lines would be ideal:
for (yqidx in 1992Q1:2007Q4){
z <- lm(y ~ x, data = mydata <= yqidx )
}
When I do the look ahead, I need to hand it so that the predicated value is run on the the next quarter that is yqidx + 1, and so 2000Q4 moves to 2001Q1.
If all you need help on is how to generate quarters,
require(data.table)
require(zoo)
months.list <- c("03","06","09","12")
years.list <- c(1992:2007)
#The next line of code generates all the month-year combinations.
df<-expand.grid(year=years.list,month=months.list)
#Then, we paste together the year and month with a day so that we get dates like "2007-03-01". Pass that to as.Date, and pass the result to as.yearqtr.
df$Date=as.yearqtr(as.Date(paste0(df$year,"-",df$month,"-01")))
df<-df[order(df$Date),]
Then you can use loops if you'd like. I'd personally consider using data.table like so:
require(data.table)
require(zoo)
DT<-data.table(expand.grid(year=years.list,month=months.list))
DT<-DT[order(year,month)]
DT[,Date:=as.yearqtr(as.Date(paste0(year,"-",month,"-01")))]
#Generate fake x values.
DT[,X:=rnorm(64)]
#Generate time index.
DT[,t:=1:64]
#Fake time index.
DT[,Y:=X+rnorm(64)+t]
#Get rid of the year and month columns -unneeded.
DT[,c("year","month"):=NULL]
#Create a second data.table to hold all your models.
Models<-data.table(Date=DT$Date,Index=1:64)
#Generate your (rolling) models. I am assuming you want to use all past observations in each model.
Models[,Model:=list(list(lm(data=DT[1:Index],Y~X+t))),by=Index]
#You can access an individual model thusly:
Models[5,Model]

Extract data from a by-timeseries object

Let's start from the end: the R output will be read in Tableau to create a dashboard, and therefore I need the R output to look like in a certain way. With that in mind, I'm starting with a data frame in R with n groups of time series. I want to run auto.arima (or another forecasting method from package forecast) on each by group. I'm using the by function to do that, but I'm not attached to that approach, it's just what seemed to do the job for an R beginner like me.
The output I need would append a (say) 1 period forecast to the original data frame, filling in the date (variable t) and by variable (variable class).
If possible I'd like the approach to generalize to multiple by variables (i.e class_1,...class_n,).
#generate fake data
t<-seq(as.Date("2012/1/1"), by = "month", length.out = 36)
class<-rep(c("A","B"),each=18)
set.seed(1234)
metric<-as.numeric(arima.sim(model=list(order=c(2,1,1),ar=c(0.5,.3),ma=0.3),n=35))
df <- data.frame(t,class,metric)
df$type<-"ORIGINAL"
#sort of what I'd like to do
library(forecast)
ts<-ts(df$metric)
ts<-by(df$metric,df$class,auto.arima)
#extract forecast and relevant other pieces of data
#???
#what I'd like to look like
t<-as.Date(c("2013/7/1","2015/1/1"))
class<-rep(c("A","B"),each=1)
metric<-c(1.111,2.222)
dfn <- data.frame(t,class,metric)
dfn$type<-"FORECAST"
dfinal<-rbind(df,dfn)
I'm not attached to the how-to, as long as it starts with a data frame that looks like what I described, and outputs a data frame like the output I described.
Your description is a little vague, but something along these lines should work:
library(data.table)
dt = data.table(df)
dt[, {result = auto.arima(metric);
rbind(.SD,
list(seq(t[.N], length.out = 2, by = '1 month')[2], result$sigma2, "FORECAST"))},
by = class]
I arbitrarily chose to fill in the sigma^2, since it wasn't clear which variable(s) you want there.

Resources