Time series SparkR missing value - r

I'm working with SparkR on Time Series and I have a question.
After some operation I got something like this, where DayHour represent the Day and the Hour of the ID's Value.
DayHour ID Value
01 00 4704 10
01 01 4705 11
.
.
.
04 23 4705 12
The problem is that I have some gap like 01 01, 01 02 missing
DayHour ID Value
01 00 4704 13
01 03 4704 12
I have to fill the gap in the whole dataset with :
DayHour ID Value
01 00 4704 13
01 01 4704 0
01 02 4704 0
01 03 4704 12
Foreach ID I have to fill the gap with the DayHour missing, ID and Value = 0
Solution both in R SparkR would be usefull.

I represented your data in data frame df_r
>df_r <- data.frame(DayHour=c("01 00","01 01","01 02","01 03","01 06","01 07"),
ID = c(4704,4705,4705,4706,4706,4706),Value=c(10,11,12,13,14,15))
> df_r
DayHour ID Value
1 01 00 4704 10
2 01 01 4705 11
3 01 02 4705 12
4 01 03 4706 13
5 01 06 4706 14
6 01 07 4706 15
where the missing hours are 01 04 and 01 05
#Removing white spaces
>df_r$DayHour <- sub(" ", "", df_r$DayHour)
# create dummy all the 'dayhour' in sequence
x=c(00:23)
y=01:04
all_day_hour <- data.frame(Hour = rep(x,4), Day = rep(y,each=24))
all_day_hour$Hour <- sprintf("%02d", all_day_hour$Hour)
all_day_hour$Day <- sprintf("%02d", all_day_hour$Day)
all_day_hour_1 <- transform(all_day_hour,DayHour=paste0(Day,Hour))
all_day_hour_1 <- all_day_hour_1[c(3)]
# using for loop to filter out by each id
>library(dplyr)
>library(forecast)
>df.new <- data.frame()
>factors=unique(df_r$ID)
>for(i in 1:length(factors))
{
df_r1 <- filter(df_r, ID == factors[i])
#Merge
df_data1<- merge(df_r1, all_day_hour_1, by="DayHour", all=TRUE)
df_data1$Value[which(is.na(df_data1$Value))] <- 0
df.new <- rbind(df.new, df_data1)
}

Related

ggplot by group does not get expected outcomes

I have a data frame oz.sim.long. It has three columns. Please see below. The Times column should be the x axis in ggplot, i.e. hours from 00:30-23:00. The Month is the column of groups (03:08). The Ozone column is to plot.
> oz.sim.long
# A tibble: 144 x 3
Times Month Ozone
<chr> <chr> <fct>
1 00:30 03 44.45481
2 00:30 04 49.43994
3 00:30 05 50.86507
4 00:30 06 48.97589
5 00:30 07 46.31845
6 00:30 08 44.78662
7 01:30 03 44.47265
8 01:30 04 49.46492
9 01:30 05 50.83062
10 01:30 06 48.79744
# … with 134 more rows
Here is my code to plot and I got unexpected outcome. Any ideas?
simul.plt <- ggplot(data = oz.sim.long, aes(x=Times, y=Ozone)) +
geom_point(aes(shape=Month,color=Month)) +
geom_smooth(aes(color=Month, linetype=Month), method = 'auto', se = F) +
labs(x='Times',y='Ozone (ppb)')

Converting Month character to date for time series without "0" before Month

How do I convert this data set into a time series format in R? Lets call the data set Bob. This is what it looks like
1/2013 25
2/2013 865
3/2013 26
4/2013 33
5/2013 74
6/2013 24
Are you looking for something like this....?
> dat <- read.table(text = "1/2013 25
2/2013 865
3/2013 26
4/2013 33
5/2013 74
6/2013 24
", header=FALSE) # your data
> ts(dat$V2, start=c(2013, 1), frequency = 12) # time series object
Jan Feb Mar Apr May Jun
2013 25 865 26 33 74 24
Assuming that your starting point is the data frame DF defined reproducibly in the Note at the end this converts it to a zoo series z as well as a ts series tt.
library(zoo)
z <- read.zoo(DF, FUN = as.yearmon, format = "%m/%Y")
tt <- as.ts(z)
z
## Jan 2013 Feb 2013 Mar 2013 Apr 2013 May 2013 Jun 2013
## 25 865 26 33 74 24
tt
## Jan Feb Mar Apr May Jun
## 2013 25 865 26 33 74 24
Note
Lines <- "1/2013 25
2/2013 865
3/2013 26
4/2013 33
5/2013 74
6/2013 24"
DF <- read.table(text = Lines)

compare to next row group data.frame - count per group

I am pretty new to R and I have the following problem that I try to solve.
I would like to count the amount of times that a (just one) wet day follows up a dry day per month - averaged for all the years. The data is stored in a data.frame. OR to put it simple:
I want to count the amount of times that the following row (x+1) has a value > 0 if the row x has a value of zero for a group(Month) - averaged for all years.
I first thought that I could try it the same way as was done in the stackoverflow forum with question compare to next row group data.table. Unfortunatelly I got the error:
Error in `[.data.frame`(weatherdata, , `:=`(PCPnextdat, PCP[match(Date + : unused argument (by = Month)
when executing the following task:
weatherdata[, PCPnextdat := PCP[match(Date + 1, Date)] , by=Month]
The important columns in the datafile, lets call it weatherdata have the following structure, and are data for 36 years - from 01Jan1979 to 31July2014:
Date Year Month Day PCP
1979-01-01 1979 01 01 0.000
1979-01-02 1979 01 02 0.987 <---- FIRST DAY
1979-01-03 1979 01 03 0.876
1979-01-04 1979 01 04 0.000
1979-01-05 1979 01 05 0.234 <---- SECOND DAY
1979-01-06 1979 01 06 0.000
1979-01-07 1979 01 07 0.123 <----- THIRD DAY
1979-01-08 1979 01 08 1.899
So in this example the amount of wet days that follow up dry days is 3 days.
I allready found a way to make a new colum with the precipitation data (x+1).
By using:
weatherdataPCP.next <- weatherdata..5341$PCP[c(2:12986,1)]
This would give:
Date Year Month Day PCP PCP.next
1979-01-01 1979 01 01 0.000 0.987 <--- ONE
1979-01-02 1979 01 02 0.987 0.876
1979-01-03 1979 01 03 0.876 0.000
1979-01-04 1979 01 04 0.000 0.234 <--- TWO
1979-01-05 1979 01 05 0.234 0.000
1979-01-06 1979 01 06 0.000 0.123 <--- THIRD
1979-01-07 1979 01 07 0.123 1.899
1979-01-08 1979 01 08 1.899 0.000
What I would like to end up with is:
Month dry.wet.p.month
01 9.23
02 12.14
03 9.51
04 8.71
05 13.11
06 9.09
07 6.55
08 7.22
09 10.67
10 4.23
11 5.67
12 7.54
All help/tips/tricks are appreciated :) !
Here's a data.table option of what I think you're looking for. First, aggregate the number of wet/dry combinations per Month and Year. Then, compute the mean of that sum only per Month.
library(data.table)
setDT(dt)
dt[, list(drywetpermonth = sum(PCP > 0 & shift(PCP == 0), na.rm = TRUE)),
by = list(Year, Month)][
, list(drywetpermonth = mean(drywetpermonth)), by = Month]

Converting 10 minute data to hourly average using if condition in R

Similar questions may be asked before.I'm new to R and unable to use the other methods.I have one month 10 minute interval data.Example is below. First column is date second is hour.
> 01 00 10 2,8
01 00 20 2,4
01 00 30 2,4
01 00 40 2,1
01 00 50 2,3
01 01 00 1,9
01 01 10 2
I tried to write a code that calculates hourly average if first column(day) and second column(hour) is equal. Because of some values are missing. I tried this code but it does not help.
for(i in 1:4314) {
if(mydata1[i,1] == mydata1[i+1,1] && (mydata1[i,2]= mydata1[i+1,2])){
while(mydata1[i,2] != mydata1[i+1,2]){sum(mydata1[i,4])}}
else {
print(mean(sum(mydata1[i,4])))
}
}
Thanks.
This is very easy with the dplyr package.
Let's give your data some names:
names(mydata) = c("day", "hour", "minute", "value")
library(dplyr)
group_by(mydata, day, hour) %>%
summarize(hourly.mean = mean(hour))

Seasonality by day of month

I want to check for seasonality in a time series by the day of the month.
The problem is that the months are not of equal length (or frequency) - there are months with 31, 28 & 30 days.
When declaring the ts object I can only specify a fixed frequency so it wont be correct.
> x <- data.frame(d = as.Date("2013-01-01") + 1:365 , v = runif(365))
> tapply(as.numeric(format(x$d,"%d")) , format(x$d,"%m") , max)
01 02 03 04 05 06 07 08 09 10 11 12
31 28 31 30 31 30 31 31 30 31 30 31
How can I create a time series object in r that i can later decompose and check for seasonality ?
Is it possible to create a pivot table and convert it into a ts ?

Resources