ggplot by group does not get expected outcomes

ggplot by group does not get expected outcomes - r

I have a data frame oz.sim.long. It has three columns. Please see below. The Times column should be the x axis in ggplot, i.e. hours from 00:30-23:00. The Month is the column of groups (03:08). The Ozone column is to plot.
> oz.sim.long
# A tibble: 144 x 3
Times Month Ozone
<chr> <chr> <fct>
1 00:30 03 44.45481
2 00:30 04 49.43994
3 00:30 05 50.86507
4 00:30 06 48.97589
5 00:30 07 46.31845
6 00:30 08 44.78662
7 01:30 03 44.47265
8 01:30 04 49.46492
9 01:30 05 50.83062
10 01:30 06 48.79744
# … with 134 more rows
Here is my code to plot and I got unexpected outcome. Any ideas?
simul.plt <- ggplot(data = oz.sim.long, aes(x=Times, y=Ozone)) +
geom_point(aes(shape=Month,color=Month)) +
geom_smooth(aes(color=Month, linetype=Month), method = 'auto', se = F) +
labs(x='Times',y='Ozone (ppb)')

Related

Is it possible to convert year-week date format to the first day of the week`?

I have a Year-Week format date. Is it possible to convert it to the first day of the week i.e. 201553 is 2015-12-28 and 201601 is 2016-01-04.
I found here how to do it, however, it does not work correctly on my dates. Could you help to do it without ISOweek package.
date<-c(201553L, 201601L, 201602L, 201603L, 201604L, 201605L, 201606L,
201607L, 201608L, 201609L)
as.POSIXct(paste(date, "0"),format="%Y%u %w")

Here's a way,
date<-data.frame(first = c(201553L, 201601L, 201602L, 201603L, 201604L, 201605L, 201606L,
201607L, 201608L, 201609L))
First separate the week and year from integer,
library(stringr)
library(dplyr)
date = date %>% mutate(week = str_sub(date$first,5,6))
date = date %>% mutate(year = str_sub(date$first,1,4))
The use aweek package to find the date,
library(aweek)
date = date %>% mutate(actual_date = get_date(week = date$week, year = date$year))
first week year actual_date
1 201553 53 2015 2015-12-28
2 201601 01 2016 2016-01-04
3 201602 02 2016 2016-01-11
4 201603 03 2016 2016-01-18
5 201604 04 2016 2016-01-25
6 201605 05 2016 2016-02-01
7 201606 06 2016 2016-02-08
8 201607 07 2016 2016-02-15
9 201608 08 2016 2016-02-22
10 201609 09 2016 2016-02-29

Adding hour and 0 count where it is missing from data [duplicate]

This question already has answers here:
How to format a pivot like table that includes records for all time and id values?
(2 answers)
Closed 4 years ago.
My dataframe looks like this. If there is no data for the hour there isnt even a row for the hour of day. The hours in the data go from 0-23 representing 24 hours in the day. Is there a way to add the hours for the date with a zero count with maybe a second dataframe as a lookup or something?
df
date hour count
2018-01-15 08 4682
2018-01-15 09 406
2018-01-16 05 3359
2018-01-16 06 11926
2018-01-16 07 42602
I would like the dataframe to look like this:
df
date hour count
2018-01-15 01 0
2018-01-15 02 0
2018-01-15 03 0
2018-01-15 04 0
2018-01-15 06 0
2018-01-15 06 0
2018-01-15 07 0
2018-01-15 08 4682
2018-01-15 09 406
2018-01-15 10 0
....
2018-01-16 05 3359
2018-01-16 06 11926
2018-01-16 07 42602
2018-01-16 08 0
2018-01-16 09 0
2018-01-16 10 0
2018-01-16 11 0
....

As mentionned by others, you could use dplyr and tidyr.
For your specific column names, this comes down to:
library(dplyr)
library(tidyr)
data = "date hour count
2018-01-15 08 4682
2018-01-15 09 406
2018-01-16 05 3359
2018-01-16 06 11926
2018-01-16 07 42602"
df <- read.table(text=data, header = T)
df
df %>%
group_by(date) %>%
complete(hour = full_seq(1:24, 1), fill = list(count = 0))
Which yields:
# A tibble: 48 x 3
# Groups: date [2]
date hour count
<fct> <dbl> <dbl>
1 2018-01-15 1. 0.
2 2018-01-15 2. 0.
3 2018-01-15 3. 0.
4 2018-01-15 4. 0.
5 2018-01-15 5. 0.
6 2018-01-15 6. 0.
7 2018-01-15 7. 0.
8 2018-01-15 8. 4682.
9 2018-01-15 9. 406.
10 2018-01-15 10. 0.
# ... with 38 more rows

you can use expand.grid to get the cartesian product of the column values, and use join operation in data.table package
library('data.table')
df2 <- expand.grid(date = unique(df1$date), hour = 0:23, count = 0L, stringsAsFactors = FALSE)
setDT(df2)[df1, count := i.count, on = .(date, hour)]
using cross join CJ in data.table for creating the df2 data
df2 <- CJ(date = unique(df1$date), hour = 0:23, count = 0L)
df2[df1, count := i.count, on = .(date, hour)]
Data:
df1 <- read.table(text='2018-01-15 08 4682
2018-01-15 09 406
2018-01-16 05 3359
2018-01-16 06 11926
2018-01-16 07 42602 ', stringsAsFactors = FALSE)
colnames(df1) <- c('date', 'hour', 'count')

avgokmts returning incorrect maximum rain from ok mesonet data

I'm using the okmesonet package to get data on rainfall. I've tried using avgokmts from this package to calculate the rainfall for each day, but I'm getting non-sensical values.
Get rain data for Norman, OK (cumulative rain in mm over a day at 5 min intervals)
library(okmesonet)
rainDat <- okmts(begintime="2016-06-21 00:00:00", endtime="2016-07-04 00:00:00",
station="NRMN", variables="RAIN", localtime=TRUE)
Calculate the max rain per day
avgokmts(rainDat, by="day", metric="max")
Which returns these values
STID STNM DAY MONTH YEAR RAIN Time Date
1 NRMN 121 21 06 2016 0.00 23:55:00 2016-06-22
2 NRMN 121 22 06 2016 0.25 23:55:00 2016-06-23
3 NRMN 121 23 06 2016 59.70 23:55:00 2016-06-24
4 NRMN 121 24 06 2016 0.00 23:55:00 2016-06-25
5 NRMN 121 25 06 2016 0.00 23:55:00 2016-06-26
6 NRMN 121 26 06 2016 0.00 23:55:00 2016-06-27
7 NRMN 121 27 06 2016 0.00 23:55:00 2016-06-28
8 NRMN 121 28 06 2016 0.00 23:55:00 2016-06-29
9 NRMN 121 29 06 2016 0.00 23:55:00 2016-06-30
10 NRMN 121 30 06 2016 28.19 23:55:00 2016-07-01
11 NRMN 121 01 07 2016 0.00 23:55:00 2016-07-02
12 NRMN 121 02 07 2016 0.51 23:55:00 2016-07-03
13 NRMN 121 03 07 2016 0.00 23:55:00 2016-07-04
14 NRMN 121 04 07 2016 0.00 00:00:00 2016-07-04
But these rainfall values very clearly don't match up with the rainfall as graphed below (peak rainfall occurs on June 27th and July 3rd).
plot(rainDat$TIME, rainDat$RAIN, xlab="Date", ylab="Cumulative Daily Rain (mm)")
Why isn't avgokmts working in this case? Is there an error in how I'm calling the function? Is there an alternative way to calculate daily rainfall using this dataset?

I'm pretty sure that the pkg author did not deal with the UTC<->CDT conversions properly for the precip readings. Here's a fragile way to get the max precip per day if you are using a single station. The expansion of the procedure to handle multiple stations should just be by adding one more group_by() variable.
library(okmesonet)
library(dplyr)
library(ggplot2)
library(gridExtra)
rainDat <- okmts(begintime="2016-06-21 00:00:00",
endtime="2016-07-04 00:00:00",
station="NRMN",
variables="RAIN",
localtime=TRUE)
# Use the pkg calculation -------------------------------------------------
pkg_calc <- avgokmts(rainDat, by="day", metric="max")
# Begin our own calculations ----------------------------------------------
rainDat <- mutate(rainDat, day=format(TIME, "%Y-%m-%d"))
day_precip_max <- function(x) {
prev_day_last_reading_time <- as.POSIXct(sprintf("%s 23:55:00", x$day[1]), tz="America/Chicago") -
as.difftime(1, unit="days")
prev_day_last_reading <- rainDat[rainDat$TIME==prev_day_last_reading_time, "RAIN"]
if (length(prev_day_last_reading) == 0) prev_day_last_reading <- 0
x <- mutate(x, RAIN=RAIN - prev_day_last_reading)
data_frame(
STID=x$STID[1], STNM=x$STNM[1],
DAY=substr(x$day[1], 9, 10),
MONTH=substr(x$day[1], 6, 7),
YEAR=substr(x$day[1], 1, 4),
RAIN=max(x$RAIN)
)
}
new_calc <- group_by(rainDat, day) %>% do(day_precip_max(.)) %>% ungroup()
# Convert to POSIXct for common plotting axis ------------------------------
pkg_calc <- mutate(pkg_calc, day=as.POSIXct(sprintf("%s-%s-%s 23:55:00", YEAR, MONTH, DAY), tz="America/Chicago"))
new_calc <- mutate(new_calc, day=as.POSIXct(sprintf("%s-%s-%s 23:55:00", YEAR, MONTH, DAY), tz="America/Chicago"))
grid.arrange(
ggplot(rainDat, aes(x=TIME, y=RAIN)) +
geom_point() +
scale_x_datetime(date_breaks="1 day", date_labels="%d") +
labs(x=NULL, y="Rain", title="Raw readings")
,
ggplot(pkg_calc, aes(x=day, y=RAIN)) +
geom_point() +
scale_x_datetime(date_breaks="1 day", date_labels="%d", limits=range(rainDat$TIME)) +
labs(x=NULL, y="Rain", title="Package aggregation (max)")
,
ggplot(new_calc, aes(x=day, y=RAIN)) +
geom_point() +
scale_x_datetime(date_breaks="1 day", date_labels="%d", limits=range(rainDat$TIME)) +
labs(x=NULL, y="Rain", title="Manual aggregation (max)")
,
ncol=1
)
I have the plot displaying the max reading at 23:55:00.

compare to next row group data.frame - count per group

I am pretty new to R and I have the following problem that I try to solve.
I would like to count the amount of times that a (just one) wet day follows up a dry day per month - averaged for all the years. The data is stored in a data.frame. OR to put it simple:
I want to count the amount of times that the following row (x+1) has a value > 0 if the row x has a value of zero for a group(Month) - averaged for all years.
I first thought that I could try it the same way as was done in the stackoverflow forum with question compare to next row group data.table. Unfortunatelly I got the error:
Error in `[.data.frame`(weatherdata, , `:=`(PCPnextdat, PCP[match(Date + : unused argument (by = Month)
when executing the following task:
weatherdata[, PCPnextdat := PCP[match(Date + 1, Date)] , by=Month]
The important columns in the datafile, lets call it weatherdata have the following structure, and are data for 36 years - from 01Jan1979 to 31July2014:
Date Year Month Day PCP
1979-01-01 1979 01 01 0.000
1979-01-02 1979 01 02 0.987 <---- FIRST DAY
1979-01-03 1979 01 03 0.876
1979-01-04 1979 01 04 0.000
1979-01-05 1979 01 05 0.234 <---- SECOND DAY
1979-01-06 1979 01 06 0.000
1979-01-07 1979 01 07 0.123 <----- THIRD DAY
1979-01-08 1979 01 08 1.899
So in this example the amount of wet days that follow up dry days is 3 days.
I allready found a way to make a new colum with the precipitation data (x+1).
By using:
weatherdataPCP.next <- weatherdata..5341$PCP[c(2:12986,1)]
This would give:
Date Year Month Day PCP PCP.next
1979-01-01 1979 01 01 0.000 0.987 <--- ONE
1979-01-02 1979 01 02 0.987 0.876
1979-01-03 1979 01 03 0.876 0.000
1979-01-04 1979 01 04 0.000 0.234 <--- TWO
1979-01-05 1979 01 05 0.234 0.000
1979-01-06 1979 01 06 0.000 0.123 <--- THIRD
1979-01-07 1979 01 07 0.123 1.899
1979-01-08 1979 01 08 1.899 0.000
What I would like to end up with is:
Month dry.wet.p.month
01 9.23
02 12.14
03 9.51
04 8.71
05 13.11
06 9.09
07 6.55
08 7.22
09 10.67
10 4.23
11 5.67
12 7.54
All help/tips/tricks are appreciated :) !

Here's a data.table option of what I think you're looking for. First, aggregate the number of wet/dry combinations per Month and Year. Then, compute the mean of that sum only per Month.
library(data.table)
setDT(dt)
dt[, list(drywetpermonth = sum(PCP > 0 & shift(PCP == 0), na.rm = TRUE)),
by = list(Year, Month)][
, list(drywetpermonth = mean(drywetpermonth)), by = Month]

Time series SparkR missing value

I'm working with SparkR on Time Series and I have a question.
After some operation I got something like this, where DayHour represent the Day and the Hour of the ID's Value.
DayHour ID Value
01 00 4704 10
01 01 4705 11
.
.
.
04 23 4705 12
The problem is that I have some gap like 01 01, 01 02 missing
DayHour ID Value
01 00 4704 13
01 03 4704 12
I have to fill the gap in the whole dataset with :
DayHour ID Value
01 00 4704 13
01 01 4704 0
01 02 4704 0
01 03 4704 12
Foreach ID I have to fill the gap with the DayHour missing, ID and Value = 0
Solution both in R SparkR would be usefull.

I represented your data in data frame df_r
>df_r <- data.frame(DayHour=c("01 00","01 01","01 02","01 03","01 06","01 07"),
ID = c(4704,4705,4705,4706,4706,4706),Value=c(10,11,12,13,14,15))
> df_r
DayHour ID Value
1 01 00 4704 10
2 01 01 4705 11
3 01 02 4705 12
4 01 03 4706 13
5 01 06 4706 14
6 01 07 4706 15
where the missing hours are 01 04 and 01 05
#Removing white spaces
>df_r$DayHour <- sub(" ", "", df_r$DayHour)
# create dummy all the 'dayhour' in sequence
x=c(00:23)
y=01:04
all_day_hour <- data.frame(Hour = rep(x,4), Day = rep(y,each=24))
all_day_hour$Hour <- sprintf("%02d", all_day_hour$Hour)
all_day_hour$Day <- sprintf("%02d", all_day_hour$Day)
all_day_hour_1 <- transform(all_day_hour,DayHour=paste0(Day,Hour))
all_day_hour_1 <- all_day_hour_1[c(3)]
# using for loop to filter out by each id
>library(dplyr)
>library(forecast)
>df.new <- data.frame()
>factors=unique(df_r$ID)
>for(i in 1:length(factors))
{
df_r1 <- filter(df_r, ID == factors[i])
#Merge
df_data1<- merge(df_r1, all_day_hour_1, by="DayHour", all=TRUE)
df_data1$Value[which(is.na(df_data1$Value))] <- 0
df.new <- rbind(df.new, df_data1)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

ggplot by group does not get expected outcomes - r

Related

Is it possible to convert year-week date format to the first day of the week`?

Adding hour and 0 count where it is missing from data [duplicate]

avgokmts returning incorrect maximum rain from ok mesonet data

compare to next row group data.frame - count per group

Time series SparkR missing value

Categories

Resources