ggvis: plotting data in multiple series - r

Here is what I have:
A data frame which contains a date field, and a number of summary statistics.
Here's what I want:
I want a chart that allows me to compare the time series week over week, to see how the performance of the process this week compares to the previous one, for example.
What I have done so far:
##Get the week day name to display
summaryData$WeekDay <- format(summaryData$Date, format = '%A')
##Get the week number to differentiate the weeks
summaryData$Week <- format(summaryData$Date, format = '%V')
summaryData %>%
ggvis(x = ~WeekDay, y = ~Referrers) %>%
layer_lines(stroke = ~Week)`
I expected it to create a chart with multiple coloured lines, each one representing a week in my data set. It does not do what I expect

Try looking at reshaper to convert your data with a factor variable for each week, or split up the data with a dplyr::lag() command.
A general way of doing graphs of multiple columns in ggivs is to use the following format
summaryData %>%
ggvis() %>%
layer_lines(x = ~WeekDay, y = ~Referrers)%>%
layer_lines(x=~WeekDay, y= ~Other)
I hope this helps

Related

How to separate a time series panel by the number of missing observations at the end?

Consider a set of time series having the same length. Some have missing data in the end, due to the product being out of stock, or due to delisting.
If the series contains at least four missing observations (in my case it is value = 0 and not NA) at the end, I consider the series as delisted.
In my time series panel, I want to separate the series with delisted id's from the other ones and create two different dataframes based on this separation.
I created a simple reprex to illustrate the problem:
library(tidyverse)
library(lubridate)
data <- tibble(id = as.factor(c(rep("1",24),rep("2",24))),
date = rep(c(ymd("2013-01-01")+ months(0:23)),2),
value = c(c(rep(1,17),0,0,0,0,2,2,3), c(rep(9,20),0,0,0,0))
)
I am searching for a pipeable tidyverse solution.
Here is one possibility to find delisted ids
data %>%
group_by(id) %>%
mutate(delisted = all(value[(n()- 3):n()] == 0)) %>%
group_by(delisted) %>%
group_split()
In the end I use group_split to split the data into two parts: one containing delisted ids and the other one contains the non-delisted ids.

Subset data by months in r

I have a data frame of daily precipitation data that runs from Jan-1980 to Dec-2017. I have aggregated monthly averages (as seen in the image). How would I go about examining certain months (E.G. compare all Decembers)?enter image description here
your could select the Decembers and draw a plot from this:
library(dplyr)
library(ggplot2)
dd.agg %>%
# filter all Decembers
dplyr::filter(mo == "12") %>%
# change name of last column
dplyr::rename(precipitation = 3) %>%
# change year to integer just in case it is char
dplyr::mutate(yr = as.integer(yr)) %>%
# order by year just in case it is unorderes
dplyr::arrange(yr) %>%
# draw a bar chart
ggplot2::ggplot(aes(x = yr, y = precipitation)) +
ggplot2::geom_col()

How to set unit for diff()?

I have a data frame with several different variables (e.g. location, species, date & time). I'm trying to find the difference between two timestamps within the same column, according to location and species.
What my data frame looks like:
dat <- data.frame(
location = c("A","A","A","B","B","B","C","C","C"),
ID = c("x","y","x","x","x","y","y","x","x"),
datetime = c("2019-09-02 11:33:00","2019-09-03 10:00:00","2019-08-23 14:22:34","2019-09-12 12:18:00","2019-09-15 09:40:00","2019-09-15 09:40:00","2019-09-15 10:05:00","2019-08-23 13:58:18","2019-09-16 09:34:00"))
I grouped my data frame by location and ID and calculated the time difference with this:
Data1 <- Data %>% group_by(location, ID)
Data2<-mutate(Data1,diff:=c(1000, diff(datetime)))
This successfully gives me the time difference, but for some reason they're randomly in different units (seconds, minutes, hours). I tried this instead:
Data2<-mutate(Data1,diff:=c(1000, diff(datetime, units="mins")))
but the output doesn't change. Is there a way to set the units, and if not is there an alternative way to get the time difference in a data frame sorted by specific variables?
We can use difftime
library(dplyr)
Data1 %>%
mutate(diff = difftime(datetime, lag(datetime,
default = first(datetime)), unit = 'mins'))

Plotting only 1 hourly datapoint (1 per day) alongside hourly points (24 per day) in R Studio

I am a bit stuck with some code. Of course I would appreciate a piece of code which sorts my dilemma, but I am also grateful for hints of how to sort that out.
Here goes:
First of all, I installed the packages (ggplot2, lubridate, and openxlsx)
The relevant part:
I extract a file from an Italians gas TSO website:
Storico_G1 <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G+1", startRow = 1, colNames = TRUE)
Then I created a data frame with the variables I want to keep:
Storico_G1_df <- data.frame(Storico_G1$pubblicazione, Storico_G1$IMMESSO, Storico_G1$`SBILANCIAMENTO.ATTESO.DEL.SISTEMA.(SAS)`)
Then change the time format:
Storico_G1_df$pubblicazione <- ymd_h(Storico_G1_df$Storico_G1.pubblicazione)
Now the struggle begins. Since in this example I would like to chart the 2 time series with 2 different Y axes because the ranges are very different. This is not really a problem as such, because with the melt function and ggplot i can achieve that. However, since there are NAs in 1 column, I dont know how I can work around that. Since, in the incomplete (SAS) column, I mainly care about the data point at 16:00, I would ideally have hourly plots on one chart and only 1 datapoint a day on the second chart (at said 16:00). I attached an unrelated example pic of a chart style I mean. However, in the attached chart, I have equally many data points on both charts and hence it works fine.
Grateful for any hints.
Take care
library(lubridate)
library(ggplot2)
library(openxlsx)
library(dplyr)
#Use na.strings it looks like NAs can have many values in the dataset
storico.xl <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",
sheet = "Storico_G+1", startRow = 1,
colNames = TRUE,
na.strings = c("NA","N.D.","N.D"))
#Select and rename the crazy column names
storico.g1 <- data.frame(storico.xl) %>%
select(pubblicazione, IMMESSO, SBILANCIAMENTO.ATTESO.DEL.SISTEMA..SAS.)
names(storico.g1) <- c("date_hour","immesso","sads")
# the date column look is in the format ymd_h
storico.g1 <- storico.g1 %>% mutate(date_hour = ymd_h(date_hour))
#Not sure exactly what you want to plot, but here is each point by hour
ggplot(storico.g1, aes(x= date_hour, y = immesso)) + geom_line()
#For each day you can group, need to format the date_hour for a day
#You can check there are 24 points per day
#feed the new columns into the gplot
storico.g1 %>%
group_by(date = as.Date(date_hour, "d-%B-%y-")) %>%
summarise(count = n(),
daily.immesso = sum(immesso)) %>%
ggplot(aes(x = date, y = daily.immesso)) + geom_line()

Finding Avg/Sum of a Column Value

I have a nice Jitter plot of my data, but I'm looking to look further into the data by finding Mean/Sum/Median etc...
I don't know the syntax to separate the data by column value.
My date frame consists of 2 variables: Year (2010-2017) and Followers (Numeric)
Code I used:
ggplot(MyData, aes(factor(Date), Followers)) +
geom_jitter(aes(color = factor(Date)))
This separated each Numeric data point into categorized groups of each year.
I was able to use sum(MyData$Followers) to get total Followers for all years.
As well as count(MyData, 'Date') To get frequency for each year.
But I'm not sure how to combine them to get total followers/avg followers for each individual year.
You can use dplyr:
df <- MyData %>%
group_by(Year) %>%
summarize(Mean = mean(Followers), Count = n(Followers))

Resources