Group R table entries by month - r

I have as CSV value with two columns, a unix timestamp and a version string. What I finally want to achieve is to group the data by months and plot the data, so that the single months are entries on the x axis, and a line is plotted for each unique version string, where the y axis values should represent the number of hits on that month.
Here is a small example CSV:
timestamp,version
1434974143,1.0.0
1435734004,1.1.0
1435734304,1.0.0
1435735386,1.2.0
I'm new to R, so I encountered several problems. First I successfully read the csv with
mydata <- read.csv("data.csv")
and figured out an ungly function that convers a single timestamp into an R date:
as_time <- function(val){
return(head(as.POSIXct(as.numeric(as.character(val)),origin="1970-01-01",tz="GMT")))
}
But non of the several apply functions seemed to work on the table column.
So how do I create a data structure, that groups the version hits by month, and can be plotted later?

It's easier than you think!
You are essentially looking for the hist function.
#Let's make some mock data
# Set the random seed for reproducibility
set.seed(12345)
my.data <- data.frame(timestamp = runif(1000, 1420000000, 1460000000),
version = sample(1:5, 1000, replace = T))
my.data$timestamp <- as.POSIXct(my.data$timestamp, origin = "1970-01-01")
# Histogram of the data, irrespective of version
hist(my.data$timestamp, "month")
# If you want to see the version then split the data first...
my.data.split <- split(my.data, my.data$version)
# Then apply hist
counts <- sapply(my.data.split, function (x)
{
h <- hist(x$timestamp, br = "month", plot = FALSE)
h$counts
})
# Transform into a matrix and plot
counts <- do.call("rbind", counts)
barplot(counts, beside = T)

You can use the as.yearmon function from the zoo package to get year/month formats:
library(zoo)
dat$yearmon <- as.yearmon(as.POSIXct(dat$timestamp, origin = "1970-01-01", tz = "GMT"))
Then it depends what you want to do with your data. For example, number of version hits per month (thanks to #Frank for fixing):
dat %>% group_by(yearmon, version) %>%
summarise(hits = n())

Related

break down date into year and month

I have a df with date under column "månad" in format 2020M07.
When I plot this the x-axis gets all crowded and instead of plotting a continuous line I want to create a series per year and the x-axis only contain month.
In order to do this I need to have a col. in my df with only year so I can group on that variable (in ggplot), AND I also need to have a col. w only month (for x-data input in ggplot). How do I use the "månad" column to achieve this? Is there like in excel a LEFT-function or something that you can use with the dplyr function mutate? Or else how do I do this?
Maybe my idea isn't the best, feel free to answer both to my suggested solution and if you got a better one!
Thanks!
EDIT: - dump on the df in current state
Code:
library(pxweb)
library(tidyverse)
library(astsa)
library(forecast)
library(scales)
library(plotly)
library(zoo)
library(lubridate)
# PXWEB query
pxweb_query_list_BAS <-
list("Region"=c("22"),
"Kon" =c("1+2"),
"SNI2007" =c("A-U+US"),
"ContentsCode"=c("000005F3"))
# Download data
px_data_BAS <-
pxweb_get(url = "https://api.scb.se/OV0104/v1/doris/sv/ssd/START/AM/AM0210/AM0210B/ArbStDoNMN",
query = pxweb_query_list_BAS)
# Convert to data.frame
df_syss_natt <- as.data.frame(px_data_BAS, column.name.type = "text", variable.value.type = "text") %>%
rename(syss_natt = 'sysselsatta efter bostadens belägenhet') %>%
filter(månad >2020)
# Plot data
ggplot(df_syss_natt, aes(x=månad, y=syss_natt, group=1)) +
geom_point() +
geom_line(color="red")
Very crowded output w current 1 series:

Simulating samples from Gamma distribution in R

I'm having trouble with a programming assignment.
From the previous questions, I have a list of 49 elements.
Each element is sample data of size=10000. For the last question, I have to calculate the mean of the first n sample values.
With n between one and ten thousand, within each dataset.
I then have to plot these running averages for each data set.
I've been trying to create lists/vectors of the running averages but it's not working out.
Is there anything I can do?
Function for running average:
run_avg <- function(x, n_max){
a <- c(1:n_max)
r_avg <- sapply(a, FUN = function(y) mean(x[1:y]))
return(r_avg)
}
In your case, n_max should equal 10000;
This function then creates, for one dataset, the running averages.
This has then to be applied to all datasets. You could use lapply for this, if your datasets are stored within a list. Another approach could be a loop or something like that.
Edit: I see that your datasets are in a list, so simply use:
lapply(my_list, run_avg, n_max = 10000)
The running averages can be computed with the following.
res <- lapply(x, function(y){
sapply(seq_along(y), function(k) mean(y[1:k]))
})
Then in order to have the resulting list in a format better for plotting with package ggplot2, format it as a data frame first, with the row names as a column.
df_res <- do.call(cbind.data.frame, res)
names(df_res) <- paste("Mean", seq_len(ncol(df_res)), sep = ".")
df_res <- cbind(df_res, id = as.integer(row.names(df_res)))
Now reshape from wide to long and plot.
library(tidyverse)
df_res %>%
pivot_longer(
cols = starts_with("Mean"),
names_to = "Vector",
values_to = "Mean"
) %>%
ggplot(aes(id, Mean, colour = Vector)) +
geom_point() +
geom_line()
Test data.
set.seed(1234)
list_size <- 4 # 49 in the question
samp_size <- 20 # 10000 in the question
x <- lapply(seq.int(list_size), function(i) rgamma(samp_size, shape = i))

Plotting only 1 hourly datapoint (1 per day) alongside hourly points (24 per day) in R Studio

I am a bit stuck with some code. Of course I would appreciate a piece of code which sorts my dilemma, but I am also grateful for hints of how to sort that out.
Here goes:
First of all, I installed the packages (ggplot2, lubridate, and openxlsx)
The relevant part:
I extract a file from an Italians gas TSO website:
Storico_G1 <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G+1", startRow = 1, colNames = TRUE)
Then I created a data frame with the variables I want to keep:
Storico_G1_df <- data.frame(Storico_G1$pubblicazione, Storico_G1$IMMESSO, Storico_G1$`SBILANCIAMENTO.ATTESO.DEL.SISTEMA.(SAS)`)
Then change the time format:
Storico_G1_df$pubblicazione <- ymd_h(Storico_G1_df$Storico_G1.pubblicazione)
Now the struggle begins. Since in this example I would like to chart the 2 time series with 2 different Y axes because the ranges are very different. This is not really a problem as such, because with the melt function and ggplot i can achieve that. However, since there are NAs in 1 column, I dont know how I can work around that. Since, in the incomplete (SAS) column, I mainly care about the data point at 16:00, I would ideally have hourly plots on one chart and only 1 datapoint a day on the second chart (at said 16:00). I attached an unrelated example pic of a chart style I mean. However, in the attached chart, I have equally many data points on both charts and hence it works fine.
Grateful for any hints.
Take care
library(lubridate)
library(ggplot2)
library(openxlsx)
library(dplyr)
#Use na.strings it looks like NAs can have many values in the dataset
storico.xl <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",
sheet = "Storico_G+1", startRow = 1,
colNames = TRUE,
na.strings = c("NA","N.D.","N.D"))
#Select and rename the crazy column names
storico.g1 <- data.frame(storico.xl) %>%
select(pubblicazione, IMMESSO, SBILANCIAMENTO.ATTESO.DEL.SISTEMA..SAS.)
names(storico.g1) <- c("date_hour","immesso","sads")
# the date column look is in the format ymd_h
storico.g1 <- storico.g1 %>% mutate(date_hour = ymd_h(date_hour))
#Not sure exactly what you want to plot, but here is each point by hour
ggplot(storico.g1, aes(x= date_hour, y = immesso)) + geom_line()
#For each day you can group, need to format the date_hour for a day
#You can check there are 24 points per day
#feed the new columns into the gplot
storico.g1 %>%
group_by(date = as.Date(date_hour, "d-%B-%y-")) %>%
summarise(count = n(),
daily.immesso = sum(immesso)) %>%
ggplot(aes(x = date, y = daily.immesso)) + geom_line()

Moving averages

I have daily data for over 100 years that looks like
01.01.1856 12
02.01.1956 9
03.01.1956 -12
04.01.1956 7
etc.
I wish to calculate the 30 year running average for this huge data. I tried converting the data into a time series but cant still figure out how to go about it. I will prefer a simple method that has to do with working with a data.frame.
I guess the preparation is the difficulty considering some leapyears.
So I try to show some way for preparing, before using the already mentioned function runmean of package require(caTools).
First we create example data (which is not necessary for you, but for the understanding).
Second I divide the data frame into a list of data frames, one for each year and taking the mean values for each year. These two steps could be done at once, but I think the separated way is easier to understand and to adapt.
#example data
Days <- seq(as.Date("1958-01-01"), as.Date("2015-12-31"), by="days")
Values <- runif(length(Days))
DF <- data.frame(Days = Days, Values = Values)
#start of script
Years <- format(DF$Days, "%Y")
UniqueYears <- unique(format(DF$Days, "%Y"))
#Create subset of years
#look for every unique year which element of days is in this year.
YearlySubset <- lapply(UniqueYears, function(x){
DF[which(Years == x), ]
})
YearlyMeanValues <- sapply(YearlySubset, function(x){
mean(x$Values)
})
Now the running mean is applied:
#install.packages("caTools")
require(caTools)
RM <- data.frame(Years = UniqueYears, RunningMean30y = runmean(YearlyMeanValues, 30))
Just if I didn't got you right at first and you want some running mean for every day within about 30 years, of course you could simply do:
RM <- cbind(DF, runmean(DF$Values, 365 * 30))
And considering your problems creating a timeseries:
DF[ , 1] <- as.Date(DF[ , 1], format = "%Y.%m.%d")
I would also suggest exploring RcppRoll in combination with dplyr which provides a fairly convenient solution to calculate rolling averages, sums, etc.
Code
# Libs
library(RcppRoll) # 'roll'-ing functions for R vectors and matrices.
library(dplyr) # data grammar (convenience)
library(zoo) # time series (convenience)
library(magrittr) # compound assignment pipe-operator (convenience)
# Data
data("UKgas")
## Convert to data frame to make example better
UKgas <- data.frame(Y = as.matrix(UKgas), date = time(UKgas))
# Calculations
UKgas %<>%
# To make example more illustrative I converted the data to a quarterly format
mutate(date = as.yearqtr(date)) %>%
arrange(date) %>%
# The window size can be changed to reflect any period
mutate(roll_mean = roll_mean(Y, n = 4, align = "right", fill = NA))
Notes
As the data provided in the example was fairly modest I used quarterly UK gas consumption data available via the data function in the utils package.

R Zoo - aggregating many records with same time entry

I consistently need to take transaction data and aggregate it by Day, Week, Month, Quarter, Year - essentially, it's time-series data. I started to apply zoo/xts to my data in hopes I could aggregate the data faster, but I either don't fully understand the packages' purpose or I'm trying to apply it incorrectly.
In general, I would like to calculate the number of orders and the number of products ordered by category, by time period (day, week, month, etc).
#Create the data
clients <- 1:10
dates <- seq(as.Date("2012/1/1"), as.Date("2012/9/1"), "days")
categories <- LETTERS[1:5]
products <- data.frame(numProducts = 1:10,
category = sample(categories, 1000, replace = TRUE),
clientID = sample(clients, 1000, replace = TRUE),
OrderDate = sample(dates, 1000, replace = TRUE))
I could do this with plyr and reshape, but I think this is a round-about way to do so.
#Aggregate by date and category
products.day <- ddply(products, .(OrderDate, category), summarize, numOrders = length(numProducts), numProducts = sum(numProducts))
#Aggregate by Month and category
products.month <- ddply(products, .(Month = months(OrderDate), Category = category), summarize, numOrders = length(numProducts), numProducts = sum(numProducts))
#Make a wide-version of the data frame
products.month.wide <- cast(products.month, Month~Category, sum)
I tried to apply zoo to the data like so:
products.TS <- aggregate(products$numProducts, yearmon, mean)
It returned this error:
Error in aggregate.data.frame(as.data.frame(x), ...) :
'by' must be a list
I've read the zoo vignettes and documentation, but every example that I've found only shows 1 record/row/entry per time entry.
Do I have to pre-aggregate the data I want to time-series on? I was hoping that I could simply group by the fields I want, then have the months or quarters get added to the data frame incrementally to the X-axis.
Is there a better approach to aggregating this or a more appropriate package?
products$numProducts is a vector, not a zoo object. You'd need to create a zoo object before you can use method dispatch to call aggregate.zoo.
pz <- with(products, zoo(numProducts, OrderDate))
products.TS <- aggregate(pz, as.yearmon, mean)

Resources