Substituting dates with number of days in time series - r

I have following data on student scores on several pretests before their true exam.
a<-(c("2013-02-25","2013-03-13","2013-04-24","2013-05-12","2013-07-12","2013-08-11","actual_exam_date"))
b<-c(300,230,400,NA,NA,NA,"2013-04-30")
c<-c(NA,260,410,420,NA,NA,"2013-05-30")
d<-c(300,230,400,NA,370,390,"2013-08-30")
df<-as.data.frame(rbind(b,c,d))
colnames(df)<-a
rownames(df)<-(c("student 1","student 2","student 3"))
The actual datasheet is much larger. Since the dates vary so much, and the timing between the pretests and to the exam are relatively similar, I would rather convert the true dates into the number of days before the exam, so that they are the new column names, not dates. I understand that this will merge some of the columns which is OK. How would I be able to do that?

This is another good use case for reshape2, because you want to go to long form for plotting. For example:
# you are going to need the student id as a field
df$student_id <- row.names(df)
library('reshape2')
df2 <- melt(df, id.vars = c('student_id','actual_exam_date'),
variable.name = 'pretest_date',
value.name = 'pretest_score')
# drop empty observations
df2 <- df2[!is.na(df2$pretest_score),]
# these need to be dates
df2$actual_exam_date <- as.Date(df2$actual_exam_date)
df2$pretest_date <- as.Date(df2$pretest_date)
# date difference
df2$days_before_exam <- as.integer(df2$actual_exam_date - df2$pretest_date)
# scores need to be numeric
df2$pretest_score <- as.numeric(df2$pretest_score)
# now you can make some plots
library('ggplot2')
ggplot(df2, aes(x = days_before_exam, y = pretest_score, col=student_id) ) +
geom_line(lwd=1) + scale_x_reverse() +
geom_vline(xintercept = 0, linetype = 'dashed', lwd = 1) +
ggtitle('Pretest Performance') + xlab('Days Before Exam') + ylab('Pretest Score')

Here is one way to approach this one. I am sure there are many others. I commented the code to explain what is going on at each step:
# Load two libraries you need
library(tidyr)
library(dplyr)
# Construct data frame you provided
a <- (c("2013-02-25","2013-03-13","2013-04-24","2013-05-12","2013-07-12","2013-08-11","actual_exam_date"))
b <- c(300,230,400,NA,NA,NA,"2013-04-30")
c <- c(NA,260,410,420,NA,NA,"2013-05-30")
d <- c(300,230,400,NA,370,390,"2013-08-30")
df <- as.data.frame(rbind(b,c,d))
colnames(df) <- a
# Add student IDs as a column instead of row names and move them to first position
df$StudentID <- row.names(df)
row.names(df) <- NULL
df <- select(df, StudentID, everything())
# Gather date columns as 'categories' with score as the new column value
newdf <- df %>% gather(Date, Score, -actual_exam_date, -StudentID) %>% arrange(StudentID)
# Convert dates coded as factor variables into actual dates so we can do days to exam computation
newdf$actual_exam_date <- as.Date(as.character(newdf$actual_exam_date))
newdf$Date <- as.Date(as.character(newdf$Date))
# Create a new column of days before exam per student ID (group) and filter
# out dates with missing scores for each student
newdf <- newdf %>% group_by(StudentID) %>% mutate(daysBeforeExam = as.integer(difftime(actual_exam_date, Date, units = 'days'))) %>% filter(!is.na(Score))
# Plot the trends using ggplot
ggplot(newdf, aes(x = daysBeforeExam, y = Score, col = StudentID, group = StudentID)) + geom_line(size = 1) + geom_point(size = 2)

Related

Plotting multiple slope plot's for multiple variables (With a for loop) (Facing issues with key redundancy)

Despite having managed to plot a multiple slope plot with fake data (See reproducible example below), I am having troubles managing to adapt the code to my real data, and keep facing errors due to key redundancy.
First, some context: I have a dataset with numerous "_x" and "_y" variables, which are measures at time 1 and 2 -recorded in a column, since each entry has a time1 and a time2- and I would like to plot my slopes for each individual, making a plot for each variable (pair of variables).
I have managed -with some help- to do it for one set of variables in the following reproducible example with no "_x" or "_y" colnames. Yet when I try to adapt this code with selects -in order to just take those columns instead of all the dataset-, changing the colnames to mimic the example, changing the regex etc. etc. I keep facing errors of keys redundancy.
"Error in spread():
! Each row of output must be identified by a unique combination of keys.
Keys are shared for 195 rows:"
I suspect that this is cause I do have some values in my data that are the same, but with the column ID it shouldn't be a problem and I don't quite grasp what can I do to solve it.
The foo example:
library(tidyverse)
Id <- rep(1:10)
a = c(5,10,15,12,13,25,12,13,11,9)
b = c(8,14,20,13,19,29,15,19,20,11)
c = c(10,14,20,1.5,9,21,13,21,11,10)
d = c(15,9,20,14,12,5,12,13,12,30)
group = as.factor( rep(1:2,each=5) )
data = data.frame(Id,a,b,c,d,group)
case_mapping <- data.frame(
key = c("a", "b", "c", "d"),
key2 = c("x1", "x2", "y1", "y2")
)
data %>%
gather(key, val, c(a:d)) %>%
left_join(case_mapping, by = "key") %>%
select(-key) %>%
extract(key2, into = c("key", "order"), "([a-z])([0-9])") %>%
spread(key, val) %>%
ggplot() +
aes(x, y, group = Id, color = group) + xlab("Age")+ #ggtitle(paste("Variable")+
geom_point() +
geom_line()
And now a example of my data.
library(tidyverse)
Id <- rep(1:10)
var1_x = c(5,10,15,12,13,25,12,13,11,9)
var2_x = c(8,14,20,13,19,29,NA,19,20,11) # just adding some nas.
var3_x = c(10,14,20,1.5,9,21,13,21,11,10)
var1_y = var1_x+3
var2_y = var2_x*2
var3_y = c(10,14,20,1.5,9,21,13,21,11,10) #same, just to see.
age1 = c(15,9,20,14,12,5,12,13,12,30)
age2 = c(18,19,24,16,15,9,16,19,14,37)
group = as.factor( rep(1:2,each=5) )
data = data.frame(Id,var1_x,var2_x,var3_x, var1_y,var2_y,var3_y,age1,age2,group)
Now, should I create a for loop, so I can pair the variables correctly.
First we create two strings with the colnames _x and _y
sub_x = colnames(data)[2:4] # sub x
sub_y = colnames(data)[5:7] # suby
And now we should be able to implement the for loop.
for( i in 1:length(sub_x)) {
# We define the matching keys.
case_mapping <- data.frame(
key = c(sub_x[i],sub_y[i], "age1", "age2"),
key2 = c("x1", "x2", "y1", "y2")
)
# And now we should be able to plot this.
data %>%
gather(key, val, c(!!sym(sub_x[i]), !!sym(sub_y[i]), age1,age2 )) %>%
left_join(case_mapping, by = "key") %>%
select(-key) %>%
extract(key2, into = c("key", "order"), "([a-z])([0-9])") %>%
spread(key, val) %>%
ggplot() +
aes(x, y, group = Id, color = group) +
xlab("Age")+
geom_point() +
geom_line()
}
Yet this doesn't give me any results and when I try to tweak it it throws errors due to the gather. I hope you can enlighten me in order to understand what I am doing wrong.
PD: Sorry if I'm not fully grammatically correct, but English is my second language.
Edit to clarify:
I intend to plot something like this for every variable -and if there is a way to indicate the ID to each slope that would be really nice so I don't have to look it up from the data to see to which they correspond)
EDIT 2
With the help of Tjebo I somewhat "solve it" but I still need to automatize via dplyr the construction of this data_long2 from the data_long1 provided.
data_long2 <- data.frame( Id = rep(data_long$Id,2), Group = rep(data_long$group,2), Var= rep(data_long$var,2) , Valueage= c(data_long$age1,data_long$age2), Valuevar= c(data_long$x,data_long$y) )
ggplot(data_long2) +
## I've removed the grouping by ID, because there was only one observation per ID
aes(Valueage, Valuevar, color=Id) +
geom_point() +
geom_line(aes(group = Id))+
# geom_line() +
## you can for example facet by your new variable column
facet_grid(~Var)
#> Warning: Removed 1 rows containing missing values (geom_point).
And changing color to group
I think you might be overcomplicating things. As far as I understand, you struggle with reshaping your data and then plotting all variables, correct?
Below one approach that makes use of the new-ish pivot_longer for reshaping (it has amazing functionality especially with regards to "multiple gatherings") and then faceting instead of looping.
Update
You basically need to pivot longer twice
library(tidyverse)
Id <- rep(1:10)
var1_x = c(5,10,15,12,13,25,12,13,11,9)
var2_x = c(8,14,20,13,19,29,NA,19,20,11) # just adding some nas.
var3_x = c(10,14,20,1.5,9,21,13,21,11,10)
var1_y = var1_x+3
var2_y = var2_x*2
var3_y = c(10,14,20,1.5,9,21,13,21,11,10) #same, just to see.
age1 = c(15,9,20,14,12,5,12,13,12,30)
age2 = c(18,19,24,16,15,9,16,19,14,37)
group = as.factor( rep(1:2,each=5) )
data = data.frame(Id,var1_x,var2_x,var3_x, var1_y,var2_y,var3_y,age1,age2,group)
data_long <-
data %>%
## make use of the cool pivot_longer function
pivot_longer(cols = matches("_[x|y]"),
names_to = c("var", ".value"),
names_pattern = "(.*)_(.*)") %>%
## now make even longer! all y (currently confusingly called x and y) belong into one column
## and all x (currently called age1 and age2) in another column
## this is easier with a similar pattern in both, therefore renaming
## note the .value name is switched when compared with the first pivoting
rename(y1= x, y2 = y) %>%
pivot_longer(
matches(".*([1-2])"),
names_to = c(".value", "set"),
names_pattern = "(.+)([0-9+])"
)
ggplot(data_long) +
## I've removed the grouping by ID, because there was only one observation per ID
aes(age, y, color = as.character(Id)) +
geom_point() +
geom_line() +
## you can for example facet by your new variable column
facet_grid(~var)
#> Warning: Removed 2 rows containing missing values (geom_point).
To create each plot separately in a loop:
## split by your new variable and run a loop to create a list of plots
ls_p <- lapply(split(data_long, data_long$var), function(.x){
ggplot(.x) +
## I've removed the grouping by ID, because there was only one observation per ID
aes(age, y, color = as.character(Id)) +
geom_point() +
geom_line() +
## you can for example facet by your new variable column
facet_grid(~var)
} )
## you can then either print them separately or all together, e.g. with patchwork
patchwork::wrap_plots(ls_p) + patchwork::plot_layout(ncol = 1)
#> Warning: Removed 2 rows containing missing values (geom_point).
#> Warning: Removed 2 row(s) containing missing values (geom_path).
Created on 2022-05-31 by the reprex package (v2.0.1)

R: Creative visualization in RStudio

I am at the final stages of a project where i have been comparing the appraisal price vs the sold price of different properties. The complete code for data collection and tidying is below.
At this stage i am looking at different ways to visualize my data. However, I am quite new to it so my question is whether anyone has any "new" or special ways they visualizing data that they find usefull og intuitive. I have given a couple of examples of what i am able to visualize now using ggplot.
Additionally: Now my visualizations plots all 1275 observations every time. I would however also like to visualize the data both with mean and median for the Percentage, Sold and Tax variables which i am most interested in. For example to visualize the mean value of the Percentage column based on different years.
Appreciate any help!
Complete code:
#Step 1: Load needed library
library(tidyverse)
library(rvest)
library(jsonlite)
library(stringi)
library(dplyr)
library(data.table)
library(ggplot2)
#Step 2: Access the URL of where the data is located
url <- "https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/10/"
#Step 3: Direct JSON as format of data in URL
data <- jsonlite::fromJSON(url, flatten = TRUE)
#Step 4: Access all items in API
totalItems <- data$TotalNumberOfItems
#Step 5: Summarize all data from API
allData <- paste0('https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/', totalItems,'/') %>%
jsonlite::fromJSON(., flatten = TRUE) %>%
.[1] %>%
as.data.frame() %>%
rename_with(~str_replace(., "ListItems.", ""), everything())
#Step 6: removing colunms not needed
allData <- allData[, -c(1,4,8,9,11,12,13,14,15)]
#Step 7: remove whitespace and change to numeric in columns SoldAmount and Tax
#https://stackoverflow.com/questions/71440696/r-warning-argument-is-not-an-atomic-vector-when-attempting-to-remove-whites/71440806#71440806
allData[c("Tax", "SoldAmount")] <- lapply(allData[c("Tax", "SoldAmount")], function(z) as.numeric(gsub(" ", "", z)))
#Step 8: Remove rows where value is NA
#https://stackoverflow.com/questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame
alldata <- allData %>%
filter(across(where(is.numeric),
~ !is.na(.)))
#Step 9: Remove values below 10000 NOK on SoldAmount og Tax.
alldata <- alldata %>%
filter_all(any_vars(is.numeric(.) & . > 10000))
#Step 10: Calculate percentage change between tax and sold amount and create new column with percent change
#df %>% mutate(Percentage = number/sum(number))
alldata_Percent <- alldata %>% mutate(Percentage = (SoldAmount-Tax)/Tax)
Visualization
# Plot Percentage difference based on County
ggplot(data=alldata_Percent,mapping = aes(x = Percentage, y = County)) +
geom_point(size = 1.5)
#Plot County with both Date and Percentage difference The The
theme_set(new = ggthemes::theme_economist())
p <- ggplot(data = alldata_Percent,
mapping = aes(x = Date, y = Percentage, colour = County)) +
geom_line(na.rm = TRUE) +
geom_point(na.rm = TRUE)
p

How can I create a running median of diel cycle from multiyear data?

I think this problem may be of interest to others who deal with data smoothing of long-term environmental variables.
I have a dataset structured as below:
Columns:
Date Hour_Min Y(response variable)
These data are hourly, and I need to create a moving average of the diel cycle, but categorized by the Hour_Min. In other words, if I were to use a 31 day window, for a given day the running average data point for Hour_Min 00:00 would take the average of the day in question with the data points from Hour_Min 00:00 for the previous and the following 15 days. This would then repeat for that day's hour 1:00, etc. through the dataframe.
Unfortunately the data also have many NAs, which is problematic for moving window averages, although I think that can be solved using rollapply from the zoo package.
One approach I tried was to use tidyr's spread function to switch from long to wide format, to create a dataframe like this:
Date Y_Hour_Min_0000 Y_Hour_Min_0100 Y_Hour_Min_0200 etc...
If I could change the format in this way, I could then create new columns of running averages of each Y_Hour_Min_.... column. I would then need to gather everything together back to long format (another task I'm not sure how to approach).
However, I wasn't able to get the spread function to work so that it kept Date as a grouping variable associated with each Y_Hour_Min_.... column.
Another, possibly more elegant solution would be if there is a way to create a single new column in one step, using some combination of rollapply and custom function.
Any thoughts on how to implement code for this task will be greatly appreciated. Below I have a simple code to simulate my dataset:
Simulated data:
### Create vector of hours/dates:
date <- seq(as.POSIXct("2016-01-01 00:00"), as.POSIXct("2016-12-30
23:00"), by="hour")
### Create vector of noisy sine function:
d <- 365
n <- 24*d # number of data points
t <- seq(from = 0, to = 2*d*pi, length.out=24*d)
a <- 6
b <- 1
c.norm <- rnorm(n)
amp <- 3
y <- a*sin(b*t)+c.norm*amp+15
### Randomly insert NAs into data:
ind <- which(y %in% sample(y, 1000))
y[ind]<-NA
### Create test dataframe:
df <- data.frame(dt = date, y = y) %>%
separate(dt, c("date", "hour_min"), sep=" ") %>%
mutate(date = as.Date(date))
I think this could work:
EDIT: Simplified code by adding fill = NA parameter to rollapply() function as suggested in the comments.
# add a complete date + time stamp
df$date_time <- paste(df$date, df$hour_min)
# make new column to store median data
df$median_y <- NA
# set rolling median width
width_roll <- 31
# do a rolling median for each hour, one at a time
# add NAs where no median can be calculated
for (i in levels(factor(df$hour_min))) {
df[df$hour_min == i, "median_y"] <- rollapply(df[df$hour_min == i, "y"],
width = width_roll,
median,
na.rm = TRUE,
fill = NA))
}
The approach is just to use the rollapply() function as you suggested, but only on one particular hour at a time. Then each of these is placed back into a new column in turn.
Here's an example for just one hour over the whole year, which makes it easier to visualize the median smoothing.
# Examples:
# plot one hour plus rolling median over time
# here i = "23:00:00"
plot(x = as.POSIXct(df[df$hour_min == i, "date_time"]),
y = df[df$hour_min == i, "y"],
type = "l",
col = "blue",
ylab = "y values",
xlab = i)
lines(x = as.POSIXct(df[df$hour_min == i, "date_time"]),
y = df[df$hour_min == i, "median_y"],
lwd = 3)
legend("topleft",
legend = c("raw", "median"),
col = c("blue", "black"),
lwd = 3)
Plot for a single hour
This is for everything (lots of data so not so easy to see but looks like it worked).
# plot all the data
plot(x = as.POSIXct(df$date_time),
y = df$y,
type = "l",
col = "blue",
ylab = "y values",
xlab = "Date")
lines(x = as.POSIXct(df$date_time),
y = df$median_y,
lwd = 3)
legend("topleft",
legend = c("raw", "median"),
col = c("blue", "black"),
lwd = 3)
Plot for all data
I'll take a crack at it but its not perfect. Hoping someone can come in and top me off.
TL:DR;
df2 <- df %>% slice(-7441) %>% spread(hour_min, y)
mov_avg <- function(x) {c(rep(NA, 15), rollapply(x, width = list(-15:15), FUN = mean, align="center", na.rm=T), rep(NA, 15))}
avgs <- as.data.frame(matrix(unlist(lapply(df2[,2:ncol(df2)], mov_avg)), nrow = nrow(df2), byrow = FALSE))
colnames(avgs) <- paste0("avg_", colnames(df2[,2:ncol(df2)]))
final_df <- cbind(df2, avgs) %>%
gather(2:ncol(.), key = "hour_min", value = "value") %>%
arrange(date, hour_min)
In Depth:
Starting at your starting point.. I added set.seed(1) so we can all follow along in tandem.
Your Initial Starting Point:
### Create vector of hours/dates:
set.seed(1)
date <- seq(as.POSIXct("2016-01-01 00:00"), as.POSIXct("2016-12-30
23:00"), by="hour")
### Create vector of noisy sine function:
d <- 365
n <- 24*d # number of data points
t <- seq(from = 0, to = 2*d*pi, length.out=24*d)
a <- 6
b <- 1
c.norm <- rnorm(n)
amp <- 3
y <- a*sin(b*t)+c.norm*amp+15
### Randomly insert NAs into data:
ind <- which(y %in% sample(y, 1000))
y[ind]<-NA
### Create test dataframe:
df <- data.frame(dt = date, y = y) %>%
separate(dt, c("date", "hour_min"), sep=" ") %>%
mutate(date = as.Date(date))
First thing was to do what you said and try the long format. Normally I think this problem would be best by using dplyr's group_by on the hour_min column and doing the rolling average there, but I'm not sure how to do that.
First thing I noticed is that there is a duplicate value for one row on a given day. There are two observations for 1am, which breaks our spread, so I removed that observation using slice(-7441)
So let's spread your df.
df2 <- df %>% slice(-7441) %>% spread(hour_min, y)
As we can see, the dataframe is now 365 observations long(dates), and 25 columns wide (date + 24 hours)
dim(df2)
[1] 365 25
Next thing I did which is where this isn't perfect, is using rollapply. When using rollapply we can give it a width = list(-15:15). This will look 15 days into the past and 15 into the future and average all 31 days together. The problem is the first 15 days don't have a past 15, and the last 15 days don't have a future 15. So I padded these with NAs. I'm hoping someone can fix this part of my answer.
I created a custom function to do this:
mov_avg <- function(x) {c(rep(NA, 15), rollapply(x, width = list(-15:15), FUN = mean, align="center", na.rm=T), rep(NA, 15))}
If we just do the rollapply we will get a vector of length 335. I padded 15 in front and back to get us to our needed 365.
Next we want to lapply that function across our entire dataframe. That will give us a list of 24 vectors of length 365. We then want to turn that into a dataframe and bind it to our current dataframe.
Lastly we gather all of the columns back into the long format and arrange
avgs <- as.data.frame(matrix(unlist(lapply(df2[,2:ncol(df2)], mov_avg)), nrow = nrow(df2), byrow = FALSE))
colnames(avgs) <- paste0("avg_", colnames(df2[,2:ncol(df2)]))
final_df <- cbind(df2, avgs) %>%
gather(2:ncol(.), key = "hour_min", value = "value") %>%
arrange(date, hour_min)
I hope this helps.

how to display grouped values in r?

I have data in form: date, key, value, n,
where:
date is the first date and time when a variable key got a specific value.
key is the variable name
value is a value
n is the number of subsequent occurrences of the same value
For example, if a has a value of 20 from 8am to 11am on 2017-01-01, and there are four recordings during that time span, its n value for 2017-01-01 8am would be 4. The reason the data is highly aggregated like this is that there are billions of rows of data.
This is a small example:
r1 <- c("2017-01-01 08:00:00","a",20,5)
r2 <- c("2017-01-01 08:00:00","b",10,20)
r3 <- c("2017-01-01 14:00:00","a",35,4)
dat <- rbind(r1,r2,r3)
colnames(dat) <- c("Date","Key","Value","n")
My goal is to show the value distributions over time, using different plots including lines (for time series).
As the amount of data is huge, I'm looking for an effective way of ungrouping this kind of data (i.e. replicating the value n-times) or displaying the data as it is.
Here is how I would ungroup the data, using dplyr chain. But as you can see, the comment of Roman is quite similar.
r1 <- c("2017-01-01 08:00:00","a",20,5)
r2 <- c("2017-01-01 08:00:00","b",10,20)
r3 <- c("2017-01-01 14:00:00","a",35,4)
dat <- as.data.frame(rbind(r1,r2,r3),stringsAsFactors = F)
colnames(dat) <- c("Date","Key","Value","n")
library(dplyr)
dat %>% mutate(n = as.numeric(n)) %>%
do(.[rep(1:nrow(.), .$n),])
You could do this:
dat <- as.data.frame(dat)
dat$Date <- as.character(dat$Date)
dat$n <- as.numeric(dat$n)
dat$Value <- as.numeric(dat$Value)
ggplot(dat) +
geom_point(aes(x = Date, y = Value, color = Key, stroke = n)) +
expand_limits(y = 0)

Using functionals instead of for loops to identify sequential changes in a vector

My data look like this:
I want to identify which "downward trend" each observation is part of, so I can group them and do things like make this graph:
My logic for distinguishing "downward trends" is that they end when the next observation has a higher measurement.
I've written a loop to do this, but I'm wondering if there's a better way to do it with one of the apply functions or something like them.
##Create sample data
df <- data.frame(timestamp = seq(1:20),
measurement = seq(10, 1, by = -1))
## This is the for loop I'm hoping to improve
df$downward.trend.seq <- 0
seq <- 1
for(i in 1:nrow(df)){
df$downward.trend.seq[i] <- seq
if (i < nrow(df) & df$measurement[i] < df$measurement[i+1]) {
seq <- seq + 1
}
}
## Code for plots
library(ggplot2)
library(dplyr)
ggplot(df, aes(x = timestamp, y = measurement)) + geom_point()
ggplot(df, aes(x = timestamp, y = measurement, group = downward.trend.seq)) + geom_line(aes(color=downward.trend.seq %>% factor))
You can use which and diff to help identify the where downward trend changes occur, and use cumsum to fill out the group membership.
# set up new column with all 0s
df$downward.trend.seq <- 0
# use diff to identify indices to change to 1
df$downward.trend.seq[which(c(NA, diff(df$measurement)) > 0)] <- 1
# use cumsum to fill in proper group membership
df$downward.trend.seq <- cumsum(df$downward.trend.seq)
Here is a dplyr solution
df %>% mutate(data_group = cumsum( c(0, diff(measurement)) > 0 ))
This performs the cumulative sum over a logical vector and assigns the results to data_group

Resources