R: casting multiple columns according - r

I have this dataset (run it in the command line, to have a look at it)
structure(list(Staz = c("Carmagnola", "Chieri", "Chivasso", "Ivrea",
"Moncalieri", "Orbassano"), Year = c(2004, 2004, 2004, 2004,
2004, 2004), Season = c("Autumn", "Autumn", "Autumn", "Autumn",
"Autumn", "Autumn"), Avg_T = c(11.7361111111111, 11.7361111111111,
11.7361111111111, 11.7361111111111, 11.7361111111111, 11.7361111111111
), Min_T = c(7.27222222222222, 7.27222222222222, 7.27222222222222,
7.27222222222222, 7.27222222222222, 7.27222222222222), Max_T = c(16.6722222222222,
16.6722222222222, 16.6722222222222, 16.6722222222222, 16.6722222222222,
16.6722222222222), Moisture = c(69.6388888888889, 69.6388888888889,
69.6388888888889, 69.6388888888889, 69.6388888888889, 69.6388888888889
), Rain = c(79.2, 79.2, 79.2, 79.2, 79.2, 79.2), Year_Bef = c(2004,
2004, 2004, 2004, 2004, 2004), Year_Bef_Two = c(2004, 2004, 2004,
2004, 2004, 2004)), .Names = c("Staz", "Year", "Season", "Avg_T",
"Min_T", "Max_T", "Moisture", "Rain", "Year_Bef", "Year_Bef_Two"
), row.names = c(NA, 6L), class = "data.frame")
From what you can see there is a variable named 'Season', defining the season of the data. I would like to split the weather variables ('Avg_T', Min_T', 'Max_T', 'Moisture', 'Rain') for every season, but having them in the same row. So, I would have just one row per study area for every year, containing information about seasonal data.
I tried to do that with the 'cast' and 'dcast' commands in the 'reshape' and 'reshape2' packages but it didn't work.
May somebody help me?
Thanks,
Jacopo

First, let's say your data lives in df. I rbind df to df and change the season in the latter half of df to summer so that we have more than 1 season present:
df <- rbind(df, df)
df[7:12,]$Season = 'Summer'
Then I get rid of the last two columns in df (they don't seem to be doing anything):
df = df[, -c(9,10)]
Now, we're ready to use the reshape function:
r_df <- reshape(df, timevar = 'Season', idvar = c('Staz', 'Year'), direction = 'wide')
I think that should give you what you're looking for.

Related

R time series ggtsdisplay function won't run but I can do time series, ACF and PACF separately

I'm using the following libraries to try and work with time series data.
First I install fpp3 because it has the aus_airpassengers dataset. Here is the dput for that dataset or else you could get it from the package. It just has two columns, one for Year and one for Passengers.
library(fpp3)
structure(list(Year = c(1970, 1971, 1972, 1973, 1974, 1975, 1976,
1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
2010, 2011), Passengers = c(7.3187, 7.3266, 7.7956, 9.3846, 10.6647,
11.0551, 10.8643, 11.3065, 12.1223, 13.0225, 13.6488, 13.2195,
13.1879, 12.6015, 13.2368, 14.4121, 15.4973, 16.8802, 18.8163,
15.1143, 17.5534, 21.8601, 23.8866, 26.9293, 26.8885, 28.8314,
30.0751, 30.9535, 30.1857, 31.5797, 32.577569, 33.477398, 39.021581,
41.386432, 41.596552, 44.657324, 46.951775, 48.728837, 51.488427,
50.026967, 60.640913, 63.3603103378)), row.names = c(NA, -42L
), key = structure(list(.rows = structure(list(1:42), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), index = structure("Year", ordered = TRUE), index2 = "Year", interval = structure(list(
year = 1, quarter = 0, month = 0, week = 0, day = 0, hour = 0,
minute = 0, second = 0, millisecond = 0, microsecond = 0,
nanosecond = 0, unit = 0), .regular = TRUE, class = c("interval",
"vctrs_rcrd", "vctrs_vctr")), class = c("tbl_ts", "tbl_df", "tbl",
"data.frame"))
aus_airpassengers <- aus_airpassengers
aus_airpassengers <- aus_airpassengers %>%
filter(Year >= 1970) %>%
filter(Year <= 2011)
I use autoplot which I think is from the ggplot2 package to see how passengers change over these 42 years.
autoplot(aus_airpassengers) +
labs(title="Air Passengers from 1970 to 2011") +
labs(x ="Date") +
labs(y = "Passenger numbers (1000's)")
Now I'm trying to use the ggtsdisplay function from the forecast package in order to plot the time series along with the ACF and PACF plots.
library(forecast)
ggtsdisplay(aus_airpassengers)
But it's giving me this error:
Error in ggtsdisplay(aus_airpassengers) :
ggtsdisplay is only for univariate time series
I looked online for what this error means, and I come across posts that tell me I probably have more than one value at the same time value. But that's not the case here. We can see that both of these tests evaluate to 42, meaning each Year is a unique year in the dataset (plus I can just look over the dataset to see that each year is unique.)
length(unique(aus_airpassengers$Year))
nrow(aus_airpassengers)
I'm also confused because if I use the feasts library to plot the ACF and PACF plots it works just fine.
library(feasts)
aus_airpassengers %>%
ACF(Passengers) %>%
autoplot()
aus_airpassengers %>%
PACF(Passengers) %>%
autoplot()
Does anyone know what's up with ggtsdisplay?
You are using ggtsdisplay from the forecast package which is designed for ts objects. You need to use gg_tsdisplay from the feasts package (which is loaded when you load fpp3).
In general, if you are using tsibbles and fable, you should not be loading the forecast package.

Accidental column deletion - What is the proper way of creating data.table backups and deleting data.table columns

When I used to remove columns, I would always do something like:
DT[, Tax:=NULL]
Sometimes to make a backup, I would do something like
DT2 <- DT
But just a second ago this happened:
library(data.table)
DT <- structure(list(Province = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2,
3), Tax = c(2000, 3000, 1500, 3200, 2000, 1500, 4000, 2000, 2000,
1000, 2000, 1500), year = c(2000, 2000, 2000, 2001, 2001, 2001,
2002, 2002, 2002, 2003, 2003, 2003)), row.names = c(NA, -12L), class = c("tbl_df",
"tbl", "data.frame"))
DT2 <- structure(list(Province = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2,
3), Tax = c(2000, 3000, 1500, 3200, 2000, 1500, 4000, 2000, 2000,
1000, 2000, 1500), year = c(2000, 2000, 2000, 2001, 2001, 2001,
2002, 2002, 2002, 2003, 2003, 2003)), row.names = c(NA, -12L), class = c("tbl_df",
"tbl", "data.frame"))
setDT(DT)
setDT(DT2)
DT2 <- DT
# Removes Tax in BOTH datasets !!
DT2[, Tax:=NULL]
I remember something about this when starting to learn about data.table, but obviously this is not really desirable (for me at least).
What is the proper way to deal with this without accidentally deleting columns?
(Moved from comments.)
Since data.table uses referential semantics (in-place, not copy-on-write like most of R), then your assignment DT2 <- DT means that both variables point to the same data. This is one of the gotchas with "memory-efficient operations" that rely on in-place work: if you goof, you lose it. Any way that will protect you against this kind of mistake will be memory-inefficient, keeping one (or more) copies of data sitting around.
If you need DT2 to be a different dataset, then use
DT2 <- copy(DT)
after which DT2[,Tax:=NULL] will not affect DT.
I find MattDowle's answer here to be informative/helpful here (though the question explicitly asked about copy, not just the behavior you mentioned).

Exclude incomplete classes from ggplot

I have a data frame and from it I'm plotting some trend lines, however, I want to exclude data where there aren't complete records (i.e. if the dose of drug C is NA in 2002, then I don't want C included on the plot at all). How do I achieve this in R?
Reproducible Example
df <- data.frame(year=c(2001, 2002, 2003, 2004, 2001, 2002, 2003, 2004, 2001, 2002, 2003, 2004),
dose=c(500, 600, 750, 550, 300, 330, 350, 390, 100, NA, 250, 125),
drug=c("A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C"))
ggplot(df) + geom_line(aes(x = year, y = dose, color=drug))
The tidyverse approach:
library(tidyverse)
gplot(df %>% group_by(drug) %>% filter(!any(is.na(dose))))+
geom_line(aes(x = year, y = dose, color=drug))
It filters now per drug (from group_by) if there is not ! any na-value

Interpolate Time Series Using Weighted Loess in R

I have a dataset of multiple lakes with water level elevations through time. The observations are not regularly spaced and have many large gaps. Further, some of the older observations may be of lower or unknown quality. I created a separate model that does a reasonably good job of predicting water levels across time, but still misses the actual observations by varying amounts.
I would like to create a third inputed/interpolated set of data in which the solution is:
informed by the modeled values where observations are missing
crosses the highly weighted observations
and is informed by the lower weighted observations
So far, I have used the fable package's TSLM->interpolate to perform this. It works reasonably well, but I cannot see a way to introduce weighting to the process. Further, it relies to heavily on the global coefficient and intercepts making it a bit too volatile when the modeled value significantly misses the observed. I am thinking that I need to use some sort of weighted loess that relies on local coefficients and can accomodate weighting.
library(dplyr)
library(tsibble)
library(fable)
library(ggplot2)
test_data <- data.frame(obs_year = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009),
site_name = c("Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2"),
observed = c(100,200,NA, NA, NA, NA, 220, NA, NA, 125, NA,NA,425, NA, 475, NA, 450, 450, 475, 500),
weights = c(1,1,NA, NA, NA, NA, 2, NA, NA, 2, NA,NA,2, NA, 1, NA, 2, 2, 2, 2),
modeled = c(110,120,165,150, 200, 225, 240, 250, 150, 130, 450,430,415,400, 425, 450, 460, 460, 470, 490))
test_tsibble <- as_tsibble(test_data, key = site_name, index = obs_year)
tslm_interpolate <- test_tsibble %>%
group_by(site_name) %>%
model(lm = TSLM(observed~modeled)) %>%
fabletools::interpolate(test_tsibble)
tslm_interpolate <- left_join(tslm_interpolate, test_data, by = c("site_name", "obs_year")) %>%
dplyr::select(obs_year, site_name, observed = observed.y, imputed = observed.x, modeled, weights)
tslm_interpolate %>%
ggplot(aes(x=obs_year))+
geom_line(aes(y = imputed), color = "blue")+
geom_line(aes(y = modeled), color = "red")+
geom_point(aes(y = observed), color = "green")+
facet_wrap(~site_name, scales = "free_y")

Sum over rows (rollapply) with time decay

This is a follow on question to a question I posted earlier (see Sum over rows with multiple changing conditions R data.table for more details). I want to calculate how many times the 3 subjects have experienced an event in the last 5 years. So have been summing over a rolling window using rollapply from the zoo package. This assumes that the experience 5 years ago is as important as the experience 1 year ago (same weighting), so now I want to include a time decay for the experience that enters the sum. This basically means that the experience 5 years ago does not enter into the sum with the same weighting as the experience 1 year ago.
I my case I want to include an age dependent decay (even though for other applications faster or slower decays such as square root or squares could be possible).
For example lets assume I have the following data (I build on the previous data for clarity):
mydf <- data.frame (Year = c(2000, 2001, 2002, 2004, 2005,
2007, 2000, 2001, 2002, 2003,
2003, 2004, 2005, 2006, 2006, 2007),
Name = c("Tom", "Tom", "Tom", "Fred", "Gill",
"Fred", "Gill", "Gill", "Tom", "Tom",
"Fred", "Fred", "Gill", "Fred", "Gill", "Gill"))
# Create an indicator for the experience
mydf$Ind <- 1
# Load require packages
library(data.table)
library(zoo)
# Set data.table
setDT(mydf)
setkey(mydf, Name,Year)
# Perform cartesian join to calculate experience. I2 is the new experience indicator
m <- mydf[CJ(unique(Name),seq(min(Year)-5, max(Year))),allow.cartesian=TRUE][,
list(Ind = unique(Ind), I2 = sum(Ind,na.rm=TRUE)),
keyby=list(Name,Year)]
# This is the approach I have been taking so far. Note that is a simple rolling sum of I2
m[,Exp := rollapply(I2, 5, function(x) sum(head(x,-1)),
align = 'right', fill=0),by=Name]
So question now is, how can I include a age dependent decay into this calculation. To model this I need to divide the experience by the age of the experience before it enters the sum.
I have been trying to get it to work using something along these lines:
m[,Exp_age := rollapply(I2, 5, function(x) sum(head(x,-1)/(tail((Year))-head(Year,-1))),
align = 'right', fill=0),by=Name]
But it does not work. I think my main problem is that I cannot get the age of the experience right so I can divide by the age in the sum. The result should look like the Exp_age column in the myres data.frame below
myres <- data.frame(Name = c("Fred", "Fred", "Fred", "Fred", "Fred",
"Gill", "Gill", "Gill", "Gill", "Gill", "Gill",
"Tom", "Tom", "Tom", "Tom", "Tom"),
Year = c(2003, 2004, 2004, 2006, 2007, 2000, 2001, 2005,
2005, 2006, 2007, 2000, 2001, 2002, 2002, 2003),
Ind = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
Exp = c(0, 1, 1, 3, 4, 0, 1, 1, 1, 2, 3, 0, 1, 2, 2, 4),
Exp_age = c(0, 1, 1, 1.333333333, 1.916666667, 0, 1, 0.45,
0.45, 2.2, 2, 0, 1, 1.5, 1.5, 2.833333333))
Any pointers would be greatly appreciated!
If I understand you correctly, you are trying to do a rollapply with width=5 and rather than do a simple sum, you want to do a weighted sum. The weights are the age of the experience relative to the 5 year window. I would do this: first set the key in your data.table so that it has proper increasing order by Name, then you know that the last item in your x variable is the youngest and the first item is the oldest (you do this in your code already). I can't quite tell which way you want the weights to go (youngest to have greatest weight or oldest) but you get the point:
setkey(m, Name, Year)
my_fun = function(x) { w = 1:length(x); sum(x*w)}
m[,Exp_age:=rollapply(I2, width=5, by=1, fill=NA, FUN=my_fun, by.column=FALSE, align="right") ,by=Name]

Resources