This is a follow on question to a question I posted earlier (see Sum over rows with multiple changing conditions R data.table for more details). I want to calculate how many times the 3 subjects have experienced an event in the last 5 years. So have been summing over a rolling window using rollapply from the zoo package. This assumes that the experience 5 years ago is as important as the experience 1 year ago (same weighting), so now I want to include a time decay for the experience that enters the sum. This basically means that the experience 5 years ago does not enter into the sum with the same weighting as the experience 1 year ago.
I my case I want to include an age dependent decay (even though for other applications faster or slower decays such as square root or squares could be possible).
For example lets assume I have the following data (I build on the previous data for clarity):
mydf <- data.frame (Year = c(2000, 2001, 2002, 2004, 2005,
2007, 2000, 2001, 2002, 2003,
2003, 2004, 2005, 2006, 2006, 2007),
Name = c("Tom", "Tom", "Tom", "Fred", "Gill",
"Fred", "Gill", "Gill", "Tom", "Tom",
"Fred", "Fred", "Gill", "Fred", "Gill", "Gill"))
# Create an indicator for the experience
mydf$Ind <- 1
# Load require packages
library(data.table)
library(zoo)
# Set data.table
setDT(mydf)
setkey(mydf, Name,Year)
# Perform cartesian join to calculate experience. I2 is the new experience indicator
m <- mydf[CJ(unique(Name),seq(min(Year)-5, max(Year))),allow.cartesian=TRUE][,
list(Ind = unique(Ind), I2 = sum(Ind,na.rm=TRUE)),
keyby=list(Name,Year)]
# This is the approach I have been taking so far. Note that is a simple rolling sum of I2
m[,Exp := rollapply(I2, 5, function(x) sum(head(x,-1)),
align = 'right', fill=0),by=Name]
So question now is, how can I include a age dependent decay into this calculation. To model this I need to divide the experience by the age of the experience before it enters the sum.
I have been trying to get it to work using something along these lines:
m[,Exp_age := rollapply(I2, 5, function(x) sum(head(x,-1)/(tail((Year))-head(Year,-1))),
align = 'right', fill=0),by=Name]
But it does not work. I think my main problem is that I cannot get the age of the experience right so I can divide by the age in the sum. The result should look like the Exp_age column in the myres data.frame below
myres <- data.frame(Name = c("Fred", "Fred", "Fred", "Fred", "Fred",
"Gill", "Gill", "Gill", "Gill", "Gill", "Gill",
"Tom", "Tom", "Tom", "Tom", "Tom"),
Year = c(2003, 2004, 2004, 2006, 2007, 2000, 2001, 2005,
2005, 2006, 2007, 2000, 2001, 2002, 2002, 2003),
Ind = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
Exp = c(0, 1, 1, 3, 4, 0, 1, 1, 1, 2, 3, 0, 1, 2, 2, 4),
Exp_age = c(0, 1, 1, 1.333333333, 1.916666667, 0, 1, 0.45,
0.45, 2.2, 2, 0, 1, 1.5, 1.5, 2.833333333))
Any pointers would be greatly appreciated!
If I understand you correctly, you are trying to do a rollapply with width=5 and rather than do a simple sum, you want to do a weighted sum. The weights are the age of the experience relative to the 5 year window. I would do this: first set the key in your data.table so that it has proper increasing order by Name, then you know that the last item in your x variable is the youngest and the first item is the oldest (you do this in your code already). I can't quite tell which way you want the weights to go (youngest to have greatest weight or oldest) but you get the point:
setkey(m, Name, Year)
my_fun = function(x) { w = 1:length(x); sum(x*w)}
m[,Exp_age:=rollapply(I2, width=5, by=1, fill=NA, FUN=my_fun, by.column=FALSE, align="right") ,by=Name]
Related
I have a data set that consists of an ID, years and an index:
ID = c("ABW", "ABW", "FRA", "FRA", "FRA", "GER", "GER", "GER")
year = c(2000, 2005, 2000, 2002, 2008, 2005, 2008, 2010)
index = c(NA, NA, 4, NA, 8, NA, 6, NA)
df <- data.frame(ID, year, index)
I am trying to interpolate the missing values in the index but I want the interpolation to be restricted by the ID - e.g. I want R to interpolate the index for all rows with the ID "FRA" and then start the interpolation over again for all rows with the ID "GER". And if there are no values at all for a specific ID (like for the ID "ABW") then I want R to return no interpolated values either.
I have tried this code (but it does not take the ID into consideration):
df <- df %>% mutate(index = na.approx(index, rule = 2)
After the interpolation I want my index column to look like this:
index = c(NA, NA, 4, 6, 8, 6, 6, 6)
Does anyone know how I can do this?
When I used to remove columns, I would always do something like:
DT[, Tax:=NULL]
Sometimes to make a backup, I would do something like
DT2 <- DT
But just a second ago this happened:
library(data.table)
DT <- structure(list(Province = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2,
3), Tax = c(2000, 3000, 1500, 3200, 2000, 1500, 4000, 2000, 2000,
1000, 2000, 1500), year = c(2000, 2000, 2000, 2001, 2001, 2001,
2002, 2002, 2002, 2003, 2003, 2003)), row.names = c(NA, -12L), class = c("tbl_df",
"tbl", "data.frame"))
DT2 <- structure(list(Province = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2,
3), Tax = c(2000, 3000, 1500, 3200, 2000, 1500, 4000, 2000, 2000,
1000, 2000, 1500), year = c(2000, 2000, 2000, 2001, 2001, 2001,
2002, 2002, 2002, 2003, 2003, 2003)), row.names = c(NA, -12L), class = c("tbl_df",
"tbl", "data.frame"))
setDT(DT)
setDT(DT2)
DT2 <- DT
# Removes Tax in BOTH datasets !!
DT2[, Tax:=NULL]
I remember something about this when starting to learn about data.table, but obviously this is not really desirable (for me at least).
What is the proper way to deal with this without accidentally deleting columns?
(Moved from comments.)
Since data.table uses referential semantics (in-place, not copy-on-write like most of R), then your assignment DT2 <- DT means that both variables point to the same data. This is one of the gotchas with "memory-efficient operations" that rely on in-place work: if you goof, you lose it. Any way that will protect you against this kind of mistake will be memory-inefficient, keeping one (or more) copies of data sitting around.
If you need DT2 to be a different dataset, then use
DT2 <- copy(DT)
after which DT2[,Tax:=NULL] will not affect DT.
I find MattDowle's answer here to be informative/helpful here (though the question explicitly asked about copy, not just the behavior you mentioned).
I'm quite new in time series and I'm wondering, is there any similar function as sma() from smooth package to fit weighted moving average (WMA) for my time serie?
I would like to fit a WMA model with weights
weights <- (1/35)*c(-3, 12, 17, 12, -3)
I'm able to calculate the values of WMA with filter() function but I would love to get output similar to sma() function (including e.g. residuals, BIC, ...)
Income <- c(44649, 47507, 49430, 51128, 54453, 58712, 60533,
63091, 63563, 62857, 62481, 63685, 65596)
Year <- c(2000, 2001, 2002, 2003, 2004, 2005, 2006,
2007, 2008, 2009, 2010, 2011, 2012)
df <- data.frame(Year, Income)
library(smooth)
# simple moving average model
(sma5_fit <- sma(df$Income, order = 5))
# weighted moving average
wma5 <- filter(df$Income, filter = (1/35)*c(-3, 12, 17, 12, -3), sides = 2)
Any suggestions welcomed!
EDIT:
It would be also nice to calculate first 2 and last 2 values of weighted moving average. Now, I have to calculate them by hand with following code (weights come from Kendall's Time Series book):
n <- length(Income)
wma5[n-1] <- sum(1/35 * c(2, -8, 12, 27, 2) * c(Income[(n-4):(n)]))
wma5[n] <- sum(1/70 * c(-1, 4, -6, 4, 69) * c(Income[(n-4):(n)]))
wma5[2] <- sum(1/35 * c(2, 27, 12, -8, 2) * c(Income[1:5]))
wma5[1] <- sum(1/70 * c(69, 4, -6, 4, -1) * c(Income[1:5]))
I have this dataset (run it in the command line, to have a look at it)
structure(list(Staz = c("Carmagnola", "Chieri", "Chivasso", "Ivrea",
"Moncalieri", "Orbassano"), Year = c(2004, 2004, 2004, 2004,
2004, 2004), Season = c("Autumn", "Autumn", "Autumn", "Autumn",
"Autumn", "Autumn"), Avg_T = c(11.7361111111111, 11.7361111111111,
11.7361111111111, 11.7361111111111, 11.7361111111111, 11.7361111111111
), Min_T = c(7.27222222222222, 7.27222222222222, 7.27222222222222,
7.27222222222222, 7.27222222222222, 7.27222222222222), Max_T = c(16.6722222222222,
16.6722222222222, 16.6722222222222, 16.6722222222222, 16.6722222222222,
16.6722222222222), Moisture = c(69.6388888888889, 69.6388888888889,
69.6388888888889, 69.6388888888889, 69.6388888888889, 69.6388888888889
), Rain = c(79.2, 79.2, 79.2, 79.2, 79.2, 79.2), Year_Bef = c(2004,
2004, 2004, 2004, 2004, 2004), Year_Bef_Two = c(2004, 2004, 2004,
2004, 2004, 2004)), .Names = c("Staz", "Year", "Season", "Avg_T",
"Min_T", "Max_T", "Moisture", "Rain", "Year_Bef", "Year_Bef_Two"
), row.names = c(NA, 6L), class = "data.frame")
From what you can see there is a variable named 'Season', defining the season of the data. I would like to split the weather variables ('Avg_T', Min_T', 'Max_T', 'Moisture', 'Rain') for every season, but having them in the same row. So, I would have just one row per study area for every year, containing information about seasonal data.
I tried to do that with the 'cast' and 'dcast' commands in the 'reshape' and 'reshape2' packages but it didn't work.
May somebody help me?
Thanks,
Jacopo
First, let's say your data lives in df. I rbind df to df and change the season in the latter half of df to summer so that we have more than 1 season present:
df <- rbind(df, df)
df[7:12,]$Season = 'Summer'
Then I get rid of the last two columns in df (they don't seem to be doing anything):
df = df[, -c(9,10)]
Now, we're ready to use the reshape function:
r_df <- reshape(df, timevar = 'Season', idvar = c('Staz', 'Year'), direction = 'wide')
I think that should give you what you're looking for.
I'm trying to run svd on some data and I'm getting an error. I saw another post suggesting that this might happen when one or more of the columns are all 0, but this is not the case here. Can someone please explain what is going on and how to fix this? Note, that this is a subset of a much larger data-set. Thank you.
year <- c(2015, 2015, 2015, 2015, 2015, 2015)
week <- c(1, 1, 1, 1, 1, 1)
flight_type_name <- c("Commercial", "Filler", "Label", "Commercial", "Filler", "Filler")
userdata_country <- c("NO", "SG", "NI", "None", "CA", "GT")
platform <- c("iphone", "linux", "iphone", "linux", "web", "ipad")
num_users <- c("26726, 2, 161, 1, 4316, 577")
impressions <- c(135019, 0, 312, 0, 37014, 11492)
clicks <- c(407, 2, 2, 2, 59, 25)
ctr <- data.frame(year, week, flight_type_name, userdata_country, platform, num_users, impressions, clicks)
svd(ctr)