filling missing values time series data in R - r

I am trying to expand yearly values in my panel data to year-quarter values. That is repeat the yearly values to every quarter.
For e.g., I am looking to get the repeated values of income for year-quarter 2000Q1, 2000Q2, 2000Q3, 2000Q4, 2001Q1, ... , 2001Q4. So the data frame would be id,year-quarter, income.
I use a two step approach but have some issues to handle. If the quarterly starting value is missing, then I would then need to the quarterly to be missing (NA) too.
Case 1:
annual_data <- data.frame(
person=c(1, 1, 1, 2, 2,2),
year=c(2010, 2011, 2012, 2010, 2011, 2012),
income=c(4, 10, 13, 1, NA, 30)
)
Case 2:
annual_data <- data.frame(
person=c(1, 1, 1, 2, 2,2),
year=c(2010, 2011, 2012, 2010, 2011, 2012),
income=c(4, 10, 13, NA, NA, 30)
)
In the first step, I expand the data to quarterly as was mentioned:
interpolating in R yearly time series data with quarterly values
So use a function such as:
expand <- function(x) {
years <- min(x$year):max(x$year)
quarters <- 1:4
grid <- expand.grid(quarter=quarters, year=years)
x$quarter <- 1
merged <- grid %>% left_join(x, by=c('year', 'quarter'))
merged$person <- x$person[1]
return(merged)
}
Then I used in
zoo::na.locf
dplyr::mutate.
quarterlydata <- annual_data %>% group_by(person) %>% do(expand(.))
testdata <- quarterlydata %>% group_by(person) %>% mutate(ynew=zoo::na.locf(y))
but havent had much luck as it copies forward to all missing values from the previous non-missing values. That is,
Case 1: it copies all values, So income of 1 for person 2 gets copied over to 2010 and 2011. When it must be copied over to just 2010, and 2011 should be NAs.
For case 2: I get
Error: incompatible size (%d), expecting %d (the group size) or 1.
Any thoughts on where I am missing?

For case 1 you are missing the year in your group_by. Since using the code that you have, the groupings for na.locf thinks that year is part of the grouping which na.locf must run over.
testdata <- quarterlydata %>%
group_by(person, year) %>%
mutate(ynew=zoo::na.locf(income, na.rm=FALSE))
With the output:
> tail(testdata, 13)
Source: local data frame [13 x 5]
Groups: person, year
quarter year person income ynew
1 4 2012 1 NA 13
2 1 2010 2 1 1
3 2 2010 2 NA 1
4 3 2010 2 NA 1
5 4 2010 2 NA 1
6 1 2011 2 NA NA
7 2 2011 2 NA NA
8 3 2011 2 NA NA
9 4 2011 2 NA NA
10 1 2012 2 30 30
11 2 2012 2 NA 30
12 3 2012 2 NA 30
13 4 2012 2 NA 30
For case 2, as you might already infer from the code above, you must have na.rm set to FALSE otherwise the vector will drop off all NA which it could not extrapolate.
So using exactly the same code for case 2 we will have the output:
> tail(testdata, 13)
Source: local data frame [13 x 5]
Groups: person, year
quarter year person income ynew
1 4 2012 1 NA 13
2 1 2010 2 NA NA
3 2 2010 2 NA NA
4 3 2010 2 NA NA
5 4 2010 2 NA NA
6 1 2011 2 NA NA
7 2 2011 2 NA NA
8 3 2011 2 NA NA
9 4 2011 2 NA NA
10 1 2012 2 30 30
11 2 2012 2 NA 30
12 3 2012 2 NA 30
13 4 2012 2 NA 30

Related

R - delete consecutive (ONLY) duplicates

I need to eliminate rows from a data frame based on the repetition of values in a given column, but only those that are consecutive.
For example, for the following data frame:
df = data.frame(x=c(1,1,1,2,2,4,2,2,1))
df$y <- c(10,11,30,12,49,13,12,49,30)
df$z <- c(1,2,3,4,5,6,7,8,9)
x y z
1 10 1
1 11 2
1 30 3
2 12 4
2 49 5
4 13 6
2 12 7
2 49 8
1 30 9
I would need to eliminate rows with consecutive repeated values in the x column, keep the last repeated row, and maintain the structure of the data frame:
x y z
1 30 3
2 49 5
4 13 6
2 49 8
1 30 9
Following directions from help and some other posts, I have tried using the duplicated function:
df[ !duplicated(x,fromLast=TRUE), ] # which gives me this:
x y z
1 1 10 1
6 4 13 6
7 2 12 7
9 1 30 9
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA
Not sure why I get the NA rows at the end (wasn't happening with a similar table I was testing), but works only partially on the values.
I have also tried using the data.table package as follows:
library(data.table)
dt <- as.data.table(df)
setkey(dt, x)
dt[J(unique(x)), mult ='last']
Works great, but it eliminates ALL duplicates from the data frame, not just those that are consecutive, giving something like this:
x y z
1 30 9
2 49 8
4 13 6
Please, forgive if cross-posting. I tried some of the suggestions but none worked for eliminating only those that are consecutive.
I would appreciate any help.
Thanks
How about:
df[cumsum(rle(df$x)$lengths),]
Explanation:
rle(df$x)
gives you the run lengths and values of consecutive duplicates in the x variable. Then:
rle(df$x)$lengths
extracts the lengths. Finally:
cumsum(rle(df$x)$lengths)
gives the row indices which you can select using [.
EDIT for fun here's a microbenchmark of the answers given so far with rle being mine, consec being what I think is the most fundamentally direct answer, given by #James, and would be the answer I would "accept", and dp being the dplyr answer given by #Nik.
#> Unit: microseconds
#> expr min lq mean median uq max
#> rle 134.389 145.4220 162.6967 154.4180 172.8370 375.109
#> consec 111.411 118.9235 136.1893 123.6285 145.5765 314.249
#> dp 20478.898 20968.8010 23536.1306 21167.1200 22360.8605 179301.213
rle performs better than I thought it would.
You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.
df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
x y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9
A cheap solution with dplyr that I could think of:
Method:
library(dplyr)
df %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 30 9
This will even work if your data has the same x value at the bottom
New Input:
df2 <- df %>% add_row(x = 1, y = 10, z = 12)
df2
x y z
1 1 10 1
2 1 11 2
3 1 30 3
4 2 12 4
5 2 49 5
6 4 13 6
7 2 12 7
8 2 49 8
9 1 30 9
10 1 10 12
Use same method:
df2 %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
New Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 10 12
Here is a data.table solution. The trick is to create a shifted version of x with the shift function and compare it with x
library(data.table)
dattab <- as.data.table(df)
dattab[x != shift(x = x, n = 1, fill = -999, type = "lead")] # edited to add closing )
This way you compare each value of x with its immediately following value and throw out where they match. Make sure to set fill to something that is not in x in order for correct handling of the last value.

Define variable iteratively in data table in r

I am trying to find a faster solution to defining a variable iteratively, i.e., the next row of the variable depends on the previous row. For example, suppose I have the following data.table:
tmp <- data.table(type = c("A", "A", "A", "B", "B", "B"),
year = c(2011, 2012, 2013, 2011, 2012, 2013),
alpha = c(1,1,1,2,2,2),
beta = c(3,3,3,4,4,4),
pred = c(1,NA,NA,2,NA, NA))
For each type (A and then B), I want to solve for pred going forward, where pred for type A for the year 2012 is:
pred_2012_A = alpha + beta * pred_2011_A
and the pred for 2013 for type A continues:
pred_2013_A = alpha + beta * pred_2012_A
I have a solution using a for loop to go through type and create a variable to store the previous value and use the "by" command in data table to loop through the year as such:
for(i in c("A", "B")){
tmp.val <- tmp[type == i & year == 2011]$pred # initial value for type i
tmp[year > 2011 & type == i, pred := {
tmp.val <- alpha + beta * tmp.val
}, by = year]
}
Ultimately, the original data table looks like:
type year alpha beta pred
1: A 2011 1 3 1
2: A 2012 1 3 NA
3: A 2013 1 3 NA
4: B 2011 2 4 2
5: B 2012 2 4 NA
6: B 2013 2 4 NA
And the updated table looks like:
type year alpha beta pred
1: A 2011 1 3 1
2: A 2012 1 3 4
3: A 2013 1 3 13
4: B 2011 2 4 2
5: B 2012 2 4 10
6: B 2013 2 4 42
My question here is if there is a faster way to implement this without the for loop. Is there a way to implement this routine in one data table statement that is faster than using the for loop? My real usage has many more types and many more years to compute, so a faster implementation would be greatly appreciated.
Thank you.
You can just do the math:
tmp[, pred := pred[1]*beta^(1:.N-1) + alpha*cumsum(c(0, beta[1]^(0:(.N-2)))), by=type]
# type year alpha beta pred
# 1: A 2011 1 3 1
# 2: A 2012 1 3 4
# 3: A 2013 1 3 13
# 4: B 2011 2 4 2
# 5: B 2012 2 4 10
# 6: B 2013 2 4 42
Comment. In my opinion, the data structure in the OP is flawed. Alpha and beta are clearly attributes of the type, not something that is varying from row to row. It should start with:
typeDT = data.table(
type=c("A","B"),
year.start = 2011L,
year.end=2013,
a = 1:2,
b = 3:4,
pred0 = 1:2
)
# type year.start year.end a b pred0
# 1: A 2011 2013 1 3 1
# 2: B 2011 2013 2 4 2
With this structure, you could expand to your data set naturally:
typeDT[, {
year = year.start:year.end
n = length(year)
p = pred0*b^(0:(n-1)) + a*cumsum(c(0, b^(0:(n-2))))
.(year = year, pred = p)
}, by=type]
# type year pred
# 1: A 2011 1
# 2: A 2012 4
# 3: A 2013 13
# 4: B 2011 2
# 5: B 2012 10
# 6: B 2013 42
A bit hacky but bear with me, it only takes two iterations.
df <- read.table(text = "type year alpha beta pred
1: A 2011 1 3 1
2: A 2012 1 3 NA
3: A 2013 1 3 NA
4: B 2011 2 4 2
5: B 2012 2 4 NA
6: B 2013 2 4 NA", header = T)
df2 <- df
while(any(is.na(df2$pred))){
df2$pred <- df2$alpha + df2$beta*lag(df2$pred)
df2$pred[which(!is.na(df$pred))] <- df$pred[which(!is.na(df$pred))]
}
The solution is correct
df2
type year alpha beta pred
1: A 2011 1 3 1
2: A 2012 1 3 4
3: A 2013 1 3 13
4: B 2011 2 4 2
5: B 2012 2 4 10
6: B 2013 2 4 42

matching only if target 'cell' is NA

I have the following two data.frames opcat and polity.
opcat <- data.frame(country = rep(LETTERS[1:5]), date.ratification = c(2003,2004,2005,NA,NA), date.accession = c(NA,NA,NA,2000,2006))
opcat
polity <- data.frame(year = rep((2000:2007), 7), country = rep(LETTERS[1:7],8), polity.score = sample(10, 56, replace=TRUE))
polity <- polity[order(polity$country, polity$year),]
polity
I want to insert the polity.score of the dateframe polity to the data.frame opcat for the year in which a country 'ratified' (= date.ratified) or 'acceeded' (= date.accession).
for ratification
opcat$polity.score <- polity$polity.score[match(interaction(opcat$country, opcat$date.ratification), interaction(polity$country, polity$year))]
opcat
country date.ratification date.accession polity.score
1 A 2003 NA 10
2 B 2004 NA 2
3 C 2005 NA 10
4 D NA 2000 NA
5 E NA 2006 NA
for accesssion
opcat$polity.score <- polity$polity.score[match(interaction(opcat$country, opcat$date.accession), interaction(polity$country, polity$year))]
opcat
country date.ratification date.accession polity.score
1 A 2003 NA NA
2 B 2004 NA NA
3 C 2005 NA NA
4 D NA 2000 9
5 E NA 2006 7
A country has either a date for ratification or for accession (not both). Since the matching for accession would fill the results for ratification with NA, I tried the following modification:
opcat$polity.score[is.na(opcat$date.ratification)] <- polity$polity.score[match(interaction(opcat$country, opcat$date.accession), interaction(polity$country, polity$year))]
opcat
But this doesn't work. I get the error message " number of items to replace is not a multiple of replacement length". How can I match the scores into the same variable without overwriting them?
The final result should be
country date.ratification date.accession polity.score
1 A 2003 NA 10
2 B 2004 NA 2
3 C 2005 NA 10
4 D NA 2000 9
5 E NA 2006 7
I would think that this shouldn't be that difficult.
Many thanks.
Add the year to opcat, and perform a standard merge:
opcat$year <- with(opcat, ifelse(is.na(date.ratification), date.accession, date.ratification))
merge(opcat,polity)

Split and Diff function in R

I have a data frame called data. I am splitting the data using split function by an attribute called KEY.
data <- split(data, data$KEY);
After splitting the dataframe by KEY, what we get is data for individual firms. dataframe data had the data for all the firms in the universe. After the split, each individual split has two columns, year and sales. For each split, I have to calculate incremental sales corresponding to each year. For instance, if we have data 2002 - 10, 2003 - 12, 2004 - 15, 2005 - 20. What I am interested in getting would be 2003-2, 2004 -3, 2005 - 5, for each split.
I have written a function, called mod_sale, to perform the job mentioned:
data[with(data, order(year)),];
sale_data <- diff(data$SALE);
data <- data[-1,];
data$SALE <- sale_data;
return(data)
Currently, I am using for loop:
for(key in names(data)){
a <- try(mod_sale(data[[key]]))
if(class(a) == "try-error") next;
mod_data <- rbind(mod_data,a)};
I think there is some way, I can use sapply (and may be plyr too). Can someone help me with improving this R code? Not sure how sapply code would go.
sapply(data, mod_sale)
Any help would be appreciated. Thanks.
Edit:
Here is a data example:
a <- data.frame();
key <- c(1,1,1,1,2,2,2,2,2,3,3,3);
sales <- c(12,12,15,8,3,6,3,9,9,12,3,7);
year <- c(2002,2003,2004,2005,2001,2002,2003,2004,2005,2003,2004,2005);
ovar <- runif(12,5.0,7.5);
a <- data.frame(key,sales,year,ovar)
In the resultant data.frame, I am expecting incremental sales rather than real sales. Obviously, we will lose 3 data points for 3 key; one for each starting year, as we are taking difference. So there will be three less rows in the resultant data.frame, which would have columns key,diff(sales),year, and ovar.
This is what I would have done:
a$diffsales <- ave( a$sales, a$key, FUN=function(x) c(NA, diff(x) ) )
a
key sales year ovar diffsales
1 1 12 2002 6.845177 NA
2 1 12 2003 6.328153 0
3 1 15 2004 6.872669 3
4 1 8 2005 6.098920 -7
5 2 3 2001 7.154824 NA
6 2 6 2002 6.110810 3
7 2 3 2003 5.906624 -3
8 2 9 2004 5.214369 6
9 2 9 2005 5.818218 0
10 3 12 2003 5.354354 NA
11 3 3 2004 6.728992 -9
12 3 7 2005 7.412213 4
I appreciate the attempt to display what you'd tried. Thank you.
In the future, try to provide a small example, like this:
df <- data.frame(year = 2001:2010,
sale = sample(20,10))
df <- rbind(df,df,df)
df$key <- rep(letters[1:3],each = 10)
That makes it much clearer what your data look like, and it makes it very easy for people trying to answer. The easier you make it for us, the faster+better answers you'll get.
I'd recommend sorting before splitting:
#Sort first (already sorted, but you get the idea)
df <- df[order(df$key,df$year),]
df_split <- split(df,df$key)
You don't actually want to use sapply. (Try it and see.) You just want lapply:
out <- lapply(df_split,function(x) {x$sale_diff <- c(NA,diff(x$sale)); x[-1,]})
You'd put it all together again using:
do.call(rbind,out)
You're right, plyr or data.table could also do this. I'll leave those examples to others.
Using data.table:
library(data.table)
dt = data.table(a)
dt[, sale_diff := c(NA, diff(sales)), by = key]
dt
# key sales year ovar sale_diff
# 1: 1 12 2002 7.416857 NA
# 2: 1 12 2003 5.625818 0
# 3: 1 15 2004 5.018934 3
# 4: 1 8 2005 6.671986 -7
# 5: 2 3 2001 6.242739 NA
# 6: 2 6 2002 6.297763 3
# 7: 2 3 2003 6.482124 -3
# 8: 2 9 2004 6.724256 6
# 9: 2 9 2005 5.071265 0
#10: 3 12 2003 6.136681 NA
#11: 3 3 2004 6.974392 -9
#12: 3 7 2005 6.517553 4

Is there already a function to substract different variables in subsequent quarters?

I have an unbalanced quarterly panel data set with missing values. I want to substract variable A2 from A1 in subsequent quarters. Note that I do not want to get differences of A2, but substract DIFFERENT variables from each other. Differences should be calculated separately for every uid. Besides changing years like Q4 1999 and Q1 2000 are meant to be subsequent.
I am really not sure whether i should concatenate my time index here since packages like zoo only take one index. But that's not the problem here. Here is a some example data:
structure(list(uid = c(1, 1, 1, 2, 2, 3, 3, 3), tndx = c(1999.4,
2000.1, 2000.2, 1999.4, 2000.1, 2000.1, 2000.2, 2000.3), A1 = c(2,
2, 2, 10, 11, 1, 1, 1), A2 = c(3, 3, 3, 14, 14, 2, 100, 2)), .Names = c("uid",
"tndx", "A1", "A2"), row.names = c(NA, -8L), class = "data.frame")
# which results in
uid tndx A1 A2
1 1 1999.4 2 3
2 1 2000.1 2 3
3 1 2000.2 2 3
4 2 1999.4 10 14
5 2 2000.1 11 14
6 3 2000.1 1 2
7 3 2000.2 1 100
8 3 2000.3 1 2
If you prefer a separated index, use this example:
# Thx Andrie!
x2 <- data.frame(x, colsplit(x$tndx, "\\.", names=c("year", "qtr")))
Is there a good way to solve this with reshape2, plyr or even base or would you rather write a custom function?
Note, it is also possible that some uid occurs only once. Obviously you can't calculate a lagged difference then. Still I need to check for that and create an NA then.
We split it on the uid using by and within the function that operates on each set of rows for a single uid, we create a zoo object, z, using yearqtr class for the index. Then we merge the time series with an empty series having all the desired quarters including any missing intermediate quarters giving zm and perform the computation giving zz. Finally we convert to the data.frame form on the way out:
library(zoo)
to.yearqtr <- function(x) as.yearqtr(trunc(x) + (10*(x-trunc(x))-1)/4)
DF <- do.call("rbind", by(x, x$uid, function(x) {
# columns of x are: uid tndx A1 A2
z <- zoo(x[c("A1", "A2")], to.yearqtr(x$tndx))
zm <- merge(z, zoo(, seq(start(z), end(z), 1/4)))
zz <- with(zm, cbind(zm, `A1 - A2 lag` = A1 - lag(A2, -1)))
if (ncol(zz) <= ncol(z)) zz$`A1 - A2 lag` <- NA # append NA if col not added
data.frame(uid = x[1, 1], tndx = time(zz), coredata(zz), check.names = FALSE)
}))
which gives this:
> DF
uid tndx A1 A2 result A1 - A2 lagged
1.1 1 1999 Q4 2 3 NA NA
1.2 1 2000 Q1 2 2 NA -1
1.3 1 2000 Q2 2 3 NA 0
2.1 2 1999 Q4 2 4 NA NA
2.2 2 2000 Q1 NA NA NA NA
2.3 2 2000 Q2 NA NA NA NA
2.4 2 2000 Q3 NA NA NA NA
2.5 2 2000 Q4 NA NA NA NA
2.6 2 2001 Q1 3 4 NA NA
3.1 3 2000 Q1 1 2 NA NA
3.2 3 2000 Q2 1 NA NA -1
3.3 3 2000 Q3 1 2 NA NA
EDIT: Completely re-did the solution based on further discussion. Note that this not only adds an extra column but it also converts the index to "yearqtr" class and adds the extra missing rows.
EDIT: Some minor simplifications in the by function.
I wasn't entirely clear what you wnated because you didn't include a "right answer". If you want to subtract one lagged variable from another unlagged variable you cna do that with indexing that is offset. (You do need to pad the result if you wnat it to get put back into the dataframe.
x$A1lagA2 <- ave(x[, c("A1", "A2")], x$uid, FUN=function(z) {
with(z, c(NA, A1[2:NROW(z)] -A2[1:(NROW(z)-1)]) ) } )[[1]]
x
uid tndx A1 A2 A1lagA2
1 1 1999.4 2 3 NA
2 1 2000.1 2 3 -1
3 1 2000.2 2 3 -1
4 2 1999.4 10 14 NA
5 2 2000.1 11 14 -3
6 3 2000.1 1 2 NA
7 3 2000.2 1 100 -1
8 3 2000.3 1 2 -99
You do get annoying duplicate extra columns with ave() when it argument is multicolumn, but I just took the first one.

Resources