Reshaping repeated measures data in R wide to long - r

I need to convert a "wide" dataframe of annually repeated measures on individuals into "long" format so that I can model it like lm(y_year2 ~ x_year1) as well as lm(z_year2 ~ y_year2)
I can get it into the format I want "by hand" but cannot get figure out how to melt/dcast it into the shape I want
Below I've illustrated what I'm doing with some simulated data
The dataframe is like this in wide format, one individual per line
ID SITE L_03 M_03 R_03 L_04 M_04 R_04 L_05 M_05 R_05
1 forest X a YES Y b YES Z c NO
2 forest ...
I'd like it in LONG format:
ID SITE L_year1 L_year2 M_year1 M_year2 R_year1 R_year2 year1 year2
1 forest Z Y a b YES YES 03 04
1 forest Y Z b c YES NO 04 05
2 forest ...
2 forest ...
Some Simulated data:
L and M are numeric (length & mass), R is a Yes/No factor (reproductive), 3 years of repeated measurements (2003-2005)
ID <- 1:10; SITE <- c(rep("forest",3), rep("swamp",3), rep("field",4))
L_03 <- round(rnorm(10, 100, 1),3) ; M_03 <- round((10 + L_03*0.25 + rnorm(10, 0, 1)), 3)
R_03 <- sample(c("Yes", "No"), 10, replace = TRUE) ; L_04 <- round((2 + L_03*1.25 + rnorm(10, 1,10)), 3)
M_04 <- round((10 + L_04*0.25 + rnorm(10, 0,10)), 3) ;R_04 <- sample(c("Yes", "No"), 10, replace = TRUE)
L_05 <- round((2 + L_04*1.25 + rnorm(10, 1,10)),3) ; M_05 <- round((10 + L_05*0.25 + abs(rnorm(10, 0,10))),3)
R_05 <- sample(c("Yes", "No"), 10, replace = TRUE); rm_data <- data.frame(ID, SITE, L_03, M_03, R_03, L_04, M_04,R_04, L_05, M_05, R_05)
Approach 1: My ad hoc reshaping "by hand" with rbind
1st, make subset with 2003 & 2004 data, then another w/ 2004 & 2005
rm_data1 <- cbind(rm_data[ ,c(1,2,3:5, 6:8)], rep(2003,10), rep(2004,10))
rm_data2 <- cbind(rm_data[ ,c(1,2,6:8, 9:11)],rep(2004,10), rep(2005,10))
names(rm_data1)[3:10]<- c("L1", "M1", "R1", "L2", "M2", "R2", "yr1", "yr2")
names(rm_data2)[3:10]<- c("L1", "M1", "R1", "L2", "M2", "R2", "yr1", "yr2")
data3 <- rbind(rm_data1, rm_data2)
Approach 2?: I'd like to do this with reshape/melt/dcast. I can't figure out if I can use dcast directly on the wide dataframe or, once I melt it, how to dcast it into the format I want.
library(reshape2)
rm_measure_vars <- c("L_03", "M_03", "R_03", "L_04", "M_04","R_04", "L_05", "M_05", "R_05")
rm_data_melt <- melt(data = rm_data, id.vars = c("ID", "SITE"), measure.vars = rm_measure_vars, value.name = "data")
I add a designator of the year the measurement was taken to the melted data
obs_year <- gsub("(.*)([0-9]{2})", "\\2", rm_data_melt$variable)
rm_data_melt <- cbind(rm_data_melt, obs_year)
The dcast seems like it should be something like this, but this is not yet what I need
dcast(data = rm_data_melt, formula = ID + SITE + obs_year ~ variable)
ID SITE obs_year L_03 M_03 R_03 L_04 M_04 R_04 L_05 M_05 R_05
1 1 forest 03 99.96 35.364 No <NA> <NA> <NA> <NA> <NA> <NA>
2 1 forest 04 <NA> <NA> <NA> 129.595 47.256 Yes <NA> <NA> <NA>
3 1 forest 05 <NA> <NA> <NA> <NA> <NA> <NA> 177.607 58.204 Yes
Any suggestions would be greatly appreciated

I gave it some try. The reshape is the easy part. The rest needs some semi-manual handling, I believe. The following should give you what you want.
output <- reshape(rm_data, idvar=c("ID","SITE"), varying=3:11,
v.names=c("L_","M_","R_"), direction="long")
output$time <- output$time + 2 # to get the year
names(output)[3:6] <- c("year1", "L_year1", "M_year1", "R_year1")
output$year2 <- output$year1+1
rownames(output) <- c()
sapply(output[,4:6], function(x) {
i <- ncol(output)+1
output[,i] <<- x[c(2:length(x), NA)]
names(output)[i] <<- sub("1","2",names(output)[i-4])
})
output <- output[,c(1,2,4,8,5,9,6,10,3,7)] # rearrange columns as necessary
Hope this helps!

Install onetree packages.
devtools::install_github("yikeshu0611/onetree")
library(onetree)
3 steps, using onetree package
1 step
reshape the data to a long data
long1=reshape_toLong(data = rm_data,
id = "ID",
j = "year",
value.var.prefix = c("L_","M_","R_"))
2nd step
drop 5 year, choose 3 and 4 year; duplicated year as y
long2=long1[long1$year!=5,]
long2$y=long2$year
reshape long2 to a wide data by year
wide1=reshape_toWide(data = long2,
id = "ID",
j = "year",
value.var.prefix = c("L_","M_","R_","y")
)
Now, we get data with year 3 and year 4, whic is year1 and year2 in your purpose data.
So we repalce 3 with 1, 4 with 2 in the colnames.
colnames(wide1)=gsub(3,1,colnames(wide1))
colnames(wide1)=gsub(4,2,colnames(wide1))
3rd step
do 2nd step again, this time, we drop year3, we choose year4 and year5.
long3=long1[long1$year!=3,]
long3$y=long3$year
wide2=reshape_toWide(data = long3,
id = "ID",
j = "year",
value.var.prefix = c("L_","M_","R_","y")
)
colnames(wide2)=gsub(4,1,colnames(wide2))
colnames(wide2)=gsub(5,2,colnames(wide2))
last
rbind wide1 and wide2
data=rbind(wide1,wide2)
data[order(data$ID),]
ID SITE L_1 M_1 R_1 y1 L_2 M_2 R_2 y2
1 1 forest 100.181 34.279 Yes 3 131.88 50.953 No 4
11 1 forest 131.88 50.953 No 4 158.642 50.255 No 5
2 2 forest 101.645 36.667 Yes 3 123.923 43.915 No 4
12 2 forest 123.923 43.915 No 4 163.81 55.979 No 5
3 3 forest 98.961 33.901 Yes 3 125.928 41.611 No 4
13 3 forest 125.928 41.611 No 4 165.865 57.417 No 5
4 4 swamp 100.807 36.254 No 3 117.856 48.634 Yes 4
14 4 swamp 117.856 48.634 Yes 4 137.487 50.945 No 5
5 5 swamp 99.75 33.881 No 3 132.419 50.563 Yes 4
15 5 swamp 132.419 50.563 Yes 4 168.461 58.373 Yes 5
6 6 swamp 100.463 34.859 Yes 3 122.884 40.301 No 4
16 6 swamp 122.884 40.301 No 4 152.85 57.491 No 5
7 7 field 102.527 34.521 No 3 123.363 35.935 No 4
17 7 field 123.363 35.935 No 4 168 55.692 No 5
8 8 field 99.957 35.236 Yes 3 139.083 34.793 No 4
18 8 field 139.083 34.793 No 4 177.648 62.638 Yes 5
9 9 field 100.16 36.454 No 3 135.468 45.115 Yes 4
19 9 field 135.468 45.115 Yes 4 180.666 57.233 No 5
10 10 field 100.037 35.612 No 3 139.165 46.95 No 4
20 10 field 139.165 46.95 No 4 169.333 55.782 Yes 5

Related

How to loop many factors into one function

I have a large data frame regarding Covid patients. I have included a very simplified version of what this frame looks like.
CovidFake <- data.frame(DateReporting=sample(seq(as.Date("2020-10-1"), as.Date("2020-11-01"), by="day"), 50, replace=TRUE),
Industry=sample(c("Minor or Student", "Educational Services", "Medical Services", "Food Production"),50, replace =TRUE))
I want use ggplot to make a graph of the daily cases by industry of the patient. I have this function to structure the frame so ggplot can graph it.
library(zoo)
MainFunction <- function(MainFrame, CatVal){
Frame <- subset(MainFrame, Industry==CatVal)
Frame <- as.data.frame(table(Frame$DateReporting))
colnames(Frame) <- c("Var1", "Freq")
Frame$Var1 <- as.Date(Frame$Var1, "%Y-%m-%d")
Frame <- Frame %>% complete(Var1 = seq.Date(as.Date("2020-10-01", "%Y-%m-%d"),
as.Date("2020-11-01", "%Y-%m-%d"), by="day"))
Frame$Freq <- replace_na(Frame$Freq, 0)
Frame$CumSum <- cumsum(Frame$Freq)
Frame$Cat <- CatVal
Frame$SevenDayAverage <- rollmean(Frame$Freq, 7, fill=NA, align = "right")
colnames(Frame) <- c("Date", "DailyCases", "CumSum", "Industry", "SevenDayAve")
Frame <- subset(Frame, Date >= "2020-03-13")
return(Frame)
}
I need to create a frame that has all of these industries, so I've been doing something like this.
IndGraph <- rbind(MainFunction(CovidFake, "Minor or Student"),
MainFunction(CovidFake, "Educational Services"),
MainFunction(CovidFake, "Medical Services"),
MainFunction(CovidFake, "Food Production"))
The true frame has about 15 industries, so the code gets pretty long and seemingly unnecessarily repetitive. Is there anyway to loop in all the factors into the function and do this in one? Or is there a simpler way to structure the frame? I'm new to R so any and all help is much appreciated.
Thanks!
Using a for loop:
IndGraph <- vector()
for(i in CovidFake$Industry){
IndGraph <- rbind(IndGraph, MainFunction(CovidFake, i))}
Output:
> IndGraph
# A tibble: 1,600 x 5
Date DailyCases CumSum Industry SevenDayAve
<date> <dbl> <dbl> <chr> <dbl>
1 2020-10-01 0 0 Minor or Student NA
2 2020-10-02 0 0 Minor or Student NA
3 2020-10-03 1 1 Minor or Student NA
4 2020-10-04 0 1 Minor or Student NA
5 2020-10-05 0 1 Minor or Student NA
6 2020-10-06 0 1 Minor or Student NA
7 2020-10-07 1 2 Minor or Student 0.286
8 2020-10-08 1 3 Minor or Student 0.429
9 2020-10-09 2 5 Minor or Student 0.714
10 2020-10-10 0 5 Minor or Student 0.571
# ... with 1,590 more rows
One option would be:
do.call("rbind", lapply(unique(CovidFake$Industry), FUN = function(x, y = CovidFake) MainFunction(y, x)))

How do I infill non-adjacent rows with sample data from previous rows in R?

I have data containing a unique identifier, a category, and a description.
Below is a toy dataset.
prjnumber <- c(1,2,3,4,5,6,7,8,9,10)
category <- c("based","trill","lit","cold",NA,"epic", NA,NA,NA,NA)
description <- c("skip class",
"dunk on brayden",
"record deal",
"fame and fortune",
NA,
"female attention",
NA,NA,NA,NA)
toy.df <- data.frame(prjnumber, category, description)
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 <NA> <NA>
6 6 epic female attention
7 7 <NA> <NA>
8 8 <NA> <NA>
9 9 <NA> <NA>
10 10 <NA> <NA>
I want to randomly sample the 'category' and 'description' columns from rows that have been filled in to use as infill for rows with missing data.
The final data frame would be complete and would only rely on the initial 5 rows which contain data. The solution would preserve between-column correlation.
An expected output would be:
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 lit record deal
6 6 epic female attention
7 7 based skip class
8 8 based skip class
9 9 lit record deal
10 10 trill dunk on brayden
complete = na.omit(toy.df)
toy.df[is.na(toy.df$category), c("category", "description")] =
complete[sample(1:nrow(complete), size = sum(is.na(toy.df$category)), replace = TRUE),
c("category", "description")]
toy.df
# prjnumber category description
# 1 1 based skip class
# 2 2 trill dunk on brayden
# 3 3 lit record deal
# 4 4 cold fame and fortune
# 5 5 lit record deal
# 6 6 epic female attention
# 7 7 cold fame and fortune
# 8 8 based skip class
# 9 9 epic female attention
# 10 10 epic female attention
Though it would seem a little more straightforward if you didn't start with the unique identifiers filled out for the NA rows...
You could try
library(dplyr)
toy.df %>%
mutate_each(funs(replace(., is.na(.), sample(.[!is.na(.)]))), 2:3)
Based on new information, we may need a numeric index to use in the funs.
toy.df %>%
mutate(indx= replace(row_number(), is.na(category),
sample(row_number()[!is.na(category)], replace=TRUE))) %>%
mutate_each(funs(.[indx]), 2:3) %>%
select(-indx)
Using Base R to fill in a single field a at a time, use something like (not preserving the correlation between the fields):
fields <- c('category','description')
for(field in fields){
missings <- is.na(toy.df[[field]])
toy.df[[field]][missings] <- sample(toy.df[[field]][!missings],sum(missings),T)
}
and to fill them in simultaneously (preserving the correlation between the fields) use something like:
missings <- apply(toy.df[,fields],
1,
function(x)any(is.na(x)))
toy.df[missings,fields] <- toy.df[!missings,fields][sample(sum(!missings),
sum(missings),
T),]
and of course, to avoid the implicit for loop in the apply(x,1,fun), you could use:
rowAny <- function(x) rowSums(x) > 0
missings <- rowAny(toy.df[,fields])

R - group_by utilizing splinefun

I am trying to group my data by Year and CountyID then use splinefun (cubic spline interpolation) on the subset data. I am open to ideas, however the splinefun is a must and cannot be changed.
Here is the code I am trying to use:
age <- seq(from = 0, by = 5, length.out = 18)
TOT_POP <- df %.%
group_by(unique(df$Year), unique(df$CountyID) %.%
splinefun(age, c(0, cumsum(df$TOT_POP)), method = "hyman")
Here is a sample of my data Year = 2010 : 2013, Agegrp = 1 : 17 and CountyIDs are equal to all counties in the US.
CountyID Year Agegrp TOT_POP
1001 2010 1 3586
1001 2010 2 3952
1001 2010 3 4282
1001 2010 4 4136
1001 2010 5 3154
What I am doing is taking the Agegrp 1 : 17 and splitting the grouping into individual years 0 - 84. Right now each group is a representation of 5 years. The splinefun allows me to do this while providing a level of mathematical rigour to the process i.e., splinefun allows me provide a population total per each year of age, in each individual county in the US.
Lastly, the splinefun code by itself does work but within the group_by function it does not, it produces:
Error: wrong result size(4), expected 68 or 1.
The splinefun code the way I am using it works like this
TOT_POP <- splinefun(age, c(0, cumsum(df$TOT_POP)),
method = "hyman")
TOT_POP = pmax(0, diff(TOT_POP(c(0:85))))
Which was tested on one CountyID during one Year. I need to iterate this process over "x" amount of years and roughly 3200 counties.
# Reproducible data set
set.seed(22)
df = data.frame( CountyID = rep(1001:1005,each = 100),
Year = rep(2001:2010, each = 10),
Agegrp = sample(1:17, 500, replace=TRUE),
TOT_POP = rnorm(500, 10000, 2000))
# Convert Agegrp to age
df$Agegrp = df$Agegrp*5
colnames(df)[3] = "age"
# Make a spline function for every CountyID-Year combination
split.dfs = split(df, interaction(df$CountyID, df$Year))
spline.funs = lapply(split.dfs, function(x) splinefun(x[,"age"], x[,"TOT_POP"]))
# Use the spline functions to interpolate populations for all years between 0 and 85
new.split.dfs = list()
for( i in 1:length(split.dfs)) {
new.split.dfs[[i]] = data.frame( CountyID=split.dfs[[i]]$CountyID[1],
Year=split.dfs[[i]]$Year[1],
age=0:85,
TOT_POP=spline.funs[[i]](0:85))
}
# Does this do what you want? If so, then it will be
# easier for others to work from here
# > head(new.split.dfs[[1]])
# CountyID Year age TOT_POP
# 1 1001 2001 0 909033.4
# 2 1001 2001 1 833999.8
# 3 1001 2001 2 763181.8
# 4 1001 2001 3 696460.2
# 5 1001 2001 4 633716.0
# 6 1001 2001 5 574829.9
# > tail(new.split.dfs[[2]])
# CountyID Year age TOT_POP
# 81 1002 2001 80 10201.693
# 82 1002 2001 81 9529.030
# 83 1002 2001 82 8768.306
# 84 1002 2001 83 7916.070
# 85 1002 2001 84 6968.874
# 86 1002 2001 85 5923.268
First, I believe I was using the wrong wording in what I was trying to achieve, my apologies; group_by actually wasn't going to solve the issue. However, I was able to solve the problem using two functions and ddply. Here is the code that solved the issue:
interpolate <- function(x, ageVector){
result <- splinefun(ageVector,
c(0, cumsum(x)), method = "hyman")
diff(result(c(0:85)))
}
mainFunc <- function(df){
age <- seq(from = 0, by = 5, length.out = 18)
colNames <- setdiff(colnames(df)
c("Year","CountyID","AgeGrp"))
colWiseSpline <- colwise(interpolate, .cols = true,
age)(df[ , colNames])
cbind(data.frame(
Year = df$Year[1],
County = df$CountyID[1],
Agegrp = 0:84
),
colWiseSpline
)
}
CompleteMainRaw <- ddply(.data = df,
.variables = .(CountyID, Year),
.fun = mainFunc)
The code now takes each county by year and runs the splinefun on that subset of population data. At the same time it creates a data.frame with the results i.e., splits the data from 17 age groups to 85 age groups while factoring it our appropriately; which is what splinefun does.
Thanks!

Creating a long table from a wide table using merged.stack (or reshape)

I have a data frame that looks like this:
ID rd_test_2011 rd_score_2011 mt_test_2011 mt_score_2011 rd_test_2012 rd_score_2012 mt_test_2012 mt_score_2012
1 A 80 XX 100 NA NA BB 45
2 XX 90 NA NA AA 80 XX 80
I want to write a script that would, for IDs that don't have NA's in the yy_test_20xx columns, create a new data frame with the subject taken from the column title, the test name, the test score and year taken from the column title. So, in this example ID 1 would have three entries. Expected output would look like this:
ID Subject Test Score Year
1 rd A 80 2011
1 mt XX 100 2012
1 mt BB 45 2012
2 rd XX 90 2011
2 rd AA 80 2012
2 mt XX 80 2012
I've tried both reshape and various forms of merged.stack which works in the sense that I get an output that is on the road to being right but I can't understand the inputs well enough to get there all the way:
library(splitstackshape)
merged.stack(x, id.vars='id', var.stubs=c("rd_test","mt_test"), sep="_")
I've had more success (gotten closer) with reshape:
y<- reshape(x, idvar="id", ids=1:nrow(x), times=grep("test", names(x), value=TRUE),
timevar="year", varying=list(grep("test", names(x), value=TRUE), grep("score",
names(x), value=TRUE)), direction="long", v.names=c("test", "score"),
new.row.names=NULL)
This will get your data into the right format:
df.long = reshape(df, idvar="ID", ids=1:nrow(df), times=grep("Test", names(df), value=TRUE),
timevar="Year", varying=list(grep("Test", names(df), value=TRUE),
grep("Score", names(df), value=TRUE)), direction="long", v.names=c("Test", "Score"),
new.row.names=NULL)
Then omitting NA:
df.long = df.long[!is.na(df.long$Test),]
Then splitting Year to remove Test_:
df.long$Year = sapply(strsplit(df.long$Year, "_"), `[`, 2)
And ordering by ID:
df.long[order(df.long$ID),]
ID Year Test Score
1 1 2011 A 80
5 1 2012 XX 100
2 2 2011 XX 90
9 2 2013 AA 80
6 3 2012 A 10
3 4 2011 A 50
7 4 2012 XX 60
10 4 2013 AA 99
4 5 2011 C 50
8 5 2012 A 75
Using reshape:
dat.long <- reshape(dat, direction="long", varying=list(c(2, 4,6), c(3, 5,7)),
times=2011:2013,timevar='Year',
sep="_", v.names=c("Test", "Score"))
dat.long[complete.cases(dat.long),]
ID Year Test Score id
1.2011 1 2011 A 80 1
2.2011 2 2011 XX 90 2
4.2011 4 2011 A 50 4
5.2011 5 2011 C 50 5
1.2012 1 2012 XX 100 1
3.2012 3 2012 A 10 3
4.2012 4 2012 XX 60 4
5.2012 5 2012 A 75 5
2.2013 2 2013 AA 80 2
4.2013 4 2013 AA 99 4
Considering your update, I've entirely rewritten this answer. View the history if you want to see the old version.
The main problem is that your data is "double wide" in a ways. Thus, you can actually solve your problem by reshaping in the "long" direction twice. Alternatively, use melt and *cast to melt your data in a very long format and convert it to a semi-wide format.
However, I would still suggest "splitstackshape" (and not just because I wrote it). It can handle this problem fine, but it needs you to rearrange your names of your data. The part of the name that will result in the names of the new columns should come first. In your example, that means "test" and "score" should be the first part of the variable name.
For this, we can use some gsub to rearrange the existing names.
library(splitstackshape)
setnames(mydf, gsub("(rd|mt)_(score|test)_(.*)", "\\2_\\1_\\3", names(mydf)))
names(mydf)
# [1] "ID" "test_rd_2011" "score_rd_2011" "test_mt_2011"
# [5] "score_mt_2011" "test_rd_2012" "score_rd_2012" "test_mt_2012"
# [9] "score_mt_2012"
out <- merged.stack(mydf, "ID", var.stubs=c("test", "score"), sep="_")
setnames(out, c(".time_1", ".time_2"), c("Subject", "Year"))
out[complete.cases(out), ]
# ID Subject Year test score
# 1: 1 mt 2011 XX 100
# 2: 1 mt 2012 BB 45
# 3: 1 rd 2011 A 80
# 4: 2 mt 2012 XX 80
# 5: 2 rd 2011 XX 90
# 6: 2 rd 2012 AA 80
For the benefit of others, "mydf" in this answer is defined as:
mydf <- structure(list(ID = 1:2, rd_test_2011 = c("A", "XX"),
rd_score_2011 = c(80L, 90L), mt_test_2011 = c("XX", NA),
mt_score_2011 = c(100L, NA), rd_test_2012 = c(NA, "AA"),
rd_score_2012 = c(NA, 80L), mt_test_2012 = c("BB", "XX"),
mt_score_2012 = c(45L, 80L)),
.Names = c("ID", "rd_test_2011", "rd_score_2011", "mt_test_2011",
"mt_score_2011", "rd_test_2012", "rd_score_2012", "mt_test_2012",
"mt_score_2012"), class = "data.frame", row.names = c(NA, -2L))

general lag in time series panel data

I have a dataset akin to this
User Date Value
A 2012-01-01 4
A 2012-01-02 5
A 2012-01-03 6
A 2012-01-04 7
B 2012-01-01 2
B 2012-01-02 3
B 2012-01-03 4
B 2012-01-04 5
I want to create a lag of Value, respecting User.
User Date Value Value.lag
A 2012-01-01 4 NA
A 2012-01-02 5 4
A 2012-01-03 6 5
A 2012-01-04 7 6
B 2012-01-01 2 NA
B 2012-01-02 3 2
B 2012-01-03 4 3
B 2012-01-04 5 4
I've done it very inefficiently in a loop
df$value.lag1<-NA
levs<-levels(as.factor(df$User))
levs
for (i in 1:length(levs)) {
temper<- subset(df,User==as.numeric(levs[i]))
temper<- rbind(NA,temper[-nrow(temper),])
df$value.lag1[df$User==as.numeric(as.character(levs[i]))]<- temper
}
But this is very slow. I've looked at using by and tapply, but not figured out how to make them work.
I don't think XTS or TS will work because of the User element.
Any suggestions?
You can use ddply: it cuts a data.frame into pieces and transforms each piece.
d <- data.frame(
User = rep( LETTERS[1:3], each=10 ),
Date = seq.Date( Sys.Date(), length=30, by="day" ),
Value = rep(1:10, 3)
)
library(plyr)
d <- ddply(
d, .(User), transform,
# This assumes that the data is sorted
Value = c( NA, Value[-length(Value)] )
)
I think the easiest way, especially considering doing further analysis, is to convert your data frame to pdata.frame class from plm package.
After the conversion from diff() and lag() operators can be used to create panel differences and lags.
df<-pdata.frame(df,index=c("id","date"))
df<-transform(df, l_value=lag(value,1))
For a panel without missing obs this is an intuitive solution:
df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2),
date = c(1992, 1993, 1991, 1990, 1994, 1992, 1991),
value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2))
df<-df[with(df, order(id,date)), ] # sort by id and then by date
df$l_value=c(NA,df$value[-length(df$value)]) # create a new var with data displaced by 1 unit
df$l_value[df$id != c(NA, df$id[-length(df$id)])] =NA # NA data with different current and lagged id.
df
id date value l_value
4 1 1990 5.3 NA
3 1 1991 3.3 5.3
1 1 1992 4.1 3.3
2 1 1993 4.5 4.1
5 1 1994 3.0 4.5
7 2 1991 5.2 NA
6 2 1992 3.2 5.2
I stumbled over a similar problem and wrote a function.
#df needs to be a structured balanced paneldata set sorted by id and date
#OBS the function deletes the row where the NA value would have been.
df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2,2,2,2),
date = c(1992, 1993, 1991, 1990, 1994, 1992, 1991
,1994,1990,1993),
value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2,5.3,3.4,5.6))
# sort paneldata set
library(dplyr)
df<-arrange(df,id,date)
#Function
# a=df
# b=colname of variable/variables that you want to lag
# q=number of lag years
# t=colname of date/time column
retraso<-function(a,b,q,t){
sto<-max(as.numeric(unique(a[[t]])))
sta<-min(as.numeric(unique(a[[t]])))
yo<-a[which(a[[t]]>=(sta+q)),]
la<-function(a,d,t,sto,sta){
ja<-data.frame(a[[d]],a[[t]])
colnames(ja)<-c(d,t)
ja<-ja[which(ja[[t]]<=(sto-q)),1]
return(ja)
}
for (i in 1:length(b)){
yo[[b[i]]] <-la(a,b[i],t,sto,sta)
}
return(yo)
}
#lag df 1 year
df<-retraso(df,"value",1,"date")
If you don't have gaps in the time variable , do
df %>% group_by(User) %>% mutate(value_lag = lag(value, order_by =Date)
If you do have gaps in the time variable, see this answer
https://stackoverflow.com/a/26108191/3662288
Similarly, you could use tapply
# Create Data
user = c(rep('A',4),rep('B',4))
date = rep(seq(as.Date('2012-01-01'),as.Date('2012-01-04'),1),2)
value = c(4:7,2:5)
df = data.frame(user,date,value)
# Get lagged values
df$value.lag = unlist(tapply(df$value, df$user, function(x) c(NA,x[-length(df$value)])))
The idea is exactly the same: take value, split it by user, and then run a function on each subset. The unlist brings it back into vector format.
Provided the table is ordered by User and Date, this can be done with zoo. The trick is not to specify an index at this point.
library(zoo)
df <-read.table(text="User Date Value
A 2012-01-01 4
A 2012-01-02 5
A 2012-01-03 6
A 2012-01-04 7
B 2012-01-01 2
B 2012-01-02 3
B 2012-01-03 4
B 2012-01-04 5", header=TRUE, as.is=TRUE,sep = " ")
out <-zoo(df)
Value.lag <-lag(out,-1)[out$User==lag(out$User)]
res <-merge.zoo(out,Value.lag)
res <-res[,-(4:5)] # to remove extra columns
User.out Date.out Value.out Value.Value.lag
1 A 2012-01-01 4 <NA>
2 A 2012-01-02 5 4
3 A 2012-01-03 6 5
4 A 2012-01-04 7 6
5 B 2012-01-01 2 <NA>
6 B 2012-01-02 3 2
7 B 2012-01-03 4 3
8 B 2012-01-04 5 4
The collapse package now available on CRAN provides the most general C/C++ based solution to (fully-identified) panel-lags, leads, differences and growth rates / log differences. It has the generic functions flag, fdiff and fgrowth and associated lag / lead, difference and growth operators L, F, D and G. So to lag a panel dataset, it is sufficient to type:
L(data, n = 1, by = ~ idvar, t = ~ timevar, cols = 4:8)
which means: Compute 1 lag of columns 4 through 8 of data, identified by idvar and timevar. Multiple ID and time-variables can be supplied i.e. ~ id1 + id2, and sequences of lags and leads can also be computed on each column (i.e. n = -1:3 computes one lead and 3 lags). The same thing can also be done more programmatically with flag:
flag(data[4:8], 1, data$idvar, data$timevar)
Both of these options compute below 1 millisecond on typical datasets (<30,000 obs.). Large data performance is similar to data.tables shift. Similar programming applies to differences fdiff / D and growth rates fgrowth / G. These functions are all S3 generic and have vector / time-series, matrix / ts-matrix, data.frame, as well as plm::pseries and plm::pdata.frame and grouped_df methods. Thus they can be used together with plm classes for panel data and with dplyr.

Resources