general lag in time series panel data - r

I have a dataset akin to this
User Date Value
A 2012-01-01 4
A 2012-01-02 5
A 2012-01-03 6
A 2012-01-04 7
B 2012-01-01 2
B 2012-01-02 3
B 2012-01-03 4
B 2012-01-04 5
I want to create a lag of Value, respecting User.
User Date Value Value.lag
A 2012-01-01 4 NA
A 2012-01-02 5 4
A 2012-01-03 6 5
A 2012-01-04 7 6
B 2012-01-01 2 NA
B 2012-01-02 3 2
B 2012-01-03 4 3
B 2012-01-04 5 4
I've done it very inefficiently in a loop
df$value.lag1<-NA
levs<-levels(as.factor(df$User))
levs
for (i in 1:length(levs)) {
temper<- subset(df,User==as.numeric(levs[i]))
temper<- rbind(NA,temper[-nrow(temper),])
df$value.lag1[df$User==as.numeric(as.character(levs[i]))]<- temper
}
But this is very slow. I've looked at using by and tapply, but not figured out how to make them work.
I don't think XTS or TS will work because of the User element.
Any suggestions?

You can use ddply: it cuts a data.frame into pieces and transforms each piece.
d <- data.frame(
User = rep( LETTERS[1:3], each=10 ),
Date = seq.Date( Sys.Date(), length=30, by="day" ),
Value = rep(1:10, 3)
)
library(plyr)
d <- ddply(
d, .(User), transform,
# This assumes that the data is sorted
Value = c( NA, Value[-length(Value)] )
)

I think the easiest way, especially considering doing further analysis, is to convert your data frame to pdata.frame class from plm package.
After the conversion from diff() and lag() operators can be used to create panel differences and lags.
df<-pdata.frame(df,index=c("id","date"))
df<-transform(df, l_value=lag(value,1))

For a panel without missing obs this is an intuitive solution:
df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2),
date = c(1992, 1993, 1991, 1990, 1994, 1992, 1991),
value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2))
df<-df[with(df, order(id,date)), ] # sort by id and then by date
df$l_value=c(NA,df$value[-length(df$value)]) # create a new var with data displaced by 1 unit
df$l_value[df$id != c(NA, df$id[-length(df$id)])] =NA # NA data with different current and lagged id.
df
id date value l_value
4 1 1990 5.3 NA
3 1 1991 3.3 5.3
1 1 1992 4.1 3.3
2 1 1993 4.5 4.1
5 1 1994 3.0 4.5
7 2 1991 5.2 NA
6 2 1992 3.2 5.2

I stumbled over a similar problem and wrote a function.
#df needs to be a structured balanced paneldata set sorted by id and date
#OBS the function deletes the row where the NA value would have been.
df <- data.frame(id = c(1, 1, 1, 1, 1, 2, 2,2,2,2),
date = c(1992, 1993, 1991, 1990, 1994, 1992, 1991
,1994,1990,1993),
value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2,5.3,3.4,5.6))
# sort paneldata set
library(dplyr)
df<-arrange(df,id,date)
#Function
# a=df
# b=colname of variable/variables that you want to lag
# q=number of lag years
# t=colname of date/time column
retraso<-function(a,b,q,t){
sto<-max(as.numeric(unique(a[[t]])))
sta<-min(as.numeric(unique(a[[t]])))
yo<-a[which(a[[t]]>=(sta+q)),]
la<-function(a,d,t,sto,sta){
ja<-data.frame(a[[d]],a[[t]])
colnames(ja)<-c(d,t)
ja<-ja[which(ja[[t]]<=(sto-q)),1]
return(ja)
}
for (i in 1:length(b)){
yo[[b[i]]] <-la(a,b[i],t,sto,sta)
}
return(yo)
}
#lag df 1 year
df<-retraso(df,"value",1,"date")

If you don't have gaps in the time variable , do
df %>% group_by(User) %>% mutate(value_lag = lag(value, order_by =Date)
If you do have gaps in the time variable, see this answer
https://stackoverflow.com/a/26108191/3662288

Similarly, you could use tapply
# Create Data
user = c(rep('A',4),rep('B',4))
date = rep(seq(as.Date('2012-01-01'),as.Date('2012-01-04'),1),2)
value = c(4:7,2:5)
df = data.frame(user,date,value)
# Get lagged values
df$value.lag = unlist(tapply(df$value, df$user, function(x) c(NA,x[-length(df$value)])))
The idea is exactly the same: take value, split it by user, and then run a function on each subset. The unlist brings it back into vector format.

Provided the table is ordered by User and Date, this can be done with zoo. The trick is not to specify an index at this point.
library(zoo)
df <-read.table(text="User Date Value
A 2012-01-01 4
A 2012-01-02 5
A 2012-01-03 6
A 2012-01-04 7
B 2012-01-01 2
B 2012-01-02 3
B 2012-01-03 4
B 2012-01-04 5", header=TRUE, as.is=TRUE,sep = " ")
out <-zoo(df)
Value.lag <-lag(out,-1)[out$User==lag(out$User)]
res <-merge.zoo(out,Value.lag)
res <-res[,-(4:5)] # to remove extra columns
User.out Date.out Value.out Value.Value.lag
1 A 2012-01-01 4 <NA>
2 A 2012-01-02 5 4
3 A 2012-01-03 6 5
4 A 2012-01-04 7 6
5 B 2012-01-01 2 <NA>
6 B 2012-01-02 3 2
7 B 2012-01-03 4 3
8 B 2012-01-04 5 4

The collapse package now available on CRAN provides the most general C/C++ based solution to (fully-identified) panel-lags, leads, differences and growth rates / log differences. It has the generic functions flag, fdiff and fgrowth and associated lag / lead, difference and growth operators L, F, D and G. So to lag a panel dataset, it is sufficient to type:
L(data, n = 1, by = ~ idvar, t = ~ timevar, cols = 4:8)
which means: Compute 1 lag of columns 4 through 8 of data, identified by idvar and timevar. Multiple ID and time-variables can be supplied i.e. ~ id1 + id2, and sequences of lags and leads can also be computed on each column (i.e. n = -1:3 computes one lead and 3 lags). The same thing can also be done more programmatically with flag:
flag(data[4:8], 1, data$idvar, data$timevar)
Both of these options compute below 1 millisecond on typical datasets (<30,000 obs.). Large data performance is similar to data.tables shift. Similar programming applies to differences fdiff / D and growth rates fgrowth / G. These functions are all S3 generic and have vector / time-series, matrix / ts-matrix, data.frame, as well as plm::pseries and plm::pdata.frame and grouped_df methods. Thus they can be used together with plm classes for panel data and with dplyr.

Related

index a dataframe with repeated values according to vector

I am trying to average values in different months over vectors of dates. Basically, I have a dataframe with monthly values of a variable, and I'm trying to get a representative average of the experienced values for samples that sometimes span month boundaries.
I've ended up with a dataframe of monthly values, and vectors of the representative number of "month-year" combinations of every sampling duration (e.g. if a sample was out from Jan 28, 2000 to Feb 1, 2000, the vector would show 4 values of Jan 2000, 1 value of Feb 2000). Later I'm going to average the values with these weights, so it's important that the returned variable values appear in representative numbers.
I am having trouble figuring out how to index the dataframe pulling the representative value repeatedly. See below.
# data frame of monthly values
reprex_df <-
tribble(
~my, ~value,
"2000-01", 10,
"2000-02", 11,
"2000-03", 15,
"2000-04", 9,
"2000-05", 13
) %>%
as.data.frame()
# vector of month-year dates from Jan 28 to Feb 1:
reprex_vec <- c("2000-01","2000-01","2000-01","2000-01","2000-02")
# I want to index the df using the vector to get a return vector of
# January value*4, Feb value*1, or 10, 10, 10, 10, 11
# I tried this:
reprex_df[reprex_df$my %in% reprex_vec,"value"]
# but %in% only returns each value once ("10 11", not "10 10 10 10 11").
# is there a different way I should be indexing to account for repeated values?
# eventually I will take an average, e.g.:
mean(reprex_df[reprex_df$my %in% reprex_vec,"value"])
# but I want this average to equal 10.2 for mean(c(10,10,10,10,11)), not 10.5 for mean(c(10,11))
Simple tidy solution with inner_join:
dplyr::inner_join(reprex_df, data.frame(my = reprex_vec), by = "my")$value
in base R:
merge(reprex_df, list(my = reprex_vec))
my value
1 2000-01 10
2 2000-01 10
3 2000-01 10
4 2000-01 10
5 2000-02 11
Perhaps use match from base R to get the index
reprex_df[match(reprex_vec, reprex_df$my),]
my value
1 2000-01 10
1.1 2000-01 10
1.2 2000-01 10
1.3 2000-01 10
2 2000-02 11
Another base R option using setNames
with(
reprex_df,
data.frame(
my = reprex_vec,
value = setNames(value, my)[reprex_vec]
)
)
gives
my value
1 2000-01 10
2 2000-01 10
3 2000-01 10
4 2000-01 10
5 2000-02 11

Time intervals between resightings of several individuals

In R, I need to calculate several time interval variables between resightings of marked individuals. I have a dataset similar to this:
ID Time Day Month
a 11.15 13 6
a 12.35 13 6
a 10.02 14 6
a 19.30 15 6
a 20.46 15 6
.
.
.
b 11.12 8 7
etc
In which each ID represents a different animal marked for individual recognition, and each row contains the date and time in which it was relocated.
For each individual, I'd need to calculate the number of days each animal was observed, the mean and standard deviation of the number of relocations per day, and the mean and standard deviation of the days elapsed between relocations (including 0 days between observations on the same day.
Ideally, I need to obtain a data frame such this:
ID N.Obs N.days mean.Obs.per.Day m.O.D.sd mean.days.elapsed mde.sd
a 27 7 4.2 1.1 1.5 0.5
b 32 5 3.4 0.4 3.2 0.7
c 17 6 4.4 0.2 4.5 1.2
d etc
I've been doing it in using the tapply function and transferring the results to an Excel, but I am sure there must be a relatively simple code which could help me to ignite the process in R.
The OP has requested to aggregate 6 statistics per ID. Four of them can by directly aggregated by grouping by ID. Two (mean.Obs.per.Day and m.O.D.sd) need to be grouped by date and ID first.
Unfortunately, the time stamps are split up in three different fields, Time, Day, and Month with the year missing. As four of the statistics are based on dates, we need to construct a Date column which combines Day, Month, and a dummy year.
The code below utilises the data.table and lubridate packages for efficiency.
library(data.table)
# coerce to data.table and add Date column
setDT(DF)[, Date := lubridate::make_date(, Month, Day)]
# aggregate by ID,
# use temporary variable to hold the day differences between resightings
agg_per_id <- DF[, {
tmp <- as.numeric(diff(Date))
.(N.Obs = .N, N.days = uniqueN(Date),
mean.days.elapsed = mean(tmp),
mde.sd = sd(tmp))
} , by = ID]
# aggregate by Date and ID
agg_per_day_and_id <- DF[, .N, by = .(ID, Date)][
, .(mean.Obs.per.Day = mean(N), m.O.D.sd = sd(N)), by = ID]
# join partial results
result <- agg_per_day_and_id[agg_per_id, on = "ID"]
# reorder columns (for comparison with expected result)
setcolorder(result, c("ID", "N.Obs", "N.days", "mean.Obs.per.Day",
"m.O.D.sd", "mean.days.elapsed", "mde.sd"))
result
ID N.Obs N.days mean.Obs.per.Day m.O.D.sd mean.days.elapsed mde.sd
1: a 5 3 1.666667 0.5773503 0.5 0.5773503
2: b 1 1 1.000000 NA NaN NA
Note that the figures differ from the expected result of the OP due to different input data.
Data
As far as provided by the OP
DF <- readr::read_table(
"ID Time Day Month
a 11.15 13 6
a 12.35 13 6
a 10.02 14 6
a 19.30 15 6
a 20.46 15 6
b 11.12 8 7"
)

Using ifelse statement when concatenating elements of a date variable

I am attempting to use two ifelse statements to create a new date variable that makes a series of assumptions to fill in the gaps of an existing date variable. Here is an example of what I mean:
id EffectiveDate EffectiveYear ED_NA EY_NA NewEffectiveDate
1 a 1972-10-05 1972 FALSE FALSE 1972-10-05
2 a <NA> 1985 TRUE FALSE 1985-01-01
3 a 1988-11-12 1988 FALSE FALSE 1988-11-12
4 b 2011-09-05 2011 FALSE FALSE 2011-09-05
5 b <NA> NA TRUE TRUE 2011-09-05
6 b <NA> 2012 TRUE FALSE 2012-01-01
7 c 2012-11-11 2012 FALSE FALSE 2012-11-11
8 c 2013-05-15 2013 FALSE FALSE 2013-05-15
quick code for id:EY_NA =
id <- c("a","a","a","b","b","b","c","c")
EffectiveDate <- c("1972-10-05",NA,"1988-11-12","2011-09-05",NA,NA,"2012-11-11","2013-05-15")
EffectiveYear <- c(1972,1985,1988,2011,NA,2012,2012,2013)
tdat <- data.frame(id, EffectiveDate, EffectiveYear)
tdat$ED_NA <- is.na(tdat$EffectiveDate)
tdat$EY_NA <- is.na(tdat$EffectiveYear)
What I'm trying to create in this example is the "NewEffectiveDate" variable. In plain English, what I want is, where EffectiveDate data are missing BUT EffectiveYear data are not missing, assume NewEffectiveDate is equal to January 1 of the EffectiveYear. If EffectiveDate AND EffectiveYear data are missing, assume the prior observation's EffectiveDate. Last, of course, if EffectiveDate data are not missing, select EffectiveDate.
Here is the latest code I used to attempt to solve the problem:
tdat %>% mutate(NewEffectiveDate = ifelse(ED_NA == 1 & EY_NA == 0,
as.Date(paste(EffectiveYear, 1, 1, sep="-")),
ifelse(ED_NA == 1 & EY_NA == 1),
as.Date(lag(EffectiveDate)),
EffectiveDate
))
When I try this particular code, I get an error message that reads: Error: unused arguments (as.Date(c(NA, 1, NA, 2, 3, NA, NA, 4)), c(1, NA, 2, 3, NA, NA, 4, 5))
I searched for similar questions with queries like "ifelse concatenate date" and some variations thereof, but haven't found anything that seems to apply to this particular problem.
I am very new to R (and CLIs, for that matter), so I apologize in advance if I'm overlooking a perfectly obvious solution. The transition from Excel to R has been interesting, but often painful when it comes to doing what seem like relatively straightforward tasks (though the dplyr package has been tremendously helpful).
id <- c("a","a","a","b","b","b","c","c")
EffectiveDate <- c("1972-10-05",NA,"1988-11-12","2011-09-05",NA,NA,"2012-11-11","2013-05-15")
EffectiveYear <- c(1972,1985,1988,2011,NA,2012,2012,2013)
tdat <- data.frame(id, EffectiveDate, EffectiveYear,
stringsAsFactors=FALSE)
library(zoo)
tdat %>%
mutate(NewEffectiveDate = ifelse(!is.na(EffectiveDate),
EffectiveDate,
ifelse(is.na(EffectiveDate) & !is.na(EffectiveYear),
paste0(EffectiveYear, "-01-01"),
NA)),
NewEffecitveDate = na.locf(NewEffectiveDate))
This should give you what you need. I recommend using na.locf (last one carried forward) from the zoo package rather than trying to deal with the previous date issue.
You can do
tdat$EffectiveDate <- as.Date(tdat$EffectiveDate)
tdat %>% mutate(NewEffectiveDate = as.Date(
ifelse(!is.na(EffectiveDate), EffectiveDate,
ifelse(!is.na(EffectiveYear), as.Date(paste(EffectiveYear, 1, 1, sep="-")),
lag(EffectiveDate)))
)) -> res
res
# id EffectiveDate EffectiveYear NewEffectiveDate
# 1 a 1972-10-05 1972 1972-10-05
# 2 a <NA> 1985 1985-01-01
# 3 a 1988-11-12 1988 1988-11-12
# 4 b 2011-09-05 2011 2011-09-05
# 5 b <NA> NA 2011-09-05
# 6 b <NA> 2012 2012-01-01
# 7 c 2012-11-11 2012 2012-11-11
# 8 c 2013-05-15 2013 2013-05-15
There appears to be a problem with your ifelse block you closed the bracket for the second block early and didn't give a yes or no argument and you gave an extra argument to the first ifelse block.
This should work:
tdat %>% mutate(NewEffectiveDate = ifelse(ED_NA == 1 & EY_NA == 0,
as.Date(paste(EffectiveYear, 1, 1, sep="-")),
ifelse(ED_NA == 1 & EY_NA == 1,
as.Date(lag(EffectiveDate))),
EffectiveDate))

Grouping and conditions without loop (big data)

I have several observations of the same groups, and for each observation I have a year.
dat = data.frame(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995))
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
For each observation, i would like to know if another observation of the same group can be found with given conditions relative to the focal observation. e.g. : "Is there any other observation (than the focal one) that has been done during the last 6 years (starting from the focal year) in the same group".
Ideally the dataframe should be like that
group year six_years
1 a 2000 1 # there is another member of group a that is year = 1996 (2000-6 = 1994, this value is inside the threshold)
2 a 1996 0
3 a 1975 0
4 b 2002 0
5 b 2010 0
6 b 1980 0
7 c 1990 1
8 c 1986 0
9 c 1995 1
Basically for each row we should look into the subset of groups, and see if any(dat$year == conditions). It is very easy to do with a for loop, but it's of no use here : the dataframe is massive (several millions of row) and a loop would take forever.
I am searching for an efficient way with vectorized functions or a fast package.
Thanks !
EDITED
Actually thinking about it you will probably have a lot of recurring year/group combinations, in which case much quicker to pre-calculate the frequencies using count() - which is also a plyr function:
90M rows took ~4sec
require(plyr)
dat <- data.frame(group = sample(c("a","b","c"),size=9000000,replace=TRUE),
year = sample(c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),size=9000000,replace=TRUE))
test<-function(y,g,df){
d<-df[df$year>=y-6 &
df$year<y &
df$group== g,]
return(nrow(d))
}
rollup<-function(){
summ<-count(dat) # add a frequency to each combination
return(ddply(summ,.(group,year),transform,t=test(as.numeric(year),group,summ)*freq))
}
system.time(rollup())
user system elapsed
3.44 0.42 3.90
My dataset had too many different groups, and the plyr option proposed by Troy was too slow.
I found a hack (experts would probably say "an ugly one") with package data.table : the idea is to merge the data.table with itself quickly with the fast merge function. It gives every possible combination between a given year of a group and all others years from the same group.
Then proceed with an ifelse for every row with the condition you're looking for.
Finally, aggregate everything with a sum function to know how many times every given years can be found in a given timespan relative to another year.
On my computer, it took few milliseconds, instead of the probable hours that plyr was going to take
dat = data.table(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995), key = "group")
Produces this :
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
Then :
z = merge(dat, dat, by = "group", all = T, allow.cartesian = T) # super fast
z$sixyears = ifelse(z$year.y >= z$year.x - 6 & z$year.y < z$year.x, 1, 0) # creates a 0/1 column for our condition
z$sixyears = as.numeric(z$sixyears) # we want to sum this up after
z$year.y = NULL # useless column now
z2 = z[ , list(sixyears = sum(sixyears)), by = list(group, year.x)]
(Years with another year of the same group in the last six years are given a "1" :
group year x
1 a 1975 0
2 b 1980 0
3 c 1986 0
4 c 1990 1 # e.g. here there is another "c" which was in the timespan 1990 -6 ..
5 c 1995 1 # <== this one. This one too has another reference in the last 6 years, two rows above.
6 a 1996 0
7 a 2000 1
8 b 2002 0
9 b 2010 0
Icing on the cake : it deals with NA seamlessly.
Here's another possibility also using data.table but including diff().
dat <- data.table(group = rep(c("a","b","c"), each = 3),
year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),
key = "group")
valid_case <- subset(dt[,list(valid_case = diff(year)), by=key(dt)],
abs(valid_case)<6)
dat$valid_case <- ifelse(dat$group %in% valid_case$group, 1, 0)
I am not sure how this compares in terms of speed or NA handling (I think it should be fine with NAs since they propagate in diff() and abs()), but I certainly find it more readable. Joins are really fast in data.table, but I'd have to think avoiding that all together helps. There's probably a more idiomatic way to do the condition in the ifelse statement using data.table joins. That could potentially speed things up, although my experience has never found %in% to be the limiting factor.

Reshaping repeated measures data in R wide to long

I need to convert a "wide" dataframe of annually repeated measures on individuals into "long" format so that I can model it like lm(y_year2 ~ x_year1) as well as lm(z_year2 ~ y_year2)
I can get it into the format I want "by hand" but cannot get figure out how to melt/dcast it into the shape I want
Below I've illustrated what I'm doing with some simulated data
The dataframe is like this in wide format, one individual per line
ID SITE L_03 M_03 R_03 L_04 M_04 R_04 L_05 M_05 R_05
1 forest X a YES Y b YES Z c NO
2 forest ...
I'd like it in LONG format:
ID SITE L_year1 L_year2 M_year1 M_year2 R_year1 R_year2 year1 year2
1 forest Z Y a b YES YES 03 04
1 forest Y Z b c YES NO 04 05
2 forest ...
2 forest ...
Some Simulated data:
L and M are numeric (length & mass), R is a Yes/No factor (reproductive), 3 years of repeated measurements (2003-2005)
ID <- 1:10; SITE <- c(rep("forest",3), rep("swamp",3), rep("field",4))
L_03 <- round(rnorm(10, 100, 1),3) ; M_03 <- round((10 + L_03*0.25 + rnorm(10, 0, 1)), 3)
R_03 <- sample(c("Yes", "No"), 10, replace = TRUE) ; L_04 <- round((2 + L_03*1.25 + rnorm(10, 1,10)), 3)
M_04 <- round((10 + L_04*0.25 + rnorm(10, 0,10)), 3) ;R_04 <- sample(c("Yes", "No"), 10, replace = TRUE)
L_05 <- round((2 + L_04*1.25 + rnorm(10, 1,10)),3) ; M_05 <- round((10 + L_05*0.25 + abs(rnorm(10, 0,10))),3)
R_05 <- sample(c("Yes", "No"), 10, replace = TRUE); rm_data <- data.frame(ID, SITE, L_03, M_03, R_03, L_04, M_04,R_04, L_05, M_05, R_05)
Approach 1: My ad hoc reshaping "by hand" with rbind
1st, make subset with 2003 & 2004 data, then another w/ 2004 & 2005
rm_data1 <- cbind(rm_data[ ,c(1,2,3:5, 6:8)], rep(2003,10), rep(2004,10))
rm_data2 <- cbind(rm_data[ ,c(1,2,6:8, 9:11)],rep(2004,10), rep(2005,10))
names(rm_data1)[3:10]<- c("L1", "M1", "R1", "L2", "M2", "R2", "yr1", "yr2")
names(rm_data2)[3:10]<- c("L1", "M1", "R1", "L2", "M2", "R2", "yr1", "yr2")
data3 <- rbind(rm_data1, rm_data2)
Approach 2?: I'd like to do this with reshape/melt/dcast. I can't figure out if I can use dcast directly on the wide dataframe or, once I melt it, how to dcast it into the format I want.
library(reshape2)
rm_measure_vars <- c("L_03", "M_03", "R_03", "L_04", "M_04","R_04", "L_05", "M_05", "R_05")
rm_data_melt <- melt(data = rm_data, id.vars = c("ID", "SITE"), measure.vars = rm_measure_vars, value.name = "data")
I add a designator of the year the measurement was taken to the melted data
obs_year <- gsub("(.*)([0-9]{2})", "\\2", rm_data_melt$variable)
rm_data_melt <- cbind(rm_data_melt, obs_year)
The dcast seems like it should be something like this, but this is not yet what I need
dcast(data = rm_data_melt, formula = ID + SITE + obs_year ~ variable)
ID SITE obs_year L_03 M_03 R_03 L_04 M_04 R_04 L_05 M_05 R_05
1 1 forest 03 99.96 35.364 No <NA> <NA> <NA> <NA> <NA> <NA>
2 1 forest 04 <NA> <NA> <NA> 129.595 47.256 Yes <NA> <NA> <NA>
3 1 forest 05 <NA> <NA> <NA> <NA> <NA> <NA> 177.607 58.204 Yes
Any suggestions would be greatly appreciated
I gave it some try. The reshape is the easy part. The rest needs some semi-manual handling, I believe. The following should give you what you want.
output <- reshape(rm_data, idvar=c("ID","SITE"), varying=3:11,
v.names=c("L_","M_","R_"), direction="long")
output$time <- output$time + 2 # to get the year
names(output)[3:6] <- c("year1", "L_year1", "M_year1", "R_year1")
output$year2 <- output$year1+1
rownames(output) <- c()
sapply(output[,4:6], function(x) {
i <- ncol(output)+1
output[,i] <<- x[c(2:length(x), NA)]
names(output)[i] <<- sub("1","2",names(output)[i-4])
})
output <- output[,c(1,2,4,8,5,9,6,10,3,7)] # rearrange columns as necessary
Hope this helps!
Install onetree packages.
devtools::install_github("yikeshu0611/onetree")
library(onetree)
3 steps, using onetree package
1 step
reshape the data to a long data
long1=reshape_toLong(data = rm_data,
id = "ID",
j = "year",
value.var.prefix = c("L_","M_","R_"))
2nd step
drop 5 year, choose 3 and 4 year; duplicated year as y
long2=long1[long1$year!=5,]
long2$y=long2$year
reshape long2 to a wide data by year
wide1=reshape_toWide(data = long2,
id = "ID",
j = "year",
value.var.prefix = c("L_","M_","R_","y")
)
Now, we get data with year 3 and year 4, whic is year1 and year2 in your purpose data.
So we repalce 3 with 1, 4 with 2 in the colnames.
colnames(wide1)=gsub(3,1,colnames(wide1))
colnames(wide1)=gsub(4,2,colnames(wide1))
3rd step
do 2nd step again, this time, we drop year3, we choose year4 and year5.
long3=long1[long1$year!=3,]
long3$y=long3$year
wide2=reshape_toWide(data = long3,
id = "ID",
j = "year",
value.var.prefix = c("L_","M_","R_","y")
)
colnames(wide2)=gsub(4,1,colnames(wide2))
colnames(wide2)=gsub(5,2,colnames(wide2))
last
rbind wide1 and wide2
data=rbind(wide1,wide2)
data[order(data$ID),]
ID SITE L_1 M_1 R_1 y1 L_2 M_2 R_2 y2
1 1 forest 100.181 34.279 Yes 3 131.88 50.953 No 4
11 1 forest 131.88 50.953 No 4 158.642 50.255 No 5
2 2 forest 101.645 36.667 Yes 3 123.923 43.915 No 4
12 2 forest 123.923 43.915 No 4 163.81 55.979 No 5
3 3 forest 98.961 33.901 Yes 3 125.928 41.611 No 4
13 3 forest 125.928 41.611 No 4 165.865 57.417 No 5
4 4 swamp 100.807 36.254 No 3 117.856 48.634 Yes 4
14 4 swamp 117.856 48.634 Yes 4 137.487 50.945 No 5
5 5 swamp 99.75 33.881 No 3 132.419 50.563 Yes 4
15 5 swamp 132.419 50.563 Yes 4 168.461 58.373 Yes 5
6 6 swamp 100.463 34.859 Yes 3 122.884 40.301 No 4
16 6 swamp 122.884 40.301 No 4 152.85 57.491 No 5
7 7 field 102.527 34.521 No 3 123.363 35.935 No 4
17 7 field 123.363 35.935 No 4 168 55.692 No 5
8 8 field 99.957 35.236 Yes 3 139.083 34.793 No 4
18 8 field 139.083 34.793 No 4 177.648 62.638 Yes 5
9 9 field 100.16 36.454 No 3 135.468 45.115 Yes 4
19 9 field 135.468 45.115 Yes 4 180.666 57.233 No 5
10 10 field 100.037 35.612 No 3 139.165 46.95 No 4
20 10 field 139.165 46.95 No 4 169.333 55.782 Yes 5

Resources