This is a very simple question, but I haven't been able to find a definitive answer, so I thought I would ask it. I use the plm package for dealing with panel data. I am attempting to use the lag function to lag a variable FORWARD in time (the default is to retrieve the value from the previous period, and I want the value from the NEXT). I found a number of old articles/questions (circa 2009) suggesting that this is possible by using k=-1 as an argument. However, when I attempt this, I get an error.
Sample code:
library(plm)
df<-as.data.frame(matrix(c(1,1,1,2,2,3,20101231,20111231,20121231,20111231,20121231,20121231,50,60,70,120,130,210),nrow=6,ncol=3))
names(df)<-c("individual","date","data")
df$date<-as.Date(as.character(df$date),format="%Y%m%d")
df.plm<-pdata.frame(df,index=c("individual","date"))
Lagging:
lag(df.plm$data,0)
##returns
1-2010-12-31 1-2011-12-31 1-2012-12-31 2-2011-12-31 2-2012-12-31 3-2012-12-31
50 60 70 120 130 210
lag(df.plm$data,1)
##returns
1-2010-12-31 1-2011-12-31 1-2012-12-31 2-2011-12-31 2-2012-12-31 3-2012-12-31
NA 50 60 NA 120 NA
lag(df.plm$data,-1)
##returns
Error in rep(1, ak) : invalid 'times' argument
I've also read that plm.data has replaced pdata.frame for some applications in plm. However, plm.data doesn't seem to work with the lag function at all:
df.plm<-plm.data(df,indexes=c("individual","date"))
lag(df.plm$data,1)
##returns
[1] 50 60 70 120 130 210
attr(,"tsp")
[1] 0 5 1
I would appreciate any help. If anyone has another suggestion for a package to use for lagging, I'm all ears. However, I do love plm because it automagically deals with lagging across multiple individuals and skips gaps in the time series.
EDIT2: lagging forward (=leading values) is implemented in plm CRAN releases >= 1.6-4 .
Functions are either lead() or lag() (latter with a negative integer for leading values).
Take care of any other packages attached that use the same function names. To be sure, you can refer to the function by the full namespace, e.g., plm::lead.
Examples from ?plm::lead:
# First, create a pdata.frame
data("EmplUK", package = "plm")
Em <- pdata.frame(EmplUK)
# Then extract a series, which becomes additionally a pseries
z <- Em$output
class(z)
# compute negative lags (= leading values)
lag(z, -1)
lead(z, 1) # same as line above
identical(lead(z, 1), lag(z, -1)) # TRUE
The collapse package in CRAN has a C++ based function flag and also associated lag/lead operators L and F. It supports continuous sequences of lags/leads (positive and negative n values), and plm pseries and pdata.frame classes. Performance: 100x faster than plm and 10x faster than data.table (the fastest in R at the time of writing). Example:
library(collapse)
pwlddev <- plm::pdata.frame(wlddev, index = c("iso3c", "year"))
head(flag(pwlddev$LIFEEX, -1:1)) # A sequence of lags and leads
F1 -- L1
ABW-1960 66.074 65.662 NA
ABW-1961 66.444 66.074 65.662
ABW-1962 66.787 66.444 66.074
ABW-1963 67.113 66.787 66.444
ABW-1964 67.435 67.113 66.787
ABW-1965 67.762 67.435 67.113
head(L(pwlddev$LIFEEX, -1:1)) # Same as above
head(L(pwlddev, -1:1, cols = 9:12)) # Computing on columns 9 through 12
iso3c year F1.PCGDP PCGDP L1.PCGDP F1.LIFEEX LIFEEX L1.LIFEEX F1.GINI GINI L1.GINI
ABW-1960 ABW 1960 NA NA NA 66.074 65.662 NA NA NA NA
ABW-1961 ABW 1961 NA NA NA 66.444 66.074 65.662 NA NA NA
ABW-1962 ABW 1962 NA NA NA 66.787 66.444 66.074 NA NA NA
ABW-1963 ABW 1963 NA NA NA 67.113 66.787 66.444 NA NA NA
ABW-1964 ABW 1964 NA NA NA 67.435 67.113 66.787 NA NA NA
ABW-1965 ABW 1965 NA NA NA 67.762 67.435 67.113 NA NA NA
F1.ODA ODA L1.ODA
ABW-1960 NA NA NA
ABW-1961 NA NA NA
ABW-1962 NA NA NA
ABW-1963 NA NA NA
ABW-1964 NA NA NA
ABW-1965 NA NA NA
library(microbenchmark)
library(data.table)
microbenchmark(plm_class = flag(pwlddev),
ad_hoc = flag(wlddev, g = wlddev$iso3c, t = wlddev$year),
data.table = qDT(wlddev)[, shift(.SD), by = iso3c])
Unit: microseconds
expr min lq mean median uq max neval cld
plm_class 462.313 512.5145 1044.839 551.562 637.6875 15913.17 100 a
ad_hoc 443.124 519.6550 1127.363 559.817 701.0545 34174.05 100 a
data.table 7477.316 8070.3785 10126.471 8682.184 10397.1115 33575.18 100 b
I had this same problem and couldn't find a good solution in plm or any other package. ddply was tempting (e.g. s5 = ddply(df, .(country,year), transform, lag=lag(df[, "value-to-lag"], lag=3))), but I couldn't get the NAs in my lagged column to line up properly for lags other than one.
I wrote a brute force solution that iterates over the dataframe row-by-row and populates the lagged column with the appropriate value. It's horrendously slow (437.33s for my 13000x130 dataframe vs. 0.012s for turning it into a pdata.frame and using lag) but it got the job done for me. I thought I would share it here because I couldn't find much information elsewhere on the internet.
In the function below:
df is your dataframe. The function returns df with a new column containing the forward values.
group is the column name of the grouping variable for your panel data. For example, I had longitudinal data on multiple countries, and I used "Country.Name" here.
x is the column you want to generate lagged values from, e.g. "GDP"
forwardx is the (new) column that will contain the forward lags, e.g. "GDP.next.year".
lag is the number of periods into the future. For example, if your data were taken in annual intervals, using lag=5 would set forwardx to the value of x five years later.
.
add_forward_lag <- function(df, group, x, forwardx, lag) {
for (i in 1:(nrow(df)-lag)) {
if (as.character(df[i, group]) == as.character(df[i+lag, group])) {
# put forward observation in forwardx
df[i, forwardx] <- df[i+lag, x]
}
else {
# end of group, no forward observation
df[i, forwardx] <- NA
}
}
# last elem(s) in forwardx are NA
for (j in ((nrow(df)-lag+1):nrow(df))) {
df[j, forwardx] <- NA
}
return(df)
}
See sample output using built-in DNase dataset. This doesn't make sense in context of the dataset, but it lets you see what the columns do.
require(DNase)
add_forward_lag(DNase, "Run", "density", "lagged_density",3)
Grouped Data: density ~ conc | Run
Run conc density lagged_density
1 1 0.04882812 0.017 0.124
2 1 0.04882812 0.018 0.206
3 1 0.19531250 0.121 0.215
4 1 0.19531250 0.124 0.377
5 1 0.39062500 0.206 0.374
6 1 0.39062500 0.215 0.614
7 1 0.78125000 0.377 0.609
8 1 0.78125000 0.374 1.019
9 1 1.56250000 0.614 1.001
10 1 1.56250000 0.609 1.334
11 1 3.12500000 1.019 1.364
12 1 3.12500000 1.001 1.730
13 1 6.25000000 1.334 1.710
14 1 6.25000000 1.364 NA
15 1 12.50000000 1.730 NA
16 1 12.50000000 1.710 NA
17 2 0.04882812 0.045 0.123
18 2 0.04882812 0.050 0.225
19 2 0.19531250 0.137 0.207
Given how long this takes, you may want to use a different approach: backwards-lag all of your other variables.
Related
I am wondering if there is some way how to extract results from adonis function in vegan package and possibly save it by write.table?
I mean other way than print results to console and copy-past R2 value to Excel.
This can be especially useful when running adonis iteratively for multiple combinations and saving objects with results into one list as suggested in this SO answer.
Here is an example on how you can extract the needed parameters from the model. I will use the linked example:
library(vegan)
data(dune)
data(dune.env)
lapply is used here instead of a loop:
results <- lapply(colnames(dune.env), function(x){
form <- as.formula(paste("dune", x, sep="~"))
z <- adonis(form, data = dune.env, permutations=99)
return(as.data.frame(z$aov.tab)) #convert anova table to a data frame
}
)
this will produce a list of data frames each having the form
> results[[1]]
#output
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
A1 1 0.7229518 0.7229518 3.638948 0.1681666 0.01
Residuals 18 3.5760701 0.1986706 NA 0.8318334 NA
Total 19 4.2990219 NA NA 1.0000000 NA
now you can name the list elements with the appropriate variable:
names(results) <- colnames(dune.env)
convert to a data frame:
results <- do.call(rbind, results)
#output
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
A1.A1 1 0.7229518 0.7229518 3.638948 0.1681666 0.01
A1.Residuals 18 3.5760701 0.1986706 NA 0.8318334 NA
A1.Total 19 4.2990219 NA NA 1.0000000 NA
Moisture.Moisture 3 1.7281651 0.5760550 3.585140 0.4019903 0.01
Moisture.Residuals 16 2.5708567 0.1606785 NA 0.5980097 NA
Moisture.Total 19 4.2990219 NA NA 1.0000000 NA
Management.Management 3 1.4685918 0.4895306 2.767243 0.3416107 0.01
Management.Residuals 16 2.8304301 0.1769019 NA 0.6583893 NA
Management.Total 19 4.2990219 NA NA 1.0000000 NA
Use.Use 2 0.5531507 0.2765754 1.255190 0.1286690 0.30
Use.Residuals 17 3.7458712 0.2203454 NA 0.8713310 NA
Use.Total 19 4.2990219 NA NA 1.0000000 NA
Manure.Manure 4 1.5238805 0.3809701 2.059193 0.3544714 0.03
Manure.Residuals 15 2.7751414 0.1850094 NA 0.6455286 NA
Manure.Total 19 4.2990219 NA NA 1.0000000 NA
and now you can save it to a csv or any other format you like:
write.csv(results, "res.csv")
If only R squared is needed change the lapply call to:
results <- lapply(colnames(dune.env), function(x){
form <- as.formula(paste("dune", x, sep="~"))
z <- adonis(form, data = dune.env, permutations=99)
return(data.frame(name = rownames(z$aov.tab), R2 = z$aov.tab$R2))
}
)
I have data as follows:
PERMNO date DLSTCD
10 1983 NA
10 1985 250
10 1986 NA
10 1986 NA
10 1987 240
10 1987 NA
11 1984 NA
11 1984 NA
11 1985 NA
11 1987 NA
12 1984 240
I need to filter rows based on following criteria:
For each PERMNO, sort data by date
Parse through the sorted data and delete rows after a company gets delisted (ie. DLSTCD != NA)
If the first row corresponds to company getting delisted, do not include any rows for that company
Based on these criteria, following is my expected output:
PERMNO date DLSTCD
10 1983 NA
10 1985 250
11 1984 NA
11 1984 NA
11 1985 NA
11 1987 NA
I am using data.table in R to work with this data. The example above is an oversimplified version of my actual data, which contains around 3M rows corresponding to 30k PERMNOs.
I implemented three different methods for doing this, as can be seen here:
r-fiddle: http://www.r-fiddle.org/#/fiddle?id=4GapqSbX&version=3
Below I compare my implementations using a small dataset of 50k rows. Here are my results:
Time Comparison
system.time(dt <- filterbydelistingcode(dt)) # 39.962 seconds
system.time(dt <- filterbydelistcoderowindices(dt)) # 39.014 seconds
system.time(dt <- filterbydelistcodeinline(dt)) # 114.3 seconds
As you can see all my implementations are extremely inefficient. Can someone please help me implement a much faster version for this? Thank you.
Edit: Here is a link to a sample dataset of 50k rows that I used for time comparison: https://ufile.io/q9d8u
Also, here is a customized read function for this data:
readdata = function(filename){
data = read.csv(filename,header=TRUE, colClasses = c(date = "Date"))
PRCABS = abs(data$PRC)
mcap = PRCABS * data$SHROUT
hpr = data$RET
HPR = as.numeric(levels(hpr))[hpr]
HPR[HPR==""] = NA
data = cbind(data,PRCABS,mcap, HPR)
return(data)
}
data <- readdata('fewdata.csv')
dt <- as.data.table(data)
Here's an attempt in data.table:
dat[
dat[order(date),
{
pos <- match(TRUE, !is.na(DLSTCD));
(.I <= .I[pos] & pos != 1) | (is.na(pos))
},
by=PERMNO]
$V1]
# PERMNO date DLSTCD
#1: 10 1983 NA
#2: 10 1985 250
#3: 11 1984 NA
#4: 11 1984 NA
#5: 11 1985 NA
#6: 11 1987 NA
Testing it on 2.5million rows, 400000 with a delisting date:
set.seed(1)
dat <- data.frame(PERMNO=sample(1:22000,2.5e6,replace=TRUE), date=1:2.5e6)
dat$DLSTCD <- NA
dat$DLSTCD[sample(1:2.5e6, 400000)] <- 1
setDT(dat)
system.time({
dat[
dat[order(date),
{
pos <- match(TRUE, !is.na(DLSTCD));
(.I <= .I[pos] & pos != 1) | (is.na(pos))
},
by=PERMNO]
$V1]
})
# user system elapsed
# 0.74 0.00 0.76
Less than a second - not bad.
Building on #thelatemail's answer, here are two more variations on the same theme.
In both cases, setkey() first makes things easier to reason with :
setkey(dat,PERMNO,date) # sort by PERMNO, then by date within PERMNO
Option 1 : stack the data you want (if any) from each group
system.time(
ans1 <- dat[, {
w = first(which(!is.na(DLSTCD)))
if (!length(w)) .SD
else if (w>1) .SD[seq_len(w)]
}, keyby=PERMNO]
)
user system elapsed
2.604 0.000 2.605
That's quite slow because allocating and populating all the little bits of memory for the result for each group, only then to be stacked into one single result in the end again, takes time and memory.
Option 2 : (closer to the way you phrased the question) find the row numbers to delete, then delete them.
system.time({
todelete <- dat[, {
w = first(which(!is.na(DLSTCD)))
if (length(w)) .I[seq.int(from=if (w==1) 1 else w+1, to=.N)]
}, keyby=PERMNO]
ans2 <- dat[ -todelete$V1 ]
})
user system elapsed
0.160 0.000 0.159
That's faster because it's only stacking row numbers to delete followed by a single operation to delete the required rows in one bulk operation. Since it's grouping by the first column of the key, it uses the key to make the grouping faster (groups are contiguous in RAM).
More info can be found about ?.SD and ?.I on this manual page.
You can inspect and debug what is happening inside each group just by adding a call to browser() and having a look as follows.
> ans1 <- dat[, {
browser()
w = first(which(!is.na(DLSTCD)))
if (!length(w)) .SD
else if (w>1) .SD[seq_len(w)]
}, keyby=PERMNO]
Browse[1]> .SD # type .SD to look at it
date DLSTCD
1: 21679 NA
2: 46408 1
3: 68378 NA
4: 75362 NA
5: 77690 NA
---
111: 2396559 1
112: 2451629 NA
113: 2461958 NA
114: 2484403 NA
115: 2485217 NA
Browse[1]> w # doesn't exist yet because browser() before that line
Error: object 'w' not found
Browse[1]> w = first(which(!is.na(DLSTCD))) # copy and paste line
Browse[1]> w
[1] 2
Browse[1]> if (!length(w)) .SD else if (w>1) .SD[seq_len(w)]
date DLSTCD
1: 21679 NA
2: 46408 1
Browse[1]> # that is what is returned for this group
Browse[1]> n # or type n to step to next line
debug at #3: w = first(which(!is.na(DLSTCD)))
Browse[2]> help # for browser commands
Let's say you find a problem or bug with one particular PERMNO. You can make the call to browser conditional as follows.
> ans1 <- dat[, {
if (PERMNO==42) browser()
w = first(which(!is.na(DLSTCD)))
if (!length(w)) .SD
else if (w>1) .SD[seq_len(w)]
}, keyby=PERMNO]
Browse[1]> .SD
date DLSTCD
1: 31018 NA
2: 35803 1
3: 37494 NA
4: 50012 NA
5: 52459 NA
---
128: 2405818 NA
129: 2429995 NA
130: 2455519 NA
131: 2478605 1
132: 2497925 NA
Browse[1]>
I have 2 sucesive ZOO time series (the date of one begins after the other finishes), they have the following form (but much longer and not only NA values):
a:
1979-01-01 1979-01-02 1979-01-03 1979-01-04 1979-01-05 1979-01-06 1979-01-07 1979-01-08 1979-01-09
NA NA NA NA NA NA NA NA NA
b:
1988-08-15 1988-08-16 1988-08-17 1988-08-18 1988-08-19 1988-08-20 1988-08-21 1988-08-22 1988-08-23 1988-08-24 1988-08-25
NA NA NA NA NA NA NA NA NA NA NA
all I want to do is combine them in one time serie as a ZOO object, it seems to be a basic task but I am doing something wrong. I use the function "merge":
combined <- merge(a, b)
but the result is something in the form:
a b
1980-03-10 NA NA
1980-03-11 NA NA
1980-03-12 NA NA
1980-03-13 NA NA
1980-03-14 NA NA
1980-03-15 NA NA
1980-03-16 NA NA
.
.
which is not a time series, and the lengths dont fit:
> length(a)
[1] 10957
> length(b)
[1] 2557
> length(combined)
[1] 27028
how can I just combine them into one time series with the form of the original ones?
Assuming the series shown reproducibly in the Note at the end, the result of merging the two series has 20 times and 2 columns (one for each series). The individual series have lengths 9 and 11 elements and the merged series is a zoo object with 9 + 11 = 20 rows (since there are no intersecting times) and 2 columns (one for each input) and length 40 (= 20 * 2). Note that the length of a multivariate series is the number of elements in it, not the number of time points.
length(z1)
## [1] 9
length(z2)
## [1] 11
m <- merge(z1, z2)
class(m)
## [1] "zoo"
dim(m)
## [1] 20 2
nrow(m)
## [1] 20
length(index(m))
## [1] 20
length(m)
## [1] 40
If what you wanted is to string them out one after another then use c:
length(c(z1, z2))
## [1] 20
The above are consistent with how merge, c and length work in base R.
Note:
library(zoo)
z1 <- zoo(rep(NA, 9), as.Date(c("1979-01-01", "1979-01-02", "1979-01-03",
"1979-01-04", "1979-01-05", "1979-01-06", "1979-01-07", "1979-01-08",
"1979-01-09")))
z2 <- zoo(rep(NA, 11), as.Date(c("1988-08-15", "1988-08-16", "1988-08-17",
"1988-08-18", "1988-08-19", "1988-08-20", "1988-08-21", "1988-08-22",
"1988-08-23", "1988-08-24", "1988-08-25")))
I am working with R and "WGCNA" package. I am doing an integrative analysis of transcriptome and metabolome.
I have two data.frames, one for the transcriptome data: datExprFemale, and one for the metabomics data: allTraits, but I am having trouble merging the two data.frames together.
> datExprFemale[1:5, 1:5]
ID gene1 gene2 gene3 gene4
F16 -0.450904880 0.90116800 -2.710879397 0.98942336
F17 -0.304889916 0.70307639 -0.245912838 -0.01089557
F18 0.001696330 0.43059153 -0.177277078 -0.24611398
F19 -0.005428231 0.32838938 0.001070509 -0.31351216
H1 0.183912553 -0.10357460 0.069589703 0.15791036
> allTraits[1:5, 1:5]
IND met1 met2 met3 met4
F15 6546 68465 56465 6548
F17 89916 7639 2838 9557
F20 6330 53 7078 11398
F1 231 938 509 351216
The individuals in allTraits have measurements in datExprFemale, but some individuals in datExprFemale do not occur in allTraits.
Here is what I have tried to merge the two data.frames together:
# First get a vector containing the row names (individual's ID) in datExprFemale
IND=rownames(datExprFemale)
# Get the rows in which two variables have the same individuals
traitRows = match(allTraits$IND, IND)
datTraits = allTraits[traitRows, -1]
This gives me the following:
met1 met2 met3 met4
11 0.0009 0.0559 7.1224 3.3894
12 0.0006 0.0370 10.5776 14.4437
15 0.0011 0.0295 5.7941 19.0225
16 0.0010 0.0531 6.1010 4.7698
17 0.0016 0.0462 7.7819 7.8796
19 0.0011 0.0192 12.7126 9.2564
20 0.0007 0.0502 9.4147 15.3579
21 0.0025 0.0455 8.4129 17.7273
NA NA NA NA NA
NA.1 NA NA NA NA
NA.2 NA NA NA NA
NA.3 NA NA NA NA
NA.4 NA NA NA NA
3 0.0017 0.0375 8.8503 8.7581
7 0.0006 0.0156 7.9272 4.9887
8 0.0011 0.0154 8.4716 8.6515
9 0.0010 0.0306 9.1220 3.5843
As you see there are some NA values, but I'm not sure why?
Now when I want to assign the ID of each individual to the corresponding row using the following code :
rownames(datTraits) = allTraits[traitRows, 1]
R gives this error:
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names':
I'm not sure what I'm doing wrong,
There's a few problems in your code:
In the format you've presented, your datExprFemale does not have rownames, so the match won't work at all.
match is telling you the which rows the individuals in allTraits correspond to in datExprFemale, not the rows you need to extract from allTraits.
Here's the approach I would take:
# First make sure `allTraits` and `datExprFemale` actually have the right rownames
rownames(datExprFemale) = datExprFemale$ID
rownames(allTraits) = allTraits$IND
# Now get the individuals who have both transcriptomic and metabolomic
# measurements
has.both = union(rownames(allTraits), rownames(datExprFemale))
# Now pull out the subset of allTraits you want:
allTraits[has.both,]
thanks for your reply. in fact "datTraits" in the code must be like this:
Insulin_ug_l Glucose_Insulin Leptin_pg_ml Adiponectin Aortic.lesions
F2_3 944 0.42055085 15148.76 14.339 296250
F2_14 632 0.67088608 6188.74 15.439 486313
F2_15 3326 0.16746843 18400.26 11.124 180750
F2_19 426 0.89671362 8438.70 16.842 113000
F2_20 2906 0.15691672 41801.54 13.498 166750
F2_23 920 0.58804348 24133.54 14.511 234000
F2_24 1895 0.24538259 52360.00 13.813 267500
F2_26 7293 0.09090909 126880.00 14.118 198000
F2_37 653 0.65849923 17100.00 12.470 121000
F2_42 1364 0.35703812 99220.00 14.531 110000
in which rows are individuals and columns are metabolites. this variable contains individuals who are in both transcriptomics and metabolomics files.
but in case of the codes I have copied them from the tutorial of WGCNA.
thanks for any suggestion,
Behzad
I have a panel data with "entity" and "year". I have a column "x" with values that i consider like time series. I want to create a new column "xp" where for each "entity" I give, for each "year", the value obtained from the forecast of the previous 5 years. If there are less than 5 previous values available, xp=NA.
For the sake of generality, the forecast is the output of a function built in R from a couple of predefinite functions found in some packages like "forecast". If it is easier with a specific function, let's use forecast(auto.arima(x.L5:x.L1),h=1).
For now, I use data.table in R because it is so much faster for all the other manipulations I make on my dataset.
However, what I want to do is not data.table 101 and I struggle with it.
I would so much appreciate a bit of your time to help me on that.
Thanks.
Here is an extract of what i would like to do:
entity year x xp
1 1980 21 NA
1 1981 23 NA
1 1982 32 NA
1 1983 36 NA
1 1984 38 NA
1 1985 45 42.3 =f((21,23,32,36,38))
1 1986 50 48.6 =f((23,32,36,38,45))
2 1991 2 NA
2 1992 4 NA
2 1993 6 NA
2 1994 8 NA
2 1995 10 NA
2 1996 12 12.4 =f((2,4,6,8,10))
2 1997 14 13.9 =f((4,6,8,10,12))
...
As suggested by Eddi, I found a way using rollapply:
DT <- data.table(mydata)
DT <- DT[order(entity,year)]
DT[,xp:=rollapply(.SD$x,5,timeseries,align="right",fill=NA,by="entity"]
with:
timeseries <- function(x){
fit <- auto.arima(x)
value <- as.data.frame(forecast(fit,h=1))[1,1]
return(value)
}
For a sample of mydata, it works perfectly. However, when I use the whole dataset (150k lines), after some computing time, i have the following error message:
Error in seq.default(start.at,NROW(data),by = by) : wrong sign in 'by' argument
Where does it come from?
Can it come from the "5" parameter in rollapply and from some specifities of certain entities in the dataset (not enough data...)?
Thanks again for your time and help.