When 0 in x is odd, how to assign id value between this zero and the next zero to the new variable ref - r

x<-c(0,0,1,1,0,1,1,1,0,1,1,0,1,1)
aaa<-data.frame(x)
aaa$id<-1:nrow(aaa)
When 0 in x is odd, how to assign id value between this zero and the next zero to the new variable ref.
The results like:

aaa$ref <- with(aaa, ifelse(cumsum(x == 0) %% 2, id, NA))
aaa
# x id ref
# 1 0 1 1
# 2 0 2 NA
# 3 1 3 NA
# 4 1 4 NA
# 5 0 5 5
# 6 1 6 6
# 7 1 7 7
# 8 1 8 8
# 9 0 9 NA
# 10 1 10 NA
# 11 1 11 NA
# 12 0 12 12
# 13 1 13 13
# 14 1 14 14

An option using data.table
library(data.table)
i1 <- setDT(aaa)[, grp := rleid(x)][, .I[seq_len(.N) == .N & x==0], grp]$V1
i2 <- unlist(lapply(split(i1, as.integer(gl(length(i1), 2,
length(i1)))), function(x) head(x[1]:x[2],-1)))
aaa[!i2, ref := id][, grp := NULL][]
# x id ref
# 1: 0 1 1
# 2: 0 2 NA
# 3: 1 3 NA
# 4: 1 4 NA
# 5: 0 5 5
# 6: 1 6 6
# 7: 1 7 7
# 8: 1 8 8
# 9: 0 9 NA
#10: 1 10 NA
#11: 1 11 NA
#12: 0 12 12
#13: 1 13 13
#14: 1 14 14

Related

Tracking the first incidence of each episode

I am currently using R to process a data set that looks like the following:
age ep
1 0
2 0
3 1
4 1
5 1
6 1
7 0
8 0
9 1
10 1
11 0
I want to create a variable that will keep track of the first occurrence of ep=1 per series of ep=1. These series will have ep=0 prior to the first ep=1 and ep=0 following the last ep=1 of each series.
I would like the data set to look like this after processing:
age ep first
1 0 NA
2 0 NA
3 1 1
4 1 NA
5 1 NA
6 1 NA
7 0 NA
8 0 NA
9 1 1
10 1 NA
11 0 NA
I am working in data table as this data set is rather large, so I'd prefer to process the data using code for data tables, however if this isn't possible I can convert to a data frame and use other code. Any assistance would be greatly appreciated.
A fast data.table method ...
library(data.table)
dt <- fread("age ep
1 0
2 0
3 1
4 1
5 1
6 1
7 0
8 0
9 1
10 1
11 0")
dt[!shift(ep) & ep, first := 1]
# or more explicit:
dt[shift(ep) != 1 & ep == 1, first := 1]
dt
# age ep first
# 1: 1 0 NA
# 2: 2 0 NA
# 3: 3 1 1
# 4: 4 1 NA
# 5: 5 1 NA
# 6: 6 1 NA
# 7: 7 0 NA
# 8: 8 0 NA
# 9: 9 1 1
# 10: 10 1 NA
# 11: 11 0 NA
Note: just for clarity, if your object is not already a data.table. You can coerce it to a data.table:
setDT(dt)
Another option using an update join
dt[, first := dt[dt[, .I[1], by=rleid(ep)]$V1][ep == 1][dt, on=.(age), ep]]
dt
# age ep first
# 1: 1 0 NA
# 2: 2 0 NA
# 3: 3 1 1
# 4: 4 1 NA
# 5: 5 1 NA
# 6: 6 1 NA
# 7: 7 0 NA
# 8: 8 0 NA
# 9: 9 1 1
#10: 10 1 NA
#11: 11 0 NA
Using data provided by #Khaynes
An approach using fifelse
dt[, first := fifelse( ep == 1 & shift( ep , type = "lag" ) == 0L, 1L, NA_integer_) ]
dt
# age ep first
# 1: 1 0 NA
# 2: 2 0 NA
# 3: 3 1 1
# 4: 4 1 NA
# 5: 5 1 NA
# 6: 6 1 NA
# 7: 7 0 NA
# 8: 8 0 NA
# 9: 9 1 1
# 10: 10 1 NA
# 11: 11 0 NA
Another update join version, using mult="first" to only overwrite the first matching row in the group:
dt[, rid := rleid(ep)][dt[ep==1], on=.(rid), mult="first", first := 1]
dt
# age ep rid first
# 1: 1 0 1 NA
# 2: 2 0 1 NA
# 3: 3 1 2 1
# 4: 4 1 2 NA
# 5: 5 1 2 NA
# 6: 6 1 2 NA
# 7: 7 0 3 NA
# 8: 8 0 3 NA
# 9: 9 1 4 1
#10: 10 1 4 NA
#11: 11 0 5 NA

Subsetting panel observations

I have a data.table with firm information.
library(data.table)
DT <- fread("
iso Firm GDP year
A 1 1 1
A 2 1 1
A 3 1 1
A 4 1 1
A 5 3 2
A 6 3 2
A 7 3 2
A 8 3 2
B 9 2 1
B 10 2 1
B 11 2 1
B 12 2 1
B 13 4 1
B 14 4 1
B 15 4 1
B 16 4 1",
header = TRUE)
I want to calculate GDPgrowth (per country) from one year to the other and add it to the dataset ((N-O)/O). However, if I do:
DT <- DT[,GDPgrowth :=((GDP- shift(GDP))/shift(GDP)), by=iso]
the outcome will be zero because it subtracts the firm observations from each other.
How can I make sure it calculates for the whole group of firms belonging to the country together?
Desired output:
library(data.table)
DT <- fread("
iso Firm GDP GDPgrowth year
A 1 1 NA 1
A 2 1 NA 1
A 3 1 NA 1
A 4 1 NA 1
A 5 3 2 2
A 6 3 2 2
A 7 3 2 2
A 8 3 2 2
B 9 2 NA 1
B 10 2 NA 1
B 11 2 NA 1
B 12 2 NA 1
B 13 4 1 1
B 14 4 1 1
B 15 4 1 1
B 16 4 1 1",
header = TRUE)
Here is one way continuing from your current approach :
library(data.table)
DT[,GDPgrowth :=((GDP- shift(GDP))/shift(GDP)), by=iso]
DT[GDPgrowth == 0, GDPgrowth := NA]
DT[, GDPgrowth:= zoo::na.locf(GDPgrowth, na.rm = FALSE), .(iso, year)]
DT
# iso Firm GDP year GDPgrowth
# 1: A 1 1 1 NA
# 2: A 2 1 1 NA
# 3: A 3 1 1 NA
# 4: A 4 1 1 NA
# 5: A 5 3 2 2
# 6: A 6 3 2 2
# 7: A 7 3 2 2
# 8: A 8 3 2 2
# 9: B 9 2 1 NA
#10: B 10 2 1 NA
#11: B 11 2 1 NA
#12: B 12 2 1 NA
#13: B 13 4 1 1
#14: B 14 4 1 1
#15: B 15 4 1 1
#16: B 16 4 1 1
Using dplyr and tidyr::fill it can be done as
library(dplyr)
DT %>%
group_by(iso) %>%
mutate(GDPgrowth = (GDP - lag(GDP))/lag(GDP),
GDPgrowth = replace(GDPgrowth, GDPgrowth == 0, NA)) %>%
group_by(iso, year) %>%
tidyr::fill(GDPgrowth)

Calculate diff price in a unbalanced set

I have a unbalanced data frame with date, localities and prices. I would like calculate diff price among diferents localities by date. My data its unbalanced and to get all diff price I think in create data(localities) to balance data.
My data look like:
library(dplyr)
set.seed(123)
df= data.frame(date=(1:3),
locality= rbinom(21,3, 0.2),
price=rnorm(21, 50, 20))
df %>%
arrange(date, locality)
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 2 0 26.68910
9 2 1 100.56673
10 2 1 48.88628
11 2 1 48.29153
12 2 2 29.02214
13 2 2 45.68269
14 2 2 43.59887
15 3 0 60.98193
16 3 0 75.89527
17 3 0 43.30174
18 3 0 71.41221
19 3 0 33.62969
20 3 1 34.31236
21 3 1 23.76955
To get balanced data I think in:
> date locality price
1 1 0 60.07625
2 1 0 35.32994
3 1 0 63.69872
4 1 1 54.76426
5 1 1 66.51080
6 1 1 28.28602
7 1 2 47.09213
8 1 2 NA
9 1 2 NA
10 2 0 26.68910
10 2 0 NA
10 2 0 NA
11 2 1 100.56673
12 2 1 48.88628
13 2 1 48.29153
14 2 2 29.02214
15 2 2 45.68269
16 2 2 43.59887
etc...
Finally to get diff price beetwen pair localities I think:
> date diff(price, 0-1) diff(price, 0-2) diff(price, 1-2)
1 1 60.07625-54.76426 60.07625-47.09213 etc...
2 1 35.32994-66.51080 35.32994-NA
3 1 63.69872-28.28602 63.69872-NA
You don't need to balance your data. If you use dcast, it will add the NAs for you.
First transform the data to show individual columns for each locality
library(data.table)
library(tidyverse)
setDT(df)
df[, rid := rowid(date, locality)]
df2 <- dcast(df, rid + date ~ locality, value.var = 'price')
# rid date 0 1 2
# 1: 1 1 60.07625 54.76426 47.09213
# 2: 1 2 26.68910 100.56673 29.02214
# 3: 1 3 60.98193 34.31236 NA
# 4: 2 1 35.32994 66.51080 NA
# 5: 2 2 NA 48.88628 45.68269
# 6: 2 3 75.89527 23.76955 NA
# 7: 3 1 63.69872 28.28602 NA
# 8: 3 2 NA 48.29153 43.59887
# 9: 3 3 43.30174 NA NA
# 10: 4 3 71.41221 NA NA
# 11: 5 3 33.62969 NA NA
Then create a data frame to_diff of differences to calculate, and pmap over that to calculate the differences. Here c0_1 corresponds to what you call in your question diff(price, 0-1).
to_diff <- CJ(0:2, 0:2)[V1 < V2]
pmap(to_diff, ~ df2[[as.character(.x)]] - df2[[as.character(.y)]]) %>%
setNames(paste0('c', to_diff[[1]], '_', to_diff[[2]])) %>%
bind_cols(df2[, 1:2])
# A tibble: 11 x 5
# c0_1 c0_2 c1_2 rid date
# <dbl> <dbl> <dbl> <int> <int>
# 1 5.31 13.0 7.67 1 1
# 2 -73.9 -2.33 71.5 1 2
# 3 26.7 NA NA 1 3
# 4 -31.2 NA NA 2 1
# 5 NA NA 3.20 2 2
# 6 52.1 NA NA 2 3
# 7 35.4 NA NA 3 1
# 8 NA NA 4.69 3 2
# 9 NA NA NA 3 3
# 10 NA NA NA 4 3
# 11 NA NA NA 5 3

data.table: Select n specific rows before & after other rows meeting a condition

Given the following example data table:
library(data.table)
DT <- fread("grp y exclude
a 1 0
a 2 0
a 3 0
a 4 1
a 5 0
a 7 1
a 8 0
a 9 0
a 10 0
b 1 0
b 2 0
b 3 0
b 4 1
b 5 0
b 6 1
b 7 1
b 8 0
b 9 0
b 10 0
c 5 1
d 1 0")
I want to select
by group grp
all rows that have y==5
and up to two rows before and after each row from 2 within the grouping.
but 3. only those rows that have exclude==0.
Assuming each group has max one row with y==5, this would yield the desired result for 1.-3.:
idx <- -2:2 # 2 rows before match, the matching row itself, and two rows after match
(row_numbers <- DT[,.I[{
x <- rep(which(y==5),each=length(idx))+idx
x[x>0 & x<=.N]
}], by=grp]$V1)
# [1] 3 4 5 6 7 12 13 14 15 16 20
DT[row_numbers]
# grp y exclude
# 1: a 3 0
# 2: a 4 1
# 3: a 5 0 # y==5 + two rows before and two rows after
# 4: a 7 1
# 5: a 8 0
# 6: b 3 0
# 7: b 4 1
# 8: b 5 0 # y==5 + two rows before and two rows after
# 9: b 6 1
# 10: b 7 1
# 11: c 5 1 # y==5 + nothing, because the group has only 1 element
However, how do I incorporate 4. so that I get
# grp y exclude
# 1: a 2 0
# 2: a 3 0
# 3: a 5 0
# 4: a 8 0
# 5: a 9 0
# 6: b 2 0
# 7: b 3 0
# 8: b 5 0
# 9: b 8 0
# 10: b 9 0
# 11: c 5 1
? Feels like I'm close, but I guess I looked too long at heads and whiches, now, so I'd be thankful for some fresh ideas.
A bit more simplified:
DT[DT[, rn := .I][exclude==0 | y==5][, rn[abs(.I - .I[y==5]) <= 2], by=grp]$V1]
# grp y exclude rn
#1: a 2 0 2
#2: a 3 0 3
#3: a 5 0 5
#4: a 8 0 7
#5: a 9 0 8
#6: b 2 0 11
#7: b 3 0 12
#8: b 5 0 14
#9: b 8 0 17
#10: b 9 0 18
#11: c 5 1 20
You are very close. This should do it:
row_numbers <- DT[exclude==0 | y==5, .I[{
x <- rep(which(y==5), each=length(idx)) + idx
x[x>0 & x<=.N]
}], by=grp]$V1
DT[row_numbers]

Shifting row values by lag value in another column

I have a rather large dataset and I am interested in "marching" values forward through time based on values from another column. For example, if I have a Value = 3 at Time = 0 and a DesiredShift = 2, I want the 3 to shift down two rows to be at Time = 2. Here is a reproducible example.
Build reproducible fake data
library(data.table)
set.seed(1)
rowsPerID <- 8
dat <- CJ(1:2, 1:rowsPerID)
setnames(dat, c("ID","Time"))
dat[, Value := rpois(.N, 4)]
dat[, Shift := sample(0:2, size=.N, replace=TRUE)]
Fake Data
# ID Time Value Shift
# 1: 1 1 3 2
# 2: 1 2 3 2
# 3: 1 3 4 1
# 4: 1 4 7 2
# 5: 1 5 2 2
# 6: 1 6 7 0
# 7: 1 7 7 1
# 8: 1 8 5 0
# 9: 2 1 5 0
# 10: 2 2 1 1
# 11: 2 3 2 0
# 12: 2 4 2 1
# 13: 2 5 5 2
# 14: 2 6 3 1
# 15: 2 7 5 1
# 16: 2 8 4 1
I want each Value to shift forward according the the Shift column. So the
DesiredOutput column for row 3 will be equal to 3 since the value at Time=1 is
Value = 3 and Shift = 2.
Row 4 shows 3+4=7 since 3 shifts down 2 and 4 shifts down 1.
I would like to be able to do this by ID group and hopefully take advantage
of data.table since speed is of interest for this problem.
Desired Result
# ID Time Value Shift DesiredOutput
# 1: 1 1 3 2 NA
# 2: 1 2 3 2 NA
# 3: 1 3 4 1 3
# 4: 1 4 7 2 3+4 = 7
# 5: 1 5 2 2 NA
# 6: 1 6 7 0 7+7 = 14
# 7: 1 7 7 1 2
# 8: 1 8 5 0 7+5 = 12
# 9: 2 1 5 0 5
# 10: 2 2 1 1 NA
# 11: 2 3 2 0 1+2 = 3
# 12: 2 4 2 1 NA
# 13: 2 5 5 2 2
# 14: 2 6 3 1 NA
# 15: 2 7 5 1 3+5=8
# 16: 2 8 4 1 5
I was hoping to get this working using the data.table::shift function, but I am unsure how to make this work using multiple lag parameters.
Try this:
dat[, TargetIndex:= .I + Shift]
toMerge = dat[, list(Out = sum(Value)), by='TargetIndex']
dat[, TargetIndex:= .I]
# dat = merge(dat, toMerge, by='TargetIndex', all=TRUE)
dat[toMerge, on='TargetIndex', DesiredOutput:= i.Out]
> dat
# ID Time Value Shift TargetIndex DesiredOutput
# 1: 1 1 3 2 1 NA
# 2: 1 2 3 2 2 NA
# 3: 1 3 4 1 3 3
# 4: 1 4 7 2 4 7
# 5: 1 5 2 2 5 NA
# 6: 1 6 7 0 6 14
# 7: 1 7 7 1 7 2
# 8: 1 8 5 0 8 12
# 9: 2 1 5 0 9 5
# 10: 2 2 1 1 10 NA
# 11: 2 3 2 0 11 3
# 12: 2 4 2 1 12 NA
# 13: 2 5 5 2 13 2
# 14: 2 6 3 1 14 NA
# 15: 2 7 5 1 15 8
# 16: 2 8 4 1 16 5

Resources