I have the following dataset:
df1 <- data.frame(number = c(1,1,0,0,0,0,0,1,1))
In this dataset i want to create a second column, which shows if in the certain row of the first column there is a case, when first and second lags are equal to 0 and the first lead equals to 1. If this is a case, so the number 1 is put in the second column where change from 0 to 1 occurred (if not the case so equals to 44. As a result, in this output all rows in the second column should equal to 44 except the 8th.
here is my code. and in the comments below I will put a photo of the required result.
df1$t<-ifelse(df1[,1]==1 & lag(df1[,1]==0,1,default = 44) & lag(df1[,1]==0,2,default = 44)
& lead(df1[,1]==1,1,default = 44)
,1,44)
Athough the OP has asked for an explanation why his code does not return the expected result (which is addressed by Gregor's comment) I would like to propose an alternative approach.
If I understand correctly, the OP wants to find all sub-sequences in df1$number which consist of two zeros followed by two ones, i.e., c(0, 0, 1, 1). Then, the row which contains the first one in the sub-sequence should be marked by a 1 while all other rows should get 44 as default value.
As of version v1.12.0 (on CRAN 13 Jan 2019) of data.table, the shift() function recognizes negative lag/lead parameters. By this, a column can be shifted by multiple values in one batch. The row numbers which fulfill above condition are identified by a subsequent join operation. Finally df1 is updated selectively using these row numbers:
# use enhanced sample dataset, rows 10 to 21 added
df1 <- data.frame(number = c(1,1,0,0,0,0,0,1,1,0,1,0,1,1,0,0,1,0,0,1,1))
library(data.table)
setDT(df1)[, t := 44] # coerce to data.table, pre-populate result column
# shift and join
idx <- df1[, shift(number, 2:-1)][.(0, 0, 1, 1), on = paste0("V", 1:4), which = TRUE]
df1[idx, t := 1] # selective update
df1
number t
1: 1 44
2: 1 44
3: 0 44
4: 0 44
5: 0 44
6: 0 44
7: 0 44
8: 1 1
9: 1 44
10: 0 44
11: 1 44
12: 0 44
13: 1 44
14: 1 44
15: 0 44
16: 0 44
17: 1 44
18: 0 44
19: 0 44
20: 1 1
21: 1 44
number t
This works essentially as OP's approach by shifting and comparing with expected values. However, OP's approach requires to code four comparisions and three shift operations while here the shifting is done in one step and the comparison of all columns simultaneously is done by the join operation in the second step.
Additional explanations
The shift operation
df1[, shift(number, 2:-1)]
returns
V1 V2 V3 V4
1: NA NA 1 1
2: NA 1 1 0
3: 1 1 0 0
4: 1 0 0 0
5: 0 0 0 0
6: 0 0 0 0
7: 0 0 0 1
8: 0 0 1 1
9: 0 1 1 0
10: 1 1 0 1
11: 1 0 1 0
12: 0 1 0 1
13: 1 0 1 1
14: 0 1 1 0
15: 1 1 0 0
16: 1 0 0 1
17: 0 0 1 0
18: 0 1 0 0
19: 1 0 0 1
20: 0 0 1 1
21: 0 1 1 NA
V1 V2 V3 V4
In the subsequent join operation,
df1[, shift(number, 2:-1)][.(0, 0, 1, 1), on = paste0("V", 1:4), which = TRUE]
which = TRUE asks for returning only the indices of matching rows which are
[1] 8 20
Related
I have a data set with 2 VPs and 350 interval values for each. I am writing an if loop to select when a minimum value of VP1 overlaps with the maximum value of VP2.
The data usually sorts by VP, but I arranged to sort by minimum since it is a timeframe.
I ran the following code that worked to assign 0 or 1 when the values overlap the previous item, but it does not account for what the previous item is (ie. whether the previous item is VP1 or VP2).
for (i in 2:length(df$newvariable)) {
if (df$minimum[i] < df$maximum[i-1]){
df$newvariable[i] <- 0
} else {
df$newvariable[i] <- 1
}
}
I want to say if df$minimum[i] of VP1 < df$maximum[i] of VP2, then df$newvariable = 0. Otherwise, df$newvariable = 1.
I have not been able to find how to make it conditional per each row and loop again. Does anyone have any recommendations?
Many thanks.
Sample Data:
VP xmin xmax
1 0 6
2 0 2
2 6 14
1 14 24
2 20 30
1 30 36
... And so on for 600 or so rows.
Desired Output:
VP xmin xmax newvariable
1 0 6 -
2 0 2 0
2 6 14 1
1 14 24 1
2 20 30 0
1 30 36 1
If I have a dataframe that had another variable and I subsetted to only look at one part of the variable. For example, variable = talking and the assignments are 1 (yes) or 0 (no). I originally subsetted to just look at 0 and create new variables, like quiet_together. However, I want to put these dataframes back together but have added columns in the separate dataframes. If I want the same exact thing as described above but with the dataframe together (instead of 2 separate ones), how would I specify for the each assigned variable? I want to end up with two new columns based on xmin and xmax values while accounting for the value in the talking variable. The new columns would be talk_together (for the 1 value of the talking variable) and quiet_together (for the 0 value of the talking variable, when xmin <= xmax for the previous line.
For example:
Sample Data:
VP xmin xmax talking
1 0 6 0
2 0 2 0
2 2 6 1
2 6 14 0
1 6 14 1
2 14 24 1
1 14 20 0
1 20 30 1
2 24 32 0
1 30 32 0
... And so on for 600 or so rows.
Desired Output:
VP xmin xmax talking talk_together quiet_together
1 0 6 0 0 0
2 0 2 0 0 0
2 2 6 1 0 0
2 6 14 0 0 0
1 6 14 1 0 0
1 14 20 0 0 0
2 14 24 1 1 0
1 20 30 1 1 0
2 24 32 0 0 1
1 30 32 0 0 1
You could use lag from dplyr to compare with previous xmax value.
library(dplyr)
df %>% mutate(newvariable = as.integer(xmin >= lag(xmax)))
# VP xmin xmax newvariable
#1 1 0 6 NA
#2 2 0 2 0
#3 2 6 14 1
#4 1 14 24 1
#5 2 20 30 0
#6 1 30 36 1
Or shift with data.table
library(data.table)
setDT(df)[, newvariable := +(xmin >= shift(xmax))]
Base R alternatives are :
df$newvariable <- as.integer(c(NA, df$xmin[-1] >= df$xmax[-nrow(df)]))
and
df$newvariable <- +c(NA, tail(df$xmin, -1) >= head(df$xmax, -1))
With data.table, we can do
library(data.table)
setDT(df)[, newvariable := as.integer(xmin >= shift(xmax))]
There must be an easy, possible recursive solution, for the following problem. I would very much appreciate, if anyone can help:
I use data.table and RcppRoll to calculate the weekly sales in qualified weeks within the past 26 weeks for each product. With a window of 26, this works fine, as long as # of current week > 26. However, when # of current week is <= 26, I want to use a window of size 26, 25, ..., and so on.
The formular would be: baseline sales = sum over 26 (or less) weeks of sales (before current week, in qualified weeks only), divided by # of qualified weeks
Here is some code to create test data:
library("data.table")
library("RcppRoll")
products <- seq(1:10) #grouping variable
weeks <- seq(1:100) #weeks
sales <- round(rchisq(1000, 2),0) #sales
countweek <- round(runif(1000, 0,1),0) #1, if qualified weeks
data <- as.data.table(cbind(merge(weeks,products,all=T),sales,countweek))
names(data) <- c("week","product","sales","countweek")
data <- data[order(product,week)]
data[,pastsales:=shift(RcppRoll::roll_sumr(sales*countweek,26L,fill=0),1L,0,"lag"),by=.(product)]
data[,rollweekcount:=shift(RcppRoll::roll_sumr(countweek,26L,fill=0),1L,0,"lag"),by=.(product)]
data[,baseline:=pastsales/rollweekcount]
You can see the break at week line 26 for product 1. After line 26, I get the desired results:
> data[product == 1]
week product sales countweek pastsales rollweekcount baseline
...
20: 20 1 1 0 0 0 NaN
21: 21 1 2 0 0 0 NaN
22: 22 1 1 1 0 0 NaN
23: 23 1 0 0 0 0 NaN
24: 24 1 3 1 0 0 NaN
25: 25 1 5 1 0 0 NaN
26: 26 1 5 1 0 0 NaN
27: 27 1 1 1 44 13 3.384615
28: 28 1 0 1 45 14 3.214286
29: 29 1 5 0 44 14 3.142857
30: 30 1 0 1 44 14 3.142857
31: 31 1 3 1 44 14 3.142857
32: 32 1 4 0 42 14 3.000000
...
You need an "adaptive" window width. Not sure about RcppRoll, but the more recent versions of data.table has frollsum which can do this
data[, pastsales := shift(frollsum(sales*countweek, pmin(1:.N, 26L), adaptive = TRUE),
1L, 0, "lag"),
by = .(product)]
data[, rollweekcount := shift(frollsum(countweek, pmin(1:.N, 26L), adaptive = TRUE),
1L, 0, "lag"),
by = .(product)]
My data frame looks like this
personID t1 t2 t3
1 0 11 0
1 0 11 0
2 0 11 13
2 0 11 13
3 0 0 0
3 0 0 0
I need to make sure that each person has one test score above 10. If they do not, they have to be removed from the data frame. I also want to keep track of the lowest score above 10, and add it to a new column.
Thus, the result would look like this:
personID t1 t2 t3 new
1 0 11 0 11
1 0 11 0 11
2 0 11 13 11
2 0 11 13 11
If I was to go the data.table route, I think you could do it with a melt and join:
library(data.table)
setDT(dat)
dat[
melt(dat, id.vars="personID")[value > 10, .(new=min(value)), by=personID],
on="personID"
]
# personID t1 t2 t3 new
#1: 1 0 11 0 11
#2: 1 0 11 0 11
#3: 2 0 11 13 11
#4: 2 0 11 13 11
using data.table
library(data.table)
#convert your data (named DF here) to use data.table syntax
setDT(DF)
DF[ , {
# vector of row-wise minima within ID
m = do.call(pmin, .SD)
# confirm acceptance condition
if (min(m) > 10)
# add new column by appending it to current data
c(.SD, list(new = m))
}, by = personID]
This question already has answers here:
Consecutive group number in R
(3 answers)
Closed 5 years ago.
I am currently dealing with a car data. We recorded the speed of the car every 5 minutes, and it contains a lot of zero values. My question is, how to segment the data by a zero values and give each non-zero section a ordered number in R?
Let's take a sample data as example:
sample <- data.frame(
id = 1:15,
speed = c(50,0, 0, 30, 50, 40,0, 0, 25, 30, 50, 0, 30, 50, 40))
I want to add a new column that gives each non-zero section a number (starting from 1), while a consecutive number of k zero speeds (or more) is numbered as 0.
Specifically for this sample data, let's say k equals 2, then my desired result should be like this dataframe:
sample_new <- data.frame(
id = 1:15,
speed = c(50,0, 0, 0, 50, 40,0, 0, 25, 30, 50, 0, 30, 50, 40),
number = c(1, 0, 0, 0, 2, 2, 0 ,0, 3, 3, 3, 3, 3, 3, 3))
which prints as
id speed number
1 1 50 1
2 2 0 0
3 3 0 0
4 4 0 0
5 5 50 2
6 6 40 2
7 7 0 0
8 8 0 0
9 9 25 3
10 10 30 3
11 11 50 3
12 12 0 3** <- here is the difference
13 13 30 3
14 14 50 3
15 15 40 3
There are more than 1 million rows in my data, so I hope that the solution could be acceptable in speed.
The reason for setting a threshold "k" is that, some drivers just leave their GPS open even if they lock the car and go to sleep. But in other occasion where the interval is less than k, they just stopped because of the crossroad light. I want to focus on the longtime stops and just ignore the short time stops.
Hope my question makes sense to you.Thank you.
As processing speed is a concern for the production data set of more than 1 M rows, I suggest to use data.table.
It's quite easy to identify the groups of subsequent non-zero entries:
library(data.table)
setDT(sample)[, number := rleid(speed > 0 ) * (speed > 0)][]
id speed number
1: 1 50 1
2: 2 0 0
3: 3 0 0
4: 4 30 3
5: 5 50 3
6: 6 40 3
7: 7 0 0
8: 8 0 0
9: 9 25 5
10: 10 30 5
11: 11 50 5
12: 12 0 0
13: 13 30 7
14: 14 50 7
15: 15 40 7
The group numbers are different but aren't numbered consecutively. If this is a requirement it will get tricky:
setDT(sample)[, number := as.integer(factor(rleid(speed > 0 ) * (speed > 0), exclude = 0))][]
id speed number
1: 1 50 1
2: 2 0 NA
3: 3 0 NA
4: 4 30 2
5: 5 50 2
6: 6 40 2
7: 7 0 NA
8: 8 0 NA
9: 9 25 3
10: 10 30 3
11: 11 50 3
12: 12 0 NA
13: 13 30 4
14: 14 50 4
15: 15 40 4
If really required, the NAs can be replaced by 0 with
setDT(sample)[, number := as.integer(factor(rleid(speed > 0 ) * (speed > 0), exclude = 0))][
is.na(number), number := 0][]
There is an alternative approach
setDT(sample)[, number := {
tmp <- speed > 0
cumsum(tmp - shift(tmp, fill = 0, type = "lag") > 0) * tmp
}][]
id speed number
1: 1 50 1
2: 2 0 0
3: 3 0 0
4: 4 30 2
5: 5 50 2
6: 6 40 2
7: 7 0 0
8: 8 0 0
9: 9 25 3
10: 10 30 3
11: 11 50 3
12: 12 0 0
13: 13 30 4
14: 14 50 4
15: 15 40 4
If there is a sample data set as below.
> tmp <- data.table(x=c(1:10),y=(5:14))
> tmp
x y
1: 1 5
2: 2 6
3: 3 7
4: 4 8
5: 5 9
6: 6 10
7: 7 11
8: 8 12
9: 9 13
10: 10 14
I want choose two lowest number and I want change 0 value to the other numbers.
like
x y
1: 1 5
2: 2 6
3: 0 0
4: 0 0
5: 0 0
6: 0 0
7: 0 0
8: 0 0
9: 0 0
10: 0 0
I think the coding is
tmp[, c("x","y"):=lapply(.SD, x[which(!x %in% sort(x)[1:2])] = 0}), .SDcols=c("x","y")]
but it changes all 0
How can i solve this problem.
To expand on my comment, I'd do something like this:
for (j in names(tmp)) {
col = tmp[[j]]
min_2 = sort.int(unique(col), partial=2L)[2L] # 2nd lowest value
set(tmp, i = which(col > min_2), j = j, value = 0L)
}
This loops over all the columns in tmp, and gets the 2nd minimum value for each column using sort.int with partial argument, which is slightly more efficient than using sort (as we don't have to sort the entire data set to find the 2nd minimum value).
Then we use set() to replace those rows where the column value is greater than the 2nd minimum value, for that column, with the value 0.
May be you can try
tmp[, lapply(.SD, function(x) replace(x,
!rank(x, ties.method='first') %in% 1:2, 0))]
# x y
#1: 1 5
#2: 2 6
#3: 0 0
#4: 0 0
#5: 0 0
#6: 0 0
#7: 0 0
#8: 0 0
#9: 0 0
#10:0 0