I would like to create variable "Time" which basically indicates the number of times variable ID showed up within each day minus 1. In other words, the count is lagged by 1 and the first time ID showed up in a day should be left blank. Second time the same ID shows up on a given day should be 1.
Basically, I want to create the "Time" variable in the example below.
ID Day Time Value
1 1 0
1 1 1 0
1 1 2 0
1 2 0
1 2 1 0
1 2 2 0
1 2 3 1
2 1 0
2 1 1 0
2 1 2 0
Below is the code I am working on. Have not been successful with it.
data$time<-data.frame(data$ID,count=ave(data$ID==data$ID, data$Day, FUN=cumsum))
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', 'Day', we get the lag of sequence of rows (shift(seq_len(.N))) and assign (:=) it as "Time" column.
library(data.table)
setDT(df1)[, Time := shift(seq_len(.N)), .(ID, Day)]
df1
# ID Day Value Time
# 1: 1 1 0 NA
# 2: 1 1 0 1
# 3: 1 1 0 2
# 4: 1 2 0 NA
# 5: 1 2 0 1
# 6: 1 2 0 2
# 7: 1 2 1 3
# 8: 2 1 0 NA
# 9: 2 1 0 1
#10: 2 1 0 2
Or with base R
with(df1, ave(Day, Day, ID, FUN= function(x)
ifelse(seq_along(x)!=1, seq_along(x)-1, NA)))
#[1] NA 1 2 NA 1 2 3 NA 1 2
Or without the ifelse
with(df1, ave(Day, Day, ID, FUN= function(x)
NA^(seq_along(x)==1)*(seq_along(x)-1)))
#[1] NA 1 2 NA 1 2 3 NA 1 2
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
Day = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), Value = c(0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L)), .Names = c("ID", "Day",
"Value"), row.names = c(NA, -10L), class = "data.frame")
Related
Suppose I have the following data frame with an index date and follow up dates with a "1" as a stop indicator. I want to input the date difference in days into the index row and if no stop indicator is present input the number of days from the index date to the last observation:
id date group indicator
1 15-01-2022 1 0
1 15-01-2022 2 0
1 16-01-2022 2 1
1 20-01-2022 2 0
2 18-01-2022 1 0
2 20-01-2022 2 0
2 27-01-2022 2 0
Want:
id date group indicator stoptime
1 15-01-2022 1 0 NA
1 15-01-2022 2 0 NA
1 16-01-2022 2 1 1
1 20-01-2022 2 0 NA
2 18-01-2022 1 0 NA
2 20-01-2022 2 0 NA
2 27-01-2022 2 0 9
Convert the 'date' to Date class, grouped by 'id', find the position of 1 from 'indicator' (if not found, use the last position -n()), then get the difference of 'date' from the first to that position in days
library(dplyr)
library(lubridate)
df1 %>%
mutate(date = dmy(date)) %>%
group_by(id) %>%
mutate(ind = match(1, indicator, nomatch = n()),
stoptime = case_when(row_number() == ind ~
as.integer(difftime(date[ind], first(date), units = "days"))),
ind = NULL) %>%
ungroup
-output
# A tibble: 7 × 5
id date group indicator stoptime
<int> <date> <int> <int> <int>
1 1 2022-01-15 1 0 NA
2 1 2022-01-15 2 0 NA
3 1 2022-01-16 2 1 1
4 1 2022-01-20 2 0 NA
5 2 2022-01-18 1 0 NA
6 2 2022-01-20 2 0 NA
7 2 2022-01-27 2 0 9
data
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), date = c("15-01-2022",
"15-01-2022", "16-01-2022", "20-01-2022", "18-01-2022", "20-01-2022",
"27-01-2022"), group = c(1L, 2L, 2L, 2L, 1L, 2L, 2L), indicator = c(0L,
0L, 1L, 0L, 0L, 0L, 0L)), class = "data.frame",
row.names = c(NA,
-7L))
I´m having a data.frame of the following form:
ID Var1
1 1
1 1
1 3
1 4
1 1
1 0
2 2
2 2
2 6
2 7
2 8
2 0
3 0
3 2
3 1
3 3
3 2
3 4
and I would like to get there:
ID Var1 X
1 1 0
1 1 0
1 3 0
1 4 5
1 1 5
1 0 5
2 2 0
2 2 0
2 6 0
2 7 10
2 8 10
2 0 10
3 0 0
3 2 0
3 1 0
3 3 3
3 2 3
3 4 3
so in words: I´d like to calculate the sum of the variable in a window = 3, and then report the results obtained in the previous window. This should happen with respect to the IDs and thus the first three observations on every ID should be returned with 0, as there is no previous time period that could be reported.
For understanding: In the actual dataset each row corresponds to one week and the window = 7. So X is supposed to give information on the sum of Var1 in the previous week.
I have tried using some rollapply stuff, but always ended in an error and also the window would be a rolling window if I got that right, which is specifically not what I need.
Thanks for your answers!
In rollapply, the width argument can be a list which provides the offsets to use. In this case we want to use the points 3, 2 and 1 back for the first point, 4, 3 and 2 back for the second, 5, 4 and 3 back for the third and then recycle. That is, for a window width of k = 3 we would want the following list of offset vectors:
w <- list(-(3:1), -(4:2), -(5:3))
In general we can write w below in terms of the window width k. ave then invokes rollapply with that width list for each ID.
library(zoo)
k <- 3
w <- lapply(1:k, function(x) seq(to = -x, length = k))
transform(DF, X = ave(Var1, ID, FUN = function(x) rollapply(x, w, sum, fill = 0)))
giving:
ID Var1 X
1 1 1 0
2 1 1 0
3 1 3 0
4 1 4 5
5 1 1 5
6 1 0 5
7 2 2 0
8 2 2 0
9 2 6 0
10 2 7 10
11 2 8 10
12 2 0 10
13 3 0 0
14 3 2 0
15 3 1 0
16 3 3 3
17 3 2 3
18 3 4 3
Note
The input DF in reproducible form is:
DF <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Var1 = c(1L, 1L, 3L, 4L, 1L,
0L, 2L, 2L, 6L, 7L, 8L, 0L, 0L, 2L, 1L, 3L, 2L, 4L)),
class = "data.frame", row.names = c(NA, -18L))
We could group by 'ID', create a new grouping column with window size of 3 using gl, then get the summarized output by taking the sum of 'Var1' and placing the 'Var1' in a list, get the lag of 'X' and unnest
library(dplyr) #1.0.0
library(tidyr)
df1 %>%
# // grouping by ID
group_by(ID) %>%
# // create another group added with gl
group_by(grp = as.integer(gl(n(), 3, n())), .add = TRUE) %>%
# // get the sum of Var1, while changing the Var1 in a list
summarise(X = sum(Var1), Var1 = list(Var1)) %>%
# // get the lag of X
mutate(X = lag(X, default = 0)) %>%
# // unnest the list column
unnest(c(Var1)) %>%
select(names(df1), X)
# A tibble: 18 x 3
# Groups: ID [3]
# ID Var1 X
# <int> <int> <dbl>
# 1 1 1 0
# 2 1 1 0
# 3 1 3 0
# 4 1 4 5
# 5 1 1 5
# 6 1 0 5
# 7 2 2 0
# 8 2 2 0
# 9 2 6 0
#10 2 7 10
#11 2 8 10
#12 2 0 10
#13 3 0 0
#14 3 2 0
#15 3 1 0
#16 3 3 3
#17 3 2 3
#18 3 4 3
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Var1 = c(1L, 1L, 3L, 4L, 1L,
0L, 2L, 2L, 6L, 7L, 8L, 0L, 0L, 2L, 1L, 3L, 2L, 4L)), class = "data.frame",
row.names = c(NA,
-18L))
I am quite a beginner in R but thanks to the community of Stackoverflow I am improving!
However, I am stuck with a problem:
I have a dataset with 5 variables:
id_house represents the id for each household
id_ind is an id which values 1 for the first individual in the household, 2 for the next, 3 for the third...
Indicator_tb_men which indicates if the first person has answered to the survey (1 = yes, 0 = no). All the other members of the household take the value 0.
id_house id_ind indicator_tb_men
1 1 1
1 2 0
2 1 1
3 1 0
3 2 0
3 3 0
4 1 1
5 1 0
I would like to delete all members of households where the first individual has not answered the survey.
So it would give:
id_house id_ind indicator_tb_men
1 1 1
1 2 0
2 1 1
4 1 1
Using dplyr here is one way :
library(dplyr)
df %>%
arrange(id_house, id_ind) %>%
group_by(id_house) %>%
filter(first(indicator_tb_men) != 0)
# id_house id_ind indicator_tb_men
# <int> <int> <int>
#1 1 1 1
#2 1 2 NA
#3 2 1 1
#4 4 1 1
data
df <- structure(list(id_house = c(1L, 1L, 2L, 3L, 3L, 3L, 4L, 5L),
id_ind = c(1L, 2L, 1L, 1L, 2L, 3L, 1L, 1L), indicator_tb_men = c(1L,
NA, 1L, 0L, NA, NA, 1L, 0L)), class = "data.frame", row.names = c(NA, -8L))
in base we can use nested logic
df[df$id_house %in% df$id_house[df$id_ind == 1 & df$indicator_tb_men == 1],]
id_house id_ind indicator_tb_men
1 1 1 1
2 1 2 NA
3 2 1 1
7 4 1 1
Data: Using Ronak Shah's data
I have a dataframe with two columns (ident and value). I would like to create a counter that restart every time ident value change and also when value within each ident change. Here is an example to make it clear.
# ident value counter
#--------------------
# 1 0 1
# 1 0 2
# 1 1 1
# 1 1 2
# 1 1 3
# 1 0 1
# 1 1 1
# 1 1 2
# 2 1 1
# 2 0 1
# 2 0 2
# 2 0 3
I've tried the plyr package
ddply(mydf, .(ident, value), transform, .id = seq_along(ident))
Same result with the data.frame package.
A data.table alternative with the use of the rleid/rowid functions. With rleid you create a run length id for consecutive values, which can be used as a group. 1:.N or rowid can be used to create the counter. The code:
library(data.table)
# option 1:
setDT(d)[, counter := 1:.N, by = .(ident,rleid(value))]
# option 2:
setDT(d)[, counter := rowid(ident, rleid(value))]
which both give:
> d
ident value counter
1: 1 0 1
2: 1 0 2
3: 1 1 1
4: 1 1 2
5: 1 1 3
6: 1 0 1
7: 1 1 1
8: 1 1 2
9: 2 1 1
10: 2 0 1
11: 2 0 2
12: 2 0 3
With dplyr it is a bit less straightforward:
library(dplyr)
d %>%
group_by(ident, val.gr = cumsum(value != lag(value, default = first(value)))) %>%
mutate(counter = row_number()) %>%
ungroup() %>%
select(-val.gr)
As an alternative to the cumsum-function you could also use rleid from data.table.
Used data:
d <- structure(list(ident = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
value = c(0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L)),
.Names = c("ident", "value"), class = "data.frame", row.names = c(NA, -12L))
We can paste the two values together and use length attribute of rle to get the length of consecutive numbers. We then use sequence to generate the counter.
df$counter <- sequence(rle(paste0(df$dent, df$value))$lengths)
df
# dent value counter
#1 1 0 1
#2 1 0 2
#3 1 1 1
#4 1 1 2
#5 1 1 3
#6 1 0 1
#7 1 1 1
#8 1 1 2
#9 2 1 1
#10 2 0 1
#11 2 0 2
#12 2 0 3
I have the following data frame
id<-c(1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3)
time<-c(0,1,2,3,4,5,6,7,0,1,2,3,0,1,2,3)
value<-c(1,1,6,1,2,0,0,1,2,6,2,2,1,1,6,1)
d<-data.frame(id, time, value)
The value 6 appears only once for each id. For every id, i would like to remove all rows after the line with the value 6 per id except the first two lines coming after.
I've searched and found a similar problem, but i couldnt adapt it myself. I therefore use the code of this thread
In the above case the final data frame should be
id time value
1 0 1
1 1 1
1 2 6
1 3 1
1 4 2
2 0 2
2 1 6
2 2 2
2 3 2
3 0 1
3 1 1
3 2 6
3 3 1
On of the solution given seems getting very close to what i need. But i didn't manage to adapt it. Could u help me?
library(plyr)
ddply(d, "id",
function(x) {
if (any(x$value == 6)) {
subset(x, time <= x[x$value == 6, "time"])
} else {
x
}
}
)
Thank you very much.
We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(d)). Grouped by the 'id' column, we get the position of 'value' that is equal to 6. Add 2 to it. Find the min of the number of elements for that group (.N) and the position, get the seq, and use that to subset the dataset. We can also add an if/else condition to check whether there are any 6 in the 'value' column or else to return .SD without any subsetting.
library(data.table)
setDT(d)[, if(any(value==6)) .SD[seq(min(c(which(value==6) + 2, .N)))]
else .SD, by = id]
# id time value
# 1: 1 0 1
# 2: 1 1 1
# 3: 1 2 6
# 4: 1 3 1
# 5: 1 4 2
# 6: 2 0 2
# 7: 2 1 6
# 8: 2 2 2
# 9: 2 3 2
#10: 3 0 1
#11: 3 1 1
#12: 3 2 6
#13: 3 3 1
#14: 4 0 1
#15: 4 1 2
#16: 4 2 5
Or as #Arun mentioned in the comments, we can use the ?head to subset, which would be faster
setDT(d)[, if(any(value==6)) head(.SD, which(value==6L)+2L) else .SD, by = id]
Or using dplyr, we group by 'id', get the position of 'value' 6 with which, add 2, get the seq and use that numeric index within slice to extract the rows.
library(dplyr)
d %>%
group_by(id) %>%
slice(seq(which(value==6)+2))
# id time value
#1 1 0 1
#2 1 1 1
#3 1 2 6
#4 1 3 1
#5 1 4 2
#6 2 0 2
#7 2 1 6
#8 2 2 2
#9 2 3 2
#10 3 0 1
#11 3 1 1
#12 3 2 6
#13 3 3 1
#14 4 0 1
#15 4 1 2
#16 4 2 5
data
d <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L), time = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 3L, 0L, 1L, 2L, 3L, 0L, 1L, 2L), value = c(1L, 1L, 6L, 1L,
2L, 2L, 6L, 2L, 2L, 1L, 1L, 6L, 1L, 1L, 2L, 5L)), .Names = c("id",
"time", "value"), class = "data.frame", row.names = c(NA, -16L))