I am doing my best to learn R, and this is my first post on this forum.
I currently have a data frame with a populated vector "x" and an unpopulated vector "counter" as follows:
x <- c(NA,1,0,0,0,0,1,1,1,1,0,1)
df <- data.frame("x" = x, "counter" = 0)
x counter
1 NA 0
2 1 0
3 0 0
4 0 0
5 0 0
6 0 0
7 1 0
8 1 0
9 1 0
10 1 0
11 0 0
12 1 0
I am having a surprisingly difficult time trying to write code that will simply populate counter so that counter sums the cumulative, sequential 1s in x, but reverts back to zero when x is zero. Accordingly, I would like counter to calculate as follows per the above example:
x counter
1 NA NA
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 1
8 1 2
9 1 3
10 1 4
11 0 0
12 1 1
I have tried using lag() and ifelse(), both with and without for loops, but seem to be getting further and further away from a workable solution (while lag got me close, the figures were not calculating as expected....my ifelse and for loops eventually ended up with length 1 vectors of NA_real_, NA or 1). I have also considered cumsum - but not sure how to frame the range to just the 1s - and have searched and reviewed similar posts, for example How to add value to previous row if condition is met; however, I still cannot figure out what I would expect to be a very simple task.
Admittedly, I am at a low point in my early R learning curve and greatly appreciate any help and constructive feedback anyone from the community can provide. Thank you.
You can use :
library(dplyr)
df %>%
group_by(x1 = cumsum(replace(x, is.na(x), 0) == 0)) %>%
mutate(counter = (row_number() - 1) * x) %>%
ungroup %>%
select(-x1)
# x counter
# <dbl> <dbl>
# 1 NA NA
# 2 1 1
# 3 0 0
# 4 0 0
# 5 0 0
# 6 0 0
# 7 1 1
# 8 1 2
# 9 1 3
#10 1 4
#11 0 0
#12 1 1
Explaining the steps -
Create a new column (x1), replace NA in x with 0 and increment the group value by 1 (using cumsum) whenever x = 0.
For each group subtract the row number with 0 and multiply it by x. This multiplication is necessary because it will help to keep counter as 0 where x = 0 and counter as NA where x is NA.
Welcome #cpanagakos.
In dplyr::lag it's not posibble to use a column that still doesn't exist.
(It can't refer to itself.)
https://www.reddit.com/r/rstats/comments/a34n6b/dplyr_use_previous_row_from_a_column_thats_being/
For example:
library(tidyverse)
df <- tibble("x" = c(NA, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1))
# error: lag cannot refer to a column that still doesn't exist
df %>%
mutate(counter = case_when(is.na(x) ~ coalesce(lag(counter), 0),
x == 0 ~ 0,
x == 1 ~ lag(counter) + 1))
#> Error: Problem with `mutate()` input `counter`.
#> x object 'counter' not found
#> i Input `counter` is `case_when(...)`.
So, if you have a criteria that "resets" the counter, you would need to write a formula that changes the group when you need a reset an then refer to the row_number, that will be restarted at 1 inside the group (like #Ronald Shah and others suggest):
Create sequential counter that restarts on a condition within panel data groups
df %>%
group_by(x1 = cumsum(!coalesce(x, 0))) %>%
mutate(counter = row_number() - 1) %>%
ungroup()
#> # A tibble: 12 x 3
#> x x1 counter
#> <dbl> <int> <dbl>
#> 1 NA 1 NA
#> 2 1 1 1
#> 3 0 2 0
#> 4 0 3 0
#> 5 0 4 0
#> 6 0 5 0
#> 7 1 5 1
#> 8 1 5 2
#> 9 1 5 3
#> 10 1 5 4
#> 11 0 6 0
#> 12 1 6 1
This would be one of the few cases where using a for loop in R could be justified: because the alternatives are conceptually harder to understand.
Related
I am trying to reformat longitudinal data for a time to event analysis. In the example data below, I simply want to find the earliest week that the result was “0” for each ID.
The specific issue I am having is how to patients that don't convert to 0, and had either all 1's or 2's. In the example data, patient J has all 1's.
#Sample data
have<-data.frame(patient=rep(LETTERS[1:10], each=9),
week=rep(0:8,times=10),
result=c(1,0,2,rep(0,6),1,1,2,1,rep(0,5),1,1,rep(0,7),1,rep(0,8),
1,1,1,1,2,1,0,0,0,1,1,1,rep(0,6),1,2,1,rep(0,6),1,2,rep(0,7),
1,rep(0,8),rep(1,9)))
patient week result
A 0 1
A 1 0
A 2 2
A 3 0
A 4 0
A 5 0
A 6 0
A 7 0
A 8 0
B 0 1
B 1 0
... .....
J 6 1
J 7 1
J 8 1
I am able to do this relatively straightforward process with the following code:
want<-aggregate(have$week, by=list(have$patient,have$result), min)
want<-want[which(want[2]==0),]
but realize if someone does not convert to 0, it excludes them (in this example, patient J is excluded). Instead, J should be present with a 1 in the second column and an 8 in the third column. Instead it of course is omitted
print(want)
Group.1 Group.2 x
A 0 1
B 0 4
C 0 2
D 0 1
E 0 6
F 0 3
G 0 3
H 0 2
I 0 1
#But also need
J 1 8
Pursuant to guidelines on posting here, I did work to solve this, am able to get what I need very inelegantly:
mins<-aggregate(have$week, by=list(have$patient,have$result), min)
maxs<-aggregate(have$week, by=list(have$patient,have$result), max)
want<-rbind(mins[which(mins[2]==0),],maxs[which(maxs[2]==1&maxs[3]==8),])
This returns the correct desired dataset, but the coding is terrible and not sustainable as I work with other datasets (i.e. datasets with different timeframes since I have to manually put in maxsp[3]==8, etc).
Is there a more elegant or systematic way to approach this data manipulation issue?
We can write a function to select a row from the group.
select_row <- function(result, week) {
if(any(result == 0)) which.max(result == 0) else which.max(week)
}
This function returns the index of first 0 value if it is present or else returns index of maximum value of week.
and apply it to all groups.
library(dplyr)
have %>% group_by(patient) %>% slice(select_row(result, week))
# patient week result
# <fct> <int> <dbl>
# 1 A 1 0
# 2 B 4 0
# 3 C 2 0
# 4 D 1 0
# 5 E 6 0
# 6 F 3 0
# 7 G 3 0
# 8 H 2 0
# 9 I 1 0
#10 J 8 1
Let's say I have a tibble.
library(tidyverse)
tib <- as.tibble(list(record = c(1:10),
gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)),
like_product = as.factor(sample(1:5, 10, replace = TRUE)))
tib
# A tibble: 10 x 3
record gender like_product
<int> <fctr> <fctr>
1 1 F 2
2 2 M 1
3 3 M 2
4 4 F 3
5 5 F 4
6 6 M 2
7 7 F 4
8 8 M 4
9 9 F 4
10 10 M 5
I would like to dummy code my data with 1's and 0's so that the data looks more/less like this.
# A tibble: 10 x 8
record gender_M gender_F like_product_1 like_product_2 like_product_3 like_product_4 like_product_5
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0 1 0 0
2 2 0 1 0 0 0 0 0
3 3 0 1 0 1 0 0 0
4 4 0 1 1 0 0 0 0
5 5 1 0 0 0 0 0 0
6 6 0 1 0 0 0 0 0
7 7 0 1 0 0 0 0 0
8 8 0 1 0 1 0 0 0
9 9 1 0 0 0 0 0 0
10 10 1 0 0 0 0 0 1
My workflow would require that I know a range of variables to dummy code (i.e. gender:like_product), but don't want to identify EVERY variable by hand (there could be hundreds of variables). Likewise, I don't want to have to identify every level/unique value of every variable to dummy code. I'm ultimately looking for a tidyverse solution.
I know of several ways of doing this, but none of them that fit perfectly within tidyverse. I know I could use mutate...
tib %>%
mutate(gender_M = ifelse(gender == "M", 1, 0),
gender_F = ifelse(gender == "F", 1, 0),
like_product_1 = ifelse(like_product == 1, 1, 0),
like_product_2 = ifelse(like_product == 2, 1, 0),
like_product_3 = ifelse(like_product == 3, 1, 0),
like_product_4 = ifelse(like_product == 4, 1, 0),
like_product_5 = ifelse(like_product == 5, 1, 0)) %>%
select(-gender, -like_product)
But this would break my workflow rules of needing to specify every dummy coded output.
I've done this in the past with model.matrix, from the stats package.
model.matrix(~ gender + like_product, tib)
Easy and straightforward, but I want a solution in the tidyverse. EDIT: Reason being, I still have to specify every variable, and being able to use select helpers to specify something like gender:like_product would be much preferred.
I think the solution is in purrr
library(purrr)
dummy_code <- function(x) {
lvls <- levels(x)
sapply(lvls, function(y) as.integer(x == y)) %>% as.tibble
}
tib %>%
map_at(c("gender", "like_product"), dummy_code)
$record
[1] 1 2 3 4 5 6 7 8 9 10
$gender
# A tibble: 10 x 2
F M
<int> <int>
1 1 0
2 0 1
3 0 1
4 1 0
5 1 0
6 0 1
7 1 0
8 0 1
9 1 0
10 0 1
$like_product
# A tibble: 10 x 5
`1` `2` `3` `4` `5`
<int> <int> <int> <int> <int>
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 0 1 0
6 0 1 0 0 0
7 0 0 0 1 0
8 0 0 0 1 0
9 0 0 0 1 0
10 0 0 0 0 1
This attempt produces a list of tibbles, with the exception of the excluded variable record, and I've been unsuccessful at combining them all back into a single tibble. Additionally, I still have to specify every column, and overall it seems clunky.
Any better ideas? Thanks!!
An alternative to model.matrix is using the package recipes. This is still a work in progress and is not yet included in the tidyverse. At some point it might / will be included in the tidyverse packages.
I will leave it up to you to read up on recipes, but in the step step_dummy you can use special selectors from the tidyselect package (installed with recipes) like the selectors you can use in dplyr as starts_with(). I created a little example to show the steps.
Example code below.
But if this is handier I will leave up to you as this has already been pointed out in the comments. The function bake() uses model.matrix to create the dummies. The difference is mostly in the column names and of course in the internal checks that are being done in the underlying code of all the separate steps.
library(recipes)
library(tibble)
tib <- as.tibble(list(record = c(1:10),
gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)),
like_product = as.factor(sample(1:5, 10, replace = TRUE))))
dum <- tib %>%
recipe(~ .) %>%
step_dummy(gender, like_product) %>%
prep(training = tib) %>%
bake(newdata = tib)
dum
# A tibble: 10 x 6
record gender_M like_product_X2 like_product_X3 like_product_X4 like_product_X5
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1. 1. 0. 0. 0.
2 2 1. 1. 0. 0. 0.
3 3 1. 1. 0. 0. 0.
4 4 0. 0. 1. 0. 0.
5 5 0. 0. 0. 0. 0.
6 6 0. 1. 0. 0. 0.
7 7 0. 1. 0. 0. 0.
8 8 0. 0. 0. 1. 0.
9 9 0. 0. 0. 0. 1.
10 10 1. 0. 0. 0. 0.
In case you don't want to load any additional packages, you could also use pivot_wider statements like this:
tib %>%
mutate(dummy = 1) %>%
pivot_wider(names_from = gender, values_from = dummy, values_fill = 0) %>%
mutate(dummy = 1) %>%
pivot_wider(names_from = like_product, values_from = dummy, values_fill = 0, names_glue = "like_product_{like_product}")
I have a vector of numbers in a data.frame such as below.
df <- data.frame(a = c(1,2,3,4,2,3,4,5,8,9,10,1,2,1))
I need to create a new column which gives a running count of entries that are greater than their predecessor. The resulting column vector should be this:
0,1,2,3,0,1,2,3,4,5,6,0,1,0
My attempt is to create a "flag" column of diffs to mark when the values are greater.
df$flag <- c(0,diff(df$a)>0)
> df$flag
0 1 1 1 0 1 1 1 1 1 1 0 1 0
Then I can apply some dplyr group/sum magic to almost get the right answer, except that the sum doesn't reset when flag == 0:
df %>% group_by(flag) %>% mutate(run=cumsum(flag))
a flag run
1 1 0 0
2 2 1 1
3 3 1 2
4 4 1 3
5 2 0 0
6 3 1 4
7 4 1 5
8 5 1 6
9 8 1 7
10 9 1 8
11 10 1 9
12 1 0 0
13 2 1 10
14 1 0 0
I don't want to have to resort to a for() loop because I have several of these running sums to compute with several hundred thousand rows in a data.frame.
Here's one way with ave:
ave(df$a, cumsum(c(F, diff(df$a) < 0)), FUN=seq_along) - 1
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
We can get a running count grouped by diff(df$a) < 0. Which are the positions in the vector that are less than their predecessors. We add c(F, ..) to account for the first position. The cumulative sum of that vector creates an index for grouping. The function ave can carry out a function on that index, we use seq_along for a running count. But since it starts at 1, we subtract by one ave(...) - 1 to start from zero.
A similar approach using dplyr:
library(dplyr)
df %>%
group_by(cumsum(c(FALSE, diff(a) < 0))) %>%
mutate(row_number() - 1)
You don't need dplyr:
fun <- function(x) {
test <- diff(x) > 0
y <- cumsum(test)
c(0, y - cummax(y * !test))
}
fun(df$a)
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
a <- c(1,2,3,4,2,3,4,5,8,9,10,1,2,1)
f <- c(0, diff(a)>0)
ifelse(f, cumsum(f), f)
that it is without reset.
with reset:
unlist(tapply(f, cumsum(c(0, diff(a) < 0)), cumsum))
I would like to use dplyr to go through a dataframe row by row, and if A == 0, then set B to the value of B in the previous row, otherwise leave it unchanged. However, I want "the value of B in the previous row" to refer to the previous row during the computation, not before the computation began, because the value may have changed -- in other words, I'd like changes to propagate downwards. For example, with the following data:
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
A B
1 0
0 1
0 1
0 1
1 1
I would like the result of the computation to be:
result <- data.frame(A=c(1,0,0,0,1),B=c(0,0,0,0,1))
A B
1 0
0 0
0 0
0 0
1 1
If I use something like result <- dat %>% mutate(B = ifelse(A==0,lag(B),B) then changes won't propagate downwards: result$B will be equal to c(0,0,1,1,1), not c(0,0,0,0,1).
More generally, how do you use dplyr::mutate to create a column that depends on itself (as it updates during the computation, not a copy of what it was before)?
Seems like you want a "last observation carried forward" approach. The most common R implementation is zoo::na.locf which fills in NA values with the last observation. All we need to do to use it in this case is to first set to NA all the B values that we want to fill in:
mutate(dat,
B = ifelse(A == 0, NA, B),
B = zoo::na.locf(B))
# A B
# 1 1 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 1 1
As to my comment, do note that the only thing mutate does is add the column to the data frame. We could do it just as well without mutate:
result = dat
result$B = with(result, ifelse(A == 0, NA, B))
result$B = zoo::na.locf(result$B)
Whether you use mutate or [ or $ or any other method to access/add the columns is tangential to the problem.
We could use fill from tidyr after changing the 'B' values to NA that corresponds to 0 in 'A'
library(dplyr)
library(tidyr)
dat %>%
mutate(B = NA^(!A)*B) %>%
fill(B)
# A B
#1 1 0
#2 0 0
#3 0 0
#4 0 0
#5 1 1
NOTE: By default, the .direction (argument in fill) is "down", but it can also take "up" i.e. fill(B, .direction="up")
Here's a solution using grouping, and rleid (Run length encoding id) from data.table. I think it should be faster than the zoo solution, since zoo relies on doing multiple revs and a cumsum. And rleid is blazing fast
Basically, we only want the last value of the previous group, so we create a grouping variable based on the diff vector of the rleid and add that to the rleid if A == 1. Then we group and take the first B-value of the group for every case where A == 0
library(dplyr)
library(data.table)
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
dat <- dat %>%
mutate(grp = data.table::rleid(A),
grp = ifelse(A == 1, grp + c(diff(grp),0),grp)) %>%
group_by(grp) %>%
mutate(B = ifelse(A == 0, B[1],B)) # EDIT: Always carry forward B on A == 0
dat
Source: local data frame [5 x 3]
Groups: grp [2]
A B grp
<dbl> <dbl> <dbl>
1 1 0 2
2 0 0 2
3 0 0 2
4 0 0 2
5 1 1 3
EDIT: Here's an example with a longer dataset so we can really see the behavior: (Also, switched, it should be if all A != 1 not if not all A == 1
set.seed(30)
dat <- data.frame(A=sample(0:1,15,replace = TRUE),
B=sample(0:1,15,replace = TRUE))
> dat
A B
1 0 1
2 0 0
3 0 1
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 1 0
10 0 0
11 0 0
12 0 0
13 1 0
14 1 1
15 0 0
Result:
Source: local data frame [15 x 3]
Groups: grp [5]
A B grp
<int> <int> <dbl>
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 3
8 0 1 3
9 1 0 5
10 0 0 5
11 0 0 5
12 0 0 5
13 1 0 6
14 1 1 7
15 0 1 7
I have a vector of numbers in a data.frame such as below.
df <- data.frame(a = c(1,2,3,4,2,3,4,5,8,9,10,1,2,1))
I need to create a new column which gives a running count of entries that are greater than their predecessor. The resulting column vector should be this:
0,1,2,3,0,1,2,3,4,5,6,0,1,0
My attempt is to create a "flag" column of diffs to mark when the values are greater.
df$flag <- c(0,diff(df$a)>0)
> df$flag
0 1 1 1 0 1 1 1 1 1 1 0 1 0
Then I can apply some dplyr group/sum magic to almost get the right answer, except that the sum doesn't reset when flag == 0:
df %>% group_by(flag) %>% mutate(run=cumsum(flag))
a flag run
1 1 0 0
2 2 1 1
3 3 1 2
4 4 1 3
5 2 0 0
6 3 1 4
7 4 1 5
8 5 1 6
9 8 1 7
10 9 1 8
11 10 1 9
12 1 0 0
13 2 1 10
14 1 0 0
I don't want to have to resort to a for() loop because I have several of these running sums to compute with several hundred thousand rows in a data.frame.
Here's one way with ave:
ave(df$a, cumsum(c(F, diff(df$a) < 0)), FUN=seq_along) - 1
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
We can get a running count grouped by diff(df$a) < 0. Which are the positions in the vector that are less than their predecessors. We add c(F, ..) to account for the first position. The cumulative sum of that vector creates an index for grouping. The function ave can carry out a function on that index, we use seq_along for a running count. But since it starts at 1, we subtract by one ave(...) - 1 to start from zero.
A similar approach using dplyr:
library(dplyr)
df %>%
group_by(cumsum(c(FALSE, diff(a) < 0))) %>%
mutate(row_number() - 1)
You don't need dplyr:
fun <- function(x) {
test <- diff(x) > 0
y <- cumsum(test)
c(0, y - cummax(y * !test))
}
fun(df$a)
[1] 0 1 2 3 0 1 2 3 4 5 6 0 1 0
a <- c(1,2,3,4,2,3,4,5,8,9,10,1,2,1)
f <- c(0, diff(a)>0)
ifelse(f, cumsum(f), f)
that it is without reset.
with reset:
unlist(tapply(f, cumsum(c(0, diff(a) < 0)), cumsum))