I would like to use dplyr to go through a dataframe row by row, and if A == 0, then set B to the value of B in the previous row, otherwise leave it unchanged. However, I want "the value of B in the previous row" to refer to the previous row during the computation, not before the computation began, because the value may have changed -- in other words, I'd like changes to propagate downwards. For example, with the following data:
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
A B
1 0
0 1
0 1
0 1
1 1
I would like the result of the computation to be:
result <- data.frame(A=c(1,0,0,0,1),B=c(0,0,0,0,1))
A B
1 0
0 0
0 0
0 0
1 1
If I use something like result <- dat %>% mutate(B = ifelse(A==0,lag(B),B) then changes won't propagate downwards: result$B will be equal to c(0,0,1,1,1), not c(0,0,0,0,1).
More generally, how do you use dplyr::mutate to create a column that depends on itself (as it updates during the computation, not a copy of what it was before)?
Seems like you want a "last observation carried forward" approach. The most common R implementation is zoo::na.locf which fills in NA values with the last observation. All we need to do to use it in this case is to first set to NA all the B values that we want to fill in:
mutate(dat,
B = ifelse(A == 0, NA, B),
B = zoo::na.locf(B))
# A B
# 1 1 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 1 1
As to my comment, do note that the only thing mutate does is add the column to the data frame. We could do it just as well without mutate:
result = dat
result$B = with(result, ifelse(A == 0, NA, B))
result$B = zoo::na.locf(result$B)
Whether you use mutate or [ or $ or any other method to access/add the columns is tangential to the problem.
We could use fill from tidyr after changing the 'B' values to NA that corresponds to 0 in 'A'
library(dplyr)
library(tidyr)
dat %>%
mutate(B = NA^(!A)*B) %>%
fill(B)
# A B
#1 1 0
#2 0 0
#3 0 0
#4 0 0
#5 1 1
NOTE: By default, the .direction (argument in fill) is "down", but it can also take "up" i.e. fill(B, .direction="up")
Here's a solution using grouping, and rleid (Run length encoding id) from data.table. I think it should be faster than the zoo solution, since zoo relies on doing multiple revs and a cumsum. And rleid is blazing fast
Basically, we only want the last value of the previous group, so we create a grouping variable based on the diff vector of the rleid and add that to the rleid if A == 1. Then we group and take the first B-value of the group for every case where A == 0
library(dplyr)
library(data.table)
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
dat <- dat %>%
mutate(grp = data.table::rleid(A),
grp = ifelse(A == 1, grp + c(diff(grp),0),grp)) %>%
group_by(grp) %>%
mutate(B = ifelse(A == 0, B[1],B)) # EDIT: Always carry forward B on A == 0
dat
Source: local data frame [5 x 3]
Groups: grp [2]
A B grp
<dbl> <dbl> <dbl>
1 1 0 2
2 0 0 2
3 0 0 2
4 0 0 2
5 1 1 3
EDIT: Here's an example with a longer dataset so we can really see the behavior: (Also, switched, it should be if all A != 1 not if not all A == 1
set.seed(30)
dat <- data.frame(A=sample(0:1,15,replace = TRUE),
B=sample(0:1,15,replace = TRUE))
> dat
A B
1 0 1
2 0 0
3 0 1
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 1 0
10 0 0
11 0 0
12 0 0
13 1 0
14 1 1
15 0 0
Result:
Source: local data frame [15 x 3]
Groups: grp [5]
A B grp
<int> <int> <dbl>
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 3
8 0 1 3
9 1 0 5
10 0 0 5
11 0 0 5
12 0 0 5
13 1 0 6
14 1 1 7
15 0 1 7
Related
I am doing my best to learn R, and this is my first post on this forum.
I currently have a data frame with a populated vector "x" and an unpopulated vector "counter" as follows:
x <- c(NA,1,0,0,0,0,1,1,1,1,0,1)
df <- data.frame("x" = x, "counter" = 0)
x counter
1 NA 0
2 1 0
3 0 0
4 0 0
5 0 0
6 0 0
7 1 0
8 1 0
9 1 0
10 1 0
11 0 0
12 1 0
I am having a surprisingly difficult time trying to write code that will simply populate counter so that counter sums the cumulative, sequential 1s in x, but reverts back to zero when x is zero. Accordingly, I would like counter to calculate as follows per the above example:
x counter
1 NA NA
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 1
8 1 2
9 1 3
10 1 4
11 0 0
12 1 1
I have tried using lag() and ifelse(), both with and without for loops, but seem to be getting further and further away from a workable solution (while lag got me close, the figures were not calculating as expected....my ifelse and for loops eventually ended up with length 1 vectors of NA_real_, NA or 1). I have also considered cumsum - but not sure how to frame the range to just the 1s - and have searched and reviewed similar posts, for example How to add value to previous row if condition is met; however, I still cannot figure out what I would expect to be a very simple task.
Admittedly, I am at a low point in my early R learning curve and greatly appreciate any help and constructive feedback anyone from the community can provide. Thank you.
You can use :
library(dplyr)
df %>%
group_by(x1 = cumsum(replace(x, is.na(x), 0) == 0)) %>%
mutate(counter = (row_number() - 1) * x) %>%
ungroup %>%
select(-x1)
# x counter
# <dbl> <dbl>
# 1 NA NA
# 2 1 1
# 3 0 0
# 4 0 0
# 5 0 0
# 6 0 0
# 7 1 1
# 8 1 2
# 9 1 3
#10 1 4
#11 0 0
#12 1 1
Explaining the steps -
Create a new column (x1), replace NA in x with 0 and increment the group value by 1 (using cumsum) whenever x = 0.
For each group subtract the row number with 0 and multiply it by x. This multiplication is necessary because it will help to keep counter as 0 where x = 0 and counter as NA where x is NA.
Welcome #cpanagakos.
In dplyr::lag it's not posibble to use a column that still doesn't exist.
(It can't refer to itself.)
https://www.reddit.com/r/rstats/comments/a34n6b/dplyr_use_previous_row_from_a_column_thats_being/
For example:
library(tidyverse)
df <- tibble("x" = c(NA, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1))
# error: lag cannot refer to a column that still doesn't exist
df %>%
mutate(counter = case_when(is.na(x) ~ coalesce(lag(counter), 0),
x == 0 ~ 0,
x == 1 ~ lag(counter) + 1))
#> Error: Problem with `mutate()` input `counter`.
#> x object 'counter' not found
#> i Input `counter` is `case_when(...)`.
So, if you have a criteria that "resets" the counter, you would need to write a formula that changes the group when you need a reset an then refer to the row_number, that will be restarted at 1 inside the group (like #Ronald Shah and others suggest):
Create sequential counter that restarts on a condition within panel data groups
df %>%
group_by(x1 = cumsum(!coalesce(x, 0))) %>%
mutate(counter = row_number() - 1) %>%
ungroup()
#> # A tibble: 12 x 3
#> x x1 counter
#> <dbl> <int> <dbl>
#> 1 NA 1 NA
#> 2 1 1 1
#> 3 0 2 0
#> 4 0 3 0
#> 5 0 4 0
#> 6 0 5 0
#> 7 1 5 1
#> 8 1 5 2
#> 9 1 5 3
#> 10 1 5 4
#> 11 0 6 0
#> 12 1 6 1
This would be one of the few cases where using a for loop in R could be justified: because the alternatives are conceptually harder to understand.
I'm looking for a command in r by which I can flag the 10(or n) highest and 10 lowest values. I found this post in which it does very similar to what I'm trying to do. The post suggests that is.max could do what I wanted, but I couldn't really find the command in R documentation.
Has it been updated to another command? Is there any other command in tidyvese or dplyr I could try?
Thanks!
library(tidyverse)
generate data:
set.seed(666)
rnorm(20) %>% as.data.frame() -> x
# choose breakpont (e.g. top10)
n <- 10
x %>% arrange(x) %>%
mutate(rnum = row_number()) %>%
mutate(bottom_n = ifelse(rnum %in% seq(1, n), 1, 0)) %>%
mutate(top_n = ifelse(rnum %in% seq( n()-n+1, n()), 1, 0)) %>%
select(-rnum)
Here, we first order the values in ascending order, and create a helper column for row numbers. bottom_n=1 are rows that have their row number between 1 and n; top_n are the rows that have their row number between n()-n+1 and n(), where n() is the length of the input vector.
Output:
. bottom_n top_n
1 -2.21687445 1 0
2 -1.79224083 1 0
3 -1.77023084 1 0
4 -1.72015590 1 0
5 -1.30618526 1 0
6 -0.80251957 1 0
7 -0.58245269 1 0
8 -0.35513446 1 0
9 -0.07582656 1 0
10 -0.04203245 1 0
11 0.13412567 0 1
12 0.34490035 0 1
13 0.75331105 0 1
14 0.75839618 0 1
15 0.78617038 0 1
16 0.85830054 0 1
17 0.86465359 0 1
18 2.01435467 0 1
19 2.02816784 0 1
20 2.15004262 0 1
This question already has an answer here:
How to subset all rows in a dataframe that have a particular value
(1 answer)
Closed 2 years ago.
Data example:
df <- data.frame("a" = c(1,2,3,4), "b" = c(4,3,2,1), "x_ind" = c(1,0,1,1), "y_ind" = c(0,0,1,1), "z_ind" = c(0,1,1,1) )
> df
a b x_ind y_ind z_ind
1 1 4 1 0 0
2 2 3 0 0 1
3 3 2 1 1 1
4 4 1 1 1 1
I want to add a new column which checks if the whole row for the columns which end in "_ind" has all values equal to 1. If it does then returns 1 else returns 0. So the result dataframe looks like:
a b x_ind y_ind z_ind keep
1 1 4 1 0 0 0
2 2 3 0 0 1 0
3 3 2 1 1 1 1
4 4 1 1 1 1 1
I can select the columns by using df %>% select(contains("_ind")) however I am not sure how to do a rowwise operation which checks if every value in the row contains a 1, and then append the column back to the original dataframe.
Any help would be appreicated! Working with Dplyr but appreciate any solution
You can use rowSums when your df is equal to 1, i.e.
rowSums(df[grepl('_ind', names(df))] == 1) == ncol(df[grepl('_ind', names(df))])
#[1] FALSE FALSE TRUE TRUE
Continuing your dplyr attempt you can do,
df %>%
select(contains("_ind")) %>%
mutate(new = rowSums(. == 1) == ncol(.))
# x_ind y_ind z_ind new
#1 1 0 0 FALSE
#2 0 0 1 FALSE
#3 1 1 1 TRUE
#4 1 1 1 TRUE
#OR you can filter directly
df %>%
select(contains("_ind")) %>%
filter(rowSums(. == 1) == ncol(.))
# x_ind y_ind z_ind
#1 1 1 1
#2 1 1 1
If you want to also keep the origina columns, you can use,
df %>%
filter_at(vars(ends_with('_ind')), all_vars(. == 1))
# a b x_ind y_ind z_ind
#1 3 2 1 1 1
#2 4 1 1 1 1
NOTE: When we use (.), the dot refers to the resulting data frame. In this case, it refers to columns specify in the condition (i.e. to the columns that end with _ind)
Similarly in base R,
df[rowSums(df[grepl('_ind', names(df))] == 1) == ncol(df[grepl('_ind', names(df))]),]
# a b x_ind y_ind z_ind
#3 3 2 1 1 1
#4 4 1 1 1 1
You can use rowwise with c_across in new dplyr :
library(dplyr)
df %>% rowwise() %>% mutate(keep = +all(c_across(ends_with('ind')) == 1))
# a b x_ind y_ind z_ind keep
# <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#1 1 4 1 0 0 0
#2 2 3 0 0 1 0
#3 3 2 1 1 1 1
#4 4 1 1 1 1 1
You can use apply with all, using endsWith to get the columns ending with _ind and testing if they are == 1.
df$keep <- +(apply(df[,endsWith(colnames(df), "_ind")]==1, 1, all))
df
# a b x_ind y_ind z_ind keep
#1 1 4 1 0 0 0
#2 2 3 0 0 1 0
#3 3 2 1 1 1 1
#4 4 1 1 1 1 1
or using rowSums
df$keep <- +(rowSums(df[,endsWith(colnames(df), "_ind")]!=1) == 0)
Let's say I have a tibble.
library(tidyverse)
tib <- as.tibble(list(record = c(1:10),
gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)),
like_product = as.factor(sample(1:5, 10, replace = TRUE)))
tib
# A tibble: 10 x 3
record gender like_product
<int> <fctr> <fctr>
1 1 F 2
2 2 M 1
3 3 M 2
4 4 F 3
5 5 F 4
6 6 M 2
7 7 F 4
8 8 M 4
9 9 F 4
10 10 M 5
I would like to dummy code my data with 1's and 0's so that the data looks more/less like this.
# A tibble: 10 x 8
record gender_M gender_F like_product_1 like_product_2 like_product_3 like_product_4 like_product_5
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0 1 0 0
2 2 0 1 0 0 0 0 0
3 3 0 1 0 1 0 0 0
4 4 0 1 1 0 0 0 0
5 5 1 0 0 0 0 0 0
6 6 0 1 0 0 0 0 0
7 7 0 1 0 0 0 0 0
8 8 0 1 0 1 0 0 0
9 9 1 0 0 0 0 0 0
10 10 1 0 0 0 0 0 1
My workflow would require that I know a range of variables to dummy code (i.e. gender:like_product), but don't want to identify EVERY variable by hand (there could be hundreds of variables). Likewise, I don't want to have to identify every level/unique value of every variable to dummy code. I'm ultimately looking for a tidyverse solution.
I know of several ways of doing this, but none of them that fit perfectly within tidyverse. I know I could use mutate...
tib %>%
mutate(gender_M = ifelse(gender == "M", 1, 0),
gender_F = ifelse(gender == "F", 1, 0),
like_product_1 = ifelse(like_product == 1, 1, 0),
like_product_2 = ifelse(like_product == 2, 1, 0),
like_product_3 = ifelse(like_product == 3, 1, 0),
like_product_4 = ifelse(like_product == 4, 1, 0),
like_product_5 = ifelse(like_product == 5, 1, 0)) %>%
select(-gender, -like_product)
But this would break my workflow rules of needing to specify every dummy coded output.
I've done this in the past with model.matrix, from the stats package.
model.matrix(~ gender + like_product, tib)
Easy and straightforward, but I want a solution in the tidyverse. EDIT: Reason being, I still have to specify every variable, and being able to use select helpers to specify something like gender:like_product would be much preferred.
I think the solution is in purrr
library(purrr)
dummy_code <- function(x) {
lvls <- levels(x)
sapply(lvls, function(y) as.integer(x == y)) %>% as.tibble
}
tib %>%
map_at(c("gender", "like_product"), dummy_code)
$record
[1] 1 2 3 4 5 6 7 8 9 10
$gender
# A tibble: 10 x 2
F M
<int> <int>
1 1 0
2 0 1
3 0 1
4 1 0
5 1 0
6 0 1
7 1 0
8 0 1
9 1 0
10 0 1
$like_product
# A tibble: 10 x 5
`1` `2` `3` `4` `5`
<int> <int> <int> <int> <int>
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 0 1 0
6 0 1 0 0 0
7 0 0 0 1 0
8 0 0 0 1 0
9 0 0 0 1 0
10 0 0 0 0 1
This attempt produces a list of tibbles, with the exception of the excluded variable record, and I've been unsuccessful at combining them all back into a single tibble. Additionally, I still have to specify every column, and overall it seems clunky.
Any better ideas? Thanks!!
An alternative to model.matrix is using the package recipes. This is still a work in progress and is not yet included in the tidyverse. At some point it might / will be included in the tidyverse packages.
I will leave it up to you to read up on recipes, but in the step step_dummy you can use special selectors from the tidyselect package (installed with recipes) like the selectors you can use in dplyr as starts_with(). I created a little example to show the steps.
Example code below.
But if this is handier I will leave up to you as this has already been pointed out in the comments. The function bake() uses model.matrix to create the dummies. The difference is mostly in the column names and of course in the internal checks that are being done in the underlying code of all the separate steps.
library(recipes)
library(tibble)
tib <- as.tibble(list(record = c(1:10),
gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)),
like_product = as.factor(sample(1:5, 10, replace = TRUE))))
dum <- tib %>%
recipe(~ .) %>%
step_dummy(gender, like_product) %>%
prep(training = tib) %>%
bake(newdata = tib)
dum
# A tibble: 10 x 6
record gender_M like_product_X2 like_product_X3 like_product_X4 like_product_X5
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1. 1. 0. 0. 0.
2 2 1. 1. 0. 0. 0.
3 3 1. 1. 0. 0. 0.
4 4 0. 0. 1. 0. 0.
5 5 0. 0. 0. 0. 0.
6 6 0. 1. 0. 0. 0.
7 7 0. 1. 0. 0. 0.
8 8 0. 0. 0. 1. 0.
9 9 0. 0. 0. 0. 1.
10 10 1. 0. 0. 0. 0.
In case you don't want to load any additional packages, you could also use pivot_wider statements like this:
tib %>%
mutate(dummy = 1) %>%
pivot_wider(names_from = gender, values_from = dummy, values_fill = 0) %>%
mutate(dummy = 1) %>%
pivot_wider(names_from = like_product, values_from = dummy, values_fill = 0, names_glue = "like_product_{like_product}")
Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.