Dummy code categorical / ordinal variables in the tidyverse r - r

Let's say I have a tibble.
library(tidyverse)
tib <- as.tibble(list(record = c(1:10),
gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)),
like_product = as.factor(sample(1:5, 10, replace = TRUE)))
tib
# A tibble: 10 x 3
record gender like_product
<int> <fctr> <fctr>
1 1 F 2
2 2 M 1
3 3 M 2
4 4 F 3
5 5 F 4
6 6 M 2
7 7 F 4
8 8 M 4
9 9 F 4
10 10 M 5
I would like to dummy code my data with 1's and 0's so that the data looks more/less like this.
# A tibble: 10 x 8
record gender_M gender_F like_product_1 like_product_2 like_product_3 like_product_4 like_product_5
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 1 0 0 1 0 0
2 2 0 1 0 0 0 0 0
3 3 0 1 0 1 0 0 0
4 4 0 1 1 0 0 0 0
5 5 1 0 0 0 0 0 0
6 6 0 1 0 0 0 0 0
7 7 0 1 0 0 0 0 0
8 8 0 1 0 1 0 0 0
9 9 1 0 0 0 0 0 0
10 10 1 0 0 0 0 0 1
My workflow would require that I know a range of variables to dummy code (i.e. gender:like_product), but don't want to identify EVERY variable by hand (there could be hundreds of variables). Likewise, I don't want to have to identify every level/unique value of every variable to dummy code. I'm ultimately looking for a tidyverse solution.
I know of several ways of doing this, but none of them that fit perfectly within tidyverse. I know I could use mutate...
tib %>%
mutate(gender_M = ifelse(gender == "M", 1, 0),
gender_F = ifelse(gender == "F", 1, 0),
like_product_1 = ifelse(like_product == 1, 1, 0),
like_product_2 = ifelse(like_product == 2, 1, 0),
like_product_3 = ifelse(like_product == 3, 1, 0),
like_product_4 = ifelse(like_product == 4, 1, 0),
like_product_5 = ifelse(like_product == 5, 1, 0)) %>%
select(-gender, -like_product)
But this would break my workflow rules of needing to specify every dummy coded output.
I've done this in the past with model.matrix, from the stats package.
model.matrix(~ gender + like_product, tib)
Easy and straightforward, but I want a solution in the tidyverse. EDIT: Reason being, I still have to specify every variable, and being able to use select helpers to specify something like gender:like_product would be much preferred.
I think the solution is in purrr
library(purrr)
dummy_code <- function(x) {
lvls <- levels(x)
sapply(lvls, function(y) as.integer(x == y)) %>% as.tibble
}
tib %>%
map_at(c("gender", "like_product"), dummy_code)
$record
[1] 1 2 3 4 5 6 7 8 9 10
$gender
# A tibble: 10 x 2
F M
<int> <int>
1 1 0
2 0 1
3 0 1
4 1 0
5 1 0
6 0 1
7 1 0
8 0 1
9 1 0
10 0 1
$like_product
# A tibble: 10 x 5
`1` `2` `3` `4` `5`
<int> <int> <int> <int> <int>
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 0 1 0 0
5 0 0 0 1 0
6 0 1 0 0 0
7 0 0 0 1 0
8 0 0 0 1 0
9 0 0 0 1 0
10 0 0 0 0 1
This attempt produces a list of tibbles, with the exception of the excluded variable record, and I've been unsuccessful at combining them all back into a single tibble. Additionally, I still have to specify every column, and overall it seems clunky.
Any better ideas? Thanks!!

An alternative to model.matrix is using the package recipes. This is still a work in progress and is not yet included in the tidyverse. At some point it might / will be included in the tidyverse packages.
I will leave it up to you to read up on recipes, but in the step step_dummy you can use special selectors from the tidyselect package (installed with recipes) like the selectors you can use in dplyr as starts_with(). I created a little example to show the steps.
Example code below.
But if this is handier I will leave up to you as this has already been pointed out in the comments. The function bake() uses model.matrix to create the dummies. The difference is mostly in the column names and of course in the internal checks that are being done in the underlying code of all the separate steps.
library(recipes)
library(tibble)
tib <- as.tibble(list(record = c(1:10),
gender = as.factor(sample(c("M", "F"), 10, replace = TRUE)),
like_product = as.factor(sample(1:5, 10, replace = TRUE))))
dum <- tib %>%
recipe(~ .) %>%
step_dummy(gender, like_product) %>%
prep(training = tib) %>%
bake(newdata = tib)
dum
# A tibble: 10 x 6
record gender_M like_product_X2 like_product_X3 like_product_X4 like_product_X5
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1. 1. 0. 0. 0.
2 2 1. 1. 0. 0. 0.
3 3 1. 1. 0. 0. 0.
4 4 0. 0. 1. 0. 0.
5 5 0. 0. 0. 0. 0.
6 6 0. 1. 0. 0. 0.
7 7 0. 1. 0. 0. 0.
8 8 0. 0. 0. 1. 0.
9 9 0. 0. 0. 0. 1.
10 10 1. 0. 0. 0. 0.

In case you don't want to load any additional packages, you could also use pivot_wider statements like this:
tib %>%
mutate(dummy = 1) %>%
pivot_wider(names_from = gender, values_from = dummy, values_fill = 0) %>%
mutate(dummy = 1) %>%
pivot_wider(names_from = like_product, values_from = dummy, values_fill = 0, names_glue = "like_product_{like_product}")

Related

Continual summation of a column in R until condition is met

I am doing my best to learn R, and this is my first post on this forum.
I currently have a data frame with a populated vector "x" and an unpopulated vector "counter" as follows:
x <- c(NA,1,0,0,0,0,1,1,1,1,0,1)
df <- data.frame("x" = x, "counter" = 0)
x counter
1 NA 0
2 1 0
3 0 0
4 0 0
5 0 0
6 0 0
7 1 0
8 1 0
9 1 0
10 1 0
11 0 0
12 1 0
I am having a surprisingly difficult time trying to write code that will simply populate counter so that counter sums the cumulative, sequential 1s in x, but reverts back to zero when x is zero. Accordingly, I would like counter to calculate as follows per the above example:
x counter
1 NA NA
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 1
8 1 2
9 1 3
10 1 4
11 0 0
12 1 1
I have tried using lag() and ifelse(), both with and without for loops, but seem to be getting further and further away from a workable solution (while lag got me close, the figures were not calculating as expected....my ifelse and for loops eventually ended up with length 1 vectors of NA_real_, NA or 1). I have also considered cumsum - but not sure how to frame the range to just the 1s - and have searched and reviewed similar posts, for example How to add value to previous row if condition is met; however, I still cannot figure out what I would expect to be a very simple task.
Admittedly, I am at a low point in my early R learning curve and greatly appreciate any help and constructive feedback anyone from the community can provide. Thank you.
You can use :
library(dplyr)
df %>%
group_by(x1 = cumsum(replace(x, is.na(x), 0) == 0)) %>%
mutate(counter = (row_number() - 1) * x) %>%
ungroup %>%
select(-x1)
# x counter
# <dbl> <dbl>
# 1 NA NA
# 2 1 1
# 3 0 0
# 4 0 0
# 5 0 0
# 6 0 0
# 7 1 1
# 8 1 2
# 9 1 3
#10 1 4
#11 0 0
#12 1 1
Explaining the steps -
Create a new column (x1), replace NA in x with 0 and increment the group value by 1 (using cumsum) whenever x = 0.
For each group subtract the row number with 0 and multiply it by x. This multiplication is necessary because it will help to keep counter as 0 where x = 0 and counter as NA where x is NA.
Welcome #cpanagakos.
In dplyr::lag it's not posibble to use a column that still doesn't exist.
(It can't refer to itself.)
https://www.reddit.com/r/rstats/comments/a34n6b/dplyr_use_previous_row_from_a_column_thats_being/
For example:
library(tidyverse)
df <- tibble("x" = c(NA, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1))
# error: lag cannot refer to a column that still doesn't exist
df %>%
mutate(counter = case_when(is.na(x) ~ coalesce(lag(counter), 0),
x == 0 ~ 0,
x == 1 ~ lag(counter) + 1))
#> Error: Problem with `mutate()` input `counter`.
#> x object 'counter' not found
#> i Input `counter` is `case_when(...)`.
So, if you have a criteria that "resets" the counter, you would need to write a formula that changes the group when you need a reset an then refer to the row_number, that will be restarted at 1 inside the group (like #Ronald Shah and others suggest):
Create sequential counter that restarts on a condition within panel data groups
df %>%
group_by(x1 = cumsum(!coalesce(x, 0))) %>%
mutate(counter = row_number() - 1) %>%
ungroup()
#> # A tibble: 12 x 3
#> x x1 counter
#> <dbl> <int> <dbl>
#> 1 NA 1 NA
#> 2 1 1 1
#> 3 0 2 0
#> 4 0 3 0
#> 5 0 4 0
#> 6 0 5 0
#> 7 1 5 1
#> 8 1 5 2
#> 9 1 5 3
#> 10 1 5 4
#> 11 0 6 0
#> 12 1 6 1
This would be one of the few cases where using a for loop in R could be justified: because the alternatives are conceptually harder to understand.

flag the 10 highest and 10 lowest values from a column in r

I'm looking for a command in r by which I can flag the 10(or n) highest and 10 lowest values. I found this post in which it does very similar to what I'm trying to do. The post suggests that is.max could do what I wanted, but I couldn't really find the command in R documentation.
Has it been updated to another command? Is there any other command in tidyvese or dplyr I could try?
Thanks!
library(tidyverse)
generate data:
set.seed(666)
rnorm(20) %>% as.data.frame() -> x
# choose breakpont (e.g. top10)
n <- 10
x %>% arrange(x) %>%
mutate(rnum = row_number()) %>%
mutate(bottom_n = ifelse(rnum %in% seq(1, n), 1, 0)) %>%
mutate(top_n = ifelse(rnum %in% seq( n()-n+1, n()), 1, 0)) %>%
select(-rnum)
Here, we first order the values in ascending order, and create a helper column for row numbers. bottom_n=1 are rows that have their row number between 1 and n; top_n are the rows that have their row number between n()-n+1 and n(), where n() is the length of the input vector.
Output:
. bottom_n top_n
1 -2.21687445 1 0
2 -1.79224083 1 0
3 -1.77023084 1 0
4 -1.72015590 1 0
5 -1.30618526 1 0
6 -0.80251957 1 0
7 -0.58245269 1 0
8 -0.35513446 1 0
9 -0.07582656 1 0
10 -0.04203245 1 0
11 0.13412567 0 1
12 0.34490035 0 1
13 0.75331105 0 1
14 0.75839618 0 1
15 0.78617038 0 1
16 0.85830054 0 1
17 0.86465359 0 1
18 2.01435467 0 1
19 2.02816784 0 1
20 2.15004262 0 1

Calculate Longest Consecutive Sequence Across Columns in a Data Frame - R

I have a data frame like the one below...
df <- data.frame(B1994 = c(1,0,0,0,1,0,0,1,1,0),
B1995 = c(1,1,1,0,0,1,1,1,0,0),
B1996 = c(0,0,0,0,0,0,1,1,1,0),
B1997 = c(1,0,1,0,0,1,0,1,1,1),
B1998 = c(1,0,0,0,1,0,1,0,0,1)
)
I am now trying to calculate the longest consecutive sequence of 0's across all of the columns (for each row) in this data frame, and populate a new column with these values, like this data frame below...
df2 <- data.frame(B1994 = c(1,0,0,0,1,0,0,1,1,0),
B1995 = c(1,1,1,0,0,1,1,1,0,0),
B1996 = c(0,0,0,0,0,0,1,1,1,0),
B1997 = c(1,0,1,0,0,1,0,1,1,1),
B1998 = c(1,0,0,0,1,0,1,0,0,1),
Longest_0_Interval = c(1,3,1,5,3,1,1,1,1,3)
)
Is there an easy solution for this in R?
You can use rle()
df <- data.frame(B1994 = c(1,0,0,0,1,0,0,1,1,0),
B1995 = c(1,1,1,0,0,1,1,1,0,0),
B1996 = c(0,0,0,0,0,0,1,1,1,0),
B1997 = c(1,0,1,0,0,1,0,1,1,1),
B1998 = c(1,0,0,0,1,0,1,0,0,1)
)
maxl0 <- function(x) {
r <- rle(x)
i0 <- which(r$values==0) ## or i0 <- r$values==0
max(r$lengths[i0])
}
df$Longest_0_Interval <- apply(df, 1, maxl0)
One dplyr option could be:
df %>%
rowwise() %>%
mutate(Longest_0_Interval = with(rle(c_across(everything())), max(lengths[values == 0])))
B1994 B1995 B1996 B1997 B1998 Longest_0_Interval
<dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1 1 0 1 1 1
2 0 1 0 0 0 3
3 0 1 0 1 0 1
4 0 0 0 0 0 5
5 1 0 0 0 1 3
6 0 1 0 1 0 1
7 0 1 1 0 1 1
8 1 1 1 1 0 1
9 1 0 1 1 0 1
10 0 0 0 1 1 3

R: Sparse? Transforming data for co-occurrence matrix

I'm a Bio major using R to generate some visualizations showing which human proteins (uniprots) are targeted by different bacterial strains.
# sample data
human.uniprots <- c("P15311", "P0CG48", "Q8WYH8", "P42224", "Q9NXR8",
"P40763", "P05067", "P60709", "Q9UDW1", "Q9H160",
"Q9UKL0", "P26038", "P61244", "O95817", "Q09472",
"P15311","P05067", "P60709", "Q9UDW1", "Q9H160")
strains <- rep(c("A", "B", "C", "C"), each = 5)
final <- cbind(human.uniprots, strains)
I'm trying to generate a co-occurrence matrix/heat map...something like
h.map <- data.frame(matrix(nrow = length(unique(human.uniprots)),
ncol = length(unique(strains)) + 1))
h.map.cols <- c("human_uniprots", "A", "B", "C")
colnames(h.map) <- h.map.cols
...where the columns have the strains, the rows have the proteins, and the data frame cells are populated with the counts of times that a protein interacts with a strain. So if strain A, B, and C all interact with a uniprot, they should all have a value of 3 in their cells for that uniprot row.
I've tried making a list of tuples of unique strain and human_uniprots, then searching for that tuple that matches the strain and human uniprot pair from the matrix I want to populate, and adding a "1" if there's a match...but I'm not sure how to work with tuples in R. Then I saw this: Populating a co-occurrence matrix
Which is what I want, but I'm not understanding the usage or syntax...is sparse() even a function in R?
Additionally...it would be nice to rank all the proteins by ones which interact with all strains. So all the proteins that interact with all the strains should be at the top, followed by ones that interact with 2 strains, and then 1 strain...
sparse() is a MATLAB function from the looks of it. You're describing a bipartite network represented by an incidence matrix.
human.uniprots <- c("P15311", "P0CG48", "Q8WYH8", "P42224", "Q9NXR8",
"P40763", "P05067", "P60709", "Q9UDW1", "Q9H160",
"Q9UKL0", "P26038", "P61244", "O95817", "Q09472",
"P15311","P05067", "P60709", "Q9UDW1", "Q9H160")
strains <- rep(c("A", "B", "C", "D"), each = 5)
final <- cbind(human.uniprots, strains)
final_df <- as.data.frame(final)
library(igraph) # install.packages("igraph")
g <- graph_from_data_frame(final_df, directed = FALSE)
V(g)$type <- ifelse(V(g)$name %in% strains, FALSE, TRUE)
as_incidence_matrix(g)
#> P15311 P0CG48 Q8WYH8 P42224 Q9NXR8 P40763 P05067 P60709 Q9UDW1 Q9H160
#> A 1 1 1 1 1 0 0 0 0 0
#> B 0 0 0 0 0 1 1 1 1 1
#> C 0 0 0 0 0 0 0 0 0 0
#> D 1 0 0 0 0 0 1 1 1 1
#> Q9UKL0 P26038 P61244 O95817 Q09472
#> A 0 0 0 0 0
#> B 0 0 0 0 0
#> C 1 1 1 1 1
#> D 0 0 0 0 0
or.....
V(g)$type <- ifelse(V(g)$name %in% strains, TRUE, FALSE)
# swap TRUE/FALSE
as_incidence_matrix(g)
#> A B C D
#> P15311 1 0 0 1
#> P0CG48 1 0 0 0
#> Q8WYH8 1 0 0 0
#> P42224 1 0 0 0
#> Q9NXR8 1 0 0 0
#> P40763 0 1 0 0
#> P05067 0 1 0 1
#> P60709 0 1 0 1
#> Q9UDW1 0 1 0 1
#> Q9H160 0 1 0 1
#> Q9UKL0 0 0 1 0
#> P26038 0 0 1 0
#> P61244 0 0 1 0
#> O95817 0 0 1 0
#> Q09472 0 0 1 0
Created on 2018-05-25 by the reprex package (v0.2.0).
Using dplyr, you can group_by, count, and spread to get a per-strain count. Then replace the per-strain count with the total count for that row, using rowSums():
library(dplyr)
as.data.frame(final) %>%
group_by(human.uniprots, strains) %>%
count() %>%
spread(strains, n) %>%
ungroup() %>%
mutate(total_n = rowSums(.[2:ncol(.)])) %>%
mutate_if(is.numeric, funs(ifelse(. == 0, 0, total_n))) %>%
select(-total_n)
# A tibble: 15 x 5
human.uniprots A B C D
<fct> <dbl> <dbl> <dbl> <dbl>
1 O95817 0. 0. 1. 0.
2 P05067 0. 2. 0. 2.
3 P0CG48 1. 0. 0. 0.
4 P15311 2. 0. 0. 2.
5 P26038 0. 0. 1. 0.
6 P40763 0. 1. 0. 0.
7 P42224 1. 0. 0. 0.
8 P60709 0. 2. 0. 2.
9 P61244 0. 0. 1. 0.
10 Q09472 0. 0. 1. 0.
11 Q8WYH8 1. 0. 0. 0.
12 Q9H160 0. 2. 0. 2.
13 Q9NXR8 1. 0. 0. 0.
14 Q9UDW1 0. 2. 0. 2.
15 Q9UKL0 0. 0. 1. 0.
You can do this using table, or if you want it sparse you can use xtabs.
So for your example, you can use either
tab <- table(final[,"human.uniprots"], final[,"strains"])
tab* rowSums(tab)
Or sparse
tab <- xtabs(~human.uniprots + strains, final, sparse=TRUE)
tab <- tab*Matrix::rowSums(tab)
You can then plot it using
Matrix::image(tab, scales=list(y=list(at=1:nrow(tab), label=rownames(tab)),
x=list(at=1:ncol(tab), label=colnames(tab))),
ylab="uniprots",
xlab="strains")
You can also rank the rows by occurance
r <- order(-Matrix::rowSums(tab))
# and then reorder the rows of the matrix and the labels
Matrix::image(tab[r,],
scales=list(y=list(at=1:nrow(tab), label=rownames(tab)),
x=list(at=1:ncol(tab), label=colnames(tab)[r])),
ylab="uniprots",
xlab="strains")

propagate changes down a column

I would like to use dplyr to go through a dataframe row by row, and if A == 0, then set B to the value of B in the previous row, otherwise leave it unchanged. However, I want "the value of B in the previous row" to refer to the previous row during the computation, not before the computation began, because the value may have changed -- in other words, I'd like changes to propagate downwards. For example, with the following data:
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
A B
1 0
0 1
0 1
0 1
1 1
I would like the result of the computation to be:
result <- data.frame(A=c(1,0,0,0,1),B=c(0,0,0,0,1))
A B
1 0
0 0
0 0
0 0
1 1
If I use something like result <- dat %>% mutate(B = ifelse(A==0,lag(B),B) then changes won't propagate downwards: result$B will be equal to c(0,0,1,1,1), not c(0,0,0,0,1).
More generally, how do you use dplyr::mutate to create a column that depends on itself (as it updates during the computation, not a copy of what it was before)?
Seems like you want a "last observation carried forward" approach. The most common R implementation is zoo::na.locf which fills in NA values with the last observation. All we need to do to use it in this case is to first set to NA all the B values that we want to fill in:
mutate(dat,
B = ifelse(A == 0, NA, B),
B = zoo::na.locf(B))
# A B
# 1 1 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 1 1
As to my comment, do note that the only thing mutate does is add the column to the data frame. We could do it just as well without mutate:
result = dat
result$B = with(result, ifelse(A == 0, NA, B))
result$B = zoo::na.locf(result$B)
Whether you use mutate or [ or $ or any other method to access/add the columns is tangential to the problem.
We could use fill from tidyr after changing the 'B' values to NA that corresponds to 0 in 'A'
library(dplyr)
library(tidyr)
dat %>%
mutate(B = NA^(!A)*B) %>%
fill(B)
# A B
#1 1 0
#2 0 0
#3 0 0
#4 0 0
#5 1 1
NOTE: By default, the .direction (argument in fill) is "down", but it can also take "up" i.e. fill(B, .direction="up")
Here's a solution using grouping, and rleid (Run length encoding id) from data.table. I think it should be faster than the zoo solution, since zoo relies on doing multiple revs and a cumsum. And rleid is blazing fast
Basically, we only want the last value of the previous group, so we create a grouping variable based on the diff vector of the rleid and add that to the rleid if A == 1. Then we group and take the first B-value of the group for every case where A == 0
library(dplyr)
library(data.table)
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
dat <- dat %>%
mutate(grp = data.table::rleid(A),
grp = ifelse(A == 1, grp + c(diff(grp),0),grp)) %>%
group_by(grp) %>%
mutate(B = ifelse(A == 0, B[1],B)) # EDIT: Always carry forward B on A == 0
dat
Source: local data frame [5 x 3]
Groups: grp [2]
A B grp
<dbl> <dbl> <dbl>
1 1 0 2
2 0 0 2
3 0 0 2
4 0 0 2
5 1 1 3
EDIT: Here's an example with a longer dataset so we can really see the behavior: (Also, switched, it should be if all A != 1 not if not all A == 1
set.seed(30)
dat <- data.frame(A=sample(0:1,15,replace = TRUE),
B=sample(0:1,15,replace = TRUE))
> dat
A B
1 0 1
2 0 0
3 0 1
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 1 0
10 0 0
11 0 0
12 0 0
13 1 0
14 1 1
15 0 0
Result:
Source: local data frame [15 x 3]
Groups: grp [5]
A B grp
<int> <int> <dbl>
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 3
8 0 1 3
9 1 0 5
10 0 0 5
11 0 0 5
12 0 0 5
13 1 0 6
14 1 1 7
15 0 1 7

Resources