Dynamically Create Variables Based on Binary Indicators in R - r

I have user-level data that looks like this:
ID V1 V2 V3 V4
001 1 0 1 0
002 0 1 0 1
003 0 0 0 0
004 1 1 1 0
In the above example, I would like an elegant solution (likely using tidyr) to dynamically refactor this to appear as:
ID Num_Vars Var1 Var2 Var3
001 2 V1 V3 NA
002 2 V2 V4 NA
003 0 NA NA NA
004 3 V1 V2 V3
Note that this example is simplified and there are actually many variables. The point is to have code that detects how many variables should be created, based on the maximum number of 1s in Var1-VarX that are populated for any user.

This feels like some fairly standard reshaping: convert to long, manipulate by group, convert back to wide:
df %>%
gather(key = var, value = value, -ID) %>%
group_by(ID) %>%
filter(value != 0) %>%
mutate(Num_Vars = n(),
Var_Label = paste0("Var", 1:n())) %>%
spread(key = Var_Label, value = var) %>%
select(-value) %>%
full_join(distinct(df, ID))
# Source: local data frame [4 x 5]
# Groups: ID [?]
#
# ID Num_Vars Var1 Var2 Var3
# <int> <int> <chr> <chr> <chr>
# 1 1 2 V1 V3 <NA>
# 2 2 2 V2 V4 <NA>
# 3 4 3 V1 V2 V3
# 4 3 NA <NA> <NA> <NA>
Using this data reproducibly shared with dput():
df = structure(list(ID = 1:4, V1 = c(1L, 0L, 0L, 1L), V2 = c(0L, 1L,
0L, 1L), V3 = c(1L, 0L, 0L, 1L), V4 = c(0L, 1L, 0L, 0L)), .Names = c("ID",
"V1", "V2", "V3", "V4"), class = "data.frame", row.names = c(NA,
-4L))

We can use melt/dcast from data.table
library(data.table)
dcast(melt(setDT(df), id.var = "ID")[, Num_vars := sum(value),
ID][value!=0][df[, "ID", with = FALSE], on = "ID"],
ID + Num_vars ~ paste0("Var", rowid(ID)), value.var = "variable")
# ID Num_vars Var1 Var2 Var3
#1: 1 2 V1 V3 NA
#2: 2 2 V2 V4 NA
#3: 3 NA NA NA NA
#4: 4 3 V1 V2 V3

Related

how to group_by one variable and count based on another variable?

Is it possible to use group_by to group one variable and count the target variable based on another variable?
For example,
x1
x2
x3
A
1
0
B
2
1
C
3
0
B
1
1
A
1
1
I want to count 0 and 1 of x3 with grouped x1
x1
x3=0
x3=1
A
1
1
B
0
2
C
1
0
Is it possible to use group_by and add something to summarize? I tried group_by both x1 and x3, but that gives x3 as the second column which is not what we are looking for.
If it's not possible to just use group_by, I was thinking we could group_by both x1 and x3, then split by x3 and cbind them, but the two dataframes after split have different lengths of rows, and there's no cbind_fill. What should I do to cbind them and fill the extra blanks?
using the data.table package:
library(data.table)
dat <- as.data.table(dataset)
dat[, x3:= paste0("x3=", x3)]
result <- dcast(dat, x1~x3, value.var = "x3", fun.aggregate = length)
A tidyverse approach to achieve your desired result using dplyr::count + tidyr::pivot_wider:
library(dplyr)
library(tidyr)
df %>%
count(x1, x3) %>%
pivot_wider(names_from = "x3", values_from = "n", names_prefix = "x3=", values_fill = 0)
#> # A tibble: 3 × 3
#> x1 `x3=0` `x3=1`
#> <chr> <int> <int>
#> 1 A 1 1
#> 2 B 0 2
#> 3 C 1 0
DATA
df <- data.frame(
x1 = c("A", "B", "C", "B", "A"),
x2 = c(1L, 2L, 3L, 1L, 1L),
x3 = c(0L, 1L, 0L, 1L, 1L)
)
Yes, it is possible. Here is an example:
dat = read.table(text = "x1 x2 x3
A 1 0
B 2 1
C 3 0
B 1 1
A 1 1", header = TRUE)
dat %>% group_by(x1) %>%
count(x3) %>%
pivot_wider(names_from = x3,
names_glue = "x3 = {x3}",
values_from = n) %>%
replace(is.na(.),0)
# A tibble: 3 x 3
# Groups: x1 [3]
# x1 `x3 = 0` `x3 = 1`
# <chr> <int> <int>
#1 A 1 1
#2 B 0 2
#3 C 1 0

Divide one value in a data.frame by another in an alternate data.frame base on row and column meta data

I have one data frame with structure =
Gene Transcript_ID V1 V2 V3 V4
1 ENSG00000000003.14 ENST00000612152.4 0 6 0 3
2 ENSG00000000003.14 ENST00000373020.8 4 0 5 0
3 ENSG00000000003.14 ENST00000614008.4 0 0 0 0
4 ENSG00000000003.14 ENST00000496771.5 0 3 0 0
And I have the aggregated totals by Gene in another dataframe with structure =
Category V1 V2 V3 V4
1 ENSG00000000003.14 4.00 9.00 5.00 3.00
2 ENSG00000000005.6 0.00 0.00 0.00 0.00
3 ENSG00000000419.12 61.00 94.00 103.00 71.00
4 ENSG00000000457.14 577.01 698.20 815.49 697.72
I want to divide the values in data.frame 1, by the corresponding aggregate values in dataframe2 to give the relative proportions of all values.
Is there a nice simple bit of syntax someone can apply here please? Much appreciated!
We could use a join here
library(data.table)
nm1 <- paste0("V", 1:4)
setDT(df1)[, (nm1) := lapply(.SD, as.numeric), .SDcols = nm1]
df1[df2, (nm1) := Map(`/`, mget(nm1),
mget(paste0("i.", nm1))), on = .(Gene = Category)]
-output
df1
Gene Transcript_ID V1 V2 V3 V4
1: ENSG00000000003.14 ENST00000612152.4 0 0.6666667 0 1
2: ENSG00000000003.14 ENST00000373020.8 1 0.0000000 1 0
3: ENSG00000000003.14 ENST00000614008.4 0 0.0000000 0 0
4: ENSG00000000003.14 ENST00000496771.5 0 0.3333333 0 0
data
df1 <- structure(list(Gene = c("ENSG00000000003.14", "ENSG00000000003.14",
"ENSG00000000003.14", "ENSG00000000003.14"), Transcript_ID = c("ENST00000612152.4",
"ENST00000373020.8", "ENST00000614008.4", "ENST00000496771.5"
), V1 = c(0L, 4L, 0L, 0L), V2 = c(6L, 0L, 0L, 3L), V3 = c(0L,
5L, 0L, 0L), V4 = c(3L, 0L, 0L, 0L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
df2 <- structure(list(Category = c("ENSG00000000003.14", "ENSG00000000005.6",
"ENSG00000000419.12", "ENSG00000000457.14"), V1 = c(4, 0, 61,
577.01), V2 = c(9, 0, 94, 698.2), V3 = c(5, 0, 103, 815.49),
V4 = c(3, 0, 71, 697.72)), class = "data.frame", row.names = c("1",
"2", "3", "4"))
You could also do:
df1 %>%
left_join(df2, c('Gene' = 'Category')) %>%
pivot_longer(starts_with('V'),
names_to = c('name','.value'), names_sep = '[.]') %>%
mutate(value = x/y, x = NULL, y = NULL) %>%
pivot_wider()
# A tibble: 4 x 6
Gene Transcript_ID V1 V2 V3 V4
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 ENSG000000~ ENST000006121~ 0 0.667 0 1
2 ENSG000000~ ENST000003730~ 1 0 1 0
3 ENSG000000~ ENST000006140~ 0 0 0 0
4 ENSG000000~ ENST000004967~ 0 0.333 0 0

How to recursive replace an element with value from 2 columns of previous row

I have a data set of the following:
Id Val1 Val2
ID1 3 12
ID1 4 NA
ID1 -2 NA
ID1 4 33
ID2 4 NA
I want to replace the NA with Val1+Val2 from the previous row if the Id is the same. The following is the ideal output:
Id Val1 Val2
ID1 3 12
ID1 4 15
ID1 -2 19
ID1 4 33
ID2 4 NA
I have a very big dataset. I personally don’t like the for loop in r and am looking for a beautiful vectorization solutions.
Here is one option where we group by 'Id' and a group created by taking the cumulative sum of logical vector i.e. where there are no missing values in 'Val2', then add (+) the first element of 'Val2' with the cumsum of 'Val1', take the lag, ungroup and remove the temporary 'grp' column
library(dplyr)
df1 %>%
group_by(Id, grp = cumsum(!is.na(Val2))) %>%
mutate(Val2 = lag(first(Val2) + cumsum(Val1), default = first(Val2))) %>%
ungroup %>%
select(-grp)
# A tibble: 5 x 3
# Id Val1 Val2
# <fct> <dbl> <dbl>
#1 ID1 3 12
#2 ID1 4 15
#3 ID1 -2 19
#4 ID1 4 33
#5 ID2 4 NA
data
df1 <- structure(list(Id = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("ID1",
"ID2"), class = "factor"), Val1 = c(3, 4, -2, 4, 4), Val2 = c(12,
NA, NA, 33, NA)), class = "data.frame", row.names = c(NA, -5L
))

Repeat a value within each ID

I have a dataset in R in long format. Each ID does not appear the same number of times (i.e. one ID might be one row, another might appear 79 rows).
e.g.
ID V1 V2
1 B 0
1 A 1
1 C 0
2 C 0
3 A 0
3 C 0
I want to create a variable which, if any of the rows for a given ID have Var2 == 1, then 1 repeats for every row of that ID
e.g.
ID V1 V2 V3
1 B 0 1
1 A 1 1
1 C 0 1
2 C 0 0
3 A 0 0
3 C 0 0
In base R we can use any - and ave for the grouping.
DF$V3 <- with(DF, ave(V2, ID, FUN = function(x) any(x == 1)))
DF
# ID V1 V2 V3
#1 1 B 0 1
#2 1 A 1 1
#3 1 C 0 1
#4 2 C 0 0
#5 3 A 0 0
#6 3 C 0 0
data
DF <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), V1 = c("B", "A",
"C", "C", "A", "C"), V2 = c(0L, 1L, 0L, 0L, 0L, 0L)), .Names = c("ID",
"V1", "V2"), class = "data.frame", row.names = c(NA, -6L))
Here's a tidyverse solution.
If V2 can only be 0 or 1:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(V3 = max(V2))
If you want to check that V2 is exactly 1.
df %>%
group_by(ID) %>%
mutate(V3 = as.numeric(any(V2 == 1)))
Another base R option is
df$V3 <- with(df, +(ID %in% which(rowsum(V2, ID) > 0)))

Fill missing values in a data frame

Hey I need to fill out the missing values of a data frame. The logic is easy, if there is value in M[i, j + 1] then use M[i, j + 1], else use M[i, j - 1]. But the tricky thing is I need to fill out the missing value since the beginning of the row to the column after last non-na value for each row, not only the cells near the non-empty cells.
Here is the data
a1 <- c('a',9,8,rep(NA,5))
a2 <- c('b',NA,NA,NA,NA,3,NA,4)
a3 <- c('c',11,6,7,NA,NA,NA,6)
M <- rbind(a1,a2,a3)
ind <- !is.na(M[,-1])
t <- tapply(M[,-1][ind], row(M[,-1])[ind], head, 1)
M <- M %>%
as.data.frame(stringsAsFactors = FALSE) %>%
group_by(V1) %>%
do(mutate(., last_non_na_col = max(apply(.,1,function(x) max(which(!is.na(x)))))))
for (i in 1:nrow(M)) {
for (j in 3:(M$last_non_na_col[i]+1)) {
if (is.na(M[i,j])) {
M[i,j] = ifelse(!is.na(M[i,j+1]),M[i,j+1],(ifelse(!is.na(M[i,j-1]),M[i,j-1],t[i])))
} }
for (j in 2) { M[i,j] = ifelse(is.na(M[i,j]), M[i,j+1], M[i,j])}
}
The raw data is like this
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
a1 "a" "9" "8" NA NA NA NA NA
a2 "b" NA NA NA NA "3" NA "4"
a3 "c" "11" "6" "7" NA NA NA "6"
The output of my code is the following, which is correct. Please notice that for cell M[2,5], the filled value should be 7(which is the number prior to it), not 6(the nearest number after it).
V1 V2 V3 V4 V5 V6 V7 V8 last_non_na_col
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
1 a 9 8 8 NA NA NA NA 3
2 b 3 3 3 3 3 4 4 8
3 c 11 6 7 7 7 6 6 8
I did this in for loop. Does any one can help me to do this in tidyverse?
Thanks,
Cathy
As we have a tbl_df, we could use tidyverse methods
library(tidyverse)
gather(M, key, val, -V1) %>%
group_by(V1) %>%
fill(val, .direction = 'up') %>%
mutate(val = replace(val, which(is.na(val))[1],
val[tail(which(!is.na(val)), 1)])) %>%
spread(key, val)
# A tibble: 3 x 8
# Groups: V1 [3]
# V1 V2 V3 V4 V5 V6 V7 V8
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 a 9 8 8 NA NA NA NA
#2 b 3 3 3 3 3 4 4
#3 c 11 6 7 5 5 6 6
In the OP's for loop, we could use na.locf (to fill up the NA elements by the adjacent non-NA elements - from zoo package)
library(zoo)
last_non_na_col <- c(3, 8, 8)
for (i in seq_len(nrow(M))) {
M[i, -1] <- na.locf(unlist(M[i, -1]), fromLast = TRUE, na.rm = FALSE)
for (j in 3:(pmin(ncol(M), last_non_na_col[i]+1))) {
if (is.na(M[i,j])) {
M[i,j] = ifelse(!is.na(M[i,j+1]), M[i,j+1], M[i,j-1])
}
}
}
M
# A tibble: 3 x 8
# Groups: V1 [3]
# V1 V2 V3 V4 V5 V6 V7 V8
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 a 9 8 8 NA NA NA NA
#2 b 3 3 3 3 3 4 4
#3 c 11 6 7 5 5 6 6
NOTE: Here, we created the last_non_na_col as a vector instead of a separate column in the dataset for easiness in indexing
data
M <- structure(list(V1 = c("a", "b", "c"), V2 = c("9", NA, "11"),
V3 = c("8", NA, "6"), V4 = c(NA, NA, "7"), V5 = c(NA_character_,
NA_character_, NA_character_), V6 = c(NA, "3", "5"), V7 = c(NA_character_,
NA_character_, NA_character_), V8 = c(NA, "4", "6")), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8"), row.names = c(NA,
-3L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"),
vars = "V1", drop = TRUE, indices = list(
0L, 1L, 2L), group_sizes = c(1L, 1L, 1L), biggest_group_size = 1L,
labels = structure(list(
V1 = c("a", "b", "c")), row.names = c(NA, -3L),
class = "data.frame", vars = "V1", drop = TRUE, .Names = "V1"))

Resources