can I use string split with dcast in data.table? - r

Split a string, build columns with unique values, and fill values according to string.
Sample data.table:
library(data.table)
(dt <- data.table(id = as.numeric(1:5),
x = c(NA, "ab.cde", "co.hij.ab", "cox.cde.kl", NA)))
dcast Approach: close but not quite
dcast(dt, id ~ x, value.var = "id")
dt[dcast(dt, id ~ x, value.var = "id"), on=.(id = id)]
dcast buils some columns and fills some values, but it doesn't do what I want.
string split Approach: I can't transpose
dt[, unique(unlist(strsplit(dt$x, ".", fixed = TRUE))) :=
tstrsplit(dt$x, ".", fixed = TRUE)]
the message says that my LHS has 7 columns while my RHS only has 3. So transposing doesn't work. Maybe I can build the columns and fill the values later:
dt[, unique(unlist(strsplit(dt$x, ".", fixed = TRUE))) := character()]
And now i'm getting close but still not there. I need to fill those columns with 1 and 0s according to a match (or something) on dt$x;
id 1 should have a 1 on column: NA
id 2 should have a 1 on columns: ab, and cde
id 3 should have a 1 on columns: co, hij, and ab
id 4 should have a 1 on columns: cox, cde, and kl
id 5 should have a 1 on column: NA

We can use data.table methods i.e. dcast
library(data.table)
dcast(dt[, {x1 <- strsplit(x, "\\."); c(list(unlist(x1)),
.SD[rep(seq_len(.N), lengths(x1))])}], id + x ~ V1, length)
# id x NA ab cde co cox hij kl
#1: 1 <NA> 1 0 0 0 0 0 0
#2: 2 ab.cde 0 1 1 0 0 0 0
#3: 3 co.hij.ab 0 1 0 1 0 1 0
#4: 4 cox.cde.kl 0 0 1 0 1 0 1
#5: 5 <NA> 1 0 0 0 0 0 0

One option using dplyr and tidyr is to split the string on "." and put it into separate rows and then spread it into wide format.
library(dplyr)
library(tidyr)
dt %>%
mutate(x1 = x) %>%
separate_rows(x, sep = "\\.") %>%
mutate(temp = 1) %>%
spread(x, temp, fill = 0)
# id x1 ab cde co cox hij kl <NA>
#1 1 <NA> 0 0 0 0 0 0 1
#2 2 ab.cde 1 1 0 0 0 0 0
#3 3 co.hij.ab 1 0 1 0 1 0 0
#4 4 cox.cde.kl 0 1 0 1 0 1 0
#5 5 <NA> 0 0 0 0 0 0 1

Related

Preserve column name when making function

I have a dataframe that looks like this:
ï..Employee_Name EmpID MarriedID MaritalStatusID GenderID EmpStatusID DeptID PerfScoreID FromDiversityJobFairID
1: Adinolfi, Wilson K 10026 0 0 1 1 5 4 0
2: Ait Sidi, Karthikeyan 10084 1 1 1 5 3 3 0
3: Akinkuolie, Sarah 10196 1 1 0 5 5 3 0
4: Alagbe,Trina 10088 1 1 0 1 5 3 0
5: Anderson, Carol 10069 0 2 0 5 5 3 0
6: Anderson, Linda 10002 0 0 0 1 5 4 0
I wrote a count function:
HRdata_factor_count <- function(df, var) {
df %>%
count(df[[var]], sort = T) %>%
rename(Variable = `df[[var]]`) %>%
mutate(Variable = factor(Variable)) %>%
mutate(Variable = fct_reorder(Variable, n ))
}
It outputs "variable" instead of the name of the variable given to the var argument:
Variable n
1: 0 187
2: 1 124
I would like to maintain the name of the variable that I tell the function to count without having to rename it inside the body of the function.
You can try this function :
library(dplyr)
HRdata_factor_count <- function(df, var) {
df %>%
count(.data[[var]], sort = T) %>%
mutate(!!var := factor(.data[[var]]))
}

extracting unique combinations from a long list of binary variables

I have a dataframe containing a long list of binary variables. Each row represents a participant, and columns represent whether a participant made a certain choice (1) or not (0). For the sakes of simplicity, let's say there's only four binary variables and 6 participants.
df <- data.frame(a = c(0,1,0,1,0,1),
b = c(1,1,1,1,0,1),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
>df
# a b c d
# 1 0 1 0 1
# 2 1 1 0 1
# 3 0 1 0 0
# 4 1 1 1 0
# 5 0 0 1 0
# 6 1 1 1 0
In the dataframe, I want to create a list of columns that reflect each unique combination of variables in df (i.e., abc, abd, bcd, cda). Then, for each row, I want to add value "1" if the row contains the particular combination corresponding to the column. So, if the participant scored 1 on "a", "b", and "c", and 0 on "d" he would have a score 1 in the newly created column "abc", but 0 in the other columns. Ideally, it would look something like this.
>df_updated
# a b c d abc abd bcd cda
# 1 0 1 0 1 0 0 0 0
# 2 1 1 0 1 0 1 0 0
# 3 0 1 0 0 0 0 0 0
# 4 1 1 1 0 1 0 0 0
# 5 0 0 1 0 0 0 0 0
# 6 1 1 1 0 0 0 0 0
The ultimate goal is to have an idea of the frequency of each of the combinations, so I can order them from the most frequently chosen to the least frequently chosen. I've been thinking about this issue for days now, but couldn't find an appropriate answer. I would very much appreciate the help.
Something like this?
funCombn <- function(data){
f <- function(x, data){
data <- data[x]
list(
name = paste(x, collapse = ""),
vec = apply(data, 1, function(x) +all(as.logical(x)))
)
}
res <- combn(names(df), 3, f, simplify = FALSE, data = df)
out <- do.call(cbind.data.frame, lapply(res, '[[', 'vec'))
names(out) <- sapply(res, '[[', 'name')
cbind(data, out)
}
funCombn(df)
# a b c d abc abd acd bcd
#1 0 1 0 1 0 0 0 0
#2 1 1 0 1 0 1 0 0
#3 0 1 0 0 0 0 0 0
#4 1 1 1 0 1 0 0 0
#5 0 0 1 0 0 0 0 0
#6 1 1 1 0 1 0 0 0
Base R option using combn :
n <- 3
cbind(df, do.call(cbind, combn(names(df), n, function(x) {
setNames(data.frame(as.integer(rowSums(df[x] == 1) == n)),
paste0(x, collapse = ''))
}, simplify = FALSE))) -> result
result
# a b c d abc abd acd bcd
#1 0 1 0 1 0 0 0 0
#2 1 1 0 1 0 1 0 0
#3 0 1 0 0 0 0 0 0
#4 1 1 1 0 1 0 0 0
#5 0 0 1 0 0 0 0 0
#6 1 1 1 0 1 0 0 0
Using combn create all combinations of column names taking n columns at a time. For each of those combinations assign 1 to those rows where all the 3 combinations are 1 or 0 otherwise.
If you are just looking for a frequency of the combinations (and they don't need to be back in the original data), then you could use something like this:
df <- data.frame(a = c(0,1,0,1,0,1),
b = c(1,1,1,1,0,1),
c = c(0,0,0,1,1,1),
d = c(1,1,0,0,0,0))
n <- names(df)
out <- sapply(n, function(x)ifelse(df[[x]] == 1, x, ""))
combs <- apply(out, 1, paste, collapse="")
sort(table(combs))
# combs
# abd b bd c abc
# 1 1 1 1 2
Ok, so let's use your data, including one row without any 1's:
df <- data.frame(
a = c(0,1,0,1,0,1,0),
b = c(1,1,1,1,0,1,0),
c = c(0,0,0,1,1,1,0),
d = c(1,1,0,0,0,0,0)
)
Now I want to paste all column names together if they have a 1, and then make that a wide table (so that all have a column for a combination). Of course, I fill all resulting NAs with 0's.
df2 <- df %>%
dplyr::mutate(
combination = paste0(
ifelse(a == 1, "a", ""), # There is possibly a way to automate this as well using across()
ifelse(b == 1, "b", ""),
ifelse(c == 1, "c", ""),
ifelse(d == 1, "d", "")
),
combination = ifelse(
combination == "",
"nothing",
paste0("comb_", combination)
),
value = ifelse(
is.na(combination),
0,
1
),
i = dplyr::row_number()
) %>%
tidyr::pivot_wider(
names_from = combination,
values_from = value,
names_repair = "unique"
) %>%
replace(., is.na(.), 0) %>%
dplyr::select(-i)
Since you want to order the original df by frequency, you can create a summary of all combinations (excluding those without anything filled in). Then you just make it a long table and pull the column for every combination (arranged by frequency) from the table.
comb_in_order <- df2 %>%
dplyr::select(
-tidyselect::any_of(
c(
names(df),
"nothing" # I think you want these last.
)
)
) %>%
dplyr::summarise(
dplyr::across(
.cols = tidyselect::everything(),
.fns = sum
)
) %>%
tidyr::pivot_longer(
cols = tidyselect::everything(),
names_to = "combination",
values_to = "frequency"
) %>%
dplyr::arrange(
dplyr::desc(frequency)
) %>%
dplyr::pull(combination)
The only thing to do then is to reconstruct the original df by these after arranging by the columns.
df2 %>%
dplyr::arrange(
across(
tidyselect::any_of(comb_in_order),
desc
)
) %>%
dplyr::select(
tidyselect::any_of(names(df))
)
This should work for all possible combinations.

how subset rows that have value larger than other values for multiple columns in R

I have the following data.table
library(data.table)
dt <- data.table(V1=c(1,3,1,0,NA,0),
V2=c(1,0,1,0,1,3),
Q1=c(3,5,10,14,0,3),
Q2=c(0,1,8,NA,0,NA))
and i want to add a new column that will have value 1:
if any of the columns V1,V2 has value larger than 2,
and
if any of the columns Q1,Q2 has value larger than 0
So in the end i want to up with something like this:
> dt
V1 V2 Q1 Q2 new
1: 1 1 3 0 0
2: 3 0 5 1 1
3: 1 1 10 8 0
4: 0 0 14 NA 0
5: NA 1 0 0 0
6: 0 3 3 NA 1
EDIT
In principle i would like to have 2 vectors of column names, so something like v_columms <- names(dt)[names(dt) %like%"V"] and q_columms <- names(dt)[names(dt) %like%"q"] and use these
We can use melt to process multiple columns by specifying the patterns in measure to convert it to 'long' format and then apply the condition
dt[, new := melt(dt, measure = patterns("V", "Q"))[,
+(any(value1 > 2) & any(value2 > 0)),rowid(variable)]$V1]
dt
# V1 V2 Q1 Q2 new
#1: 1 1 3 0 0
#2: 3 0 5 1 1
#3: 1 1 10 8 0
#4: 0 0 14 NA 0
#5: NA 1 0 0 0
#6: 0 3 3 NA 1
Or without melt, if there are only two groups of columns, then
vs <- grep("V", names(dt))
qs <- grep("Q", names(dt))
dt[, new := +(Reduce(`|`, lapply(.SD[, ..vs], `>`, 2)) &
Reduce(`|`, lapply(.SD[, ..qs], `>`, 0)))]
Using dplyr and either case_when or if_else:
dt %>%
mutate(new = case_when((V1 > 2 | V2 > 2) & (Q1 > 0 | Q2) > 0 ~ 1,
TRUE ~ 0))
dt %>%
mutate(new = if_else((V1 > 2 | V2 > 2) & (Q1 > 0 | Q2 > 0), 1 , 0))
V1 V2 Q1 Q2 new
1 1 1 3 0 0
2 3 0 5 1 1
3 1 1 10 8 0
4 0 0 14 NA 0
5 NA 1 0 0 0
6 0 3 3 NA 1
Here's another approach with some helper functions:
foo <- function(.dt, cols, vals, na.rm = TRUE) {
rowSums(.dt[, cols, with=FALSE] > vals, na.rm = na.rm) > 0
}
bar <- function(.dt, cols_list, vals_list) {
as.integer(Reduce("&", Map(function(cols, vals) foo(.dt, cols, vals), cols_list, vals_list)))
}
dt[, new := bar(.SD, list(v_columms, q_columms), list(2, 0))]

How to one-hot-encode factor variables with data.table?

For those unfamiliar, one-hot encoding simply refers to converting a column of categories (i.e. a factor) into multiple columns of binary indicator variables where each new column corresponds to one of the classes of the original column. This example will explain it better:
dt <- data.table(
ID=1:5,
Color=factor(c("green", "red", "red", "blue", "green"), levels=c("blue", "green", "red", "purple")),
Shape=factor(c("square", "triangle", "square", "triangle", "cirlce"))
)
dt
ID Color Shape
1: 1 green square
2: 2 red triangle
3: 3 red square
4: 4 blue triangle
5: 5 green cirlce
# one hot encode the colors
color.binarized <- dcast(dt[, list(V1=1, ID, Color)], ID ~ Color, fun=sum, value.var="V1", drop=c(TRUE, FALSE))
# Prepend Color_ in front of each one-hot-encoded feature
setnames(color.binarized, setdiff(colnames(color.binarized), "ID"), paste0("Color_", setdiff(colnames(color.binarized), "ID")))
# one hot encode the shapes
shape.binarized <- dcast(dt[, list(V1=1, ID, Shape)], ID ~ Shape, fun=sum, value.var="V1", drop=c(TRUE, FALSE))
# Prepend Shape_ in front of each one-hot-encoded feature
setnames(shape.binarized, setdiff(colnames(shape.binarized), "ID"), paste0("Shape_", setdiff(colnames(shape.binarized), "ID")))
# Join one-hot tables with original dataset
dt <- dt[color.binarized, on="ID"]
dt <- dt[shape.binarized, on="ID"]
dt
ID Color Shape Color_blue Color_green Color_red Color_purple Shape_cirlce Shape_square Shape_triangle
1: 1 green square 0 1 0 0 0 1 0
2: 2 red triangle 0 0 1 0 0 0 1
3: 3 red square 0 0 1 0 0 1 0
4: 4 blue triangle 1 0 0 0 0 0 1
5: 5 green cirlce 0 1 0 0 1 0 0
This is something I do a lot, and as you can see it's pretty tedious (especially when my data has many factor columns). Is there an easier way to do this with data.table? In particular, I assumed dcast would allow me to one-hot-encode multiple columns at once, when I try doing something like
dcast(dt[, list(V1=1, ID, Color, Shape)], ID ~ Color + Shape, fun=sum, value.var="V1", drop=c(TRUE, FALSE))
I get column combinations
ID blue_cirlce blue_square blue_triangle green_cirlce green_square green_triangle red_cirlce red_square red_triangle purple_cirlce purple_square purple_triangle
1: 1 0 0 0 0 1 0 0 0 0 0 0 0
2: 2 0 0 0 0 0 0 0 0 1 0 0 0
3: 3 0 0 0 0 0 0 0 1 0 0 0 0
4: 4 0 0 1 0 0 0 0 0 0 0 0 0
5: 5 0 0 0 1 0 0 0 0 0 0 0 0
Here you go:
dcast(melt(dt, id.vars='ID'), ID ~ variable + value, fun = length)
# ID Color_blue Color_green Color_red Shape_cirlce Shape_square Shape_triangle
#1: 1 0 1 0 0 1 0
#2: 2 0 0 1 0 0 1
#3: 3 0 0 1 0 1 0
#4: 4 1 0 0 0 0 1
#5: 5 0 1 0 1 0 0
To get the missing factors you can do the following:
res = dcast(melt(dt, id = 'ID', value.factor = T), ID ~ value, drop = F, fun = length)
setnames(res, c("ID", unlist(lapply(2:ncol(dt),
function(i) paste(names(dt)[i], levels(dt[[i]]), sep = "_")))))
res
# ID Color_blue Color_green Color_red Color_purple Shape_cirlce Shape_square Shape_triangle
#1: 1 0 1 0 0 0 1 0
#2: 2 0 0 1 0 0 0 1
#3: 3 0 0 1 0 0 1 0
#4: 4 1 0 0 0 0 0 1
#5: 5 0 1 0 0 1 0 0
Using model.matrix:
> cbind(dt[, .(ID)], model.matrix(~ Color + Shape, dt))
ID (Intercept) Colorgreen Colorred Colorpurple Shapesquare Shapetriangle
1: 1 1 1 0 0 1 0
2: 2 1 0 1 0 0 1
3: 3 1 0 1 0 1 0
4: 4 1 0 0 0 0 1
5: 5 1 1 0 0 0 0
This makes the most sense if you're doing modelling.
If you want to suppress the intercept (and restore the aliased column for the 1st variable):
> cbind(dt[, .(ID)], model.matrix(~ Color + Shape - 1, dt))
ID Colorblue Colorgreen Colorred Colorpurple Shapesquare Shapetriangle
1: 1 0 1 0 0 1 0
2: 2 0 0 1 0 0 1
3: 3 0 0 1 0 1 0
4: 4 1 0 0 0 0 1
5: 5 0 1 0 0 0 0
Here's a more generalized version of eddi's solution:
one_hot <- function(dt, cols="auto", dropCols=TRUE, dropUnusedLevels=FALSE){
# One-Hot-Encode unordered factors in a data.table
# If cols = "auto", each unordered factor column in dt will be encoded. (Or specifcy a vector of column names to encode)
# If dropCols=TRUE, the original factor columns are dropped
# If dropUnusedLevels = TRUE, unused factor levels are dropped
# Automatically get the unordered factor columns
if(cols[1] == "auto") cols <- colnames(dt)[which(sapply(dt, function(x) is.factor(x) & !is.ordered(x)))]
# Build tempDT containing and ID column and 'cols' columns
tempDT <- dt[, cols, with=FALSE]
tempDT[, ID := .I]
setcolorder(tempDT, unique(c("ID", colnames(tempDT))))
for(col in cols) set(tempDT, j=col, value=factor(paste(col, tempDT[[col]], sep="_"), levels=paste(col, levels(tempDT[[col]]), sep="_")))
# One-hot-encode
if(dropUnusedLevels == TRUE){
newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = T, fun = length)
} else{
newCols <- dcast(melt(tempDT, id = 'ID', value.factor = T), ID ~ value, drop = F, fun = length)
}
# Combine binarized columns with the original dataset
result <- cbind(dt, newCols[, !"ID"])
# If dropCols = TRUE, remove the original factor columns
if(dropCols == TRUE){
result <- result[, !cols, with=FALSE]
}
return(result)
}
Note that for large datasets it's probably better to use Matrix::sparse.model.matrix
Update (2017)
This is now in the package mltools.
If no one posts a clean way to write this out by hand each time, you can always make a function/macro:
OHE <- function(dt, grp, encodeCols) {
grpSymb = as.symbol(grp)
for (col in encodeCols) {
colSymb = as.symbol(col)
eval(bquote(
dt[, .SD
][, V1 := 1
][, dcast(.SD, .(grpSymb) ~ .(colSymb), fun=sum, value.var='V1')
][, setnames(.SD, setdiff(colnames(.SD), grp), sprintf("%s_%s", col, setdiff(colnames(.SD), grp)))
][, dt <<- dt[.SD, on=grp]
]
))
}
dt
}
dtOHE = OHE(dt, 'ID', c('Color', 'Shape'))
dtOHE
ID Color Shape Color_blue Color_green Color_red Shape_cirlce Shape_square Shape_triangle
1: 1 green square 0 1 0 0 1 0
2: 2 red triangle 0 0 1 0 0 1
3: 3 red square 0 0 1 0 1 0
4: 4 blue triangle 1 0 0 0 0 1
5: 5 green cirlce 0 1 0 1 0 0
In few lines you can solve this problem:
library(tidyverse)
dt2 <- spread(dt,Color,Shape)
dt3 <- spread(dt,Shape,Color)
df <- cbind(dt2,dt3)
df2 <- apply(df, 2, function(x){sapply(x, function(y){
ifelse(is.na(y),0,1)
})})
df2 <- as.data.frame(df2)
df <- cbind(dt,df2[,-1])

propagate changes down a column

I would like to use dplyr to go through a dataframe row by row, and if A == 0, then set B to the value of B in the previous row, otherwise leave it unchanged. However, I want "the value of B in the previous row" to refer to the previous row during the computation, not before the computation began, because the value may have changed -- in other words, I'd like changes to propagate downwards. For example, with the following data:
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
A B
1 0
0 1
0 1
0 1
1 1
I would like the result of the computation to be:
result <- data.frame(A=c(1,0,0,0,1),B=c(0,0,0,0,1))
A B
1 0
0 0
0 0
0 0
1 1
If I use something like result <- dat %>% mutate(B = ifelse(A==0,lag(B),B) then changes won't propagate downwards: result$B will be equal to c(0,0,1,1,1), not c(0,0,0,0,1).
More generally, how do you use dplyr::mutate to create a column that depends on itself (as it updates during the computation, not a copy of what it was before)?
Seems like you want a "last observation carried forward" approach. The most common R implementation is zoo::na.locf which fills in NA values with the last observation. All we need to do to use it in this case is to first set to NA all the B values that we want to fill in:
mutate(dat,
B = ifelse(A == 0, NA, B),
B = zoo::na.locf(B))
# A B
# 1 1 0
# 2 0 0
# 3 0 0
# 4 0 0
# 5 1 1
As to my comment, do note that the only thing mutate does is add the column to the data frame. We could do it just as well without mutate:
result = dat
result$B = with(result, ifelse(A == 0, NA, B))
result$B = zoo::na.locf(result$B)
Whether you use mutate or [ or $ or any other method to access/add the columns is tangential to the problem.
We could use fill from tidyr after changing the 'B' values to NA that corresponds to 0 in 'A'
library(dplyr)
library(tidyr)
dat %>%
mutate(B = NA^(!A)*B) %>%
fill(B)
# A B
#1 1 0
#2 0 0
#3 0 0
#4 0 0
#5 1 1
NOTE: By default, the .direction (argument in fill) is "down", but it can also take "up" i.e. fill(B, .direction="up")
Here's a solution using grouping, and rleid (Run length encoding id) from data.table. I think it should be faster than the zoo solution, since zoo relies on doing multiple revs and a cumsum. And rleid is blazing fast
Basically, we only want the last value of the previous group, so we create a grouping variable based on the diff vector of the rleid and add that to the rleid if A == 1. Then we group and take the first B-value of the group for every case where A == 0
library(dplyr)
library(data.table)
dat <- data.frame(A=c(1,0,0,0,1),B=c(0,1,1,1,1))
dat <- dat %>%
mutate(grp = data.table::rleid(A),
grp = ifelse(A == 1, grp + c(diff(grp),0),grp)) %>%
group_by(grp) %>%
mutate(B = ifelse(A == 0, B[1],B)) # EDIT: Always carry forward B on A == 0
dat
Source: local data frame [5 x 3]
Groups: grp [2]
A B grp
<dbl> <dbl> <dbl>
1 1 0 2
2 0 0 2
3 0 0 2
4 0 0 2
5 1 1 3
EDIT: Here's an example with a longer dataset so we can really see the behavior: (Also, switched, it should be if all A != 1 not if not all A == 1
set.seed(30)
dat <- data.frame(A=sample(0:1,15,replace = TRUE),
B=sample(0:1,15,replace = TRUE))
> dat
A B
1 0 1
2 0 0
3 0 1
4 0 1
5 0 0
6 0 0
7 1 1
8 0 0
9 1 0
10 0 0
11 0 0
12 0 0
13 1 0
14 1 1
15 0 0
Result:
Source: local data frame [15 x 3]
Groups: grp [5]
A B grp
<int> <int> <dbl>
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 1
5 0 1 1
6 0 1 1
7 1 1 3
8 0 1 3
9 1 0 5
10 0 0 5
11 0 0 5
12 0 0 5
13 1 0 6
14 1 1 7
15 0 1 7

Resources