dummy code collapsed column using data.table in R - r

It is quite easy to dummy code a collapsed column using the tidyverse. Here is a quick example of how I've done it in the past. First, I'll load the iris data and create a custom collapsed column of randomly sampled letters:
library(tidyverse)
# load practice data
data(iris)
iris <- as_tibble(iris)
# create column of collapsed values
lst <- list()
for(i in 1:150) {
value <- as.list(paste0(sample(letters[1:2], 1), ", ", sample(letters[3:4], 1)))
lst[i] <- value
}
# append custom columns to the iris dataset
iris$Samples <- unlist(lst)
iris$Subject <- c(1:150)
iris <- iris %>% select(Subject, everything())
# preview custom dataset
iris
# A tibble: 150 x 7
Subject Sepal.Length Sepal.Width Petal.Length Petal.Width Species Samples
<int> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 1 5.1 3.5 1.4 0.2 setosa a, d
2 2 4.9 3 1.4 0.2 setosa a, c
3 3 4.7 3.2 1.3 0.2 setosa a, c
4 4 4.6 3.1 1.5 0.2 setosa b, c
5 5 5 3.6 1.4 0.2 setosa a, c
6 6 5.4 3.9 1.7 0.4 setosa a, d
7 7 4.6 3.4 1.4 0.3 setosa b, c
8 8 5 3.4 1.5 0.2 setosa b, c
9 9 4.4 2.9 1.4 0.2 setosa b, d
10 10 4.9 3.1 1.5 0.1 setosa a, c
# ... with 140 more rows
So, let's say that each letter represented a unique value of interest and I wanted to wrangle this data into a series of dummy coded variables for each letter. Here is how I would do this using tidyverse functions:
iris %>%
separate_rows(Samples, sep = ', ') %>%
mutate(Values = 1) %>%
pivot_wider(names_from = "Samples", values_from = "Values") %>%
mutate_if(is.double, ~replace_na(., 0))
# A tibble: 150 x 10
Subject Sepal.Length Sepal.Width Petal.Length Petal.Width Species a d c b
<int> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 5.1 3.5 1.4 0.2 setosa 1 1 0 0
2 2 4.9 3 1.4 0.2 setosa 1 0 1 0
3 3 4.7 3.2 1.3 0.2 setosa 1 0 1 0
4 4 4.6 3.1 1.5 0.2 setosa 0 0 1 1
5 5 5 3.6 1.4 0.2 setosa 1 0 1 0
6 6 5.4 3.9 1.7 0.4 setosa 1 1 0 0
7 7 4.6 3.4 1.4 0.3 setosa 0 0 1 1
8 8 5 3.4 1.5 0.2 setosa 0 0 1 1
9 9 4.4 2.9 1.4 0.2 setosa 0 1 0 1
10 10 4.9 3.1 1.5 0.1 setosa 1 0 1 0
# ... with 140 more rows
This is fast and efficient for small datasets. But, I am quickly moving into datasets that have millions of rows. Enter data.table.
How would I accomplish the same process using data.table? Here is my attempt:
library(data.table)
# convert my tibble into a data.table
iris.dt <- as.data.table(iris)
# perform the separate_rows functionality on my data
result <- iris.dt[, list(Samples = unlist(strsplit(Samples, ", "))), by = Subject
][, Values := 1]
print(result)
Subject Samples Values
1: 1 a 1
2: 1 d 1
3: 2 a 1
4: 2 c 1
5: 3 a 1
---
296: 148 d 1
297: 149 a 1
298: 149 d 1
299: 150 b 1
300: 150 c 1
The problem is that I don't know how to (1) keep all other columns and (2) spread out this info in a way similar to dplyr::pivot_wider.
Any help would be much appreciated!

One way is to tstrsplit and then melt+dcast. Seems kind of inefficient but not sure of another way
Example Data:
library(magrittr)
library(data.table)
set.seed(2020)
iris.dt <- as.data.table(iris)
iris.dt[, samples := paste0(sample(letters[1:2], .N, T), ', ', sample(letters[3:4], .N, T))]
Create dummy cols
new_cols <-
iris.dt[, tstrsplit(samples, ', ')][, I := .I] %>%
melt('I') %>%
dcast(I ~ value, fun.agg = length) %>%
.[, I := NULL]
iris.dt[, names(new_cols) := new_cols][]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species samples a b c d
# 1: 5.1 3.5 1.4 0.2 setosa b, c 0 1 1 0
# 2: 4.9 3.0 1.4 0.2 setosa a, d 1 0 0 1
# 3: 4.7 3.2 1.3 0.2 setosa b, c 0 1 1 0
# 4: 4.6 3.1 1.5 0.2 setosa a, d 1 0 0 1
# 5: 5.0 3.6 1.4 0.2 setosa a, c 1 0 1 0
# ---
# 146: 6.7 3.0 5.2 2.3 virginica b, d 0 1 0 1
# 147: 6.3 2.5 5.0 1.9 virginica a, d 1 0 0 1
# 148: 6.5 3.0 5.2 2.0 virginica b, c 0 1 1 0
# 149: 6.2 3.4 5.4 2.3 virginica a, c 1 0 1 0
# 150: 5.9 3.0 5.1 1.8 virginica a, d 1 0 0 1

Here is another option using matrix numeric index:
l <- strsplit(DT[["Samples"]], ",")
nl <- lengths(l)
ul <- unlist(l)
cols <- sort(unique(ul))
DT[, (cols) := {
m <- matrix(0L, nrow=.N, ncol=length(cols))
m[cbind(rep(1L:.N, nl), match(ul, cols))] <- 1L
as.data.table(m)
}]
output:
Subject Samples a b c d
1: 1 a,d 1 0 0 1
2: 2 a,c 1 0 1 0
3: 3 a,c 1 0 1 0
4: 4 b,c 0 1 1 0
5: 5 a,c 1 0 1 0
6: 6 a,d 1 0 0 1
7: 7 b,c 0 1 1 0
8: 8 b,c 0 1 1 0
9: 9 b,d 0 1 0 1
10: 10 a,c 1 0 1 0
data:
DT <- fread("Subject Samples
1 a,d
2 a,c
3 a,c
4 b,c
5 a,c
6 a,d
7 b,c
8 b,c
9 b,d
10 a,c", sep=" ")

Related

Condition in rows, modify all columns without a loop

what I want to do is to modify all selected columns of an R data table according to the rows conditions i.e
for all 4 columns selected in cols variable, if the value is greater (or equal) than 1.5, i would like to put them to 1 else 0
I tried something like that : iris[(cols) > 1.5 , (cols) := 1, .SDcols = cols]
Thx
One data.table approach:
iris <- as.data.table(iris)
cols <- names(iris)[1:4]
cols
# [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
iris[, (cols) := lapply(.SD, function(z) fifelse(z > 1.5, 1, z)), .SDcols = cols]
iris
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <num> <num> <num> <num> <fctr>
# 1: 1 1 1.4 0.2 setosa
# 2: 1 1 1.4 0.2 setosa
# 3: 1 1 1.3 0.2 setosa
# 4: 1 1 1.5 0.2 setosa
# 5: 1 1 1.4 0.2 setosa
# 6: 1 1 1.0 0.4 setosa
# 7: 1 1 1.4 0.3 setosa
# 8: 1 1 1.5 0.2 setosa
# 9: 1 1 1.4 0.2 setosa
# 10: 1 1 1.5 0.1 setosa
# ---
# 141: 1 1 1.0 1.0 virginica
# 142: 1 1 1.0 1.0 virginica
# 143: 1 1 1.0 1.0 virginica
# 144: 1 1 1.0 1.0 virginica
# 145: 1 1 1.0 1.0 virginica
# 146: 1 1 1.0 1.0 virginica
# 147: 1 1 1.0 1.0 virginica
# 148: 1 1 1.0 1.0 virginica
# 149: 1 1 1.0 1.0 virginica
# 150: 1 1 1.0 1.0 virginica
An alternative using set:
for (nm in cols) set(iris, which(iris[[nm]] > 1.5), nm, 1)
Another solution:
library(dplyr)
library(data.table)
iris[,1:4] %>% data.table() %>% mutate_all(~ ifelse(.x>=1.5,1,0))
If you just need to check for numeric columns across can be a good fit, it also works with more specific choices like positions and names
library(tidyverse)
iris |>
as_tibble() |>
mutate(across(.cols = where(is.numeric),.fns = ~ if_else(.x > 1.5,1,.x)))
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1 1 1.4 0.2 setosa
#> 2 1 1 1.4 0.2 setosa
#> 3 1 1 1.3 0.2 setosa
#> 4 1 1 1.5 0.2 setosa
#> 5 1 1 1.4 0.2 setosa
#> 6 1 1 1 0.4 setosa
#> 7 1 1 1.4 0.3 setosa
#> 8 1 1 1.5 0.2 setosa
#> 9 1 1 1.4 0.2 setosa
#> 10 1 1 1.5 0.1 setosa
#> # ... with 140 more rows
Created on 2021-10-18 by the reprex package (v2.0.1)
Base R option -
data <- iris
cols <- 1:4
data[cols] <- +(data[cols] > 1.5)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 1 1 0 0 setosa
#2 1 1 0 0 setosa
#3 1 1 0 0 setosa
#4 1 1 0 0 setosa
#5 1 1 0 0 setosa
#6 1 1 1 0 setosa
#...
#...
The + at the beginning is used to change the logical values (TRUE/FALSE) to integers (1/0).
We may do
library(dplyr)
iris %>%
mutate(across(where(is.numeric), ~ +(. > 1.5)))

How to perform a percentage increase on a column when surpassing a specified value

How is it possible to introduce a percentage increase to a column after surpassing a given value in that column?
Here is a dataframe:
a <- data.frame(id = c(1,1,1,1,1,1,1,1,1), num = c(1,1.3,1.6,1.7,1.9,2.1,2.4,2.5,3.5))
For a threshold of 2 for the num column and a percentage increase of 1% would look like this:
id num adjusted
1 1.0 1.0
1 1.3 1.3
1 1.6 1.6
1 1.7 1.7
1 1.9 1.9
1 2.1 2.1
1 2.4 2.121
1 2.5 2.14221
1 3.5 2.1636321
Any help would be much appreciated
Does this work:
library(dplyr)
library(purrr)
a %>% filter(num > 2) %>% mutate(adjusted = accumulate(num, ~ .x * 1.01)) %>%
right_join(a) %>% mutate(adjusted = coalesce(adjusted, num)) %>% arrange(num)
Joining, by = c("id", "num")
id num adjusted
1 1 1.0 1.000000
2 1 1.3 1.300000
3 1 1.6 1.600000
4 1 1.7 1.700000
5 1 1.9 1.900000
6 1 2.1 2.100000
7 1 2.4 2.121000
8 1 2.5 2.142210
9 1 3.5 2.163632
thresh <- 2
pct_inc <- .01
# filter to numbers which will be changed
to_adj <- a$num[a$num > thresh]
# replace numbers above thresh with pct increase from first above thresh
a$adjusted <-
replace(a$num, a$num > thresh,
to_adj[1]*(1 + pct_inc)^(seq_along(to_adj) - 1))
a
# id num adjusted
# 1 1 1.0 1.000000
# 2 1 1.3 1.300000
# 3 1 1.6 1.600000
# 4 1 1.7 1.700000
# 5 1 1.9 1.900000
# 6 1 2.1 2.100000
# 7 1 2.4 2.121000
# 8 1 2.5 2.142210
# 9 1 3.5 2.163632

Pandas equivalent of dplyr everything()

In R I frequently use dplyr's select in combination with everything()
df %>% select(var4, var17, everything())
The above for example would reorder the columns of the dataframe, such that var4 is the first, var17 is the second and subsequently all remaining columns are listed. What is the most pandathonic way of doing this? Working with many columns makes explicitly spelling them out a pain as well as keeping track of their position.
The ideal solution is short, readable and can be used in pandas chaining.
Use Index.difference for all values without specified in list and join together:
df = pd.DataFrame({
'G':list('abcdef'),
'var17':[4,5,4,5,5,4],
'A':[7,8,9,4,2,3],
'var4':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
cols = ['var4','var17']
another = df.columns.difference(cols, sort=False).tolist()
df = df[cols + another]
print (df)
var4 var17 G A E F
0 1 4 a 7 5 a
1 3 5 b 8 3 a
2 5 4 c 9 6 a
3 7 5 d 4 9 b
4 1 5 e 2 2 b
5 0 4 f 3 4 b
EDIT: For chaining is possible use DataFrame.pipe with passed DataFrame:
def everything_after(df, cols):
another = df.columns.difference(cols, sort=False).tolist()
return df[cols + another]
df = df.pipe(everything_after, ['var4','var17']))
print (df)
var4 var17 G A E F
0 1 4 a 7 5 a
1 3 5 b 8 3 a
2 5 4 c 9 6 a
3 7 5 d 4 9 b
4 1 5 e 2 2 b
5 0 4 f 3 4 b
Now how smoothly you can do it with datar!
>>> from datar import f
>>> from datar.datasets import iris
>>> from datar.dplyr import select, everything, slice_head
>>> iris >> slice_head(5)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
>>> iris >> select(f.Species, everything()) >> slice_head(5)
Species Sepal_Length Sepal_Width Petal_Length Petal_Width
0 setosa 5.1 3.5 1.4 0.2
1 setosa 4.9 3.0 1.4 0.2
2 setosa 4.7 3.2 1.3 0.2
3 setosa 4.6 3.1 1.5 0.2
4 setosa 5.0 3.6 1.4 0.2
I am the author of the package. Feel free to submit issues if you have any questions.

dplyr suppress next n occurrences of a value in a group

I have recently looked for advice on how to suppress all but the first occurrences of a value within a group using dplyr (dplyr override all but the first occurrences of a value within a group).
The solution was a really clever one and now I am struggling with finding something equally efficient in case I need to suppress only n next values.
For example, in the code below I create a new "tag" column:
library('dplyr')
data(iris)
set.seed(1)
iris$tag <- sample(c(0,1), 150, replace=TRUE, prob = c(0.7, 0.3))
giris <- iris %>% group_by(Species)
# Source: local data frame [150 x 6]
# Groups: Species [3]
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species tag
# (dbl) (dbl) (dbl) (dbl) (fctr) (dbl)
# 1 5.1 3.5 1.4 0.2 setosa 0
# 2 4.9 3.0 1.4 0.2 setosa 0
# 3 4.7 3.2 1.3 0.2 setosa 0
# 4 4.6 3.1 1.5 0.2 setosa 1
# 5 5.0 3.6 1.4 0.2 setosa 0
# 6 5.4 3.9 1.7 0.4 setosa 1
# 7 4.6 3.4 1.4 0.3 setosa 1
# 8 5.0 3.4 1.5 0.2 setosa 0
# 9 4.4 2.9 1.4 0.2 setosa 0
# 10 4.9 3.1 1.5 0.1 setosa 0
# .. ... ... ... ... ... ...
In the setosa group rows: 4, 6, 7, ... are tagged as "1"s. I am trying to suppress "1"s (i.e. convert them to "0"s) in the next two rows after any occurrence of a "1". In other words, rows #5 and #6 should be set to "0" but #7 should remain unaffected. In this case, row #7 happens to be a "1", so rows #8 and #9 should be set to "0"s and so on...
Any hint on how to do this in dplyr? This package is really powerful but for a reason it is a mental challenge for me to master all the subtleties...
Some more examples:
in case of: 0 0 1 1, the output should be 0 0 1 0
in case of: 0 0 1 1 1 1 1, the output should be 0 0 1 0 0 1 0
I can't think of any better way to do this than a loop:
flip_followers = function(tag, nf = 2L){
w = which(tag==1L)
keep = rep(TRUE, length(w))
for (i in seq_along(w)) if (keep[i]) keep[match(w[i]+seq_len(nf), w)] = FALSE
tag[w[!keep]] = 0L
tag
}
giris %>% mutate(tag = flip_followers(tag))
Source: local data frame [150 x 6]
Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species tag
(dbl) (dbl) (dbl) (dbl) (fctr) (dbl)
1 5.1 3.5 1.4 0.2 setosa 0
2 4.9 3.0 1.4 0.2 setosa 0
3 4.7 3.2 1.3 0.2 setosa 0
4 4.6 3.1 1.5 0.2 setosa 1
5 5.0 3.6 1.4 0.2 setosa 0
6 5.4 3.9 1.7 0.4 setosa 0
7 4.6 3.4 1.4 0.3 setosa 1
8 5.0 3.4 1.5 0.2 setosa 0
9 4.4 2.9 1.4 0.2 setosa 0
10 4.9 3.1 1.5 0.1 setosa 0
.. ... ... ... ... ... ...
For a possible speedup, you could switch the loop to if (keep[i]) keep[i+seq_len(nf)][match(w[i]+seq_len(nf), w[i+seq_len(nf)])] = FALSE so that match only searches the next nf elements of w. I'm sure Rcpp would be faster still, if that's a serious concern.
To me this is semantically clearer if you use an accumulating reduce to keep track of the refraction period.
suppress <- function(x, w) {
r <- Reduce(function(d,i) if(i&!d) w else max(0,d-1), x, init=0, acc=TRUE)[-1]
x * (r==w)
}
Example
suppress(c(0,0,1,1,1,1,1), 2)
#> [1] 0 0 1 0 0 1 0
Kinda clumsy but it seems like you have to walk down the vector regardless
f <- function(x, repl = c(1,0,0)) {
sx <- seq(x)
for (ii in seq_along(x))
if (x[ii] == repl[1L]) ## thanks to #Frank for catching
x[ii:(ii + length(repl) - 1)] <- repl
x[sx]
}
(x <- c(0,0,1,1,1,1,1)); f(x)
# [1] 0 0 1 1 1 1 1
# [1] 0 0 1 0 0 1 0
(x <- c(0,0,1,0,1,0,1,1)); f(x)
# [1] 0 0 1 0 1 0 1 1
# [1] 0 0 1 0 0 0 1 0
And your example
set.seed(1)
head(n = 10,
cbind(tag <- sample(c(0,1), 150, replace=TRUE, prob = c(0.7, 0.3)),
tag2 = f(tag)))
# [1,] 0 0
# [2,] 0 0
# [3,] 0 0
# [4,] 1 1
# [5,] 0 0
# [6,] 1 0
# [7,] 1 1
# [8,] 0 0
# [9,] 0 0
# [10,] 0 0
And you can replace with whatever you want
(x <- c(0,0,1,1,1,1,1)); f(x, c(1,0,0,0))
# [1] 0 0 1 1 1 1 1
# [1] 0 0 1 0 0 0 1
(x <- c(0,0,1,1,1,1,1)); f(x, 1:3)
# [1] 0 0 1 1 1 1 1
# [1] 0 0 1 2 3 1 2
## courtesy of #Frank this would also work
(x <- c(0,0,1,1,0,0,1)); f(x, 0:2)
# [1] 0 0 1 1 0 0 1
# [1] 0 1 2 1 0 1 2

Shifting (+ wrap around) a data frame in R

i have a data frame like this
A B value
1 1 0.123
2 1 0.213
3 1 0.543
1 2 0.313
2 2 0.123
3 2 0.412
what i want to do is to create a function that shift this data frame by a value. for example:
if the value of shifting is 1 the data frame will become:
A B value
3 2 0.412
1 1 0.123
2 1 0.213
3 1 0.543
1 2 0.313
2 2 0.123
etc...
the function should be like this.
shift<-function(dataframe,shiftvalue)
is there any simple way to do this in R without entering in a lot of loops??
You can do it many ways, but one way is to use head and tail:
df <- data.frame(a=1:10, b = 11:20)
shift <- function(d, k) rbind( tail(d,k), head(d,-k), deparse.level = 0 )
> shift(df,3)
a b
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
1 1 11
2 2 12
3 3 13
I prefer plain old modulo ;-)
shift<-function(df,offset) df[((1:nrow(df))-1-offset)%%nrow(df)+1,]
It is pretty straightforward, the only quirk is R's from-one indexing. Also it works for offsets like 0, -7 or 7*nrow(df)...
here is my implementation:
> shift <- function(df, sv = 1) df[c((sv+1):nrow(df), 1:sv),]
> head(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
> tail(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
>
Updated:
> shift <- function(df, sv = 1) df[c((nrow(df)-sv+1):nrow(df), 1:(nrow(df)-sv)),]
> head(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
> tail(shift(iris, 3))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
There's a shift function in taRifx that works on vectors. Applying it results in coersion of all columns to character if any are character, so we'll use a trick from plyr. I'll likely write a data.frame method for it soon:
dd <- data.frame(b = seq(4),
x = c("A", "D", "A", "C"), y = c('a','b','c','d'),
z = c(1, 1, 1, 2),stringsAsFactors=FALSE)
> dd
b x y z
1 1 A a 1
2 2 D b 1
3 3 A c 1
4 4 C d 2
library(taRifx)
library(plyr)
shift.data.frame <- colwise(shift)
> shift.data.frame(dd)
b x y z
1 2 D b 1
2 3 A c 1
3 4 C d 2
4 1 A a 1
> shift(dd,n=-1)
b x y z
1 4 C d 2
2 1 A a 1
3 2 D b 1
4 3 A c 1
> shift(dd,n=-1,wrap=FALSE)
b x y z
1 1 A a 1
2 2 D b 1
3 3 A c 1
> shift(dd,n=-1,wrap=FALSE,pad=TRUE)
b x y z
1 NA <NA> <NA> NA
2 1 A a 1
3 2 D b 1
4 3 A c 1
The advantage of shift is that it takes a bunch of options:
n can be positive or negative to wrap from left/right
wrap can be turned on or off
If wrap is turned off, pad can be turned on to pad with NAs so vector remains the same length
https://dplyr.tidyverse.org/reference/lead-lag.html
lag(1:5, n = 1)
#> [1] NA 1 2 3 4
lag(1:5, n = 2)
#> [1] NA NA 1 2 3

Resources