Use a string as a function argument in R - r

For a given n I would like to enumerate all 2^(n) - 1 possible subsets (excluding the null set).
So for n = 3 items (A, B, C) I have 8 - 1 combinations: {A}, {B}, {C}, {A, B}, {A, C}, {B, C}, and {A, B, C}.
To enumerate these subsets I would like to define a binary grid. For n = 3:
> expand.grid(0:1, 0:1, 0:1)[ -1, ]
Var1 Var2 Var3
2 1 0 0
3 0 1 0
4 1 1 0
5 0 0 1
6 1 0 1
7 0 1 1
8 1 1 1
However, n is itself a random variable that changes from one simulation to the next.
It is easy to programmatically generate the string I need to pass to the function call. For instance, for n = 7, I can run:
> gsub(", $", "", paste(rep("0:1, ", 7), collapse = ""))
[1] "0:1, 0:1, 0:1, 0:1, 0:1, 0:1, 0:1"
But when I try to pass this string to expand.grid() I get an error. Surely there is a function that can coerce this string to a usable expression?

Running string as code is not recommended and should be avoided in general.
In this case, you can use replicate to repeat a vector n times and then use expand.grid with do.call.
n <- 3
do.call(expand.grid, replicate(n, list(0:1)))
# Var1 Var2 Var3
#1 0 0 0
#2 1 0 0
#3 0 1 0
#4 1 1 0
#5 0 0 1
#6 1 0 1
#7 0 1 1
#8 1 1 1

We can use crossing
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
n <- 3
replicate(n, list(0:1)) %>%
set_names(str_c('Var', seq_along(.))) %>%
invoke(crossing, .)
# A tibble: 8 x 3
# Var1 Var2 Var3
# <int> <int> <int>
#1 0 0 0
#2 0 0 1
#3 0 1 0
#4 0 1 1
#5 1 0 0
#6 1 0 1
#7 1 1 0
#8 1 1 1

Related

R- count values in data.frame

df <- data.frame(row.names = c('ID1','ID2','ID3','ID4'),var1 = c(0,1,2,3),var2 = c(0,0,0,0),var3 = c(1,2,3,0),var4 = c('1','1','2','2'))
> df
var1 var2 var3 var4
ID1 0 0 1 1
ID2 1 0 2 1
ID3 2 0 3 2
ID4 3 0 0 2
I want df to look like this
var1 var2 var3 var4
0 1 4 1 0
1 1 0 1 2
2 1 0 1 2
3 1 0 1 0
So I want the values of df to be counted. The problem is, that not every value occurs in every column.
I tried this lapply(df,table) but that returns a list which I cannot convert into a data.frame (because of said reason).
I could do it kind of manually with table(df$var1) and bind everything together after doing that with every var, but that is boring. Can you find a better way?
Thanks ;)
Call table function with factor levels which are present in the entire dataset.
sapply(df,function(x) table(factor(x, levels = 0:3)))
# var1 var2 var3 var4
#0 1 4 1 0
#1 1 0 1 2
#2 1 0 1 2
#3 1 0 1 0
If you don't know beforehand what levels your data can take, we can find it from data itself.
vec <- unique(unlist(df))
sapply(df, function(x) table(factor(x, levels = vec)))
We could do this without any loop
table(c(col(df)), unlist(df))
# 0 1 2 3
# 1 1 1 1 1
# 2 4 0 0 0
# 3 1 1 1 1
# 4 0 2 2 0

Split strings into individual charaters, count number of times each character occurs then combine into single df/matrix

I have a list of strings where I want to count how many times each character occurs in them and produce a single dataframe or matrix as an output.
I can do it for each individual string separately but struggling with combining the answer into a single output.
###Data
strings <- c("ba", "uv", "bg", "vuabv", "nabvnmnm")
### Split into single characters
single_chars <- str_split(strings, "")
Desired Output
strings |a|b|u|v|g|n|m
ba |1|1|0|0|0|0|0
uv |0|0|1|1|0|0|0
bg |0|1|0|0|1|0|0
vuabv |1|1|1|2|0|0|0
nabvnmnm |1|1|0|1|0|3|2
One dplyr, tibble, purrr and stringr option could be:
bind_cols(tibble(strings),
map_dfc(.x = unique(unlist(str_split(strings, boundary("character")))),
~ tibble(!!.x := str_count(strings, .x))))
strings b a u v g n m
<chr> <int> <int> <int> <int> <int> <int> <int>
1 ba 1 1 0 0 0 0 0
2 uv 0 0 1 1 0 0 0
3 bg 1 0 0 0 1 0 0
4 vuabv 1 1 1 2 0 0 0
5 nabvnmnm 1 1 0 1 0 3 2
Here is a base R option using factor + table
cbind(
strings,
as.data.frame(
do.call(
rbind,
lapply(
lst <- strsplit(strings, ""),
function(x) table(factor(x, levels = unique(sort(unlist(lst)))))
)
)
)
)
which gives
strings a b g m n u v
1 ba 1 1 0 0 0 0 0
2 uv 0 0 0 0 0 1 1
3 bg 0 1 1 0 0 0 0
4 vuabv 1 1 0 0 0 1 2
5 nabvnmnm 1 1 0 2 3 0 1

Frequency table but custom function instead of default count?

Suppose I have a data frame:
bla <- data.frame(
a = c(1,1,1,0,0,1,1,1,0,0),
b = c(0,0,0,1,1,0,0,1,1,0),
c = c(1,0,1,0,1,0,1,0,1,0),
d = c(2,3,4,7,8,6,5,2,1,0)
)
I can use table() to get the counts of each combination of 1/0 for each of a, b and c:
table(bla %>% select(a:c)) %>% as.data.frame()
a b c Freq
1 0 0 0 1
2 1 0 0 2
3 0 1 0 1
4 1 1 0 1
5 0 0 1 0
6 1 0 1 3
7 0 1 1 2
8 1 1 1 0
Here's my question, is there a approach to get back both the frequency AND the mean of column d for each combination of a, b and c?
I.e. it looks like table() auto groups by each distinct combination then returns count() (Freq field). Can I do the same but add mean()?
Here's a base R solution using aggregate:
aggregate(d ~ ., data = bla,
FUN = function(x) c('mean' = mean(x), 'count' = length(x)))
And, the dplyr package could also be handy (this would be my preference):
library(dplyr)
bla %>%
group_by(a, b, c) %>% # or group_by_at(-vars(d))
summarise(count = n(),
mean_d = mean(d))
If you want also the non-present combinations, with dplyr and tidyr you can do:
bla %>%
complete(a, b, c) %>%
group_by_at(1:3) %>%
summarise(count = sum(!is.na(d)),
mean = mean(d))
a b c count mean
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 1 0
2 0 0 1 0 NA
3 0 1 0 1 7
4 0 1 1 2 4.5
5 1 0 0 2 4.5
6 1 0 1 3 3.67
7 1 1 0 1 2
8 1 1 1 0 NA

how to create a new list of data.frames by systematically rearranging columns from an existing list of data.frames

I have a list of about 100 data frames. I would like to create a new list of data frames where the first data frame is made up of the first columns of all the existing data frames, and the second data frame is made up of the second column etc...
Please see code below for an example of what I want to do.
a <- c(0, 0, 1, 1, 1)
b <- c(0, 1, 0, 0, 1)
c <- c(1, 1, 0, 0, 1)
df1 <- data.frame(a, b, c)
df2 <- data.frame(c, b, a)
df3 <- data.frame(b, a, c)
my_lst <- list(df1, df2, df3)
new_df1 <- data.frame(df1[,1], df2[,1], df3[,1])
new_df2 <- data.frame(df1[,2], df2[,2], df3[,2])
new_df3 <- data.frame(df1[,3], df2[,3], df3[,3])
new_lst <- list(new_df1, new_df2, new_df3)
Is there a more compact way of doing this with large lists containing large data frames? Thanks in advance.
This is an option:
cols <- ncol(my_lst[[1]])
lapply(1:cols, function(x) do.call(cbind, lapply(my_lst, `[`, x)))
[[1]]
a c b
1 0 1 0
2 0 1 1
3 1 0 0
4 1 0 0
5 1 1 1
[[2]]
b b a
1 0 0 0
2 1 1 0
3 0 0 1
4 0 0 1
5 1 1 1
[[3]]
c a c
1 1 0 1
2 1 0 1
3 0 1 0
4 0 1 0
5 1 1 1
A tidyverse option is to change the columns names, transpose and bind_cols
library(tidyverse)
my_lst %>%
map(setNames, letters[1:3]) %>%
purrr::transpose() %>%
map(bind_cols)
#$a
# A tibble: 5 x 3
# V1 V2 V3
# <dbl> <dbl> <dbl>
#1 0 1 0
#2 0 1 1
#3 1 0 0
#4 1 0 0
#5 1 1 1
#$b
# A tibble: 5 x 3
# V1 V2 V3
# <dbl> <dbl> <dbl>
#1 0 0 0
#2 1 1 0
#3 0 0 1
#4 0 0 1
#5 1 1 1
#$c
# A tibble: 5 x 3
# V1 V2 V3
# <dbl> <dbl> <dbl>
#1 1 0 1
#2 1 0 1
#3 0 1 0
#4 0 1 0
#5 1 1 1

Detect a pattern in a column with R

I am trying to calculate how many times a person moved from one job to another. This can be calculated every time the Job column has this pattern 1 -> 0 -> 1.
In this example, it happened one rotation:
Person Job
A 1
A 0
A 1
A 1
In this another example, person B had one rotation as well.
Person Job
A 1
A 0
A 1
A 1
B 1
B 0
B 0
B 1
Whats would be a good approach to measure this pattern in a new column 'rotation', by person ?
Person Job Rotation
A 1 0
A 0 0
A 1 1
A 1 1
B 1 0
B 0 0
B 0 0
B 1 1
You can use regular expressions to capture a group with 101 and count it as a 1. so you use a pattern="(?<=1)0+(?=1)" where for all zeros, check whether they are preceeded by 1 and also succeeded by a 1
library(tidyverse)
df%>%
group_by(Person)%>%
mutate(Rotation=str_count(accumulate(Job,str_c,collapse=""),"(?<=1)0+(?=1)"))
# A tibble: 12 x 3
# Groups: Person [3]
Person Job Rotation
<fct> <int> <int>
1 A 1 0
2 A 0 0
3 A 1 1
4 A 1 1
5 B 1 0
6 B 0 0
7 B 0 0
8 B 1 1
9 C 0 0
10 C 1 0
11 C 0 0
12 C 1 1
One solution is to use lag with default = 0 and count cumulative sum of condition when value changes from 0 to 1. Just subtract 1 from the cumsum to get the rotation.
The solution using dplyr can be as:
library(dplyr)
df %>% group_by(Person) %>%
mutate(Rotation = cumsum(lag(Job, default = 0) == 0 & Job ==1) - 1) %>%
as.data.frame()
# Person Job Rotation
# 1 A 1 0
# 2 A 0 0
# 3 A 1 1
# 4 A 1 1
# 5 B 1 0
# 6 B 0 0
# 7 B 0 0
# 8 B 1 1
Data:
df <- read.table(text ="
Person Job
A 1
A 0
A 1
A 1
B 1
B 0
B 0
B 1",
header = TRUE, stringsAsFactors = FALSE)
Here is an option with data.table
library(data.table)
setDT(df)[, Rotation := +(grepl("101", do.call(paste0,
shift(Job, 0:.N, fill = 0)))), Person]
df
# Person Job Rotation
# 1: A 1 0
# 2: A 0 0
# 3: A 1 1
# 4: A 1 1
# 5: B 1 0
# 6: B 0 0
# 7: B 0 0
# 8: B 1 0
# 9: C 0 0
#10: C 1 0
#11: C 0 0
#12: C 1 1
A base R option would be
f1 <- function(x) Reduce(paste0, x, accumulate = TRUE)
df$Rotation <- with(df, +grepl("101", ave(Job, Person, FUN = f1)))
data
df <- data.frame(Person = rep(c("A", "B", "C"), each = 4L),
Job = as.integer(c(1,0,1,1,
1,0,0,1,
0,1,0,1)))
I'm assuming that if a person starts unemployed,
the first job they get doesn't count as rotation.
In that case:
library(dplyr)
rotation <- function(x) {
# this will have 1 when a person got a new job
dif <- c(0L, diff(x))
dif[dif < 0L] <- 0L
if (x[1L] == 0L) {
# unemployed at the beginning,
# first job doesn't count as change from one to another
dif[which.max(dif)] <- 0L
}
# return
cumsum(dif)
}
df <- data.frame(Person = rep(c("A", "B", "C"), each = 4L),
Job = as.integer(c(1,0,1,1,
1,0,0,1,
0,1,0,1)))
df %>%
group_by(Person) %>%
mutate(Rotation = rotation(Job))
# A tibble: 12 x 3
# Groups: Person [3]
Person Job Rotation
<fct> <int> <int>
1 A 1 0
2 A 0 0
3 A 1 1
4 A 1 1
5 B 1 0
6 B 0 0
7 B 0 0
8 B 1 1
9 C 0 0
10 C 1 0
11 C 0 0
12 C 1 1

Resources