Frequency table but custom function instead of default count? - r

Suppose I have a data frame:
bla <- data.frame(
a = c(1,1,1,0,0,1,1,1,0,0),
b = c(0,0,0,1,1,0,0,1,1,0),
c = c(1,0,1,0,1,0,1,0,1,0),
d = c(2,3,4,7,8,6,5,2,1,0)
)
I can use table() to get the counts of each combination of 1/0 for each of a, b and c:
table(bla %>% select(a:c)) %>% as.data.frame()
a b c Freq
1 0 0 0 1
2 1 0 0 2
3 0 1 0 1
4 1 1 0 1
5 0 0 1 0
6 1 0 1 3
7 0 1 1 2
8 1 1 1 0
Here's my question, is there a approach to get back both the frequency AND the mean of column d for each combination of a, b and c?
I.e. it looks like table() auto groups by each distinct combination then returns count() (Freq field). Can I do the same but add mean()?

Here's a base R solution using aggregate:
aggregate(d ~ ., data = bla,
FUN = function(x) c('mean' = mean(x), 'count' = length(x)))
And, the dplyr package could also be handy (this would be my preference):
library(dplyr)
bla %>%
group_by(a, b, c) %>% # or group_by_at(-vars(d))
summarise(count = n(),
mean_d = mean(d))

If you want also the non-present combinations, with dplyr and tidyr you can do:
bla %>%
complete(a, b, c) %>%
group_by_at(1:3) %>%
summarise(count = sum(!is.na(d)),
mean = mean(d))
a b c count mean
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 1 0
2 0 0 1 0 NA
3 0 1 0 1 7
4 0 1 1 2 4.5
5 1 0 0 2 4.5
6 1 0 1 3 3.67
7 1 1 0 1 2
8 1 1 1 0 NA

Related

Add a column that count number of rows until the first 1, by group in R

I have the following dataset:
test_df=data.frame(Group=c(1,1,1,1,2,2),var1=c(1,0,0,1,1,1),var2=c(0,0,1,1,0,0),var3=c(0,1,0,0,0,1))
Group
var1
var2
var3
1
1
0
0
1
0
0
1
1
0
1
0
1
1
1
0
2
1
0
0
2
1
0
1
I want to add 3 columns (out1-3) for var1-3, which count number of rows until the first 1, by Group,
as shown below:
Group
var1
var2
var3
out1
out2
out3
1
1
0
0
1
3
2
1
0
0
1
1
3
2
1
0
1
0
1
3
2
1
1
1
0
1
3
2
2
1
0
0
1
0
2
2
1
0
1
1
0
2
I used this R code, I repeated it for my 3 variables, and my actual dataset contains more than only 3 columns.
But it is not working:
test_var1<-select(test_df,Group,var1 )%>%
group_by(Group) %>%
mutate(out1 = row_number()) %>%
filter(var1 != 0) %>%
slice(1)
df <- data.frame(Group=c(1,1,1,1,2,2),
var1=c(1,0,0,1,1,1),
var2=c(0,0,1,1,0,0),
var3=c(0,1,0,0,0,1))
This works for any number of variables as long as the structure is the same as in the example (i.e. Group + many variables that are 0 or 1)
df %>%
mutate(rownr = row_number()) %>%
pivot_longer(-c(Group, rownr)) %>%
group_by(Group, name) %>%
mutate(out = cumsum(value != 1 & (cumsum(value) < 1)) + 1,
out = ifelse(max(out) > n(), 0, max(out))) %>%
pivot_wider(names_from = c(name, name), values_from = c(value, out)) %>%
select(-rownr)
Returns:
Group value_var1 value_var2 value_var3 out_var1 out_var2 out_var3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 0 1 3 2
2 1 0 0 1 1 3 2
3 1 0 1 0 1 3 2
4 1 1 1 0 1 3 2
5 2 1 0 0 1 0 2
6 2 1 0 1 1 0 2
If you only have 3 "out" variables then you can create three rows as follows
#1- Your dataset
df=data.frame(Group=rep(1,4),var1=c(1,0,0,1),var2=c(0,0,1,1),var3=c(0,1,0,0))
#2- Count the first row number with "1" value
df$out1=min(rownames(df)[which(df$var1==1)])
df$out2=min(rownames(df)[which(df$var2==1)])
df$out3=min(rownames(df)[which(df$var3==1)])
If you have more than 3 columns, then it may be better to create a loop for example
for(i in 1:3){
df[paste("out",i,sep="")]=min(rownames(df)[which(df[,which(colnames(df)==paste("var",i,sep=""))]==1)])
}

Only Use The First Match For Every N Rows

I have a data.frame that looks like this.
Date Number
1 1
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 1
I would like to create a new column that puts a 1 in the column if it is the first 1 of every 3 rows. Otherwise put a 0. For example, this is how I would like the new data.frame to look
Date Number New
1 1 1
2 0 0
3 1 0
4 0 0
5 0 0
6 1 1
7 0 0
8 0 0
9 1 1
Every three rows we find the first 1 and populate the column otherwise we place a 0. Thank you.
Hmm, at first glance I thought Akrun answer provided me the solution. However, it is not exactly what I am looking for. Here is what #akrun solution provides.
df1 = data.frame(Number = c(1,0,1,0,1,1,1,0,1,0,0,0))
head(df1,9)
Number
1 1
2 0
3 1
4 0
5 1
6 1
7 1
8 0
9 1
Attempt at solution:
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(Number == row_number()))
Number grp New
<dbl> <int> <int>
1 1 1 1
2 0 1 0
3 1 1 0
4 0 2 0
5 1 2 0 #should be a 1
6 1 2 0
7 1 3 1
8 0 3 0
9 1 3 0
As you can see the code misses the one on row 5. I am looking for the first 1 in every chunk. Then everything else should be 0.
Sorry if i was unclear akrn
Edit** Akrun new answer is exactly what I am looking for. Thank you very much
Here is an option to create a grouping column with gl and then do a == with the row_number on the index of matched 1. Here, match will return only the index of the first match.
library(dplyr)
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(row_number() == match(1, Number, nomatch = 0)))
# A tibble: 12 x 3
# Groups: grp [4]
# Number grp New
# <dbl> <int> <int>
# 1 1 1 1
# 2 0 1 0
# 3 1 1 0
# 4 0 2 0
# 5 1 2 1
# 6 1 2 0
# 7 1 3 1
# 8 0 3 0
# 9 1 3 0
#10 0 4 0
#11 0 4 0
#12 0 4 0
Looking at the logic, perhaps you want to check if Number == 1 and that the prior 2 values were both 0. If that is not correct please let me know.
library(dplyr)
df %>%
mutate(New = ifelse(Number == 1 & lag(Number, n = 1L, default = 0) == 0 & lag(Number, n = 2L, default = 0) == 0, 1, 0))
Output
Date Number New
1 1 1 1
2 2 0 0
3 3 1 0
4 4 0 0
5 5 0 0
6 6 1 1
7 7 0 0
8 8 0 0
9 9 1 1
You can replace Number value to 0 except for the 1st occurrence of 1 in each 3 rows.
library(dplyr)
df %>%
group_by(gr = ceiling(row_number()/3)) %>%
mutate(New = replace(Number, -which.max(Number), 0)) %>%
#Or to be safe and specific use
#mutate(New = replace(Number, -which(Number == 1)[1], 0)) %>%
ungroup() %>% select(-gr)
# A tibble: 9 x 3
# Date Number New
# <int> <int> <int>
#1 1 1 1
#2 2 0 0
#3 3 1 0
#4 4 0 0
#5 5 0 0
#6 6 1 1
#7 7 0 0
#8 8 0 0
#9 9 1 1

splitting all the columns in a dataframe based on their value and delimiter

I have a dataframe as follows:
df <- data.frame(s1=c("a","a/b","b","a","a/b"),s2=c("ab/bb","bb","ab","ab","bb"),s3=c("Doa","Doa","Dob/Doa","Dob/Doa","Dob"))
s1 s2 s3
1 a ab/bb Doa
2 a/b bb Doa
3 b ab Dob/Doa
4 a ab Dob/Doa
5 a/b bb Dob
Each column could take one of two values or both separated by a "/". I would like to break these down into binary sets of columns based on their values.
The desired data frame would be:
a b ab bb Doa Dob
1 1 0 1 1 1 0
2 1 1 0 1 1 0
3 0 1 1 0 1 1
4 1 0 1 0 1 1
5 1 1 0 1 0 1
I tried doing this with tidyr::separate and tapply, though it got fairly complicated as I had to specify column names for every pair. There were many columns.
First make sure your data is character and not factor. Then split into one data.frame for each row and for each of those rows, take the str_split on '/', set the names equal to the values, and make it a list. Now you can bind these results together, and set all non-na values to 1 at the end.
library(tidyverse) # dplyr, + stringr for str_split, + purrr for map
df %>%
mutate_all(as.character) %>%
split(seq(nrow(.))) %>%
map(~ str_split(., '/') %>% unlist %>% setNames(., .) %>% as.list) %>%
bind_rows %>%
mutate_all(~as.numeric(!is.na(.)))
# # A tibble: 5 x 6
# a ab bb Doa b Dob
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1 0 0
# 2 1 0 1 1 1 0
# 3 0 1 0 1 1 1
# 4 1 1 0 1 0 1
# 5 1 0 1 0 1 1
Another similar option (same output)
df %>%
mutate_all(as.character) %>%
split(seq(nrow(.))) %>%
map(~ str_split(., '/') %>% unlist %>% table %>% as.list) %>%
bind_rows %>%
mutate_all(replace_na, 0)
Or you could convert to long first then back to wide, similar to akrun's answer
library(data.table)
setDT(df)
library(magrittr)
melt(df[, r := 1:.N], 'r') %>%
.[, .(value = strsplit(value, '/')[[1]]), .(r, variable)] %>%
dcast(r ~ value, fun.aggregate = length)
# r Doa Dob a ab b bb
# 1: 1 1 0 1 1 0 1
# 2: 2 1 0 1 0 1 1
# 3: 3 1 1 0 1 1 0
# 4: 4 1 1 1 1 0 0
# 5: 5 0 1 1 0 1 1
Another approach is to usepivot_longer into 'long' format and then use separate_rows to split the 'value' column and reshape into 'wide' format
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn) %>%
separate_rows(value) %>%
mutate(i1 = 1) %>%
select(-name) %>%
pivot_wider(names_from = value, values_from = i1, values_fill = list(i1 = 0)) %>%
select(-rn)
# A tibble: 5 x 6
# a ab bb Doa b Dob
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1 0 0
#2 1 0 1 1 1 0
#3 0 1 0 1 1 1
#4 1 1 0 1 0 1
#5 1 0 1 0 1 1
Or using base R with table and strsplit
+(table(stack(setNames(strsplit(as.character(unlist(df)), "/",
fixed = TRUE), c(row(df))))[2:1]) > 0)
# values
#ind a ab b bb Doa Dob
# 1 1 1 0 1 1 0
# 2 1 0 1 1 1 0
# 3 0 1 1 0 1 1
# 4 1 1 0 0 1 1
# 5 1 0 1 1 0 1

Detect a pattern in a column with R

I am trying to calculate how many times a person moved from one job to another. This can be calculated every time the Job column has this pattern 1 -> 0 -> 1.
In this example, it happened one rotation:
Person Job
A 1
A 0
A 1
A 1
In this another example, person B had one rotation as well.
Person Job
A 1
A 0
A 1
A 1
B 1
B 0
B 0
B 1
Whats would be a good approach to measure this pattern in a new column 'rotation', by person ?
Person Job Rotation
A 1 0
A 0 0
A 1 1
A 1 1
B 1 0
B 0 0
B 0 0
B 1 1
You can use regular expressions to capture a group with 101 and count it as a 1. so you use a pattern="(?<=1)0+(?=1)" where for all zeros, check whether they are preceeded by 1 and also succeeded by a 1
library(tidyverse)
df%>%
group_by(Person)%>%
mutate(Rotation=str_count(accumulate(Job,str_c,collapse=""),"(?<=1)0+(?=1)"))
# A tibble: 12 x 3
# Groups: Person [3]
Person Job Rotation
<fct> <int> <int>
1 A 1 0
2 A 0 0
3 A 1 1
4 A 1 1
5 B 1 0
6 B 0 0
7 B 0 0
8 B 1 1
9 C 0 0
10 C 1 0
11 C 0 0
12 C 1 1
One solution is to use lag with default = 0 and count cumulative sum of condition when value changes from 0 to 1. Just subtract 1 from the cumsum to get the rotation.
The solution using dplyr can be as:
library(dplyr)
df %>% group_by(Person) %>%
mutate(Rotation = cumsum(lag(Job, default = 0) == 0 & Job ==1) - 1) %>%
as.data.frame()
# Person Job Rotation
# 1 A 1 0
# 2 A 0 0
# 3 A 1 1
# 4 A 1 1
# 5 B 1 0
# 6 B 0 0
# 7 B 0 0
# 8 B 1 1
Data:
df <- read.table(text ="
Person Job
A 1
A 0
A 1
A 1
B 1
B 0
B 0
B 1",
header = TRUE, stringsAsFactors = FALSE)
Here is an option with data.table
library(data.table)
setDT(df)[, Rotation := +(grepl("101", do.call(paste0,
shift(Job, 0:.N, fill = 0)))), Person]
df
# Person Job Rotation
# 1: A 1 0
# 2: A 0 0
# 3: A 1 1
# 4: A 1 1
# 5: B 1 0
# 6: B 0 0
# 7: B 0 0
# 8: B 1 0
# 9: C 0 0
#10: C 1 0
#11: C 0 0
#12: C 1 1
A base R option would be
f1 <- function(x) Reduce(paste0, x, accumulate = TRUE)
df$Rotation <- with(df, +grepl("101", ave(Job, Person, FUN = f1)))
data
df <- data.frame(Person = rep(c("A", "B", "C"), each = 4L),
Job = as.integer(c(1,0,1,1,
1,0,0,1,
0,1,0,1)))
I'm assuming that if a person starts unemployed,
the first job they get doesn't count as rotation.
In that case:
library(dplyr)
rotation <- function(x) {
# this will have 1 when a person got a new job
dif <- c(0L, diff(x))
dif[dif < 0L] <- 0L
if (x[1L] == 0L) {
# unemployed at the beginning,
# first job doesn't count as change from one to another
dif[which.max(dif)] <- 0L
}
# return
cumsum(dif)
}
df <- data.frame(Person = rep(c("A", "B", "C"), each = 4L),
Job = as.integer(c(1,0,1,1,
1,0,0,1,
0,1,0,1)))
df %>%
group_by(Person) %>%
mutate(Rotation = rotation(Job))
# A tibble: 12 x 3
# Groups: Person [3]
Person Job Rotation
<fct> <int> <int>
1 A 1 0
2 A 0 0
3 A 1 1
4 A 1 1
5 B 1 0
6 B 0 0
7 B 0 0
8 B 1 1
9 C 0 0
10 C 1 0
11 C 0 0
12 C 1 1

How to separate with unequal column (reverse toString) in dplyr

I'm working with survey data trying to multiple responses in a single column. The problem is that there may be 1-5 answers, separated with commas.
How do I turn this:
df <- data.frame(
splitThis = c("A,B,C","B,C","A,C","A","B","C")
)
> df
splitThis
1 A,B,C
2 B,C
3 A,C
4 A
5 B
6 C
Into this:
intoThis <- data.frame(
A = c(1,0,1,1,0,0),
B = c(1,1,0,0,1,0),
c = c(1,1,1,0,0,1)
)
> intoThis
A B c
1 1 1 1
2 0 1 1
3 1 0 1
4 1 0 0
5 0 1 0
6 0 0 1
Any wrangling help appreciated!
We can use mtabulate from qdapTools after splitting by ,
library(qdapTools)
mtabulate(strsplit(as.character(df$splitThis), ","))
# A B C
#1 1 1 1
#2 0 1 1
#3 1 0 1
#4 1 0 0
#5 0 1 0
#6 0 0 1
As the OP also mentioned dplyr/tidyr
library(dplyr)
library(tidyr)
library(tibble)
rownames_to_column(df, "rn") %>%
separate_rows(splitThis) %>%
table()
Or using tidyverse packages
rownames_to_column(df, "rn") %>%
separate_rows(splitThis) %>%
group_by(rn, splitThis) %>%
tally %>%
spread(splitThis, n, fill=0) %>%
ungroup() %>%
select(-rn)
# A tibble: 6 × 3
# A B C
#* <dbl> <dbl> <dbl>
#1 1 1 1
#2 0 1 1
#3 1 0 1
#4 1 0 0
#5 0 1 0
#6 0 0 1

Resources