How to separate with unequal column (reverse toString) in dplyr - r

I'm working with survey data trying to multiple responses in a single column. The problem is that there may be 1-5 answers, separated with commas.
How do I turn this:
df <- data.frame(
splitThis = c("A,B,C","B,C","A,C","A","B","C")
)
> df
splitThis
1 A,B,C
2 B,C
3 A,C
4 A
5 B
6 C
Into this:
intoThis <- data.frame(
A = c(1,0,1,1,0,0),
B = c(1,1,0,0,1,0),
c = c(1,1,1,0,0,1)
)
> intoThis
A B c
1 1 1 1
2 0 1 1
3 1 0 1
4 1 0 0
5 0 1 0
6 0 0 1
Any wrangling help appreciated!

We can use mtabulate from qdapTools after splitting by ,
library(qdapTools)
mtabulate(strsplit(as.character(df$splitThis), ","))
# A B C
#1 1 1 1
#2 0 1 1
#3 1 0 1
#4 1 0 0
#5 0 1 0
#6 0 0 1
As the OP also mentioned dplyr/tidyr
library(dplyr)
library(tidyr)
library(tibble)
rownames_to_column(df, "rn") %>%
separate_rows(splitThis) %>%
table()
Or using tidyverse packages
rownames_to_column(df, "rn") %>%
separate_rows(splitThis) %>%
group_by(rn, splitThis) %>%
tally %>%
spread(splitThis, n, fill=0) %>%
ungroup() %>%
select(-rn)
# A tibble: 6 × 3
# A B C
#* <dbl> <dbl> <dbl>
#1 1 1 1
#2 0 1 1
#3 1 0 1
#4 1 0 0
#5 0 1 0
#6 0 0 1

Related

Make a new column for every variable and tally [duplicate]

This question already has answers here:
R Split delimited strings in a column and insert as new column (in binary) [duplicate]
(3 answers)
Closed 4 months ago.
I have the following dataframe:
sample name
1 a cobra, tiger, reptile
2 b tiger, spynx
3 c reptile, cobra
4 d sphynx, tiger
5 e cat, dog, tiger
6 f dog, spynx
and what I want to make from that is.
sample cobra tiger spynx reptile cat dog
1 a 1 1 0 1 0 0
2 b 0 1 1 0 0 0
3 c 1 0 0 1 0 0
4 d 0 1 1 0 0 0
5 e 0 1 0 0 1 1
6 f 0 0 1 0 1 1
so basically make a new column out of all the variables that are in the column: name. and put a 1 if a value is present in the df$name and 0 if it is not present.
all <- unique(unlist(strsplit(as.character(df$name), ", ")))
all <- all[!is.na(all)]
for(i in df){
df[i]<- 0 }
this gives me all the variables as 0's, and now I want to match it to the name column, and if it is present make a 1 out of the 0
How would you approach this?
With tidyr and dplyr...
library(tidyr)
library(dplyr, warn = FALSE)
df1 |>
separate_rows(name) |>
group_by(sample, name) |>
summarise(count = n(), .groups = "drop") |>
pivot_wider(names_from = "name", values_from = "count", values_fill = 0)
#> # A tibble: 6 × 8
#> sample cobra reptile tiger spynx sphynx cat dog
#> <chr> <int> <int> <int> <int> <int> <int> <int>
#> 1 a 1 1 1 0 0 0 0
#> 2 b 0 0 1 1 0 0 0
#> 3 c 1 1 0 0 0 0 0
#> 4 d 0 0 1 0 1 0 0
#> 5 e 0 0 1 0 0 1 1
#> 6 f 0 0 0 1 0 0 1
Created on 2022-10-19 with reprex v2.0.2
data
df1 <- data.frame(sample = letters[1:6],
name = c("cobra, tiger, reptile",
"tiger, spynx",
"reptile, cobra",
"sphynx, tiger",
"cat, dog, tiger",
"dog, spynx"))

Retain records where condition is met across two rows and two columns

I have a dataset similar to this:
df <-
read.table(textConnection("ID Column1 Column2
A 0 1
A 1 0
A 1 0
A 1 0
A 0 1
A 1 0
A 0 1
A 0 0
A 1 0
A 1 0
B 0 1
B 1 0
C 0 1
C 0 0
C 1 0"), header=TRUE)
I am looking to do a group_by ID in dplyr that maintains records where Column2 = '1' and the record underneath it has Column1 = '1'. This may happen more than once per ID; all other records should be excluded. So the output from the above should be:
ID
Column1
Column2
A
0
1
A
1
0
A
0
1
A
1
0
B
0
1
B
1
0
Any help will be very much appreciated, thanks!
You could use lag and lead:
library(dplyr)
df %>%
group_by(ID) %>%
filter(lead(Column1) == 1 & Column2 == 1 |
Column1 == 1 & lag(Column2) == 1) %>%
ungroup()
# # A tibble: 6 × 3
# ID Column1 Column2
# <chr> <int> <int>
# 1 A 0 1
# 2 A 1 0
# 3 A 0 1
# 4 A 1 0
# 5 B 0 1
# 6 B 1 0
Here is an alternative approach:
library(dplyr)
df %>%
group_by(ID, x = rep(row_number(), each=2, length.out = n())) %>%
filter(sum(Column1)>=1 & sum(Column2)>=1) %>%
ungroup() %>%
select(-x)
ID Column1 Column2
<chr> <int> <int>
1 A 0 1
2 A 1 0
3 A 0 1
4 A 1 0
5 B 0 1
6 B 1 0

Only Use The First Match For Every N Rows

I have a data.frame that looks like this.
Date Number
1 1
2 0
3 1
4 0
5 0
6 1
7 0
8 0
9 1
I would like to create a new column that puts a 1 in the column if it is the first 1 of every 3 rows. Otherwise put a 0. For example, this is how I would like the new data.frame to look
Date Number New
1 1 1
2 0 0
3 1 0
4 0 0
5 0 0
6 1 1
7 0 0
8 0 0
9 1 1
Every three rows we find the first 1 and populate the column otherwise we place a 0. Thank you.
Hmm, at first glance I thought Akrun answer provided me the solution. However, it is not exactly what I am looking for. Here is what #akrun solution provides.
df1 = data.frame(Number = c(1,0,1,0,1,1,1,0,1,0,0,0))
head(df1,9)
Number
1 1
2 0
3 1
4 0
5 1
6 1
7 1
8 0
9 1
Attempt at solution:
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(Number == row_number()))
Number grp New
<dbl> <int> <int>
1 1 1 1
2 0 1 0
3 1 1 0
4 0 2 0
5 1 2 0 #should be a 1
6 1 2 0
7 1 3 1
8 0 3 0
9 1 3 0
As you can see the code misses the one on row 5. I am looking for the first 1 in every chunk. Then everything else should be 0.
Sorry if i was unclear akrn
Edit** Akrun new answer is exactly what I am looking for. Thank you very much
Here is an option to create a grouping column with gl and then do a == with the row_number on the index of matched 1. Here, match will return only the index of the first match.
library(dplyr)
df1 %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(New = +(row_number() == match(1, Number, nomatch = 0)))
# A tibble: 12 x 3
# Groups: grp [4]
# Number grp New
# <dbl> <int> <int>
# 1 1 1 1
# 2 0 1 0
# 3 1 1 0
# 4 0 2 0
# 5 1 2 1
# 6 1 2 0
# 7 1 3 1
# 8 0 3 0
# 9 1 3 0
#10 0 4 0
#11 0 4 0
#12 0 4 0
Looking at the logic, perhaps you want to check if Number == 1 and that the prior 2 values were both 0. If that is not correct please let me know.
library(dplyr)
df %>%
mutate(New = ifelse(Number == 1 & lag(Number, n = 1L, default = 0) == 0 & lag(Number, n = 2L, default = 0) == 0, 1, 0))
Output
Date Number New
1 1 1 1
2 2 0 0
3 3 1 0
4 4 0 0
5 5 0 0
6 6 1 1
7 7 0 0
8 8 0 0
9 9 1 1
You can replace Number value to 0 except for the 1st occurrence of 1 in each 3 rows.
library(dplyr)
df %>%
group_by(gr = ceiling(row_number()/3)) %>%
mutate(New = replace(Number, -which.max(Number), 0)) %>%
#Or to be safe and specific use
#mutate(New = replace(Number, -which(Number == 1)[1], 0)) %>%
ungroup() %>% select(-gr)
# A tibble: 9 x 3
# Date Number New
# <int> <int> <int>
#1 1 1 1
#2 2 0 0
#3 3 1 0
#4 4 0 0
#5 5 0 0
#6 6 1 1
#7 7 0 0
#8 8 0 0
#9 9 1 1

splitting all the columns in a dataframe based on their value and delimiter

I have a dataframe as follows:
df <- data.frame(s1=c("a","a/b","b","a","a/b"),s2=c("ab/bb","bb","ab","ab","bb"),s3=c("Doa","Doa","Dob/Doa","Dob/Doa","Dob"))
s1 s2 s3
1 a ab/bb Doa
2 a/b bb Doa
3 b ab Dob/Doa
4 a ab Dob/Doa
5 a/b bb Dob
Each column could take one of two values or both separated by a "/". I would like to break these down into binary sets of columns based on their values.
The desired data frame would be:
a b ab bb Doa Dob
1 1 0 1 1 1 0
2 1 1 0 1 1 0
3 0 1 1 0 1 1
4 1 0 1 0 1 1
5 1 1 0 1 0 1
I tried doing this with tidyr::separate and tapply, though it got fairly complicated as I had to specify column names for every pair. There were many columns.
First make sure your data is character and not factor. Then split into one data.frame for each row and for each of those rows, take the str_split on '/', set the names equal to the values, and make it a list. Now you can bind these results together, and set all non-na values to 1 at the end.
library(tidyverse) # dplyr, + stringr for str_split, + purrr for map
df %>%
mutate_all(as.character) %>%
split(seq(nrow(.))) %>%
map(~ str_split(., '/') %>% unlist %>% setNames(., .) %>% as.list) %>%
bind_rows %>%
mutate_all(~as.numeric(!is.na(.)))
# # A tibble: 5 x 6
# a ab bb Doa b Dob
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1 0 0
# 2 1 0 1 1 1 0
# 3 0 1 0 1 1 1
# 4 1 1 0 1 0 1
# 5 1 0 1 0 1 1
Another similar option (same output)
df %>%
mutate_all(as.character) %>%
split(seq(nrow(.))) %>%
map(~ str_split(., '/') %>% unlist %>% table %>% as.list) %>%
bind_rows %>%
mutate_all(replace_na, 0)
Or you could convert to long first then back to wide, similar to akrun's answer
library(data.table)
setDT(df)
library(magrittr)
melt(df[, r := 1:.N], 'r') %>%
.[, .(value = strsplit(value, '/')[[1]]), .(r, variable)] %>%
dcast(r ~ value, fun.aggregate = length)
# r Doa Dob a ab b bb
# 1: 1 1 0 1 1 0 1
# 2: 2 1 0 1 0 1 1
# 3: 3 1 1 0 1 1 0
# 4: 4 1 1 1 1 0 0
# 5: 5 0 1 1 0 1 1
Another approach is to usepivot_longer into 'long' format and then use separate_rows to split the 'value' column and reshape into 'wide' format
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -rn) %>%
separate_rows(value) %>%
mutate(i1 = 1) %>%
select(-name) %>%
pivot_wider(names_from = value, values_from = i1, values_fill = list(i1 = 0)) %>%
select(-rn)
# A tibble: 5 x 6
# a ab bb Doa b Dob
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 1 1 0 0
#2 1 0 1 1 1 0
#3 0 1 0 1 1 1
#4 1 1 0 1 0 1
#5 1 0 1 0 1 1
Or using base R with table and strsplit
+(table(stack(setNames(strsplit(as.character(unlist(df)), "/",
fixed = TRUE), c(row(df))))[2:1]) > 0)
# values
#ind a ab b bb Doa Dob
# 1 1 1 0 1 1 0
# 2 1 0 1 1 1 0
# 3 0 1 1 0 1 1
# 4 1 1 0 0 1 1
# 5 1 0 1 1 0 1

Frequency table but custom function instead of default count?

Suppose I have a data frame:
bla <- data.frame(
a = c(1,1,1,0,0,1,1,1,0,0),
b = c(0,0,0,1,1,0,0,1,1,0),
c = c(1,0,1,0,1,0,1,0,1,0),
d = c(2,3,4,7,8,6,5,2,1,0)
)
I can use table() to get the counts of each combination of 1/0 for each of a, b and c:
table(bla %>% select(a:c)) %>% as.data.frame()
a b c Freq
1 0 0 0 1
2 1 0 0 2
3 0 1 0 1
4 1 1 0 1
5 0 0 1 0
6 1 0 1 3
7 0 1 1 2
8 1 1 1 0
Here's my question, is there a approach to get back both the frequency AND the mean of column d for each combination of a, b and c?
I.e. it looks like table() auto groups by each distinct combination then returns count() (Freq field). Can I do the same but add mean()?
Here's a base R solution using aggregate:
aggregate(d ~ ., data = bla,
FUN = function(x) c('mean' = mean(x), 'count' = length(x)))
And, the dplyr package could also be handy (this would be my preference):
library(dplyr)
bla %>%
group_by(a, b, c) %>% # or group_by_at(-vars(d))
summarise(count = n(),
mean_d = mean(d))
If you want also the non-present combinations, with dplyr and tidyr you can do:
bla %>%
complete(a, b, c) %>%
group_by_at(1:3) %>%
summarise(count = sum(!is.na(d)),
mean = mean(d))
a b c count mean
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 1 0
2 0 0 1 0 NA
3 0 1 0 1 7
4 0 1 1 2 4.5
5 1 0 0 2 4.5
6 1 0 1 3 3.67
7 1 1 0 1 2
8 1 1 1 0 NA

Resources