This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 3 years ago.
With the following data frame, I would like to create new columns based on the "Type" column values using 'mutate' and count the number of instances that appear. The data should be grouped by "Group" and "Choice".
Over time, the "Type" column will have new values added in that aren't already listed, so the code should be flexible in that respect.
Is this possible using the dplyr library?
library(dplyr)
df <- data.frame(Group = c("A","A","A","B","B","C","C","D","D","D","D","D"),
Choice = c("Yes","Yes","No","No","Yes","Yes","Yes","Yes","No","No","No","No"),
Type = c("Fruit","Construction","Fruit","Planes","Fruit","Trips","Construction","Cars","Trips","Fruit","Planes","Trips"))
The desired result should be the following:
result <- data.frame(Group = c("A","A","B","B","C","D","D"),
Choice = c("Yes","No","Yes","No","Yes","Yes","No"),
Fruit = c(1,1,0,1,0,0,1),
Construction = c(0,1,0,0,1,0,0),
Planes = c(0,0,1,0,0,0,1),
Trips = c(0,0,0,0,1,0,2),
Cars = c(0,0,0,0,0,1,0))
We can do a count and then spread
library(tidyverse)
df %>%
count(Group, Choice, Type) %>%
spread(Type, n, fill = 0)
# A tibble: 7 x 7
# Group Choice Cars Construction Fruit Planes Trips
# <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A No 0 0 1 0 0
#2 A Yes 0 1 1 0 0
#3 B No 0 0 0 1 0
#4 B Yes 0 0 1 0 0
#5 C Yes 0 1 0 0 1
#6 D No 0 0 1 1 2
#7 D Yes 1 0 0 0 0
Related
This question already has an answer here:
Split a column into multiple binary dummy columns [duplicate]
(1 answer)
Closed 9 months ago.
I made the stupid mistake of enabling people to select multiple categories in a survey question.
Now the data column for this question looks something along the lines of this.
respondent
answer_openq
1
a
2
a,c
3
b
4
a,d
using the following line in r,
datanum <- data %>% mutate(dummy=1) %>%
spread(key=answer_openq,value=dummy, fill=0)
I get the following:
However, I want the dataset to transform into this:
respondent
a
b
c
d
1
1
0
0
0
2
1
0
1
0
3
0
1
0
0
4
1
0
0
1
Any help is appreciated (my thesis depends on it). Thanks :)
Try this:
library(dplyr)
library(tidyr)
df %>%
separate_rows(answer_openq, sep = ',') %>%
pivot_wider(names_from = answer_openq, values_from = answer_openq,
values_fn = function(x) 1, values_fill = 0)
# A tibble: 4 × 5
respondent a c b d
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 0 0
2 2 1 1 0 0
3 3 0 0 1 0
4 4 1 0 0 1
I've been trying to loop over left joins (using R). I need to create a table with columns representing samples from a larger table. Each column of the new table should represent each of these samples.
library(tidyr)
largetable <- data.frame(PlotCode=c(rep("Plot1",20),rep("Plot2",20)),
Category=c(rep("A",8),rep("B",8),rep("C",4),rep("A",12),rep("B",4),rep("C",4)))
a <- data.frame(PlotCode=c("Plot1","Plot1","Plot2","Plot2"),
Category=c("A","B","A","B"))
##example of code to loop over 100 left joins derived from samples of two elements from a large table. It fails to create the columns.
for (i in 1:100){
count <- largetable %>% group_by(PlotCode) %>% sample_n(2, replace = TRUE)%>%
count(PlotCode,Category)
colnames(count)[3] <- paste0("n",i)
b <- left_join(a, count, by = c("PlotCode","Category"))
}
##example of desired output table. Columns n1 to n100 should change depending of samples.
b <- data.frame(PlotCode=c("Plot1","Plot1","Plot2","Plot2"),
Category=c("A","B","A","B"),
n1=c(2,1,0,1),
n2=c(1,1,1,1),
n3=c(2,0,1,2))
How can I loop over left joins so each column corresponds to a different sample?
Instead of for loop we can use rerun/replicate to repeat a process n times.
In each iteration we randomly select 2 rows from each PlotCode and count their Category so you will have n lists which can be joined together using reduce and rename the column as per your choice and replace NA with 0.
library(dplyr)
library(purrr)
n <- 10
rerun(n, largetable %>%
group_by(PlotCode) %>%
slice_sample(n = 2, replace = TRUE) %>%
count(PlotCode,Category)) %>%
reduce(full_join, by = c('PlotCode', 'Category')) %>%
rename_with(~paste0('n', seq_along(.)), starts_with('n')) %>%
mutate(across(starts_with('n'), tidyr::replace_na, 0))
# PlotCode Category n1 n2 n3 n4 n5 n6 n7 n8 n9 n10
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Plot1 A 1 0 2 2 0 1 0 1 2 2
#2 Plot1 B 1 0 0 0 1 1 2 1 0 0
#3 Plot2 B 1 0 0 0 1 0 0 0 0 0
#4 Plot2 C 1 2 0 0 0 0 1 1 0 0
#5 Plot1 C 0 2 0 0 1 0 0 0 0 0
#6 Plot2 A 0 0 2 2 1 2 1 1 2 2
I have a dataset with two columns where I want to separate the second one (delimited by |) into many columns where each column has the name of the item and the observation has the count.
id column
1 a|b|a
2 a|b|c|d|e
3 a|c|c
I would like to have columns with the name of each item and its count. for example for user 1 it would be as follows:
id a b c d e
1 2 1 0 0 0
2 1 1 1 1 1
3 2 0 1 0 0
How do I get to separate this data such that the values are distributed in columns as such?
A tidyverse approach, assuming data frame named mydata:
library(dplyr)
library(tidyr)
mydata %>%
separate_rows(column, sep = "\\|") %>%
count(id, column) %>%
spread(column, n) %>%
replace(., is.na(.), 0) # or just spread(column, n, fill = 0)
Result:
# A tibble: 3 x 6
id a b c d e
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0 0 0
2 2 1 1 1 1 1
3 3 1 0 2 0 0
library(dplyr)
id <- c(rep(1,4),rep(2,3),rep(3,4))
missing <- c(rep(0,4),rep(0,3),1,0,0,0)
wave <- c(seq(1:4),1,2,3,seq(1:4))
df <- as.data.frame(cbind(id,missing,wave))
df
id missing wave
1 1 0 1
2 1 0 2
3 1 0 3
4 1 0 4
5 2 0 1
6 2 0 2
7 2 0 3
8 3 1 1
9 3 0 2
10 3 0 3
11 3 0 4
I am trying to delete cases if they have missing=1 or if they are missing a wave (1:4). For example, ID=3 should be dropped because at wave=1 they have missing=1 and ID=2 should be dropped because they only have values of 1, 2, and 3 in Wave.
I tried to use dplyr's group_by and filter functions but this removes all cases. I want to only end up with cases for ID=1.
df <- df %>% group_by(id) %>% filter(missing==0, wave==1, wave==2, wave==3, wave==4)
df
Try this. We first group_by id, and then create a list column with the sorted unique values of wave for each id. Then we check to make sure this list equals 1:4. We create a missing_check variable, which is just the max of missing for each id. We filter on both missing_check and wave_check.
df %>%
group_by(id) %>%
mutate(wave_list = I(list(sort(unique(wave))))) %>%
mutate(wave_list_check = all(unlist(wave_list) == 1:4),
missing_check = max(missing)) %>%
filter(missing_check == 0, wave_list_check) %>%
select(id:wave)
id missing wave
<dbl> <dbl> <dbl>
1 1 0 1
2 1 0 2
3 1 0 3
4 1 0 4
I'm new to R and I have data that looks something like this:
categories <- c("A","B","C","A","A","B","C","A","B","C","A","B","B","C","C")
animals <- c("cat","cat","cat","dog","mouse","mouse","rabbit","rat","shark","shark","tiger","tiger","whale","whale","worm")
dat <- cbind(categories,animals)
Some animals repeat according to the category. For example, "cat" appears in all three categories A, B, and C.
I like my new dataframe output to look something like this:
A B C count
1 1 1 1
1 1 0 2
1 0 1 0
0 1 1 2
1 0 0 2
0 1 0 0
0 0 1 2
0 0 0 0
The number 1 under A, B, and C means that the animal appears in that category, 0 means the animal does not appear in that category. For example, the first line has 1s in all three categories. The count is 1 for the first line because "cat" is the only animal that repeats itself in each category.
Is there a function in R that will help me achieve this? Thank you in advance.
We can use table to create a cross-tabulation of categories and animals, transpose, convert to data.frame, group_by all categories and count the frequency per combination:
library(dplyr)
library(tidyr)
as.data.frame.matrix(t(table(dat))) %>%
group_by_all() %>%
summarize(Count = n())
Result:
# A tibble: 5 x 4
# Groups: A, B [?]
A B C Count
<int> <int> <int> <int>
1 0 0 1 2
2 0 1 1 2
3 1 0 0 2
4 1 1 0 2
5 1 1 1 1
Edit (thanks to #C. Braun). Here is how to also include the zero A, B, C combinations:
as.data.frame.matrix(t(table(dat))) %>%
bind_rows(expand.grid(A = c(0,1), B = c(0,1), C = c(0,1))) %>%
group_by_all() %>%
summarize(Count = n()-1)
or with complete, as suggested by #Ryan:
as.data.frame.matrix(t(table(dat))) %>%
mutate(non_missing = 1) %>%
complete(A, B, C) %>%
group_by(A, B, C) %>%
summarize(Count = sum(ifelse(is.na(non_missing), 0, 1)))
Result:
# A tibble: 8 x 4
# Groups: A, B [?]
A B C Count
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0 0 1 2
3 0 1 0 0
4 0 1 1 2
5 1 0 0 2
6 1 0 1 0
7 1 1 0 2
8 1 1 1 1
We have
xxtabs <- function(df, formula) {
xt <- xtabs(formula, df)
xxt <- xtabs( ~ . , as.data.frame.matrix(xt))
as.data.frame(xxt)
}
and
> xxtabs(dat, ~ animals + categories)
A B C Freq
1 0 0 0 0
2 1 0 0 2
3 0 1 0 0
4 1 1 0 2
5 0 0 1 2
6 1 0 1 0
7 0 1 1 2
8 1 1 1 1
(dat should really be constructed as data.frame(animals, categories)). This base approach uses xtabs() to form the first cross-tabulation
xt <- xtabs(~ animals + categories, dat)
then coerces using as.data.frame.matrix() to a second data.frame, and uses a second cross-tabulation of all columns of the computed data.frame
xxt <- xtabs(~ ., as.data.frame.matrix(xt))
coerced to the desired form
as.data.frame(xxt)
I originally said this approach was 'arcane', because it relies on knowledge of the difference between as.data.frame() and as.data.frame.matrix(); I think of xtabs() as a tool that users of base R should know. I see though that the other solutions also require this arcane knowledge, as well as knowledge of more obscure (e.g., complete(), group_by_all(), funs()) parts of the tidyverse. Also, the other answers are not (or at least not written in a way that allows) easily generalizable; xxtabs() does not actually know anything about the structure of the incoming data.frame, whereas implicit knowledge of the incoming data are present throughout the other answers.
One 'lesson learned' from the tidy approach is to place the data argument first, allowing piping
dat %>% xxtabs(~ animals + categories)
If I understood you correctly, this should do the trick.
require(tidyverse)
dat %>%
mutate(value = 1) %>%
spread(categories, value) %>%
mutate_if(is.numeric, funs(replace(., is.na(.), 0))) %>%
mutate(count = rowSums(data.frame(A, B, C), na.rm = TRUE)) %>%
group_by(A, B, C) %>%
summarize(Count = n())
# A tibble: 5 x 4
# Groups: A, B [?]
A B C Count
<dbl> <dbl> <dbl> <int>
1 0. 0. 1. 2
2 0. 1. 1. 2
3 1. 0. 0. 2
4 1. 1. 0. 2
5 1. 1. 1. 1
Adding a data.table solution. First, pivot animals against categories using dat. Then, create the combinations of A, B, C using CJ. Join that combinations with dat and count the number of occurrences for each combi.
dcast(as.data.table(dat), animals ~ categories, length)[
CJ(A=0:1, B=0:1, C=0:1), .(count=.N), on=c("A","B","C"), by=.EACHI]