I am aware of the spread function in the tidyr package but this is something I am unable to achieve.
I have a data.frame with 2 columns as defined below. I need to transpose the column Subject into binary columns with 1 and 0.
Below is the data frame:
studentInfo <- data.frame(StudentID = c(1,1,1,2,3,3),
Subject = c("Maths", "Science", "English", "Maths", "History", "History"))
> studentInfo
StudentID Subject
1 1 Maths
2 1 Science
3 1 English
4 2 Maths
5 3 History
6 3 History
And the output I am expecting is:
StudentID Maths Science English History
1 1 1 1 1 0
2 2 1 0 0 0
3 3 0 0 0 1
How can I do this with the spread() function or any other function.
Using reshape2 we can dcast from long to wide.
As you only want a binary outcome we can unique the data first
library(reshape2)
si <- unique(studentInfo)
dcast(si, formula = StudentID ~ Subject, fun.aggregate = length)
# StudentID English History Maths Science
#1 1 1 0 1 1
#2 2 0 0 1 0
#3 3 0 1 0 0
Another approach using tidyr and dplyr is
library(tidyr)
library(dplyr)
studentInfo %>%
mutate(yesno = 1) %>%
distinct %>%
spread(Subject, yesno, fill = 0)
# StudentID English History Maths Science
#1 1 1 0 1 1
#2 2 0 0 1 0
#3 3 0 1 0 0
Although I'm not a fan (yet) of tidyr syntax...
We can use table from base R
+(table(studentInfo)!=0)
# Subject
#StudentID English History Maths Science
# 1 1 0 1 1
# 2 0 0 1 0
# 3 0 1 0 0
Using tidyr :
library(tidyr)
studentInfo <- data.frame(
StudentID = c(1,1,1,2,3,3),
Subject = c("Maths", "Science", "English", "Maths", "History", "History"))
pivot_wider(studentInfo,
names_from = "Subject",
values_from = 'Subject',
values_fill = 0,
values_fn = function(x) 1)
#> # A tibble: 3 x 5
#> StudentID Maths Science English History
#> <dbl> <int> <int> <int> <int>
#> 1 1 1 1 1 0
#> 2 2 1 0 0 0
#> 3 3 0 0 0 1
Created on 2019-09-19 by the reprex package (v0.3.0)
Related
Some types of survey software handle "choose all that apply" questions in the following inconvenient way. Suppose a question asked "What type of pet(s) do you own? Choose all that apply: dog, cat, ferret, snake." The resulting dataset looks like this:
pet_tab <- tibble(
owner = 1:5,
pet_1 = c("dog", "cat", "ferret", "dog", "snake"),
pet_2 = c("cat", "ferret", NA, "ferret", NA),
pet_3 = c("ferret", NA, NA, "snake", NA),
pet_4 = c("snake", NA, NA, NA, NA)
)
owner pet_1 pet_2 pet_3 pet_4
1 dog cat ferret snake
2 cat ferret NA NA
3 ferret NA NA NA
4 dog ferret snake NA
5 snake NA NA NA
This is hard to work with. A far better way to organize this data would be like this:
owner dog cat ferret snake
1 1 1 1 1
2 0 1 1 0
3 0 0 1 0
4 1 0 1 1
5 0 0 0 1
where each column indicates whether or not an owner has a given type of animal. How can I transform the first type of data into the second type? I realize there are a lot of ways to do this, but I'd like something elegant, concise, and preferably using tidyverse, though data.table will suffice as well.
We can reshape to 'long' format with pivot_longer on the 'pet' columns and then to wide with pivot_wider with values_fn as length and values_fill as 0
library(dplyr)
library(tidyr)
pet_tab %>%
pivot_longer(cols = starts_with('pet'), values_drop_na = TRUE) %>%
pivot_wider(names_from = value, values_from = name,
values_fn = ~ +(length(.x) > 0), values_fill = 0)
-output
# A tibble: 5 × 5
owner dog cat ferret snake
<int> <int> <int> <int> <int>
1 1 1 1 1 1
2 2 0 1 1 0
3 3 0 0 1 0
4 4 1 0 1 1
5 5 0 0 0 1
Or in base R with table
+(table(pet_tab$owner[row(pet_tab[-1])], unlist(pet_tab[-1])) > 0)
cat dog ferret snake
1 1 1 1 1
2 1 0 1 0
3 0 0 1 0
4 0 1 1 1
5 0 0 0 1
This question already has an answer here:
Split a column into multiple binary dummy columns [duplicate]
(1 answer)
Closed 9 months ago.
I made the stupid mistake of enabling people to select multiple categories in a survey question.
Now the data column for this question looks something along the lines of this.
respondent
answer_openq
1
a
2
a,c
3
b
4
a,d
using the following line in r,
datanum <- data %>% mutate(dummy=1) %>%
spread(key=answer_openq,value=dummy, fill=0)
I get the following:
However, I want the dataset to transform into this:
respondent
a
b
c
d
1
1
0
0
0
2
1
0
1
0
3
0
1
0
0
4
1
0
0
1
Any help is appreciated (my thesis depends on it). Thanks :)
Try this:
library(dplyr)
library(tidyr)
df %>%
separate_rows(answer_openq, sep = ',') %>%
pivot_wider(names_from = answer_openq, values_from = answer_openq,
values_fn = function(x) 1, values_fill = 0)
# A tibble: 4 × 5
respondent a c b d
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 0 0
2 2 1 1 0 0
3 3 0 0 1 0
4 4 1 0 0 1
I want to create a contingency table that displays the frequency distribution of pairs of variables. Here is an example dataset:
mm <- matrix(0, 5, 6)
df <- data.frame(apply(mm, c(1,2), function(x) sample(c(0,1),1)))
colnames(df) <- c("Horror", "Thriller", "Comedy", "Romantic", "Sci.fi", "gender")
All variables are binary with 1 indicating either the presence of specfic movie type or the male gender. In the end, I would like to have the table that counts the presence of different movie types under specific gender. Something like this:
male female
Horror 1 1
Thriller 1 3
Comedy 2 2
Romantic 0 0
Sci.fi 2 0
I know I can create two tables of different movie types for male and female individually (see TarJae's answer here Create count table under specific condition) and cbind them later but I would like to do it in one chunk of code. How to achieve this in an efficient way?
You could do
sapply(split(df, df$gender), function(x) colSums(x[names(x)!="gender"]))
#> 0 1
#> Horror 1 1
#> Thriller 1 3
#> Comedy 0 0
#> Romantic 0 0
#> Sci.fi 1 3
Here is a solution using dplyr and tidyr:
df %>% pivot_longer(cols = -gender, names_to = "type") %>%
mutate(gender = fct_recode(as.character(gender),Male = "0",Female = "1")) %>%
group_by(gender,type) %>%
summarise(sum = sum(value)) %>%
pivot_wider(names_from = gender,values_from = sum)
Which gives
# A tibble: 5 x 3
type Male Female
<chr> <dbl> <dbl>
1 Comedy 0 1
2 Horror 1 3
3 Romantic 1 1
4 Sci.fi 1 1
5 Thriller 1 1
The second line is optional but allows to get the levels for the variable gender.
Please find below a reprex with an alternative solution using data.table and magrittr (for the pipes), also in one chunk.
Reprex
Your data (I set a seed for reproducibility)
set.seed(452)
mm <- matrix(0, 5, 6)
df <- data.frame(apply(mm, c(1,2), function(x) sample(c(0,1),1)))
colnames(df) <- c("Horror", "Thriller", "Comedy", "Romantic", "Sci.fi", "gender")
df
#> Horror Thriller Comedy Romantic Sci.fi gender
#> 1 0 1 1 0 0 0
#> 2 0 0 0 0 1 0
#> 3 1 0 1 1 0 1
#> 4 0 1 0 0 0 1
#> 5 0 1 0 0 0 1
Code in one chunk
library(data.table)
library(magrittr) # for the pipes!
df %>%
transpose(., keep.names = "rn") %>%
setDT(.) %>%
{.[, .(rn = rn,
male = rowSums(.[,.SD, .SDcols = .[, .SD[.N]] == 1]),
female = rowSums(.[,.SD, .SDcols = .[, .SD[.N]] == 0]))][rn !="gender"]}
Output
#> rn male female
#> 1: Horror 1 0
#> 2: Thriller 2 1
#> 3: Comedy 1 1
#> 4: Romantic 1 0
#> 5: Sci.fi 0 1
Created on 2021-11-25 by the reprex package (v2.0.1)
This question already has answers here:
R - How to one hot encoding a single column while keep other columns still?
(5 answers)
Closed 2 years ago.
original table is like this:
id
food
1
fish
2
egg
2
apple
for each id, should have 1 or 0 value of its food, so the table should look like this:
id
food
fish
egg
apple
1
fish
1
0
0
2
egg
0
1
0
2
apple
0
0
1
A proposition using the dcast() function of the reshape2 package :
df1 <- read.table(header = TRUE, text = "
id food
1 fish
2 egg
2 apple
")
###
df2 <- reshape2::dcast(data = df1,
formula = id+food ~ food,
fun.aggregate = length,
value.var = "food")
df2
#> id food apple egg fish
#> 1 1 fish 0 0 1
#> 2 2 apple 1 0 0
#> 3 2 egg 0 1 0
###
df3 <- reshape2::dcast(data = df1,
formula = id+factor(food, levels=unique(food)) ~
factor(food, levels=unique(food)),
fun.aggregate = length,
value.var = "food")
names(df3) <- c("id", "food", "fish", "egg", "apple")
df3
#> id food fish egg apple
#> 1 1 fish 1 0 0
#> 2 2 egg 0 1 0
#> 3 2 apple 0 0 1
# Created on 2021-01-29 by the reprex package (v0.3.0.9001)
Regards,
Is there a quick way to one-hot encode lists of vectors (with different lenghts) in R, preferably using tidyverse?
For example:
vals <- list(a=c(1), b=c(2,3), c=c(1,2))
The wanted result is a wide dataframe:
1 2 3
a 1 0 0
b 0 1 1
c 1 1 0
Thanks!
We can enframe the list and convert them into separate rows, create a dummy column and convert the data into wide-format using pivot_wider.
library(tidyverse)
enframe(vals) %>%
unnest(value) %>%
mutate(temp = 1) %>%
pivot_wider(names_from = value, values_from = temp, values_fill = list(temp = 0))
# name `1` `2` `3`
# <chr> <dbl> <dbl> <dbl>
#1 a 1 0 0
#2 b 0 1 1
#3 c 1 1 0
One base R option could be:
t(table(stack(vals)))
values
ind 1 2 3
a 1 0 0
b 0 1 1
c 1 1 0
A base R approach,
do.call(rbind, lapply(vals, function(i) as.integer(!is.na(match(unique(unlist(vals)), i)))))
# [,1] [,2] [,3]
#a 1 0 0
#b 0 1 1
#c 1 1 0