From comma separate text to vector [duplicate] - r

This question already has answers here:
Dummify character column and find unique values [duplicate]
(7 answers)
Closed 3 years ago.
Having a data structure into the comma separated format:
dframe = data.frame(id=c(1,2,43,53), title=c("text1,color","color,text2","text2","text3,text2"))
To convert it as a Boolean vector with exist or not in every row like this expected output:
dframe = data.frame(id=c(1,2,43,53), text1=c(1,0,0,0), color=c(1,1,0,0), text2=c(0,1,1,1), text3=c(0,0,0,1))

We can use separate_rows and spread from tidyverse:
library(tidyverse)
dframe %>%
separate_rows(title, sep = ",") %>%
mutate(id2 = 1) %>%
spread(title, id2, fill = 0)
Output:
# A tibble: 4 x 5
# Groups: id [4]
id color text1 text2 text3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 0 0
2 2 1 0 1 0
3 43 0 0 1 0
4 53 0 0 1 1

Related

Convert categorical variable into binary columns in R [duplicate]

This question already has an answer here:
Split a column into multiple binary dummy columns [duplicate]
(1 answer)
Closed 9 months ago.
I made the stupid mistake of enabling people to select multiple categories in a survey question.
Now the data column for this question looks something along the lines of this.
respondent
answer_openq
1
a
2
a,c
3
b
4
a,d
using the following line in r,
datanum <- data %>% mutate(dummy=1) %>%
spread(key=answer_openq,value=dummy, fill=0)
I get the following:
However, I want the dataset to transform into this:
respondent
a
b
c
d
1
1
0
0
0
2
1
0
1
0
3
0
1
0
0
4
1
0
0
1
Any help is appreciated (my thesis depends on it). Thanks :)
Try this:
library(dplyr)
library(tidyr)
df %>%
separate_rows(answer_openq, sep = ',') %>%
pivot_wider(names_from = answer_openq, values_from = answer_openq,
values_fn = function(x) 1, values_fill = 0)
# A tibble: 4 × 5
respondent a c b d
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 0 0
2 2 1 1 0 0
3 3 0 0 1 0
4 4 1 0 0 1

Separate a row of data on different columns with the count of each item

I have a dataset with two columns where I want to separate the second one (delimited by |) into many columns where each column has the name of the item and the observation has the count.
id column
1 a|b|a
2 a|b|c|d|e
3 a|c|c
I would like to have columns with the name of each item and its count. for example for user 1 it would be as follows:
id a b c d e
1 2 1 0 0 0
2 1 1 1 1 1
3 2 0 1 0 0
How do I get to separate this data such that the values are distributed in columns as such?
A tidyverse approach, assuming data frame named mydata:
library(dplyr)
library(tidyr)
mydata %>%
separate_rows(column, sep = "\\|") %>%
count(id, column) %>%
spread(column, n) %>%
replace(., is.na(.), 0) # or just spread(column, n, fill = 0)
Result:
# A tibble: 3 x 6
id a b c d e
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0 0 0
2 2 1 1 1 1 1
3 3 1 0 2 0 0

Using mutate to create columns from column values [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 3 years ago.
With the following data frame, I would like to create new columns based on the "Type" column values using 'mutate' and count the number of instances that appear. The data should be grouped by "Group" and "Choice".
Over time, the "Type" column will have new values added in that aren't already listed, so the code should be flexible in that respect.
Is this possible using the dplyr library?
library(dplyr)
df <- data.frame(Group = c("A","A","A","B","B","C","C","D","D","D","D","D"),
Choice = c("Yes","Yes","No","No","Yes","Yes","Yes","Yes","No","No","No","No"),
Type = c("Fruit","Construction","Fruit","Planes","Fruit","Trips","Construction","Cars","Trips","Fruit","Planes","Trips"))
The desired result should be the following:
result <- data.frame(Group = c("A","A","B","B","C","D","D"),
Choice = c("Yes","No","Yes","No","Yes","Yes","No"),
Fruit = c(1,1,0,1,0,0,1),
Construction = c(0,1,0,0,1,0,0),
Planes = c(0,0,1,0,0,0,1),
Trips = c(0,0,0,0,1,0,2),
Cars = c(0,0,0,0,0,1,0))
We can do a count and then spread
library(tidyverse)
df %>%
count(Group, Choice, Type) %>%
spread(Type, n, fill = 0)
# A tibble: 7 x 7
# Group Choice Cars Construction Fruit Planes Trips
# <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A No 0 0 1 0 0
#2 A Yes 0 1 1 0 0
#3 B No 0 0 0 1 0
#4 B Yes 0 0 1 0 0
#5 C Yes 0 1 0 0 1
#6 D No 0 0 1 1 2
#7 D Yes 1 0 0 0 0

Count the number of times two values appear in a column based on the unique values of another column [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Tidyr how to spread into count of occurrence [duplicate]
(2 answers)
Closed 4 years ago.
I have the dataframe below:
year<-c("2000","2000","2001","2002","2000")
gender<-c("M","F","M","F","M")
YG<-data.frame(year,gender)
In this dataframe I want to count the number of "M" and "F" for every year and then create a new dataframe like :
year M F
1 2000 2 1
2 2001 1 0
3 2002 0 1
I tried something like:
library(dplyr)
ns<-YG %>%
group_by(year) %>%
count(YG$gender == "M")
A solution using reshape2:
dcast(YG, year~gender)
year F M
1 2000 1 2
2 2001 0 1
3 2002 1 0
Or a different tidyverse solution:
YG %>%
group_by(year) %>%
summarise(M = length(gender[gender == "M"]),
F = length(gender[gender == "F"]))
year M F
<fct> <int> <int>
1 2000 2 1
2 2001 1 0
3 2002 0 1
Or as proposed by #zx8754:
YG %>%
group_by(year) %>%
summarise(M = sum(gender == "M"),
F = sum(gender == "F"))
We can use count and spread to get the df format and use fill = 0 in spread to fill in the 0s:
library(tidyverse)
YG %>%
group_by(year) %>%
count(gender) %>%
spread(gender, n, fill = 0)
Output:
# A tibble: 3 x 3
# Groups: year [3]
year F M
<fct> <dbl> <dbl>
1 2000 1 2
2 2001 0 1
3 2002 1 0

Creating columns by counting occurence number of a categorizing variable [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
dcast warning: ‘Aggregation function missing: defaulting to length’
(2 answers)
Closed 5 years ago.
I want to create several variables which count the occurrence times of var's value for each user.id. Here is an example:
user.id var
1 A
1 B
2 A
2 A
2 C
3 C
Expected result:
user.id var_A var_B var_C
1 1 1 0
2 2 0 1
3 0 0 1
We can do this with tidyverse
library(tidyverse)
df1 %>%
count(user.id, var) %>%
spread(var, n, fill = 0)
# A tibble: 3 x 4
# user.id A B C
#* <int> <dbl> <dbl> <dbl>
#1 1 1 1 0
#2 2 2 0 1
#3 3 0 0 1
Or a more efficient approach with data.table
library(data.table)
dcast(setDT(df1), user.id ~ var)

Resources