Split survey text cell in multiple (unique and binary) columns [duplicate] - r

This question already has answers here:
Split string column to create new binary columns
(10 answers)
Dummy variables from a string variable
(7 answers)
Closed 4 years ago.
I have a survey database at user level, one of the fields has several multiple choices that the user has selected. Example
col1 | col2
ID1 | a, b, c
ID2 | c, f
ID3 | g, k, z
I want to reshape the file as follows using R:
col1| col2(a)| col3(b)| col4(c)| col5(f)| col6(g)| col7(k)| col8(z)**
ID1 | 1 | 1 | 1 | 0 | 0 | 0 | 0
ID2 | 0 | 0 | 1 | 1 | 0 | 0 | 0
ID3 | 0 | 0 | 0 | 0 | 1 | 1 | 1
please note: I don't know how many distinct values are existing in the original multiple choice field.
Thanks

One option is mtabuate after splitting the 'col2' by ,
library(qdapTools)
cbind(df1[1], mtabulate(strsplit(df1$col2, ", ")))

Related

Creating a Ones and Zeros matrix in R from features [duplicate]

This question already has answers here:
Reshape three column data frame to matrix ("long" to "wide" format) [duplicate]
(6 answers)
Closed 2 years ago.
I have a dataset like so:
Name | Pet | isTrain |
---------------------------------
Ben | Dog | 1 |
Kim | Cat | 0 |
Kim | Rabbit | 0 |
How do I make this into a matrix in R where the Name is the row and the Pet is the column, and isTrain is the value?
We can use xtabs from base R
xtabs(isTrain ~ Name + Pet, df1)
# Pet
#Name Cat Dog Rabbit
# Ben 0 1 0
# Kim 0 0 0
data
df1 <- data.frame(Name = c('Ben', 'Kim', 'Kim'),
Pet = c('Dog', 'Cat', 'Rabbit'), isTrain = c(1, 0, 0))

Subset common rows from multiple data frames

I have multiple dataframes like mentioned below with unique id for each row. I am trying to find common rows and make a new dataframe which is appearing at least in two dataframes.
example- row with Id=2 is appearing in all three dataframes. similarly row with Id= 3 is there in df1 and df3.
I want to make a loop which can find common rows and create a new dataframe with common rows.
df1 <- data.frame(Id=c(1,2,3,4),a=c(0,1,0,2),b=c(1,0,1,0),c=c(0,0,4,0))
df2 <- data.frame(Id=c(7,2,5,9),a=c(4,1,9,2),b=c(1,0,1,5),c=c(3,0,7,0))
df3 <- data.frame(Id=c(5,3,2,6),a=c(9,0,1,5),b=c(1,1,0,0),c=c(7,4,0,0))
> df1 > df2
Id | a | b | c | Id | a | b | c |
---|---|---|---| ---|---|---|---|
1 | 0 | 1 | 0 | 7 | 4 | 1 | 3 |
---|---|---|---| ---|---|---|---|
2 | 1 | 0 | 0 | 2 | 1 | 0 | 0 |
---|---|---|---| ---|---|---|---|
3 | 0 | 1 | 4 | 5 | 9 | 1 | 7 |
---|---|---|---| ---|---|---|---|
4 | 2 | 0 | 0 | 9 | 2 | 5 | 0 |
> df3
Id | a | b | c |
---|---|---|---|
5 | 9 | 1 | 7 |
---|---|---|---|
3 | 0 | 1 | 4 |
---|---|---|---|
2 | 1 | 0 | 0 |
---|---|---|---|
6 | 5 | 0 | 0 |
> expected_output
Id | a | b | c |
---|---|---|---|
5 | 9 | 1 | 7 |
---|---|---|---|
3 | 0 | 1 | 4 |
---|---|---|---|
2 | 1 | 0 | 0 |
---|---|---|---|
Note:- ID is unique.
Also, i want to remove rows from original dataframes which are duplicated and I am using it to create new dataframe.
I have multiple dataframes like mentioned below with unique id for each row. I am trying to find common rows and make a new dataframe which is appearing at least in two dataframes.
Since no ID appears twice in the same table, we can tabulate the IDs and keep any found twice:
library(data.table)
DTs = lapply(list(df1,df2,df3), data.table)
Id_keep = rbindlist(lapply(DTs, `[`, j = "Id"))[, .N, by=Id][N >= 2L, Id]
DT_keep = Reduce(funion, DTs)[Id %in% Id_keep]
# Id a b c
# 1: 2 1 0 0
# 2: 3 0 1 4
# 3: 5 9 1 7
Your data should be in an object like DTs to begin with, not a bunch of separate named objects.
How it works
To get a sense of how it works, examine intermediate objects like
list(df1,df2,df3)
lapply(DTs, `[`, j = "Id")
Reduce(funion, DTs)
Also, read the help files, like ?lapply, ?rbindlist, ?funion.
Combine all of the data frames:
combined <- rbind(df1, df2, df3)
Extract the duplicates:
duplicate_rows <- unique(combined[duplicated(combined), ])
(duplicated(combined) gives you the row indices of duplicate rows)

Impute character values in column [duplicate]

This question already has an answer here:
fill in NA based on the last non-NA value for each group in R [duplicate]
(1 answer)
Closed 5 years ago.
My code looks like this:
Item | Category
A | 1
A |
A |
A | 1
A |
A |
A | 1
B | 2
B |
B |
B | 2
B |
B |
B | 2
B |
B |
I want to impute values and fill the "Category" column with the values corresponding to each "Item", wherever it isn't blank. The end result should be like this:
Item | Category
A | 1
A | 1
A | 1
A | 1
A | 1
A | 1
A | 1
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
B | 2
How can I do this in R?
We can use fill from tidyverse
library(tidyverse)
df1 %>%
fill(Category)

R - subsetting rows from a data frame for column values within a vector

Have a data.frame, df as below
id | name | value
1 | team1 | 3
1 | team2 | 1
2 | team1 | 1
2 | team2 | 4
3 | team1 | 0
3 | team2 | 6
4 | team1 | 1
4 | team2 | 2
5 | team1 | 3
5 | team2 | 0
How do we subset the data frame to get rows for all values of id from 2:4 ?
We can apply conditionally like df[,df$id >= 2 & df$id <= 4] . But is there a way to directly use a vector of integer ranges like ids <- c(2:4) to subset a dataframe ?
One way to do this is df[,df$id >= min(ids) & df$id <= max(ids)].
Is there a more elegant R way of doing this ?
The most typical way is mentioned already, but also variations using match
with(df, df[match(id, 2:4, F) > 0, ])
or, similar
with(df, df[is.element(id, 2:4), ])

In R, how can I take a subset of columns of a data frame and then eliminate duplicate rows?

Imagine I have a data frame with data like this:
A | B | C
---+---+---
1 | 2 | a
1 | 2 | b
5 | 5 | a
5 | 5 | b
I want to take only columns A and B, and I want to remove any rows that have become duplicates as a result of eliminating all other columns (that is, column C). So my desied result for the table above would be:
A | B
---+---
1 | 2
5 | 5
What is the best way to do this?
If your data.frame is called df, then do this:
unique(df[, c("A", "B")])

Resources