Concatenating partially duplicate responses from multiple choice in long data in R - r

I am rather new to R, and I have been trying to write a code that will find and concatenate multiple choice question responses when the data is in long format. The data needs to be pivoted wide, but cannot without resolving the duplicate IDs that result from these multiple choice responses. I want to combine the extra multiple choice response to the distinct ID number, so that it would look like: "affiliation 1, affiliation 2" for the individual respondent, in long format. I would prefer to not use row numbers, as the data is recollected on a monthly basis and row numbers may not stay constant. I need to identify the duplicate ID due to the multiple choice question, and attach its secondary answer to the other response.
I have tried various versions of aggregate, grouping and summarizing, filter, unique, and distinct, but haven't been able to solve the problem.
Here is an example of the data:
ID Question Response
1 question 1 affiliation x
1 question 2 course 1
2 question 1 affiliation y
2 question 2 course 1
3 question 1 affiliation x
3 question 1 affiliation z
4 question 1 affiliation y
I want the data to look like this:
ID Question Response Text
1 question 1 affiliation x
1 question 2 course 1
2 question 1 affiliation y
2 question 2 course 1
3 question 1 affiliation x, affiliation z
4 question 1 affiliation y
so that it is prepared for pivot_wider.
Some example code that I've tried:
library(tidyverse)
course1 <- all_surveys %>%
filter(`Survey Title`=="course 1") %>%
aggregate("ID" ~ "Response Text", by(`User ID`, Question), FUN=sum) %>%
pivot_wider(id_cols = c("ID", `Response Date`),
names_from = "Question",
values_from = "Response Text") %>%
select([questions to be retained from Question])
I have also tried
group_by(question_new, `User ID`) %>%
summarize(text = str_c("Response Text", collapse = ", "))
as well as
aggregate(c[("Response Text" ~ "question_new")],
by = list(`User ID` = `User ID`, `Response Date` = `Response Date`),
function(x) unique(na.omit(x)))
and a bunch of different iterations of the above.
Thank you very much, in advance!

We can try to pivot_wider using values_fn = toString:
df %>% pivot_wider(names_from = Question,
values_from = response,
values_fn = toString)
small minimal example
df<-tibble(ID = c(1,1,2,2), Question = c('question 1', 'question 2', 'question 1', 'question 1'), response = c('affiliation x', 'course 1', 'affiliation x', 'affiliation y'))
# A tibble: 4 × 3
ID Question response
<dbl> <chr> <chr>
1 1 question 1 affiliation x
2 1 question 2 course 1
3 2 question 1 affiliation x
4 2 question 1 affiliation y
output
# A tibble: 2 × 3
ID `question 1` `question 2`
<dbl> <chr> <chr>
1 1 affiliation x course 1
2 2 affiliation x, affiliation y NA

Related

How many elements in common on multiple lists?

Hi I'm observing a dataset which have a column named "genres" of string vectors that contain all tags of genres the film has, I want to create a plot that shows the popularity of all genres.
structure(list(anime_id = c("10152", "11061", "11266", "11757",
"11771"), Name.x = c("Kimi ni Todoke 2nd Season: Kataomoi", "Hunter
x Hunter (2011)",
"Ao no Exorcist: Kuro no Iede", "Sword Art Online", "Kuroko no
Basket"
), genres = list("Romance", c("Action", " Adventure", " Fantasy"
), "Fantasy", c("Action", " Adventure", " Fantasy", " Romance"
), "Sports")), row.names = c(NA, 5L), class = "data.frame")
initially the genres column is a string with genres divided by comma . for example : ['action', 'drama', 'fantasy']. To work with I run this code to edit the column :
AnimeList2022new$genres <- gsub("\\[|\\]|'" , "",
as.character(AnimeList2022new$genres))
AnimeList2022new$genres <- strsplit( AnimeList2022new$genres,
",")
I don't know how to compare all the vectors in order to know how many times a tags appear
enter image description here
I'm trying with group_by and summarise
genresdata <-MyAnimeList %>%
group_by(genres) %>%
summarise( count = n() ) %>%
arrange( -count)
but obviously this code group similar vectors and not similar string contained in the vectors.
this is the output:
enter image description here
Your genres column is of class list, so it sounds like you want the length() of reach row in it. Generally, we could do that like this:
MyAnimeList %>%
mutate(n_genres = sapply(genres, length))
But this is a special case where there is a nice convenience function lengths() (notice the s at the end) built-in to R that gives us the same result, so we can simply do
MyAnimeList %>%
mutate(n_genres = lengths(genres))
The above will give the number of genres for each row.
In the comments I see you say you want "for example how many times "Action" appears in the whole column". For that, we can unnest() the genre list column and then count:
library(tidyr)
MyAnimeList %>%
unnest(genres) %>%
count(genres)
# # A tibble: 7 × 2
# genres n
# <chr> <int>
# 1 " Adventure" 2
# 2 " Fantasy" 2
# 3 " Romance" 1
# 4 "Action" 2
# 5 "Fantasy" 1
# 6 "Romance" 1
# 7 "Sports" 1
Do notice that some of your genres have leading white space--it's probably best to solve this problem "upstream" whenever the genre column was created, but we could do it now using trimws to trim whitespace:
MyAnimeList %>%
unnest(genres) %>%
count(trimws(genres))
# # A tibble: 5 × 2
# `trimws(genres)` n
# <chr> <int>
# 1 Action 2
# 2 Adventure 2
# 3 Fantasy 3
# 4 Romance 2
# 5 Sports 1

R Concatenate Across Rows Within Groups but Preserve Sequence

My data consists of text from many dyads that has been split into sentences, one per row. I'd like to concatenate the data by speaker within dyads, essentially converting the data to speaking turns. Here's an example data set:
dyad <- c(1,1,1,1,1,2,2,2,2)
speaker <- c("John", "John", "John", "Paul","John", "George", "Ringo", "Ringo", "George")
text <- c("Let's play",
"We're wasting time",
"Let's make a record!",
"Let's work it out first",
"Why?",
"It goes like this",
"Hold on",
"Have to tighten my snare",
"Ready?")
dat <- data.frame(dyad, speaker, text)
And this is what I'd like the data to look like:
dyad speaker text
1 1 John Let's play. We're wasting time. Let's make a record!
2 1 Paul Let's work it out first
3 1 John Why?
4 2 George It goes like this
5 2 Ringo Hold on. Have to tighten my snare
6 2 George Ready?
I've tried grouping by sender and pasting/collapsing from dplyr but the concatenation combines all of a sender's text without preserving speaking turn order. For example, John's last statement ("Why") winds up with his other text in the output rather than coming after Paul's comment. I also tried to check if the next speaker (using lead(sender)) is the same as current and then combining, but it only does adjacent rows, in which case it misses John's third comment in the example. Seems it should be simple but I can't make it happen. And it should be flexible to combine any series of continuous rows by a given speaker.
Thanks in advance
Create another group with rleid (from data.table) and paste the rows in summarise
library(dplyr)
library(data.table)
library(stringr)
dat %>%
group_by(dyad, grp = rleid(speaker), speaker) %>%
summarise(text = str_c(text, collapse = ' '), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 6 × 3
dyad speaker text
<dbl> <chr> <chr>
1 1 John Let's play We're wasting time Let's make a record!
2 1 Paul Let's work it out first
3 1 John Why?
4 2 George It goes like this
5 2 Ringo Hold on Have to tighten my snare
6 2 George Ready?
Not as elegant as dear akrun's solution. helper does the same as rleid function here without the NO need of an additional package:
library(dplyr)
dat %>%
mutate(helper = (speaker != lag(speaker, 1, default = "xyz")),
helper = cumsum(helper)) %>%
group_by(dyad, speaker, helper) %>%
summarise(text = paste0(text, collapse = " "), .groups = 'drop') %>%
select(-helper)
dyad speaker text
<dbl> <chr> <chr>
1 1 John Let's play We're wasting time Let's make a record!
2 1 John Why?
3 1 Paul Let's work it out first
4 2 George It goes like this
5 2 George Ready?
6 2 Ringo Hold on Have to tighten my snare

R using melt() and dcast() with categorical and numerical variables at the same time

I am a newbie in programming with R, and this is my first question ever here on Stackoverflow.
Let's say that I have a data frame with 4 columns:
(1) Individual ID (numeric);
(2) Morality of the individual (factor);
(3) The city (factor);
(4) Numbers of books possessed (numeric).
Person_ID <- c(1,2,3,4,5,6,7,8,9,10)
Morality <- c("Bad guy","Bad guy","Bad guy","Bad guy","Bad guy",
"Good guy","Good guy","Good guy","Good guy","Good guy")
City <- c("NiceCity", "UglyCity", "NiceCity", "UglyCity", "NiceCity",
"UglyCity", "NiceCity", "UglyCity", "NiceCity", "UglyCity")
Books <- c(0,3,6,9,12,15,18,21,24,27)
mydf <- data.frame(Person_ID, City, Morality, Books)
I am using this code in order to get the counts by each category for the variable Morality in each city:
mycounts<-melt(mydf,
idvars = c("City"),
measure.vars = c("Morality"))%>%
dcast(City~variable+value,
value.var="value",fill=0,fun.aggregate=length)
The code gives this kind of table with the sums:
names(mycounts)<-gsub("Morality_","",names(mycounts))
mycounts
City Bad guy Good guy
1 NiceCity 3 2
2 UglyCity 2 3
I wonder if there is a similar way to use dcast() for numerical variables (inside the same script) e.g. in order to get a sum the Books possessed by all individuals living in each city:
#> City Bad guy Good guy Books
#>1 NiceCity 3 2 [Total number of books in NiceCity]
#>2 UglyCity 2 3 [Total number of books in UglyCity]
Do you mean something like this:
mydf %>%
melt(
idvars = c("City"),
measure.vars = c("Morality")
) %>%
dcast(
City ~ variable + value,
value.var = "Books",
fill = 0,
fun.aggregate = sum
)
#> City Morality_Bad guy Morality_Good guy
#> 1 NiceCity 18 42
#> 2 UglyCity 12 63

How to convert specific rows into columns in r?

I have a df in R of only one column of food ratings from amazon.
head(food_ratings)
product.productId..B001E4KFG0
1 review/userId: A3SGXH7AUHU8GW
2 review/profileName: delmartian
3 review/helpfulness: 1/1
4 review/score: 5.0
5 review/time: 1303862400
6 review/summary: Good Quality Dog Food
The rows repeat themselves, so that rows 7 through 12 have the same information regarding another user(row 7). This pattern is repeated many times.
Therefore, I need to have every group of 6 rows distributed in one row with 6 columns, so that later I can subset, for instance, the review/summary according to their review/score.
I'm using RStudio 1.0.143
EDIT: I was asked to show the output of dput(head(food_ratings, 24)) but it was too big regardless of the number used.
Thanks a lot
I have taken your data and added 2 more fake users to it. Using tidyr and dplyr you can create new columns and collapse the data into a nice data.frame. You can use select from dplyr to drop the id column if you don't need it or to rearrange the order of the columns.
library(tidyr)
library(dplyr)
df %>%
separate(product.productId..B001E4KFG0, into = c("details", "data"), sep = ": ") %>%
mutate(details = sub("review/ ", "", details)) %>%
group_by(details) %>%
mutate(id = row_number()) %>%
spread(details, data)
# A tibble: 3 x 7
id helpfulness profileName score summary time userId
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1/1 delmartian 5.0 Good Quality Dog Food 1303862400 A3SGXH7AUHU8GW
2 2 1/1 martian2 1.0 Good Quality Snake Food 1303862400 123456
3 3 2/5 martian3 5.0 Good Quality Cat Food 1303862400 123654
data:
df <- structure(list(product.productId..B001E4KFG0 = c("review/userId: A3SGXH7AUHU8GW",
"review/profileName: delmartian", "review/helpfulness: 1/1",
"review/score: 5.0", "review/time: 1303862400", "review/summary: Good Quality Dog Food",
"review/userId: 123456", "review/profileName: martian2", "review/helpfulness: 1/1",
"review/score: 1.0", "review/time: 1303862400", "review/summary: Good Quality Snake Food",
"review/userId: 123654", "review/profileName: martian3", "review/helpfulness: 2/5",
"review/score: 5.0", "review/time: 1303862400", "review/summary: Good Quality Cat Food"
)), class = "data.frame", row.names = c(NA, -18L))

How do we separate a column into multiple ones based on "|"? [duplicate]

This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Split column at delimiter in data frame [duplicate]
(6 answers)
Closed 5 years ago.
I have a tibble.
library(tidyverse)
df <- tibble(
id = 1:4,
genres = c("Action|Adventure|Science Fiction|Thriller",
"Adventure|Science Fiction|Thriller",
"Action|Crime|Thriller",
"Family|Animation|Adventure|Comedy|Action")
)
df
I want to separate the genres by "|" and empty columns filled with NA.
This is what I did:
df %>%
separate(genres, into = c("genre1", "genre2", "genre3", "genre4", "genre5"), sep = "|")
However, it's being separated after each letter.
I think you haven't included into:
df <- tibble::tibble(
id = 1:4,
genres = c("Action|Adventure|Science Fiction|Thriller",
"Adventure|Science Fiction|Thriller",
"Action|Crime|Thriller",
"Family|Animation|Adventure|Comedy|Action")
)
df %>% tidyr::separate(genres, into = c("genre1", "genre2", "genre3",
"genre4", "genre5"))
Result:
# A tibble: 4 x 6
id genre1 genre2 genre3 genre4 genre5
* <int> <chr> <chr> <chr> <chr> <chr>
1 1 Action Adventure Science Fiction Thriller
2 2 Adventure Science Fiction Thriller <NA>
3 3 Action Crime Thriller <NA> <NA>
4 4 Family Animation Adventure Comedy Action
Edit: Or as RichScriven wrote in the comments, df %>% tidyr::separate(genres, into = paste0("genre", 1:5)). For separating on | exactly, use sep = "\\|".
Well, this is what helped, writing regex properly.
df %>%
separate(genres, into = paste0("genre", 1:5), sep = "\\|")

Resources