Wrangling dataset by picking out Rotten Tomatoes movie ratings from a column - r

I have this sample dataset:
structure(list(Title = c("Isn't It Romantic", "Isn't It Romantic",
"Isn't It Romantic", "Isn't It Romantic", "Isn't It Romantic",
"Isn't It Romantic", "Gully Boy", "Gully Boy", "Gully Boy", "Gully Boy",
"Gully Boy", "Gully Boy", "The Wandering Earth", "The Wandering Earth",
"The Wandering Earth", "The Wandering Earth", "The Wandering Earth",
"The Wandering Earth", "How to Train Your Dragon: The Hidden World",
"How to Train Your Dragon: The Hidden World", "How to Train Your Dragon: The Hidden World",
"How to Train Your Dragon: The Hidden World", "How to Train Your Dragon: The Hidden World",
"How to Train Your Dragon: The Hidden World", "American Woman",
"American Woman", "Us", "Us", "Us", "Us", "Us", "Us", "The Wolf's Call",
"The Wolf's Call", "Avengers: Endgame", "Avengers: Endgame",
"Avengers: Endgame", "Avengers: Endgame", "Avengers: Endgame",
"Avengers: Endgame", "The Silence", "The Silence", "The Silence",
"The Silence", "The Silence", "The Silence", "My Little Pony: Equestria Girls: Spring Breakdown",
"My Little Pony: Equestria Girls: Spring Breakdown"), Ratings = c("Internet Movie Database",
"5.9/10", "Rotten Tomatoes", "68%", "Metacritic", "60/100", "Internet Movie Database",
"8.4/10", "Rotten Tomatoes", "100%", "Metacritic", "65/100",
"Internet Movie Database", "6.4/10", "Rotten Tomatoes", "74%",
"Metacritic", "62/100", "Internet Movie Database", "7.6/10",
"Rotten Tomatoes", "91%", "Metacritic", "71/100", "Rotten Tomatoes",
"57%", "Internet Movie Database", "7.1/10", "Rotten Tomatoes",
"94%", "Metacritic", "81/100", "Internet Movie Database", "7.6/10",
"Internet Movie Database", "8.7/10", "Rotten Tomatoes", "94%",
"Metacritic", "78/100", "Internet Movie Database", "5.2/10",
"Rotten Tomatoes", "23%", "Metacritic", "25/100", "Internet Movie Database",
"7.7/10")), row.names = c(NA, -48L), class = c("tbl_df", "tbl",
"data.frame"))
The Ratings column has 3 different types of Ratings (Imdb, Rotten Tomatoes and Metacritic) for each movie, spread out over 6 rows for each movie.
I'd like to wrangle this dataset so that for each movie, I create a new column called rottentomatoes_rating and the values are the rating. So, in my sample dataset, Isn't it Romantic movie would have 68% under rottentomatoes_rating, Gully Boy would have 100% under rottentomatoes_rating, etc.
For those movies that don't have a rottentomatoes_rating, then I'd like to put NA under rottentomatoes_rating.
I've thought about using spread in tidyr, but I can't quite figure out how to do so since in my case, the variable and values are all in the same column!

Assuming your dataset is called dt you can use this process to get a tidy version of your dataset:
library(tidyverse)
# specify indexes of Rating companies
ids = seq(1, nrow(dt), 2)
# get rows of Rating companies
dt %>% slice(ids) %>%
# combine with the rating values
cbind(dt %>% slice(-ids) %>% select(RatingsValue = Ratings)) %>%
# reshape dataset
spread(Ratings, RatingsValue)
# Title Year Rated Released Runtime Internet Movie Database Metacritic Rotten Tomatoes
# 1 Gully Boy 2019 Not Rated 2019-02-14 153 min 8.4/10 65/100 100%
# 2 Isn't It Romantic 2019 PG-13 2019-02-13 89 min 5.9/10 60/100 68%

If the data is formatted similarly throughout your dataset, the following code should work:
df %>% group_by(Title) %>%
slice(match("Rotten Tomatoes", df$Ratings) + 1) %>%
rename(rottentomatoes_rating = Ratings)
This gives:
# A tibble: 2 x 6
# Groups: Title [2]
Title Year Rated Released Runtime rottentomatoes_rating
<chr> <chr> <chr> <date> <chr> <chr>
1 Gully Boy 2019 Not Rated 2019-02-14 153 min 100%
2 Isn't It Romantic 2019 PG-13 2019-02-13 89 min 68%
For the NAs, if the original data always has the RT score the row after the string is observed, then it should give you NA by default.

sumshyftw answer is good.
But here is a data.table version if you simply want to get rotten tomatoes's percents:
dt <- dt[dt$Ratings %like% "%",]
dt <- setnames(dt, "Ratings", "rottentomatoes_rating")
Output :
# A tibble: 2 x 6
Title Year Rated Released Runtime rottentomatoes_rating
<chr> <chr> <chr> <date> <chr> <chr>
1 Isn't It Romantic 2019 PG-13 2019-02-13 89 min 68%
2 Gully Boy 2019 Not Rated 2019-02-14 153 min 100%
I used %like% "%" because I assume that the full data is just like your example.

new version that fills NA values when blank
# using data.table
library(data.table)
dt <- as.data.table(df)
# Index will hold whether the row is a Provider eg Rotten Tomatoes, or a value
dt[, Index:=rep(c("Provider", "Value"), .N/2)]
# Need an index to bind these together
dt[, Provider.Id:=rep(1:(.N/2), each=2), by=Title]
dt[1:6,]
# segment out the Provider & Values in to columns
out <- dcast(dt, Title+Provider.Id~Index, value.var = "Ratings")
out[, Provider := NULL]
# now convert to full wide format
out_df <- as.data.frame(dcast(out, Title~Provider, value.var="Value", fill=NA))
out_df

To get all metrics with data.table
# using data.table
library(data.table)
dt <- as.data.table(df)
# groups the data set with by, and extracts the Ratings
# makes use of logic that the odd indeces hold the name of the provider,
# the even ones hold the values. Only works if this holds.
# It can probably be optimised a bit. dcast converts from long to required wide
# format
splitRatings <- function(Ratings){
# e.g. Ratings=dt$Ratings[1:6]
N <- length(Ratings)
split_dt <- data.table(DB=Ratings[1:N %% 2 == 1],
Values=Ratings[1-(1:N %% 2) == 1])
out <- dcast(split_dt, .~DB, value.var = "Values")
out[, ".":=NULL]
out
}
# applies the function based on the by clause, returning the table embedded
dt2 <- dt[, splitRatings(Ratings), by=.(Title, Year, Rated, Released, Runtime)]
# convert back
out <- as.data.frame(dt2)

Here is one version.
df %>%
mutate(Value = ifelse(str_detect(Ratings, "\\d"), Ratings, NA)) %>%
fill(Value, .direction = "up") %>%
filter(!str_detect(Ratings, "\\d")) %>%
spread(Ratings, Value)

Related

using key word to label a new column in R

I need to mutate a new column "Group" by those keyword,
I tried to using %in% but not got data I expected.
I want to create an extra column names'group' in my df data frame.
In this column, I want lable every rows by using some keywords.
(from the keywords vector or may be another keywords dataframe)
For example:
library(tibble)
df <- tibble(Title = c("Iran: How we are uncovering the protests and crackdowns",
"Deepak Nirula: The man who brought burgers and pizzas to India",
"Phil Foden: Manchester City midfielder signs new deal with club until 2027",
"The Danish tradition we all need now",
"Slovakia LGBT attack"),
Text = c("Iranian authorities have been disrupting the internet service in order to limit the flow of information and control the narrative, but Iranians are still sending BBC Persian videos of protests happening across the country via messaging apps. Videos are also being posted frequently on social media.
Before a video can be used in any reports, journalists need to establish where and when it was filmed.They can pinpoint the location by looking for landmarks and signs in the footage and checking them against satellite images, street-level photos and previous footage. Weather reports, the position of the sun and the angles of shadows it creates can be used to confirm the timing.",
"For anyone who grew up in capital Delhi during the 1970s and 1980s, Nirula's - run by the family of Deepak Nirula who died last week - is more than a restaurant. It's an emotion.
The restaurant transformed the eating-out culture in the city and introduced an entire generation to fast food, American style, before McDonald's and KFC came into the country. For many it was synonymous with its hot chocolate fudge.",
"Stockport-born Foden, who has scored two goals in 18 caps for England, has won 11 trophies with City, including four Premier League titles, four EFL Cups and the FA Cup.He has also won the Premier League Young Player of the Season and PFA Young Player of the Year awards in each of the last two seasons.
City boss Pep Guardiola handed him his debut as a 17-year-old and Foden credited the Spaniard for his impressive development over the last five years.",
"Norwegian playwright and poet Henrik Ibsen popularised the term /friluftsliv/ in the 1850s to describe the value of spending time in remote locations for spiritual and physical wellbeing. It literally translates to /open-air living/, and today, Scandinavians value connecting to nature in different ways – something we all need right now as we emerge from an era of lockdowns and inactivity.",
"The men were shot dead in the capital Bratislava on Wednesday, in a suspected hate crime.Organisers estimated that 20,000 people took part in the vigil, mourning the men's deaths and demanding action on LGBT rights.Slovak President Zuzana Caputova, who has raised the rainbow flag over her office, spoke at the event.")
)
keyword1 <- c("authorities", "Iranian", "Iraq", "control", "Riots",)
keyword2 <- c("McDonald's","KFC", "McCafé", "fast food")
keyword3 <- c("caps", "trophies", "season", "seasons")
keyword4 <- c("travel", "landscape", "living", "spiritual")
keyword5 <- c("LGBT", "lesbian", "les", "rainbow", "Gay", "Bisexual","Transgender")
I need to mutate a new column "Group" by those keyword
if match keyword1 lable "Politics",
if match keyword2 lable "Food",
if match keyword3 lable "Sport",
if match keyword4 lable "Travel",
if match keyword5 lable "LGBT".
Can also ignore.case ?
Below is expected output
Title
Text
Group
Iran: How..
Iranian...
Politics
Deepak Nir..
For any...
Food
Phil Foden..
Stockpo...
Sport
The Danish..
Norwegi...
Travel
Slovakia L..
The men...
LGBT
Thanks to everyone who spending time.
you could try this:
df %>%
rowwise %>%
mutate(
## add column with words found in title or text (splitting by non-word character):
words = list(strsplit(split = '\\W', paste(Title, Text)) %>% unlist),
group = {
categories <- list(keyword1, keyword2, keyword3, keyword4, keyword5)
## i indexes those items (=keyword vectors) of list 'categories'
## which share at least one word with column Title or Text (so that length > 0)
i <- categories %>% lapply(\(category) length(intersect(unlist(words), category))) %>% as.logical
## pick group name via index; join with ',' if more than one category applies
c('Politics', 'Food', 'Sport', 'Travel', 'LGBD')[i] %>% paste(collapse = ',')
}
)
output:
## # A tibble: 5 x 4
## # Rowwise:
## Title Text words group
## <chr> <chr> <lis> <chr>
## 1 Iran: How we are uncovering the protests and crackdowns "Ira~ <chr> Poli~
## 2 Deepak Nirula: The man who brought burgers and pizzas to In~ "For~ <chr> Food
## 3 Phil Foden: Manchester City midfielder signs new deal with ~ "Sto~ <chr> Sport
## 4 The Danish tradition we all need now "Nor~ <chr> Trav~
## 5 Slovakia LGBT attack "The~ <chr> LGBD
Check this out - the basic idea is to define all keyword* case-insensitively (hence the (?i) in the patterns) as alternation patterns (hence the | for collapsing) with word boundaries (hence the \\b before and after the alternatives, to ensure that "caps" is matched but not for example "capsize") and use nested ifelse statements to assign the Group labels:
library(tidyverse)
df %>%
mutate(
All = str_c(Title, Text),
Group = ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword1, collapse = "|"), ")\\b")), "Politics",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword2, collapse = "|"), ")\\b")), "Food",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword3, collapse = "|"), ")\\b")), "Sport",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword4, collapse = "|"), ")\\b")), "Travel", "LGBT"))))
) %>%
select(Group)
# A tibble: 5 × 1
Group
<chr>
1 Politics
2 Food
3 Sport
4 Travel
5 LGBT

Identify specific words and combinations of specific words in R

I have a requirement to identify specific words and combinations of specific words within a free text description column. My dataset contains two columns - a reference number and description. The data relates to repairs. I need to be able to determine which room the repair took place in for each reference number. This could include “kitchen”, “bathroom”, “dining room” amongst others.
The dataset looks like this
|reference|description |
|————————-|———————————————————————-|
|123456 |repair light in kitchen |
The output I require is something like this:
|reference|Room |
|————————-|————————|
|123456 |kitchen |
Any help very much appreciated.
This will pull the first match from room_vector in each description.
room_vector = c("kitchen", "bathroom", "dining room")
library(stringr)
your_data$room = str_extract(your_data$description, paste(room_vector, collapse = "|"))
This version takes into account the combination with the word repair:
library(dplyr)
library(stringr)
my_vector <- c("kitchen", "bathroom", "dining room")
pattern <- paste(my_vector, collapse = "|")
df %>%
mutate(Room = case_when(
str_detect(description, "repair") &
str_detect(description, pattern) ~ str_extract(description, pattern)))
If you apply the code to this dataframe:
reference description
1 123456 live in light in kitchen
you will get:
reference description Room
1 123456 live in light in kitchen <NA>
First version does not take the combination with the word repair into account:
Similar to Gregor Thomas solution:
library(dplyr)
library(stringr)
my_vector <- c("kitchen", "bathroom", "dining room")
pattern <- paste(my_vector, collapse = "|")
df %>%
mutate(Room = case_when(
str_detect(description, "repair") |
str_detect(description, pattern) ~ str_extract(description, pattern)))
reference description Room
1 123456 repair light in kitchen kitchen
Using Base R:
rooms <- c("kitchen", "bathroom", "dining room")
pat <- sprintf('.*repair.*(%s).*|.*', paste0(rooms, collapse = '|'))
transform(df, room = sub(pat, '\\1', reference))
reference room
1 repair bathroom bathroom
2 live bathroom
3 repair lights in kitchen kitchen
4 food in kitchen
5 tv in dining room
6 table repair dining room dining room
Data:
df <- structure(list(reference = c("repair bathroom", "live bathroom",
"repair lights in kitchen", "food in kitchen", "tv in dining room",
"table repair dining room ")), class = "data.frame", row.names = c(NA,
-6L))

A more elegant way to remove duplicated names (phrases) in the elements of a character string

I have a vector of organization names in a dataframe. Some of them are just fine, others have the name repeated twice in the same element. Also, when that name is repeated, there is no separating space so the name has a camelCase appearance.
For example (id column added for general dataframe referencing):
id org
1 Alpha Company
2 Bravo InstituteBravo Institute
3 Charlie Group
4 Delta IncorporatedDelta Incorporated
but it should look like:
id org
1 Alpha Company
2 Bravo Institute
3 Charlie Group
4 Delta Incorporated
I have a solution that gets the result I need--reproducible example code below. However, it seems a bit lengthy and not very elegant.
Does anyone have a better approach for the same results?
Bonus question: If organizations have 'types' included, such as Alpha Company, LLC, then my gsub() line to fix the camelCase does not work as well. Any suggestions on how to adjust the camelCase fix to account for the ", LLC" and still work with the rest of the solution?
Thanks in advance!
(Thanks to the OP & those who helped on the previous SO post about splitting camelCase strings in R)
# packages
library(stringr)
# toy data
df <- data.frame(id=1:4, org=c("Alpha Company", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated"))
# split up & clean camelCase words
df$org_fix <- gsub("([A-Z])", " \\1", df$org)
df$org_fix <- str_trim(str_squish(df$org_fix))
# temp vector with half the org names
df$org_half <- word(df$org_fix, start=1, end=(sapply(strsplit(df$org_fix, " "), length)/2)) # stringr::word
# double the temp vector
df$org_dbl <- paste(df$org_half, df$org_half)
# flag TRUE for orgs that contain duplicates in name
df$org_dup <- df$org_fix == df$org_dbl
# corrected the org names
df$org_fix <- ifelse(df$org_dup, df$org_half, df$org_fix)
# drop excess columns
df <- df[,c("id", "org_fix")]
# toy data for the bonus question
df2 <- data.frame(id=1:4, org=c("Alpha Company, LLC", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated"))
Another approach is to compare the first half of the string with the second half of the string. If equal, pick the first half. It also works if there are numbers, underscores or any other characters present in the company name.
org <- c("Alpha Company", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated", "WD40WD40", "3M3M")
ifelse(substring(org, 1, nchar(org) / 2) == substring(org, nchar(org) / 2 + 1, nchar(org)), substring(org, 1, nchar(org) / 2), org)
# [1] "Alpha Company" "Bravo Institute" "Charlie Group" "Delta Incorporated" "WD40" "3M"
You can use regex as this line below :
my_df$org <- str_extract(string = my_df$org, pattern = "([A-Z][a-z]+ [A-Z][a-z]+){1}")
If all individual words start with a capital letter (not followed by an other capital letter), then you can use it to split on. Only keep unique elements, and paste + collapse. Will also work om the bonus LCC-option
org <- c("Alpha CompanyCompany , LLC", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated")
sapply(
lapply(
strsplit(gsub("[^A-Za-z0-9]", "", org),
"(?<=[^A-Z])(?=[A-Z])",
perl = TRUE),
unique),
paste0, collapse = " ")
[1] "Alpha Company LLC" "Bravo Institute" "Charlie Group" "Delta Incorporated"

Why is filter(str_detect() returning the wrong values using R?

I'm trying to match people that meet a certain job code, but there's many abbreviations (e.g., "dr." and "dir" are both director. For some reason, my code yields obviously wrong answers (e.g., it retains 'kvp coordinator' in the below example), and I can't figure out what's going on:
library(dplyr)
library(stringr)
test <- tibble(name = c("Corey", "Sibley", "Justin", "Kate", "Ruth", "Phil", "Sara"),
title = c("kvp coordinator", "manager", "director", "snr dr. of marketing", "drawing expert", "dir of finance", "direct to mail expert"))
test %>%
filter(str_detect(title, "chief|vp|president|director|dr\\.|dir\\ |dir\\."))
In the above example, only Justin, Kate, and Phil should be left, but somehow the filter doesn't drop Corey.
In addition to an answer, if you could explain why I'm getting this bizarre result, I'd really appreciate it.
the vp in str_detect pattern matches with kvp, that's why you are getting it in the output.
test %>% filter(str_detect(title, "chief|\\bvp\\b|president|director|dr\\.|dir\\ |dir\\."))
# A tibble: 3 x 2
name title
<chr> <chr>
1 Justin director
2 Kate snr dr. of marketing
3 Phil dir of finance

How to extract matching values from a column in a dataframe when semicolons are present in R?

I have a large dataframe of published articles for which I would like to extract all articles relating to a few authors specified in a separate list. The authors in the dataframe are grouped together in one column separated by a ; . Not all authors need to match, I would like to extract any article which has one author matched to the list. An example is below.
Title<-c("A", "B", "C")
AU<-c("Mark; John; Paul", "Simone; Lily; Poppy", "Sarah; Luke")
df<-cbind(Title, AU)
authors<-as.character(c("Mark", "John", "Luke"))
df[sapply(strsplit((as.character(df$AU)), "; "), function(x) any(authors %in% x)),]
I would expect to return;
Title AU
A Mark; John
C Sarah; Luke
However with my large dataframe this command does not work to return all AU, it only returns rows which have a single AU not multiple ones.
Here is a dput from my larger dataframe of 5 rows
structure(list(AU = c("FOOKES PG;DEARMAN WR;FRANKLIN JA", "SIMS DG;DOWNHAM MAPS;MCQUILLIN J;GARDNER PS",
"TURNER BR", "BUTLER J;MARSH H;GOODARZI F", "OVERTON M"), TI = c("SOME ENGINEERING ASPECTS OF ROCK WEATHERING WITH FIELD EXAMPLES FROM DARTMOOR AND ELSEWHERE",
"RESPIRATORY SYNCYTIAL VIRUS INFECTION IN NORTH-EAST ENGLAND",
"TECTONIC AND CLIMATIC CONTROLS ON CONTINENTAL DEPOSITIONAL FACIES IN THE KAROO BASIN OF NORTHERN NATAL, SOUTH AFRICA",
"WORLD COALS: GENESIS OF THE WORLD'S MAJOR COALFIELDS IN RELATION TO PLATE TECTONICS",
"WEATHER AND AGRICULTURAL CHANGE IN ENGLAND, 1660-1739"), SO = c("QUARTERLY JOURNAL OF ENGINEERING GEOLOGY",
"BRITISH MEDICAL JOURNAL", "SEDIMENTARY GEOLOGY", "FUEL", "AGRICULTURAL HISTORY"
), JI = c("Q. J. ENG. GEOL.", "BRIT. MED. J.", "SEDIMENT. GEOL.",
"FUEL", "AGRICULTURAL HISTORY")
An option with str_extract
library(dplyr)
library(stringr)
df %>%
mutate(Names = str_extract_all(Names, str_c(authors, collapse="|"))) %>%
filter(lengths(Names) > 0)
# Title Names
#1 A Mark, John
#2 C Luke
data
df <- data.frame(Title, Names)
in Base-R you can access it like so
df[sapply(strsplit(as.character(df$Names, "; "), function(x) any(authors %in% x)),]
Title Names
1 A Mark; John; Paul
3 C Sarah; Luke
This can be accomplished by subsetting on those Names that match the pattern specified in the first argument to the function grepl:
df[grepl(paste0(authors, collapse = "|"), df[,2]),]
Title Names
[1,] "A" "Mark; John; Paul"
[2,] "C" "Sarah; Luke"

Resources