Extract words from string in R

Extract words from string in R - r

I am trying to extract pieces of the string and creating new variables from those matched patterns. I have tried numerous of functions from the "strings" package and can't seem to get the outcome. The example below is made up data. I want to take a character string and extract the pieces and store them into new columns of a new data frame.
example
ex <- c("The Accountant (2016)Crime (vodmovies112.blogspot.com.es)","Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es),"Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)","Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)","The Remains (2016) 1080p BlurayHorror (openload.co)" ,"Suicide Squad (2016) HDAction (openload.co)")
>ex
[1] "The Accountant (2016)Crime (vodmovies112.blogspot.com.es)"
[2] "Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es)"
[3] "Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)"
[4] "Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)"
[5] "The Remains (2016) 1080p BlurayHorror (openload.co)"
[6] "Suicide Squad (2016) HDAction (openload.co)"
genres <- c("Action","Adventure","Animation","Biography",
"Comedy","Crime","Documentary","Drama","Family",
"Fantasy","Film-Noir","History","Horror","Music",
"Musical","Mystery","Romance","Sci-Fi","Sport","Thriller",
"War","Western")
genres <- paste0("^",genres,"|")
genres[22] <- "^Western"
> genres
[1] "^Action|" "^Adventure|" "^Animation|" "^Biography|"
[5] "^Comedy|" "^Crime|" "^Documentary|" "^Drama|"
[9] "^Family|" "^Fantasy|" "^Film-Noir|" "^History|"
[13] "^Horror|" "^Music|" "^Musical|" "^Mystery|"
[17] "^Romance|" "^Sci-Fi|" "^Sport|" "^Thriller|"
[21] "^War|" "^Western"
trying to accomplish
> df
title year domain genre
1 The Accountant 2016 vodmovies112.blogspot.com.es Crime

Here is a possibility:
temp <- strsplit(ex, "\\(|\\)")
df <- setNames(as.data.frame(lapply(1:4,function(i) sapply(temp,"[",i)), stringsAsFactors = FALSE), c("title", "year", "genre", "domain"))
df <- df[ , c("title", "year", "domain", "genre")]
correct <- sapply(seq_along(df$genre), function(y) which(lengths(sapply(seq_along(genres), function(x) grep(genres[x], df$genre[y])))>0))
correct <- lapply(correct, function(x) paste0(genres[x], collapse = " "))
df$genre <- unlist(correct)
df
# title year domain genre
# 1 The Accountant 2016 vodmovies112.blogspot.com.es Crime
# 2 Miss Peregrine's Home for Peculiar Children 2016 vodmovies112.blogspot.com.es Fantasy Sci-Fi
# 3 Fantastic Beasts And Where To Find Them 2016 openload.co Adventure
# 4 Ben-Hur 2016 vodmovies112.blogspot.com.es Action Adventure
# 5 The Remains 2016 openload.co Horror
# 6 Suicide Squad 2016 openload.co Action
Basically, we split the vector ex in 4 parts, delimited by the parenthesis. We then create the data.frame df with the 4 columns.
The hardest part is to correctly extract the genre (as there might be more than one genre per movie). I use a combination of sapply, lapply and grep to do it. When that's done, we "correct" the column genre.
Here is your data:
ex <- c("The Accountant (2016)Crime (vodmovies112.blogspot.com.es)",
"Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es)",
"Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)",
"Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)",
"The Remains (2016) 1080p BlurayHorror (openload.co)", "Suicide Squad (2016) HDAction (openload.co)"
)
genres <- c("Action", "Adventure", "Animation", "Biography", "Comedy",
"Crime", "Documentary", "Drama", "Family", "Fantasy", "Film-Noir",
"History", "Horror", "Music", "Musical", "Mystery", "Romance",
"Sci-Fi", "Sport", "Thriller", "War", "Western")

Another possibility using tidyverse:
library(tidyverse)
data_frame(x = ex) %>%
extract(
x,
c("title", "year", "domain", "genre"),
"(^[^(]+)\\s+\\((\\d{4})\\)\\s*([^(]+)\\s+\\(([^)]+)"
)
## title year domain genre
## * <chr> <chr> <chr> <chr>
## 1 The Accountant 2016 Crime vodmovies112.blogspot.com.es
## 2 Miss Peregrine's Home for Peculiar Children 2016 FantasySci-Fi vodmovies112.blogspot.com.es
## 3 Fantastic Beasts And Where To Find Them 2016 TSAdventure openload.co
## 4 Ben-Hur 2016 HDActionAdventure vodmovies112.blogspot.com.es
## 5 The Remains 2016 1080p BlurayHorror openload.co
## 6 Suicide Squad 2016 HDAction openload.co

Related

using key word to label a new column in R

I need to mutate a new column "Group" by those keyword,
I tried to using %in% but not got data I expected.
I want to create an extra column names'group' in my df data frame.
In this column, I want lable every rows by using some keywords.
(from the keywords vector or may be another keywords dataframe)
For example:
library(tibble)
df <- tibble(Title = c("Iran: How we are uncovering the protests and crackdowns",
"Deepak Nirula: The man who brought burgers and pizzas to India",
"Phil Foden: Manchester City midfielder signs new deal with club until 2027",
"The Danish tradition we all need now",
"Slovakia LGBT attack"),
Text = c("Iranian authorities have been disrupting the internet service in order to limit the flow of information and control the narrative, but Iranians are still sending BBC Persian videos of protests happening across the country via messaging apps. Videos are also being posted frequently on social media.
Before a video can be used in any reports, journalists need to establish where and when it was filmed.They can pinpoint the location by looking for landmarks and signs in the footage and checking them against satellite images, street-level photos and previous footage. Weather reports, the position of the sun and the angles of shadows it creates can be used to confirm the timing.",
"For anyone who grew up in capital Delhi during the 1970s and 1980s, Nirula's - run by the family of Deepak Nirula who died last week - is more than a restaurant. It's an emotion.
The restaurant transformed the eating-out culture in the city and introduced an entire generation to fast food, American style, before McDonald's and KFC came into the country. For many it was synonymous with its hot chocolate fudge.",
"Stockport-born Foden, who has scored two goals in 18 caps for England, has won 11 trophies with City, including four Premier League titles, four EFL Cups and the FA Cup.He has also won the Premier League Young Player of the Season and PFA Young Player of the Year awards in each of the last two seasons.
City boss Pep Guardiola handed him his debut as a 17-year-old and Foden credited the Spaniard for his impressive development over the last five years.",
"Norwegian playwright and poet Henrik Ibsen popularised the term /friluftsliv/ in the 1850s to describe the value of spending time in remote locations for spiritual and physical wellbeing. It literally translates to /open-air living/, and today, Scandinavians value connecting to nature in different ways – something we all need right now as we emerge from an era of lockdowns and inactivity.",
"The men were shot dead in the capital Bratislava on Wednesday, in a suspected hate crime.Organisers estimated that 20,000 people took part in the vigil, mourning the men's deaths and demanding action on LGBT rights.Slovak President Zuzana Caputova, who has raised the rainbow flag over her office, spoke at the event.")
)
keyword1 <- c("authorities", "Iranian", "Iraq", "control", "Riots",)
keyword2 <- c("McDonald's","KFC", "McCafé", "fast food")
keyword3 <- c("caps", "trophies", "season", "seasons")
keyword4 <- c("travel", "landscape", "living", "spiritual")
keyword5 <- c("LGBT", "lesbian", "les", "rainbow", "Gay", "Bisexual","Transgender")
I need to mutate a new column "Group" by those keyword
if match keyword1 lable "Politics",
if match keyword2 lable "Food",
if match keyword3 lable "Sport",
if match keyword4 lable "Travel",
if match keyword5 lable "LGBT".
Can also ignore.case ?
Below is expected output
Title
Text
Group
Iran: How..
Iranian...
Politics
Deepak Nir..
For any...
Food
Phil Foden..
Stockpo...
Sport
The Danish..
Norwegi...
Travel
Slovakia L..
The men...
LGBT
Thanks to everyone who spending time.

you could try this:
df %>%
rowwise %>%
mutate(
## add column with words found in title or text (splitting by non-word character):
words = list(strsplit(split = '\\W', paste(Title, Text)) %>% unlist),
group = {
categories <- list(keyword1, keyword2, keyword3, keyword4, keyword5)
## i indexes those items (=keyword vectors) of list 'categories'
## which share at least one word with column Title or Text (so that length > 0)
i <- categories %>% lapply(\(category) length(intersect(unlist(words), category))) %>% as.logical
## pick group name via index; join with ',' if more than one category applies
c('Politics', 'Food', 'Sport', 'Travel', 'LGBD')[i] %>% paste(collapse = ',')
}
)
output:
## # A tibble: 5 x 4
## # Rowwise:
## Title Text words group
## <chr> <chr> <lis> <chr>
## 1 Iran: How we are uncovering the protests and crackdowns "Ira~ <chr> Poli~
## 2 Deepak Nirula: The man who brought burgers and pizzas to In~ "For~ <chr> Food
## 3 Phil Foden: Manchester City midfielder signs new deal with ~ "Sto~ <chr> Sport
## 4 The Danish tradition we all need now "Nor~ <chr> Trav~
## 5 Slovakia LGBT attack "The~ <chr> LGBD

Check this out - the basic idea is to define all keyword* case-insensitively (hence the (?i) in the patterns) as alternation patterns (hence the | for collapsing) with word boundaries (hence the \\b before and after the alternatives, to ensure that "caps" is matched but not for example "capsize") and use nested ifelse statements to assign the Group labels:
library(tidyverse)
df %>%
mutate(
All = str_c(Title, Text),
Group = ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword1, collapse = "|"), ")\\b")), "Politics",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword2, collapse = "|"), ")\\b")), "Food",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword3, collapse = "|"), ")\\b")), "Sport",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword4, collapse = "|"), ")\\b")), "Travel", "LGBT"))))
) %>%
select(Group)
# A tibble: 5 × 1
Group
<chr>
1 Politics
2 Food
3 Sport
4 Travel
5 LGBT

Structure character data into data frame

I used rvest package in R to scrape some web data but I am having a lot of trouble getting it into a usuable format.
My data currently looks like this:
test
[1] "v. Philadelphia"
[2] "TD GardenRegular Season"
[3] "PTS: 23. Jayson TatumREB: 10. M. MorrisAST: 7. Kyrie Irving"
[4] "PTS: 23. Joel EmbiidREB: 15. Ben SimmonsAST: 8. Ben Simmons"
[5] "100.7 - 83.4"
[6] "# Toronto"
[7] "Air Canada Centre Regular Season"
[8] "PTS: 21. Kyrie IrvingREB: 10. Al HorfordAST: 9. Al Horford"
[9] "PTS: 31. K. LeonardREB: 10. K. LeonardAST: 7. F. VanVleet"
[10] "115.6 - 103.3"
Can someone help me perform the correct operations in order to have it look like this (as a data frame) and provide the code, I would really appreciate it:
Opponent Venue
Philadelphia TD Garden
Toronto Air Canada Centre
I do not need any of the other information.

Let me know if there are any issues :)
# put your data in here
input <- c("v. Philadelphia", "TD GardenRegular Season",
"", "", "",
"# Toronto", "Air Canada Centre Regular Season",
"", "", "")
index <- 1:length(input)
# raw table format
out_raw <- data.frame(Opponent = input[index%%5==1],
Venue = input[index%%5==2])
# using stringi package
install.packages("stringi")
library(stringi)
# copy and clean up
out_clean <- out_raw
out_clean$Opponent <- stri_extract_last_regex(out_raw$Opponent, "(?<=\\s).*$")
out_clean$Venue <- trimws(gsub("Regular Season", "", out_raw$Venue))
out_clean

How to get global environment variables in a vector in R? [duplicate]

This question already has answers here:
How do I make a list of data frames?
(10 answers)
Closed 5 years ago.
I have a csv data file with 50000+ records stored in dataframe 'data'. I am creating data subsets based on 2 factors Segment & Market with below values:
customer_segments <- c('Consumer','Corporate','Home Office')
markets <- c('Africa','APAC','Canada','EMEA','EU','LATAM','US')
To get all subsets with 21 combinations for Market & Segement, I am using below nested for loops with assign & paste functions:
for(i in 1:length(markets)){
for(j in 1:length(customer_segments)){
assign(paste(markets[i],customer_segments[j],sep='_'),data[(data$Market == markets[i]) & (data$Segment == customer_segments[j]), ])
}
}
This creates 21 dataframes & assign them a name accordingly like Canada_Home Office etc.
Problem is I want to iterate over all these 21 dataframes to aggregate 3 attributes: Sales, Quantity & Profit on each but not sure how to address these dataframes in a loop? Maybe if I get all 21 dataframes in a vector I can iterate, but not sure if this is the best option.

Create combination of markets and customer_segments using expand.grid().
df <- expand.grid(markets, customer_segments)
head(df)
# Var1 Var2
# 1 Africa Consumer
# 2 APAC Consumer
# 3 Canada Consumer
# 4 EMEA Consumer
# 5 EU Consumer
# 6 LATAM Consumer
Vector of the combination of markets and customer_segments
df1 <- as.vector(paste(df$Var1,df$Var2, sep = " "))
df1
# [1] "Africa Consumer" "APAC Consumer" "Canada Consumer"
# [4] "EMEA Consumer" "EU Consumer" "LATAM Consumer"
# [7] "US Consumer" "Africa Corporate" "APAC Corporate"
# [10] "Canada Corporate" "EMEA Corporate" "EU Corporate"
# [13] "LATAM Corporate" "US Corporate" "Africa Home Office"
# [16] "APAC Home Office" "Canada Home Office" "EMEA Home Office"
# [19] "EU Home Office" "LATAM Home Office" "US Home Office"

Extract address components from coordiantes

I'm trying to reverse geocode with R. I first used ggmap but couldn't get it to work with my API key. Now I'm trying it with googleway.
newframe[,c("Front.lat","Front.long")]
Front.lat Front.long
1 -37.82681 144.9592
2 -37.82681 145.9592
newframe$address <- apply(newframe, 1, function(x){
google_reverse_geocode(location = as.numeric(c(x["Front.lat"],
x["Front.long"])),
key = "xxxx")
})
This extracts the variables as a list but I can't figure out the structure.
I'm struggling to figure out how to extract the address components listed below as variables in newframe
postal_code, administrative_area_level_1, administrative_area_level_2, locality, route, street_number
I would prefer each address component as a separate variable.

Google's API returns the response in JSON. Which, when translated into R naturally forms nested lists. Internally in googleway this is done through jsonlite::fromJSON()
In googleway I've given you the choice of returning the raw JSON or a list, through using the simplify argument.
I've deliberately returned ALL the data from Google's response and left it up to the user to extract the elements they're interested in through usual list-subsetting operations.
Having said all that, in the development version of googleway I've written a few functions to help accessing elements of various API calls. Here are three of them that may be useful to you
## Install the development version
# devtools::install_github("SymbolixAU/googleway")
res <- google_reverse_geocode(
location = c(df[1, 'Front.lat'], df[1, 'Front.long']),
key = apiKey
)
geocode_address(res)
# [1] "45 Clarke St, Southbank VIC 3006, Australia"
# [2] "Bank Apartments, 275-283 City Rd, Southbank VIC 3006, Australia"
# [3] "Southbank VIC 3006, Australia"
# [4] "Melbourne VIC, Australia"
# [5] "South Wharf VIC 3006, Australia"
# [6] "Melbourne, VIC, Australia"
# [7] "CBD & South Melbourne, VIC, Australia"
# [8] "Melbourne Metropolitan Area, VIC, Australia"
# [9] "Victoria, Australia"
# [10] "Australia"
geocode_address_components(res)
# long_name short_name types
# 1 45 45 street_number
# 2 Clarke Street Clarke St route
# 3 Southbank Southbank locality, political
# 4 Melbourne City Melbourne administrative_area_level_2, political
# 5 Victoria VIC administrative_area_level_1, political
# 6 Australia AU country, political
# 7 3006 3006 postal_code
geocode_type(res)
# [[1]]
# [1] "street_address"
#
# [[2]]
# [1] "establishment" "general_contractor" "point_of_interest"
#
# [[3]]
# [1] "locality" "political"
#
# [[4]]
# [1] "colloquial_area" "locality" "political"

After reverse geocoding into newframe$address the address components could be extracted further as follows:
# Make a boolean array of the valid ("OK" status) responses (other statuses may be "NO_RESULTS", "REQUEST_DENIED" etc).
sel <- sapply(c(1: nrow(newframe)), function(x){
newframe$address[[x]]$status == 'OK'
})
# Get the address_components of the first result (i.e. best match) returned per geocoded coordinate.
address.components <- sapply(c(1: nrow(newframe[sel,])), function(x){
newframe$address[[x]]$results[1,]$address_components
})
# Get all possible component types.
all.types <- unique(unlist(sapply(c(1: length(address.components)), function(x){
unlist(lapply(address.components[[x]]$types, function(l) l[[1]]))
})))
# Get "long_name" values of the address_components for each type present (the other option is "short_name").
all.values <- lapply(c(1: length(address.components)), function(x){
types <- unlist(lapply(address.components[[x]]$types, function(l) l[[1]]))
matches <- match(all.types, types)
values <- address.components[[x]]$long_name[matches]
})
# Bind results into a dataframe.
all.values <- do.call("rbind", all.values)
all.values <- as.data.frame(all.values)
names(all.values) <- all.types
# Add columns and update original data frame.
newframe[, all.types] <- NA
newframe[sel,][, all.types] <- all.values
Note that I've only kept the first type given per component, effectively skipping the "political" type as it appears in multiple components and is likely superfluous e.g. "administrative_area_level_1, political".

You can use ggmap:revgeocode easily; look below:
library(ggmap)
df <- cbind(df,do.call(rbind,
lapply(1:nrow(df),
function(i)
revgeocode(as.numeric(
df[i,2:1]), output = "more")
[c("administrative_area_level_1","locality","postal_code","address")])))
#output:
df
# Front.lat Front.long administrative_area_level_1 locality
# 1 -37.82681 144.9592 Victoria Southbank
# 2 -37.82681 145.9592 Victoria Noojee
# postal_code address
# 1 3006 45 Clarke St, Southbank VIC 3006, Australia
# 2 3833 Cec Dunns Track, Noojee VIC 3833, Australia
You can add "route" and "street_number" to the variables that you want to extract but as you can see the second address does not have street number and that will cause an error.
Note: You may also use sub and extract the information from the address.
Data:
df <- structure(list(Front.lat = c(-37.82681, -37.82681), Front.long =
c(144.9592, 145.9592)), .Names = c("Front.lat", "Front.long"), class = "data.frame",
row.names = c(NA, -2L))

function to retrieve data from one column based on data from another column in R

I have a farmers market data set and one of the columns is "MarketName" (column2) and one is "WIC" (column21). I wrote a function to retrieve the market name if the WIC column = Y. My output should be a list of 2,207 names however I am getting an output of 8,144 rows because for the rows where the WIC column = N, my output is showing NA. There are 45 columns and 8,144 rows but here is a fake data set with only two columns
MarketName <- c("Union Springs Famers Market","Union Square Farmers Market", "Union Square Greenmarket", "Union Street Farmers Market", "Unity Market Day Farmers", "University Farmers Market")
WIC <- c("Y","N","N","N","Y","Y")
data3 <- data.frame(MarketName, WIC)
data3$MarketName <- as.character(data3$MarketName)
data3$WIC <- as.character(data3$WIC)
This is my function (which could be the problem?)
marketacceptWIC <- function(mydf)
{
market <- 0
for(i in 1:length(data3$WIC))
{
if(data3[i,2] == "Y")
market[i] <- data3[i,1]
}
return(market)
}
This is a sample of the output that I am getting
[1] "Union Springs Famers Market" NA NA NA
[5] "Unity Market Day Farmers" "University Farmers Market"
What I want is just a list of the farmers markets that accept WIC
[1] "Union Springs Farmers' Market"
[2] "Unity Market Day Farmers"
[3] "University Farmers Market"

I don't think you need a for loop here. Try to subset on WIC column. If you only need the MarketName column then
data3[data3$WIC == "Y", ]$MarketName
[1] "Union Springs Famers Market" "Unity Market Day Farmers" "University Farmers Market"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex