I'm trying to collect catalogue information based on text search. Search for a certain string in column Text, and put some description into a new column C_Organization.
Here is the sample data:
# load packages:
pacman::p_load("data.table",
"stringr")
# make sample data:
DE <- data.table(c("John", "Sussan", "Bill"),
c("Text contains MIT", "some text with Stanford University", "He graduated from Yale"))
colnames(DE) <- c("Name", "Text")
> DE
Name Text
1: John Text contains MIT
2: Sussan some text with Stanford University
3: Bill He graduated from Yale
search for a certain string and make a new data.table with new column:
mit <- DE[str_detect(DE$Text, "MIT"), .(Name, C_Organization = "MIT")]
yale <- DE[str_detect(DE$Text, "Yale"), .(Name, C_Organization = "Yale")]
stanford <- DE[str_detect(DE$Text, "Stanford"), .(Name, C_Organization = "Stanford")]
# bind them together:
combine_table <- rbind(mit, yale, stanford)
combine_table
Name C_Organization
1: John MIT
2: Bill Yale
3: Sussan Stanford
This pick-and-combine approach works fine but it seems a little bit tedious. Is it possible to do it in one step in data.table?
Edit
Due to my poor data analysis skill and the unclean data, I need to make the question clear:
The real data is a little complicated:
(1) There are cases where a person from more than two organizations, like Jack, UC Berkeley, Bell lab. and
(2) The same person of the same organization appears for different year, like Steven, MIT, 2011, Steven, MIT, 2014.
I want to figure out:
(1) How many people from each organization. If one person belongs to more than one organization, make the organization which appears most as his organization. (i.e. by popularity.) For example, John, MIT, AMS, Bell lab, if MIT appears 30 times, AMS 12 times, Bell lab 26 times. Then make MIT as his organization.
(2) count how many people for each year. This is not directly realted to my original question, but for later calculation, I don't want to throw away these records.
An alternative solution which takes into account for several matches in one text, operates rowwise and binds the matches together:
uni <- c("MIT","Yale","Stanford")
DE[,idx:=.I][, c_org := paste(uni[str_detect(Text, uni)], collapse=","), idx]
this gives:
> DE
Name Text idx c_org
1: John Text contains MIT 1 MIT
2: Sussan some text with Stanford University 2 Stanford
3: Bill He graduated from Yale, MIT, Stanford. 3 MIT,Yale,Stanford
4: Bill some text 4
The advantage of operating rowwise is evident when you have identical names in Name. When you do:
DE[, uni[str_detect(Text, uni)], Name]
you get not the correct result:
Name V1
1: John MIT
2: Sussan Stanford
3: Bill MIT
4: Bill Stanford
=> you don't know which Bill you have in the fourth row. Moreover, Yale isn't included for the 'first' Bill (i.e. row 3 of the original dataset).
Used data:
DE <- structure(list(Name = c("John", "Sussan", "Bill", "Bill"), Text = c("Text contains MIT", "some text with Stanford University", "He graduated from Yale, MIT, Stanford.", "some text")), .Names = c("Name", "Text"), row.names = c(NA, -4L), class = c("data.table", "data.frame"))
Related
I need to mutate a new column "Group" by those keyword,
I tried to using %in% but not got data I expected.
I want to create an extra column names'group' in my df data frame.
In this column, I want lable every rows by using some keywords.
(from the keywords vector or may be another keywords dataframe)
For example:
library(tibble)
df <- tibble(Title = c("Iran: How we are uncovering the protests and crackdowns",
"Deepak Nirula: The man who brought burgers and pizzas to India",
"Phil Foden: Manchester City midfielder signs new deal with club until 2027",
"The Danish tradition we all need now",
"Slovakia LGBT attack"),
Text = c("Iranian authorities have been disrupting the internet service in order to limit the flow of information and control the narrative, but Iranians are still sending BBC Persian videos of protests happening across the country via messaging apps. Videos are also being posted frequently on social media.
Before a video can be used in any reports, journalists need to establish where and when it was filmed.They can pinpoint the location by looking for landmarks and signs in the footage and checking them against satellite images, street-level photos and previous footage. Weather reports, the position of the sun and the angles of shadows it creates can be used to confirm the timing.",
"For anyone who grew up in capital Delhi during the 1970s and 1980s, Nirula's - run by the family of Deepak Nirula who died last week - is more than a restaurant. It's an emotion.
The restaurant transformed the eating-out culture in the city and introduced an entire generation to fast food, American style, before McDonald's and KFC came into the country. For many it was synonymous with its hot chocolate fudge.",
"Stockport-born Foden, who has scored two goals in 18 caps for England, has won 11 trophies with City, including four Premier League titles, four EFL Cups and the FA Cup.He has also won the Premier League Young Player of the Season and PFA Young Player of the Year awards in each of the last two seasons.
City boss Pep Guardiola handed him his debut as a 17-year-old and Foden credited the Spaniard for his impressive development over the last five years.",
"Norwegian playwright and poet Henrik Ibsen popularised the term /friluftsliv/ in the 1850s to describe the value of spending time in remote locations for spiritual and physical wellbeing. It literally translates to /open-air living/, and today, Scandinavians value connecting to nature in different ways – something we all need right now as we emerge from an era of lockdowns and inactivity.",
"The men were shot dead in the capital Bratislava on Wednesday, in a suspected hate crime.Organisers estimated that 20,000 people took part in the vigil, mourning the men's deaths and demanding action on LGBT rights.Slovak President Zuzana Caputova, who has raised the rainbow flag over her office, spoke at the event.")
)
keyword1 <- c("authorities", "Iranian", "Iraq", "control", "Riots",)
keyword2 <- c("McDonald's","KFC", "McCafé", "fast food")
keyword3 <- c("caps", "trophies", "season", "seasons")
keyword4 <- c("travel", "landscape", "living", "spiritual")
keyword5 <- c("LGBT", "lesbian", "les", "rainbow", "Gay", "Bisexual","Transgender")
I need to mutate a new column "Group" by those keyword
if match keyword1 lable "Politics",
if match keyword2 lable "Food",
if match keyword3 lable "Sport",
if match keyword4 lable "Travel",
if match keyword5 lable "LGBT".
Can also ignore.case ?
Below is expected output
Title
Text
Group
Iran: How..
Iranian...
Politics
Deepak Nir..
For any...
Food
Phil Foden..
Stockpo...
Sport
The Danish..
Norwegi...
Travel
Slovakia L..
The men...
LGBT
Thanks to everyone who spending time.
you could try this:
df %>%
rowwise %>%
mutate(
## add column with words found in title or text (splitting by non-word character):
words = list(strsplit(split = '\\W', paste(Title, Text)) %>% unlist),
group = {
categories <- list(keyword1, keyword2, keyword3, keyword4, keyword5)
## i indexes those items (=keyword vectors) of list 'categories'
## which share at least one word with column Title or Text (so that length > 0)
i <- categories %>% lapply(\(category) length(intersect(unlist(words), category))) %>% as.logical
## pick group name via index; join with ',' if more than one category applies
c('Politics', 'Food', 'Sport', 'Travel', 'LGBD')[i] %>% paste(collapse = ',')
}
)
output:
## # A tibble: 5 x 4
## # Rowwise:
## Title Text words group
## <chr> <chr> <lis> <chr>
## 1 Iran: How we are uncovering the protests and crackdowns "Ira~ <chr> Poli~
## 2 Deepak Nirula: The man who brought burgers and pizzas to In~ "For~ <chr> Food
## 3 Phil Foden: Manchester City midfielder signs new deal with ~ "Sto~ <chr> Sport
## 4 The Danish tradition we all need now "Nor~ <chr> Trav~
## 5 Slovakia LGBT attack "The~ <chr> LGBD
Check this out - the basic idea is to define all keyword* case-insensitively (hence the (?i) in the patterns) as alternation patterns (hence the | for collapsing) with word boundaries (hence the \\b before and after the alternatives, to ensure that "caps" is matched but not for example "capsize") and use nested ifelse statements to assign the Group labels:
library(tidyverse)
df %>%
mutate(
All = str_c(Title, Text),
Group = ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword1, collapse = "|"), ")\\b")), "Politics",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword2, collapse = "|"), ")\\b")), "Food",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword3, collapse = "|"), ")\\b")), "Sport",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword4, collapse = "|"), ")\\b")), "Travel", "LGBT"))))
) %>%
select(Group)
# A tibble: 5 × 1
Group
<chr>
1 Politics
2 Food
3 Sport
4 Travel
5 LGBT
I'm just having a really hard time figuring this out. Let's go straight to the data.
library(countrycode)
countries <- codelist$country.name.en #list of countries from the library
text <- "(France) Mr. Tom(CEO) from France is getting a new car. The car is a Toyota. His wife will get a Nissan. (Spain) No information available. (Chad) Mr. Smith (from N'Djamena) bought a new house. It's very nice."
I'd want to create a list of the parsed text (eg. from "(France)" to "Nissan.") for all three countries. The actual text is 30 pages long and each (countryName) is followed by several paragraphs of text.
All the countryNames are in parentheses but there might be other non-country parentheses in the text or countryNames without parentheses. But the general pattern is that each segment I want to parse starts with (countryName1) and ends with (countryName2)
Output:
[[1]]
[1] "(France) Mr. Tom(CEO) from France is getting a new car. The car is a Toyota. His wife will get a Nissan."
[[2]]
[1] "(Spain) No information available."
[[3]]
[1] "(Chad) Mr.Smith (from N'Djamena) bought a new house. It's very nice."
If all the countries in the 'text' matches with the reference vector, we may paste the reference vector into a single string to split the string just before the country match
as.list(strsplit(text, sprintf('(?<=\\s)(?=(%s))',
paste(paste0("\\(", countries), collapse = "|")), perl = TRUE)[[1]])
-output
[[1]]
[1] "(France) Mr. Tom(CEO) from France is getting a new car. The car is a Toyota. His wife will get a Nissan. "
[[2]]
[1] "(Spain) No information available. "
[[3]]
[1] "(Chad) Mr. Smith (from N'Djamena) bought a new house. It's very nice."
I have a large dataframe of published articles for which I would like to extract all articles relating to a few authors specified in a separate list. The authors in the dataframe are grouped together in one column separated by a ; . Not all authors need to match, I would like to extract any article which has one author matched to the list. An example is below.
Title<-c("A", "B", "C")
AU<-c("Mark; John; Paul", "Simone; Lily; Poppy", "Sarah; Luke")
df<-cbind(Title, AU)
authors<-as.character(c("Mark", "John", "Luke"))
df[sapply(strsplit((as.character(df$AU)), "; "), function(x) any(authors %in% x)),]
I would expect to return;
Title AU
A Mark; John
C Sarah; Luke
However with my large dataframe this command does not work to return all AU, it only returns rows which have a single AU not multiple ones.
Here is a dput from my larger dataframe of 5 rows
structure(list(AU = c("FOOKES PG;DEARMAN WR;FRANKLIN JA", "SIMS DG;DOWNHAM MAPS;MCQUILLIN J;GARDNER PS",
"TURNER BR", "BUTLER J;MARSH H;GOODARZI F", "OVERTON M"), TI = c("SOME ENGINEERING ASPECTS OF ROCK WEATHERING WITH FIELD EXAMPLES FROM DARTMOOR AND ELSEWHERE",
"RESPIRATORY SYNCYTIAL VIRUS INFECTION IN NORTH-EAST ENGLAND",
"TECTONIC AND CLIMATIC CONTROLS ON CONTINENTAL DEPOSITIONAL FACIES IN THE KAROO BASIN OF NORTHERN NATAL, SOUTH AFRICA",
"WORLD COALS: GENESIS OF THE WORLD'S MAJOR COALFIELDS IN RELATION TO PLATE TECTONICS",
"WEATHER AND AGRICULTURAL CHANGE IN ENGLAND, 1660-1739"), SO = c("QUARTERLY JOURNAL OF ENGINEERING GEOLOGY",
"BRITISH MEDICAL JOURNAL", "SEDIMENTARY GEOLOGY", "FUEL", "AGRICULTURAL HISTORY"
), JI = c("Q. J. ENG. GEOL.", "BRIT. MED. J.", "SEDIMENT. GEOL.",
"FUEL", "AGRICULTURAL HISTORY")
An option with str_extract
library(dplyr)
library(stringr)
df %>%
mutate(Names = str_extract_all(Names, str_c(authors, collapse="|"))) %>%
filter(lengths(Names) > 0)
# Title Names
#1 A Mark, John
#2 C Luke
data
df <- data.frame(Title, Names)
in Base-R you can access it like so
df[sapply(strsplit(as.character(df$Names, "; "), function(x) any(authors %in% x)),]
Title Names
1 A Mark; John; Paul
3 C Sarah; Luke
This can be accomplished by subsetting on those Names that match the pattern specified in the first argument to the function grepl:
df[grepl(paste0(authors, collapse = "|"), df[,2]),]
Title Names
[1,] "A" "Mark; John; Paul"
[2,] "C" "Sarah; Luke"
I have the task of searching through text, replacing peoples names and nicknames with a generic character string.
Here is the structure of my data frame of names and corresponding nicknames:
names <- c("Thomas","Thomas","Abigail","Abigail","Abigail")
nicknames <- c("Tom","Tommy","Abi","Abby","Abbey")
df_name_nick <- data.frame(names,nicknames)
Here is the structure of my data frame containing text
text_names <- c("Abigail","Thomas","Abigail","Thomas","Colin")
text_comment <- c("Tommy sits next to Abbey","As a footballer Tommy is very good","Abby is a mature young lady","Tom is a handsome man","Tom is friends with Colin and Abi")
df_name_comment <- data.frame(text_names,text_comment)
Giving these dataframes
df_name_nick:
names nicknames
1 Thomas Tom
2 Thomas Tommy
3 Abigail Abi
4 Abigail Abby
5 Abigail Abbey
df_name_comment:
text_names text_comment
1 Abigail Tommy sits next to Abbey
2 Thomas As a footballer Tommy is very good
3 Abigail Abby is a mature young lady
4 Thomas Tom is a handsome man
5 Colin Tom is friends with Colin and Abi
I am looking for a routine that will search through each row of df_name_comment and use the df_name_comment$text_names to look up the corresponding nickname from df_name_nick and replace it with XXX.
Note for each person's name there can be several nicknames.
Note that in each text comment, only the appropriate name for that row is replaced, so that we would get this as output:
Abigail "Tommy sits next to XXX"
Thomas "As a footballer, XXX is very good"
Abigail "XXX is a mature young lady"
Thomas "XXX is a handsome man"
Colin "Tom is friends with Colin and Abi"
I’m thinking this will require a cunning combination of gsubs, matches and apply functions (either mapply, sapply, etc)
I've searched on Stack Overflow for something similar to this request and can only find very specific regex solutions based on data frames with unique row elements, and not something that I think will work with generic text lookups and gsubs via multiple nicknames.
Can anyone please help me solve my predicament?
With thanks
Nevil
(newbie R programmer since Jan 2017)
Here is an idea via base R. We basically paste the nicknames together for each name, collapsed by | so as to pass it as regex in gsub and replace the matched words of each comment with XXX. We use mapply to do that after we merge our aggregated nicknames with df_name_comment.
d1 <- aggregate(nicknames ~ names, df_name_nick, paste, collapse = '|')
d2 <- merge(df_name_comment, d1, by.x = 'text_names', by.y = 'names', all = TRUE)
d2$nicknames[is.na(d2$nicknames)] <- 0
d2$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y), d2$nicknames, d2$text_comment)
d2$nicknames <- NULL
d2
Which gives,
text_names text_comment
1 Abigail Tommy sits next to XXX
2 Abigail XXX is a mature young lady
3 Colin Tom is friends with Colin and Abi
4 Thomas As a footballer XXX is very good
5 Thomas XXX is a handsome man
Note1: Replacing NA in nicknames with 0 is due to the fact that NA (which is the default fill in merge for unmatched elements) would convert the comment string to NA as well when passed in gsub
Note2 The order is also changed due to merge, but you can sort as you wish as per usual.
Note3 Is better to have your variables as characters rather than factors. So you either read the data frames with stringsAsFactors = FALSE or convert via,
df_name_comment[] <- lapply(df_name_comment, as.character)
df_name_nick[] <- lapply(df_name_nick, as.character)
EDIT
Based on your comment, we can simply match the comments' names with our aggregated data set, save that in a vector and use mapply directly on the original data frame, without having to merge and then drop variables, i.e.
#d1 as created above
v1 <- d1$nicknames[match(df_name_comment$text_names, d1$names)]
v1[is.na(v1)] <- 0
df_name_comment$text_comment <- mapply(function(x, y) gsub(x, 'XXX', y),
v1, df_name_comment$text_comment)
Hope this helps!
l <- apply(df_name_comment, 1, function(x)
ifelse(length(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"]) > 0,
gsub(paste(df_name_nick[df_name_nick$names==x["text_names"], "nicknames"], collapse="|"),'XXX', x["text_comment"]),
x["text_comment"]))
df_name_comment$text_comment <- as.list.data.frame(l)
Don't forget to let us know if it solved your problem :)
Data
df_name_nick <- data.frame(names,nicknames,stringsAsFactors = F)
df_name_comment <- data.frame(text_names,text_comment,stringsAsFactors = F)
Solution 2
EDIT: In this initial solution I manually checked with grepl if the nickname was present, and then gsubbed with one of the matching ID's. I knew the '|' operator worked with grepl, but not with gsub. So credits to Sotos for that idea.
df = df_name_comment
for(i in 1:nrow(df))
{
matching_nicknames = df_name_nick$nicknames[df_name_nick$names==df$text_names[i]]
if(length(matching_nicknames)>0)
{
df$text_comment[i] = mapply(sub, pattern=paste(paste0("\\b",matching_nicknames,"\\b"),collapse="|"), "XXX", df$text_comment[i])
}
}
Output
text_names text_comment
1 Abigail Tommy sits next to XXX
2 Thomas As a footballer XXX is very good
3 Abigail XXX is a mature young lady
4 Thomas XXX is a handsome man
5 Colin Tom is friends with Colin and Abi
Hope this helps!
I have a dataset in R that lists out a bunch of company names and want to remove words like "Inc", "Company", "LLC", etc. for part of a clean-up effort. I have the following sample data:
sampleData
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm LLC
3 Miami, FL Smith & Co.
Words I do not want to include in my output:
stopwords = c("Inc","inc","co","Co","Inc.","Co.","LLC","Corporation","Corp","&")
I built the following function to break out each word, remove the stopwords, and then bring the words back together, but it is not iterating through each row of the dataset.
removeWords <- function(str, stopwords) {
x <- unlist(strsplit(str, " "))
paste(x[!x %in% stopwords], collapse = " ")
}
removeWords(sampleData$Company,stopwords)
The output for the above function looks like this:
[1] "XYZ Company Consulting Firm Smith"
T
he output should be:
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm
3 Miami, FL Smith
Any help would be appreciated.
We can use 'tm' package
library(tm)
stopwords = readLines('stopwords.txt') #Your stop words file
x = df$company #Company column data
x = removeWords(x,stopwords) #Remove stopwords
df$company_new <- x #Add the list as new column and check
With a little check on the stopwords( having inserted "\" in Co. to avoid regex, spaces ): (But the previous answer should be preferred if you dont want to keep an eye on stopwords)
stopwords = c("Inc","inc","co ","Co ","Inc."," Co\\.","LLC","Corporation","Corp","&")
gsub(paste0(stopwords,collapse = "|"),"", df$Company)
[1] "XYZ Company" "Consulting Firm " "Smith "
df$Company <- gsub(paste0(stopwords,collapse = "|"),"", df$Company)
# df
# Location Company
#1 New York, NY XYZ Company
#2 Chicago, IL Consulting Firm
#3 Miami, FL Smith