I'm trying to create a key value store with the key being entities and the value being the average sentiment score of the entity in news articles.
I have a dataframe containing news articles and a list of entities called organizations1 indentified in those news articles by a classifier. The first rows of the organization1 list contains the entities identified in the article on the first row of the news_us dataframe. I'm trying to iterate through the organizations list and creating a key value store with the key being the entity in the organization1 list and the value being the sentiment score of the news description in which the entity was mentioned. The code I have doesn't change the scores in the sentiment list and I don't know why. My first guess was that I would have to use the $ operator on the sentiment list to add the value but that didn't change anything either. Here is the code I have so far:
library(syuzhet)
sentiment <- list()
organization1 <- list(NULL, "US", "Bath", "Animal Crossing", "World Health Organization",
NULL, c("Microsoft", "Facebook"))
news_us <- structure(list(title = c("Stocks making the biggest moves after hours: Bed Bath & Beyond, JC Penney, United Airlines and more - CNBC",
"Los Angeles mayor says 'very difficult to see' large gatherings like concerts and sporting events until 2021 - CNN",
"Bed Bath & Beyond shares rise as earnings top estimates, retailer plans to maintain some key investments - CNBC",
"6 weeks with Animal Crossing: New Horizons reveals many frustrations - VentureBeat",
"Timeline: How Trump And WHO Reacted At Key Moments During The Coronavirus Crisis : Goats and Soda - NPR",
"Michigan protesters turn out against Whitmer’s strict stay-at-home order - POLITICO"
), description = c("Check out the companies making headlines after the bell.",
"Los Angeles Mayor Eric Garcetti said Wednesday large gatherings like sporting events or concerts may not resume in the city before 2021 as the US grapples with mitigating the novel coronavirus pandemic.",
"Bed Bath & Beyond said that its results in 2020 \"will be unfavorably impacted\" by the crisis, and so it will not be offering a first-quarter nor full-year outlook.",
"Six weeks with Animal Crossing: New Horizons has helped to illuminate some of the game's shortcomings that weren't obvious in our first review.",
"How did the president respond to key moments during the pandemic? And how did representatives of the World Health Organization respond during the same period?",
"Many demonstrators, some waving Trump campaign flags, ignored organizers‘ pleas to stay in their cars and flooded the streets of Lansing, the state capital."
), name = c("CNBC", "CNN", "CNBC", "Venturebeat.com", "Npr.org",
"Politico")), na.action = structure(c(`35` = 35L, `95` = 95L,
`137` = 137L, `154` = 154L, `213` = 213L, `214` = 214L, `232` = 232L,
`276` = 276L, `321` = 321L), class = "omit"), row.names = c(NA,
6L), class = "data.frame")
i = as.integer(0)
for(index in organizations1){
i <- i+1
if(is.character(index)) { #if entity is not null/NA
val <- get_sentiment(news_us$description[i], method = "afinn")
#print(val)
print(sentiment[[index[1]]])
sentiment[[index[1]]] <- sentiment[[index[1]]]+val
}
}
Here is the sentiment list after running the above code chunk:
$US
integer(0)
$Bath
integer(0)
$`Animal Crossing`
integer(0)
$`World Health Organization`
integer(0)
$`Apple TV`
integer(0)
$`Pittsburgh Steelers`
integer(0)
Whereas I would like it to look something like:
$US
1.3
$Bath
0.3
$`Animal Crossing`
2.4
$`World Health Organization`
1.2
$`Apple TV`
-0.7
$`Pittsburgh Steelers`
0.3
The value column can have multiple values for multiple entities identified in the article.
I am not sure how organization1 and news_us$description are related but perhaps, you meant to use it something like this?
library(syuzhet)
setNames(lapply(news_us$description, get_sentiment), unlist(organization1))
#$US
#[1] 0
#$Bath
#[1] -0.4
#$`Animal Crossing`
#[1] -0.1
#$`World Health Organization`
#[1] 1.1
#$Microsoft
#[1] -0.6
#$Facebook
#[1] -1.9
Related
I need to mutate a new column "Group" by those keyword,
I tried to using %in% but not got data I expected.
I want to create an extra column names'group' in my df data frame.
In this column, I want lable every rows by using some keywords.
(from the keywords vector or may be another keywords dataframe)
For example:
library(tibble)
df <- tibble(Title = c("Iran: How we are uncovering the protests and crackdowns",
"Deepak Nirula: The man who brought burgers and pizzas to India",
"Phil Foden: Manchester City midfielder signs new deal with club until 2027",
"The Danish tradition we all need now",
"Slovakia LGBT attack"),
Text = c("Iranian authorities have been disrupting the internet service in order to limit the flow of information and control the narrative, but Iranians are still sending BBC Persian videos of protests happening across the country via messaging apps. Videos are also being posted frequently on social media.
Before a video can be used in any reports, journalists need to establish where and when it was filmed.They can pinpoint the location by looking for landmarks and signs in the footage and checking them against satellite images, street-level photos and previous footage. Weather reports, the position of the sun and the angles of shadows it creates can be used to confirm the timing.",
"For anyone who grew up in capital Delhi during the 1970s and 1980s, Nirula's - run by the family of Deepak Nirula who died last week - is more than a restaurant. It's an emotion.
The restaurant transformed the eating-out culture in the city and introduced an entire generation to fast food, American style, before McDonald's and KFC came into the country. For many it was synonymous with its hot chocolate fudge.",
"Stockport-born Foden, who has scored two goals in 18 caps for England, has won 11 trophies with City, including four Premier League titles, four EFL Cups and the FA Cup.He has also won the Premier League Young Player of the Season and PFA Young Player of the Year awards in each of the last two seasons.
City boss Pep Guardiola handed him his debut as a 17-year-old and Foden credited the Spaniard for his impressive development over the last five years.",
"Norwegian playwright and poet Henrik Ibsen popularised the term /friluftsliv/ in the 1850s to describe the value of spending time in remote locations for spiritual and physical wellbeing. It literally translates to /open-air living/, and today, Scandinavians value connecting to nature in different ways – something we all need right now as we emerge from an era of lockdowns and inactivity.",
"The men were shot dead in the capital Bratislava on Wednesday, in a suspected hate crime.Organisers estimated that 20,000 people took part in the vigil, mourning the men's deaths and demanding action on LGBT rights.Slovak President Zuzana Caputova, who has raised the rainbow flag over her office, spoke at the event.")
)
keyword1 <- c("authorities", "Iranian", "Iraq", "control", "Riots",)
keyword2 <- c("McDonald's","KFC", "McCafé", "fast food")
keyword3 <- c("caps", "trophies", "season", "seasons")
keyword4 <- c("travel", "landscape", "living", "spiritual")
keyword5 <- c("LGBT", "lesbian", "les", "rainbow", "Gay", "Bisexual","Transgender")
I need to mutate a new column "Group" by those keyword
if match keyword1 lable "Politics",
if match keyword2 lable "Food",
if match keyword3 lable "Sport",
if match keyword4 lable "Travel",
if match keyword5 lable "LGBT".
Can also ignore.case ?
Below is expected output
Title
Text
Group
Iran: How..
Iranian...
Politics
Deepak Nir..
For any...
Food
Phil Foden..
Stockpo...
Sport
The Danish..
Norwegi...
Travel
Slovakia L..
The men...
LGBT
Thanks to everyone who spending time.
you could try this:
df %>%
rowwise %>%
mutate(
## add column with words found in title or text (splitting by non-word character):
words = list(strsplit(split = '\\W', paste(Title, Text)) %>% unlist),
group = {
categories <- list(keyword1, keyword2, keyword3, keyword4, keyword5)
## i indexes those items (=keyword vectors) of list 'categories'
## which share at least one word with column Title or Text (so that length > 0)
i <- categories %>% lapply(\(category) length(intersect(unlist(words), category))) %>% as.logical
## pick group name via index; join with ',' if more than one category applies
c('Politics', 'Food', 'Sport', 'Travel', 'LGBD')[i] %>% paste(collapse = ',')
}
)
output:
## # A tibble: 5 x 4
## # Rowwise:
## Title Text words group
## <chr> <chr> <lis> <chr>
## 1 Iran: How we are uncovering the protests and crackdowns "Ira~ <chr> Poli~
## 2 Deepak Nirula: The man who brought burgers and pizzas to In~ "For~ <chr> Food
## 3 Phil Foden: Manchester City midfielder signs new deal with ~ "Sto~ <chr> Sport
## 4 The Danish tradition we all need now "Nor~ <chr> Trav~
## 5 Slovakia LGBT attack "The~ <chr> LGBD
Check this out - the basic idea is to define all keyword* case-insensitively (hence the (?i) in the patterns) as alternation patterns (hence the | for collapsing) with word boundaries (hence the \\b before and after the alternatives, to ensure that "caps" is matched but not for example "capsize") and use nested ifelse statements to assign the Group labels:
library(tidyverse)
df %>%
mutate(
All = str_c(Title, Text),
Group = ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword1, collapse = "|"), ")\\b")), "Politics",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword2, collapse = "|"), ")\\b")), "Food",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword3, collapse = "|"), ")\\b")), "Sport",
ifelse(str_detect(All, str_c("(?i)\\b(", str_c(keyword4, collapse = "|"), ")\\b")), "Travel", "LGBT"))))
) %>%
select(Group)
# A tibble: 5 × 1
Group
<chr>
1 Politics
2 Food
3 Sport
4 Travel
5 LGBT
I have tried to resolve this problem all day but without any improvement.
I am trying to replace the following abbreviations into the following desired words in my dataset:
-Abbreviations: USA, H2O, Type 3, T3, bp
Desired words United States of America, Water, Type 3 Disease, Type 3 Disease, blood pressure
The input data is for example
[1] I have type 3, its considered the highest severe stage of the disease.
[2] Drinking more H2O will make your skin glow.
[3] Do I have T2 or T3? Please someone help.
[4] We don't have this on the USA but I've heard that will be available in the next 3 years.
[5] Having a high bp means that I will have to look after my diet?
The desired output is
[1] i have type 3 disease, its considered the highest severe stage
of the disease.
[2] drinking more water will make your skin glow.
[3] do I have type 3 disease? please someone help.
[4] we don't have this in the united states of america but i've heard that will be available in the next 3 years.
[5] having a high blood pressure means that I will have to look after my diet?
I have tried the following code but without success:
data= read.csv(C:"xxxxxxx, header= TRUE")
lowercase= tolower(data$MESSAGE)
dict=list("\\busa\\b"= "united states of america", "\\bh2o\\b"=
"water", "\\btype 3\\b|\\bt3\\"= "type 3 disease", "\\bbp\\b"=
"blood pressure")
for(i in 1:length(dict1)){
lowercasea= gsub(paste0("\\b", names(dict)[i], "\\b"),
dict[[i]], lowercase)}
I know that I am definitely doing something wrong. Could anyone guide me on this? Thank you in advance.
If you need to replace only whole words (e.g. bp in Some bp. and not in bpcatalogue) you will have to build a regular expression out of the abbreviations using word boundaries, and - since you have multiword abbreviations - also sort them by length in the descending order (or, e.g. type may trigger a replacement before type three).
An example code:
abbreviations <- c("USA", "H2O", "Type 3", "T3", "bp")
desired_words <- c("United States of America", "Water", "Type 3 Disease", "Type 3 Disease", "blood pressure")
df <- data.frame(abbreviations, desired_words, stringsAsFactors = FALSE)
x <- 'Abbreviations: USA, H2O, Type 3, T3, bp'
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
library(stringr)
str_replace_all(x,
paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b"),
function(z) df$desired_words[df$abbreviations==z][[1]][1]
)
The paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b") code creates a regex like \b(Type 3|USA|H2O|T3|bp)\b, it matches Type 3, or USA, etc. as whole word only as \b is a word boundary. If a match is found, stringr::str_replace_all replaces it with the corresponding desired_word.
See the R demo online.
I'm trying to create a key-value store with the key being entities and the value being the average sentiment score of the entity in news articles.
I have a dataframe containing news articles and a list of entities called organizations1 identified in those news articles by a classifier. The first row of the organization1 list contains the entities identified in the article on the first row of the news_us dataframe. I'm trying to iterate through the organization list and creating a key-value store with the key being the entity name in the organization1 list and the value being the sentiment score of the news description in which the entity was mentioned.
I can get the sentiment scores for the entity from an article but I wanted to add them together and average the sentiment score.
library(syuzhet)
sentiment <- list()
organization1 <- list(NULL, "US", "Bath", "Animal Crossing", "World Health Organization",
NULL, c("Microsoft", "Facebook"))
news_us <- structure(list(title = c("Stocks making the biggest moves after hours: Bed Bath & Beyond, JC Penney, United Airlines and more - CNBC",
"Los Angeles mayor says 'very difficult to see' large gatherings like concerts and sporting events until 2021 - CNN",
"Bed Bath & Beyond shares rise as earnings top estimates, retailer plans to maintain some key investments - CNBC",
"6 weeks with Animal Crossing: New Horizons reveals many frustrations - VentureBeat",
"Timeline: How Trump And WHO Reacted At Key Moments During The Coronavirus Crisis : Goats and Soda - NPR",
"Michigan protesters turn out against Whitmer’s strict stay-at-home order - POLITICO"
), description = c("Check out the companies making headlines after the bell.",
"Los Angeles Mayor Eric Garcetti said Wednesday large gatherings like sporting events or concerts may not resume in the city before 2021 as the US grapples with mitigating the novel coronavirus pandemic.",
"Bed Bath & Beyond said that its results in 2020 \"will be unfavorably impacted\" by the crisis, and so it will not be offering a first-quarter nor full-year outlook.",
"Six weeks with Animal Crossing: New Horizons has helped to illuminate some of the game's shortcomings that weren't obvious in our first review.",
"How did the president respond to key moments during the pandemic? And how did representatives of the World Health Organization respond during the same period?",
"Many demonstrators, some waving Trump campaign flags, ignored organizers‘ pleas to stay in their cars and flooded the streets of Lansing, the state capital."
), name = c("CNBC", "CNN", "CNBC", "Venturebeat.com", "Npr.org",
"Politico")), na.action = structure(c(`35` = 35L, `95` = 95L,
`137` = 137L, `154` = 154L, `213` = 213L, `214` = 214L, `232` = 232L,
`276` = 276L, `321` = 321L), class = "omit"), row.names = c(NA,
6L), class = "data.frame")
setNames(lapply(news_us$description, get_sentiment), unlist(organization1))
#$US
#[1] 0
#$Bath
#[1] -0.4
#$`Animal Crossing`
#[1] -0.1
#$`World Health Organization`
#[1] 1.1
#$Microsoft
#[1] -0.6
#$Facebook
#[1] -1.9
tapply(sapply(news_us$description, get_sentiment), unlist(organization1), mean) #this line throws the error
Your problem seems to arise from the use of 'unlist'. Avoid this, as it drops the NULL values and concatenates list entries with multiple values.
Your organization1 list has 7 entries (two of which are NULL and one is length = 2). You should have 6 entries if this is to match the news_us data.frame - so something is out of sync there.
Let's assume the first 6 entries in organization1 are correct; I would bind them to your data.frame to avoid further 'sync errors':
news_us$organization1 = organization1[1:6]
Then you need to do the sentiment analysis on each row of the data.frame and bind the results to the organization1 value/s. The code below might not be the most elegant way to achieve this, but I think it does what you are looking for:
results = do.call("rbind", apply(news_us, 1, function(item){
if(!is.null(item$organization1[[1]])) cbind(item$organization1, get_sentiment(item$description))
}))
This code drops any rows where there were no detected organization1 values. It should also duplicate sentiment scores in the case of more than one organization1 being detected. The results will look like this (which I believe is your goal):
[,1] [,2]
[1,] "US" "-0.4"
[2,] "Bath" "-0.1"
[3,] "Animal Crossing" "1.1"
[4,] "World Health Organization" "-0.6"
The mean scores for each organization can then be collapsed using by, aggregate or similar.
[Edit: Examples of by and aggregate]
by(as.numeric(results[, 2]), results$V1, mean)
aggregate(as.numeric(results[, 2]), list(results$V1), mean)
After this expression
good.rows<-ifelse(nchar(ufo$DateOccurred)!=10 | nchar(ufo$DateReported)!=10,
FALSE, TRUE)
I expected to get vectors of Booleans but I got
length(good.rows)
[1] 0
This is logical(empty) as I can see in R studio. What can I do to solve this?
dput(head(ufo))
"structure(list(DateOccured = structure(c(9412, 9413, 9131, 9260,
9292, 9428), class = "Date"), DateReported = structure(c(9412,
9414, 9133, 9260, 9295, 9427), class = "Date"), Location = c(" Iowa City, IA",
" Milwaukee, WI", " Shelton, WA", " Columbia, MO", " Seattle, WA",
" Brunswick County, ND"), ShortDescription = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), Duration = c(NA, "2 min.", NA, "2 min.", NA, "30 min."), LongDescription = c("Man repts. witnessing "flash, followed by a classic UFO, w/ a tailfin at back." Red color on top half of tailfin. Became triangular.",
"Man on Hwy 43 SW of Milwaukee sees large, bright blue light streak by his car, descend, turn, cross road ahead, strobe. Bizarre!",
"Telephoned Report:CA woman visiting daughter witness discs and triangular ships over Squaxin Island in Puget Sound. Dramatic. Written report, with illustrations, submitted to NUFORC.",
"Man repts. son's bizarre sighting of small humanoid creature in back yard. Reptd. in Acteon Journal, St. Louis UFO newsletter.",
"Anonymous caller repts. sighting 4 ufo's in NNE sky, 45 deg. above horizon. (No other facts reptd. No return tel. #.)",
"Sheriff's office calls to rept. that deputy, 20 mi. SSE of Wilmington, is looking at peculiar, bright white, strobing light."
)), row.names = c(NA, 6L), class = "data.frame")"
There are a couple of reasons why this could be happening:
You're dataset is empty, check this using the dim() method.
The columns are not of type Character check this using the class()
method.
If both of these are correct try running the nchar(...) statements
separately.
Below I've create an example that works correctly, where I've gone through the above mentioned steps. In future please provide a reproducible example as part of your question.
# Create sample data
ufo <- data.frame(DateOccurred=c("a","bb","ccc"),
DateReported=c("a","bb","ccc"),
stringsAsFactors = FALSE)
print(ufo)
# Check size of data (make sure data has rows and columns are of type Character)
dim(ufo)
class(ufo$DateOccurred)
class(ufo$DateReported)
# Check nchar statements (Should run without error/warnings)
nchar(ufo$DateOccurred)
nchar(ufo$DateReported)
# Actual
good.rows <- ifelse(nchar(ufo$DateOccurred)!=3 | nchar(ufo$DateReported)!=3,
FALSE, TRUE)
print(good.rows)
length(good.rows)
I have the following name_total = matrix(nrow = 51, ncol=3, NA), where each row corresponds to a state (51 being District of Columbia). The first column is a string giving the name of the state (for example: name_total[1,1]= "Alabama").
The second and third are urls of CSV files from the Census, respectively linking counties with the state senate districts, and counties with state house districts.
For Alabama:
name_total[1,2] ="http://www2.census.gov/geo/relfiles/cdsld13/01/co_lu_delim_01.txt"
name_total[1,3] ="http://www2.census.gov/geo/relfiles/cdsld13/01/co_ll_delim_01.txt"
I wish to get as a final output a table which would basically be all 50 states + DC with their respective counties and linked Senate and House districts. I don't know if that's very clear so here is an example:
[,1] [,2] [,3] [,4]
[1,] "Alabama" "countyX1" "Senate District Y1" "House District Z1"
[2,] "Alabama" "countyX2" "Senate District Y2" "House District Z2"
[3,] "Alabama" "countyX3" "Senate District Y3" "House District Z3"
[4,] "Alaska" "countyX4" "Senate District Y4" "House District Z4"
[5,] "Alaska" "countyX5" "Senate District Y4" "House District Z5"
I use a forloop:
for (i in 1:51){
senate= name_total[i,2]
link_senate = url(senate)
house= name_total[i,3]
link_house = url(house)
state=name_total[i,1]
data_senate= read.csv2(link_senate,sep=",",header=TRUE, skip=1)
data_house= read.csv2(link_house,sep=",",header=TRUE, skip=1)
final=cbind(state, data_senate, data_house)
}
Of course each element has a different number of rows, for Alabama (i=1) State returns "Alabama" once, the others returning respectively 3 by 122 and 3 by 207 matrices. I get an error message about these variations in the number of rows.
I'm pretty sure one of the issues is the use of the cbind function, but I do not know what to use to get a better result.
In case others have similar issues, I found a way to get what I wanted separately for State Senates and State Houses. First of all some of the States only have of the two, and the link for Oregon was down. Personally I took them out of my original data.
Then I initialized for the first state outside of the loop:
senate = url(name_total[1,2])
data_senate= read.csv2(senate,sep=",",header=TRUE, skip=1)
assign(paste("Base_senate_",name_total[1,1],sep=""),data_senate)
A = assign(paste("Base_senate_",name_total[1,1],sep=""),data_senate)
house= url(name_total[1,3])
data_house= read.csv2(house,sep=",",header=TRUE, skip=1)
assign(paste("Base_house_",name_total[1,1],sep=""),data_house)
B = assign(paste("Base_house_",name_total[1,1],sep=""),data_house)
and then I used for loop:
for (i in 2:48){
senate = url(name_total[i,2])
house= url(name_total[i,3])
data_senate= read.csv2(senate,sep=",",header=TRUE, skip=1)
assign(paste("Base_senate_",name_total[i,1],sep=""),data_senate)
names(data_senate)[2] = "County"
A = rbind(A,assign(paste("Base_senate_",name_total[i,1],sep=""),data_senate))
data_house= read.csv2(house,sep=",",header=TRUE, skip=1)
assign(paste("Base_house_",name_total[i,1],sep=""),data_house)
names(data_house)[2] = "County"
B = rbind(B,assign(paste("Base_house_",name_total[i,1],sep=""),data_house))
}
A and B give you the expected tables (without the string name of the State, but the first variable identifies the state).
I had to use the names(data_senate)[2] = "County" because the second column had a different name for some states.
Hope it helps!