Using spacyr for named entity recognition - inconsistent results - r

I plan to use the spacyr R library to perform named entity recognition across several news articles (spacyr is an R wrapper for the Python spaCy package). My goal is to identify partners for network analysis automatically. However, spacyr is not recognising common entities as expected. Here is sample code to illustrate my issue:
library(quanteda)
library(spacyr)
text <- data.frame(doc_id = c(1:5),
sentence = c("Brightmark LLC, the global waste solutions provider, and Florida Keys National Marine Sanctuary (FKNMS), today announced a new plastic recycling partnership that will reduce landfill waste and amplify concerns about ocean plastics.",
"Brightmark is launching a nationwide site search for U.S. locations suitable for its next set of advanced recycling facilities, which will convert hundreds of thousands of tons of post-consumer plastics into new products, including fuels, wax, and other products.",
"Brightmark will be constructing the facility in partnership with the NSW government, as part of its commitment to drive economic growth and prosperity in regional NSW.",
"Macon-Bibb County, the Macon-Bibb County Industrial Authority, and Brightmark have mutually agreed to end discussions around building a plastic recycling plant in Macon",
"Global petrochemical company SK Global Chemical and waste solutions provider Brightmark have signed a memorandum of understanding to create a partnership that aims to take the lead in the circular economy of plastic by construction of a commercial scale plastics renewal plant in South Korea"))
corpus <- corpus(text, text_field = "sentence")
spacy_initialize(model = "en_core_web_sm")
parsed <- spacy_parse(corpus)
entity <- entity_extract(parsed)
I expect the company "Brightmark" to be recognised in all 5 sentences. However this is what I get:
entity
doc_id sentence_id entity entity_type
1 1 1 Florida_Keys_National_Marine_Sanctuary ORG
2 1 1 FKNMS ORG
3 2 1 U.S. GPE
4 3 1 NSW ORG
5 4 1 Macon_-_Bibb_County ORG
6 4 1 Brightmark ORG
7 4 1 Macon GPE
8 5 1 SK_Global_Chemical ORG
9 5 1 South_Korea GPE
"Brightmark" only appears as an ORG entity type in the 4th sentence (doc_id refers to sentence number). It should show up in all the sentences. The "NSW Government" does not appear at all.
I am still figuring out spaCy and spacyr. Perhaps someone can advise me why this is happening and what steps I should take to remedy this issue. Thanks in advance.

I changed the model and achieved better results:
spacy_initialize(model = "en_core_web_trf")
parsed <- spacy_parse(corpus)
entity <- entity_extract(parsed)
entity
doc_id sentence_id entity entity_type
1 1 1 Brightmark_LLC ORG
2 1 1 Florida_Keys GPE
3 1 1 FKNMS ORG
4 2 1 Brightmark ORG
5 2 1 U.S. GPE
6 3 1 Brightmark ORG
7 3 1 NSW GPE
8 3 1 NSW GPE
9 4 1 Macon_-_Bibb_County GPE
10 4 1 the_Macon_-_Bibb_County_Industrial_Authority ORG
11 4 1 Brightmark ORG
12 4 1 Macon GPE
13 5 1 SK_Global_Chemical ORG
14 5 1 Brightmark ORG
15 5 1 South_Korea GPE
The only downside is that NSW Government and Florida Keys National Marine Sanctuary are not resolved. I also get this warning: UserWarning: User provided device_type of 'cuda', but CUDA is not available.

Related

Couldn't get tq_exchange() or stockSymbols() to work

I am trying to get stock symbols with these functions (both failed)
TTR::stockSymbols("AMEX")
Error in symbols[, sort.by] : incorrect number of dimensions
tidyquant::tq_exchange("AMEX")
Getting data...
Error: Can't rename columns that don't exist.
x Column Symbol doesn't exist.
Do these functions work for you? What fixes do you know to correct them? Thank you!
I get the same error. It seems like there has been some changes in the website from which these packages get the information. There is an open issue about this.
In the same thread it is mentioned that you can get the information from the underlying JSON which returns this information.
tmp <- jsonlite::fromJSON('https://api.nasdaq.com/api/screener/stocks?tableonly=true&limit=25&offset=0&exchange=AMEX&download=true')
head(tmp$data$rows)
# symbol name
#1 AAMC Altisource Asset Management Corp Com
#2 AAU Almaden Minerals Ltd. Common Shares
#3 ACU Acme United Corporation. Common Stock
#4 ACY AeroCentury Corp. Common Stock
#5 AE Adams Resources & Energy Inc. Common Stock
#6 AEF Aberdeen Emerging Markets Equity Income Fund Inc. Common Stock
# lastsale netchange pctchange volume marketCap country ipoyear
#1 $24.60 -0.3595 -1.44% 15183 40595215.00 United States
#2 $0.846 0.0359 4.432% 2272603 101984125.00 Canada 2015
#3 $33.82 0.61 1.837% 7869 112922038.00 United States 1988
#4 $11.76 2.01 20.615% 739133 18179596.00 United States
#5 $28.31 0.11 0.39% 6217 120099060.00 United States
#6 $9.10 0.09 0.999% 40775 461841180.00 United States
# industry sector
#1 Real Estate Finance
#2 Precious Metals Basic Industries
#3 Industrial Machinery/Components Capital Goods
#4 Diversified Commercial Services Technology
#5 Oil Refining/Marketing Energy
#6
# url
#1 /market-activity/stocks/aamc
#2 /market-activity/stocks/aau
#3 /market-activity/stocks/acu
#4 /market-activity/stocks/acy
#5 /market-activity/stocks/ae
#6 /market-activity/stocks/aef

Match and count total words from an external list with text strings (tweets) in r

I am attempting to conduct emotional sentiment analysis of a large corpus of Tweets (91k) with an external list of emotionally-charged words (from the NRC Emotion Lexicon). To do this, I want to run a count and sum the total number of times any word from the words of joy list is contained within each Tweet. Ideally, this would not be a partial match of the word and not exact match. I would like for the total total to show in a new column in the df.
The df and column name for the Tweets are Tweets_with_Emotions$full_text and the list is Words_of_joy$word.
Example 1
> head(Tweets_with_Emotions, n=10)
ID Date full_text
1 58150 2012-09-12 I love an excellent cookie
2 12357 2012-09-28 Oranges are delicious and excellent
3 50788 2012-10-04 Eager to visit Disneyland
4 66038 2012-10-11 I wish my boyfriend would propose already
5 18119 2012-10-11 Love Maggie Smith
6 48349 2012-10-14 The movie was excellent, loved it.
7 23328 2012-10-16 Pineapples are so delicious and excellent
8 66038 2012-10-26 Eager to see the Champions Cup next week
9 32717 2012-10-28 Hating this show
10 11345 2012-11-08 Eager for the food
Example 2
> > head(words_of_joy, n=5)
word
1 eager
2 champion
3 delicious
4 excellent
5 love
Desired output
> head(New_df, n=10)
ID Date full_text joy_count
1 58150 2012-09-12 I love an excellent cookie 2
2 12357 2012-09-28 Oranges are delicious and excellent 2
3 50788 2012-10-04 Eager to visit Disneyland 1
4 66038 2012-10-11 I wish my boyfriend would propose already 0
5 18119 2012-10-11 Love Maggie Smith 1
6 48349 2012-10-14 The movie was excellent, loved it. 2
7 23328 2012-10-16 Pineapples are so delicious and excellent 2
8 66038 2012-10-26 Eager to see the Champions Cup next week 2
9 32717 2012-10-28 Hating this show 0
10 11345 2012-11-08 Eager for the food 1
I've effectively run the emotion list through the Tweets so that it returns a yes or no as to whether any words from the emotion list are contained within the Tweets (no = 0, yes = 1), however I cannot figure out how to count and return the totals in a new column
new_df <- Tweets_with_Emotions[stringr::str_detect(Tweets_with_Emotions$full_text, paste(Words_of_negative$words,collapse = '|')),]
I'm extremely new to R (and stackoverflow!) and have been struggling to figure this out for a few days so any help would be incredibly appreciated!

Create list of elements which match a value

I have a table of values with the name, zipcode and opening date of recreational pot shops in WA state.
name zip opening
1 The Stash Box 98002 2014-11-21
3 Greenside 98198 2015-01-01
4 Bud Nation 98106 2015-06-29
5 West Seattle Cannabis Co. 98168 2015-02-28
6 Nimbin Farm 98168 2015-04-25
...
I'm analyzing this data to see if there are any correlations between drug usage and location and opening of recreational stores. For one of the visualizations I'm doing, I am organizing the data by number of shops per zipcode using the group_by() and summarize() functions in dplyr.
zip count
(int) (int)
1 98002 1
2 98106 1
3 98168 2
4 98198 1
...
This data is then plotted onto a leaflet map. Showing the relative number of shops in a zipcode using the radius of the circles to represent shops.
I would like to reorganize the name variable into a third column so that this can popup in my visualization when scrolling over each circle. Ideally, the data would look something like this:
zip count name
(int) (int) (character)
1 98002 1 The Stash Box
2 98106 1 Bud Nation
3 98168 2 Nimbin Farm, West Seattle Cannabis Co.
4 98198 1 Greenside
...
Where all shops in the same zipcode appear together in the third column together. I've tried various for loops and if statements but I'm sure there is a better way to do this and my R skills are just not up there yet. Any help would be appreciated.

Conditional mathematical optimization in R

I have the following data frames:
Required <- data.table( Country=c("AT Iron", "AT Energy", "BE Iron", "BE Energy", "BG Iron", "BG Energy"),Prod1=c(5,10,0,5,0,5),Prod2=c(25,5,10,0,0,5))
Supplied <- data.table( Country=c("AT Iron", "AT Energy", "BE Iron", "BE Energy", "BG Iron", "BG Energy"),Prod1=c(10,5,5,10,5,10),Prod2=c(20,20,20,0,15,10))
> Required
Country Prod1 Prod2
1: AT Iron 5 25
2: AT Energy 10 5
3: BE Iron 0 10
4: BE Energy 5 0
5: BG Iron 0 0
6: BG Energy 5 5
> Supplied
Country Prod1 Prod2
1: AT Iron 10 20
2: AT Energy 5 20
3: BE Iron 5 20
4: BE Energy 10 0
5: BG Iron 5 15
6: BG Energy 10 10
"Required" shows the initial material and energy requirements to manufacture two products, and the materials and energy are supplied by three different countries. For example, product 1 would require, for Energy, 10 units from AT, 5 units from BE and 5 units from BG. "Supplied" shows the actual supply capacity of the countries. Following the example, AT cannot supply 10 units of energy but 5 units, so another country must supply the remaining units. I assume that the country with the most net supply capacity (that is, once discounted the initial requirement) will provide the remaining units. In this case, both BE and BG have 5 units of net supply capacity, so both will provide with equal units, 2.5.
I seek an optimization algorithm that creates a new "Required" table, "RequiredNew", considering supply constrains and the above mentioned assumption. The resulting table should look like:
> RequiredNew
Country Prod1 Prod2
1: AT Iron 5 20
2: AT Energy 10 5
3: BE Iron 0 10
4: BE Energy 7.5 0
5: BG Iron 0 5
6: BG Energy 7.5 5
In the link below I posted a similar question which was solved by user digEmAll, so a similar approach would be suitable. However, I rephrased the question so that it becomes clearer and resembles more to my actual data.
Mathematical optimization in R
I apologise by the multiple posts. Thank you in advance.

how to SPLIT and COUNT by GROUP BY

my current query is like this:
SELECT Discipline, COUNT(*) Cnt FROM [xxx].[dbo].[ScanDoc]
WHERE Discipline <> ''
GROUP BY Discipline
the result is like this..
Discipline Cnt
Advanced Material Science 1
Advanced Material Science;#Chemical Science 2
Advanced Material Science;#Engineering Science 1
Agriculture Science 1
Business and Economics 3
Computer Sciences and ICT 1
Computer Sciences and ICT;#Business and Economics 1
Engineering Science 3
Health and Medical Science 3
Health and Medical Science;#Life Science 2
Humanities and Social Science 9
Life Science 1
so what i want is to split the multiple value..sifoo please show me the way..
i want result like this
Discipline Cnt
Advanced Material Science 4
Chemical Science 2
Engineering Science 1
Agriculture Science 1
Business and Economics 3
Computer Sciences and ICT 2
Business and Economics 1
Engineering Science 3
Health and Medical Science 5
Humanities and Social Science 9
Life Science 3
do you see the different between the results?
Unfortunately there is no SPLIT function in SQL Server so your best bet would be to create a SPLIT function and then call it from a union query, first taking first part and the second taking the latter part of Discipline!

Resources