Create list of elements which match a value - r

I have a table of values with the name, zipcode and opening date of recreational pot shops in WA state.
name zip opening
1 The Stash Box 98002 2014-11-21
3 Greenside 98198 2015-01-01
4 Bud Nation 98106 2015-06-29
5 West Seattle Cannabis Co. 98168 2015-02-28
6 Nimbin Farm 98168 2015-04-25
...
I'm analyzing this data to see if there are any correlations between drug usage and location and opening of recreational stores. For one of the visualizations I'm doing, I am organizing the data by number of shops per zipcode using the group_by() and summarize() functions in dplyr.
zip count
(int) (int)
1 98002 1
2 98106 1
3 98168 2
4 98198 1
...
This data is then plotted onto a leaflet map. Showing the relative number of shops in a zipcode using the radius of the circles to represent shops.
I would like to reorganize the name variable into a third column so that this can popup in my visualization when scrolling over each circle. Ideally, the data would look something like this:
zip count name
(int) (int) (character)
1 98002 1 The Stash Box
2 98106 1 Bud Nation
3 98168 2 Nimbin Farm, West Seattle Cannabis Co.
4 98198 1 Greenside
...
Where all shops in the same zipcode appear together in the third column together. I've tried various for loops and if statements but I'm sure there is a better way to do this and my R skills are just not up there yet. Any help would be appreciated.

Related

Using spacyr for named entity recognition - inconsistent results

I plan to use the spacyr R library to perform named entity recognition across several news articles (spacyr is an R wrapper for the Python spaCy package). My goal is to identify partners for network analysis automatically. However, spacyr is not recognising common entities as expected. Here is sample code to illustrate my issue:
library(quanteda)
library(spacyr)
text <- data.frame(doc_id = c(1:5),
sentence = c("Brightmark LLC, the global waste solutions provider, and Florida Keys National Marine Sanctuary (FKNMS), today announced a new plastic recycling partnership that will reduce landfill waste and amplify concerns about ocean plastics.",
"Brightmark is launching a nationwide site search for U.S. locations suitable for its next set of advanced recycling facilities, which will convert hundreds of thousands of tons of post-consumer plastics into new products, including fuels, wax, and other products.",
"Brightmark will be constructing the facility in partnership with the NSW government, as part of its commitment to drive economic growth and prosperity in regional NSW.",
"Macon-Bibb County, the Macon-Bibb County Industrial Authority, and Brightmark have mutually agreed to end discussions around building a plastic recycling plant in Macon",
"Global petrochemical company SK Global Chemical and waste solutions provider Brightmark have signed a memorandum of understanding to create a partnership that aims to take the lead in the circular economy of plastic by construction of a commercial scale plastics renewal plant in South Korea"))
corpus <- corpus(text, text_field = "sentence")
spacy_initialize(model = "en_core_web_sm")
parsed <- spacy_parse(corpus)
entity <- entity_extract(parsed)
I expect the company "Brightmark" to be recognised in all 5 sentences. However this is what I get:
entity
doc_id sentence_id entity entity_type
1 1 1 Florida_Keys_National_Marine_Sanctuary ORG
2 1 1 FKNMS ORG
3 2 1 U.S. GPE
4 3 1 NSW ORG
5 4 1 Macon_-_Bibb_County ORG
6 4 1 Brightmark ORG
7 4 1 Macon GPE
8 5 1 SK_Global_Chemical ORG
9 5 1 South_Korea GPE
"Brightmark" only appears as an ORG entity type in the 4th sentence (doc_id refers to sentence number). It should show up in all the sentences. The "NSW Government" does not appear at all.
I am still figuring out spaCy and spacyr. Perhaps someone can advise me why this is happening and what steps I should take to remedy this issue. Thanks in advance.
I changed the model and achieved better results:
spacy_initialize(model = "en_core_web_trf")
parsed <- spacy_parse(corpus)
entity <- entity_extract(parsed)
entity
doc_id sentence_id entity entity_type
1 1 1 Brightmark_LLC ORG
2 1 1 Florida_Keys GPE
3 1 1 FKNMS ORG
4 2 1 Brightmark ORG
5 2 1 U.S. GPE
6 3 1 Brightmark ORG
7 3 1 NSW GPE
8 3 1 NSW GPE
9 4 1 Macon_-_Bibb_County GPE
10 4 1 the_Macon_-_Bibb_County_Industrial_Authority ORG
11 4 1 Brightmark ORG
12 4 1 Macon GPE
13 5 1 SK_Global_Chemical ORG
14 5 1 Brightmark ORG
15 5 1 South_Korea GPE
The only downside is that NSW Government and Florida Keys National Marine Sanctuary are not resolved. I also get this warning: UserWarning: User provided device_type of 'cuda', but CUDA is not available.

Subtracting subset from larger dataset in R

Hi all: I have two variables. The first is entitled WITHOUT_VERANDAS. It is a list of cities, aggregated by average rental prices of homes WITHOUT verandas (there are about 200 rows):
City Price
1 Appleton 5000
2 Ames 9000
3 Lodi 1020
4 Milwaukee 2010
5 Barstow 2000
6 Chicago 2320
7 Champaign 2000
The second variable is entitled WITH_VERANDAS. It's a list of cities, aggregated by average rental prices of homes WITH verandas (there are about 10 rows, this is a subset of the previous dataset, since not every city has rental properties with verandas):
City Price
1 Milwaukee 3000
2 Chicago 2050
3 Lodi 5000
For each city on the WITH_VERANDAS list, I want to subtract that city's WITHOUT_VERANDAS city value from the first list. I want to see which cities have the highest or lowest differential. Essentially, the result should only include the WITH_VERANDAS data.
I've tried this:
difference <- WITH_VERANDAS$Price-WITHOUT_VERANDAS$Price
View(difference)
However, this returns as many rows as the WITHOUT_VERANDAS dataset. I also get an error:
longer object length is not a multiple of shorter object length
And the result is simply subtracting WITHOUT_VERANDAS's row 1 from WITH_VERANDA's row 1, as seen in the results: (for example, row 1 of the output would be the value of Milwaukee-Appleton, row 2 output would be Chicago - Ames, and so forth)
1. -2000
2. -6950
If I could only filter WITHOUT_VERANDAS to include only the cities included in WITH_VERANDAS, I think it would work. Thanks!
R2evans, thank you ! this worked great. Now, I have:
City Price.x Price.y
1 Appleton NA 5000
2 Ames NA 9000
3 Lodi 5000 1020
4 Milwaukee 3000 2010
How would I go about filtering this list to take out any row where Price.x is "NA"? i.e all rows that did not match. Thanks again!

How can I create a term matrix that sums numeric values associated to each document?

I'm a bit new to R and tm so struggling with this exercise!
I have one description column with messy unstructured data containing words about the name, city and country of a customer. And another column with the amount of sold items.
**Description Sold Items**
Mrs White London UK 10
Mr Wolf London UK 20
Tania Maier Berlin Germany 10
Thomas Germany 30
Nick Forest Leeds UK 20
Silvio Verdi Italy Torino 10
Tom Cardiff UK 10
Mary House London 5
Using the tm package and documenttermmatrix, I'm able to break down each row into terms and get the frequency of each word (i.e. the number of customers with that word).
UK London Germany … Mary
Frequency 4 3 2 … 1
However, I would also like to sum the total amount of sold items.
The desired output should be:
UK London Germany … Mary
Frequency 4 3 2 … 1
Sum of Sold Items 60 35 40 … 5
How can I get to this result?
Assuming you can get to the stage where you have the Frequency table:
UK London Germany … Mary
Frequency 4 3 2 … 1
and you can extract the words you can use an apply function with a grep. Here I will create a vector which represents your dictionary you extract from your frequency table:
S_data<-read.csv("data.csv",stringsAsFactors = F)
Words<-c("UK","London","Germany","Mary")
Then use this in an apply as follows. This could be more efficiently done. But you will get the idea:
string_rows<-sapply(Words, function(x) grep(x,S_data$Description))
string_sum<-unlist(lapply(string_rows, function(x) sum(S_data$Items[x])))
> string_sum
UK London Germany Mary
60 35 40 5
Just bind this onto your frequency table

Leaflet R color map based on multiple variables?

From what I have seen colored maps in leaflet usually only depict one variable(GDP, Crime stats, Temperature etc) like this one:
.
Is there a way to make maps that display the highest variable in a data frame in leaflet R? For example showing which alcoholic beverage is the most popular in a country, like this map?
(source: dailymail.co.uk)
Say that I had a data frame that looked like this and I wanted to do a similar map to the alcoholic beverage one...
Country Beer Wine Spirits Coffee Tea
Sweden 7 7 5 10 6
USA 9 6 6 7 5
Russia 5 3 9 5 8
Is there a way in leaflet R to pick out the alcoholic beverages, assign them a color and then display them on the map to show which type of alcoholic beverage is the most popular in the three different countries?
Step 0, make a test data frame:
> set.seed(1234)
> drinks = data.frame(Country=c("Sweden","USA","Russia"),
Beer=sample(10,3), Wine=sample(10,3), Spirits=sample(10,3),
Coffee=sample(10,3), Tea=sample(10,3))
Note I have country as a column - yours might have countries in the row names which means the following code needs changing. Anyway. We get:
> drinks
Country Beer Wine Spirits Coffee Tea
1 Sweden 2 7 1 6 3
2 USA 6 8 3 7 9
3 Russia 5 6 6 5 10
Now we combine apply to work along rows, which.max to get the highest element, and various subset operations to drop the country column and get the drink name from the column names:
> drinks$Favourite = names(drinks)[-1][apply(drinks[,-1],1,which.max)]
> drinks
Country Beer Wine Spirits Coffee Tea Favourite
1 Sweden 2 7 1 6 3 Wine
2 USA 6 8 3 7 9 Tea
3 Russia 5 6 6 5 10 Tea
If there's a tie then which.max will pick (I think) the first element. If you want something else then you'll have to rewrite.
Now feed your new data frame to leaflet and map the Favourite column.

How can I count the number of instances a value occurs within a subgroup in R?

I have a data frame that I'm working with in R, and am trying to check how many times a value occurs within its larger, associated group. Specifically, I'm trying to count the number of cities that are listed for each particular country.
My data look something like this:
City Country
=========================
New York US
San Francisco US
Los Angeles US
Paris France
Nantes France
Berlin Germany
It seems that table() is the way to go, but I can't quite figure it out — how can I find out how many cities are listed for each country? That is to say, how can I find out how many fields in one column are associated with a particular value in another column?
EDIT:
I'm hoping for something along the lines of
3 US
2 France
1 Germany
I guess you can try table.
table(df$Country)
# France Germany US
# 2 1 3
Or using data.table
library(data.table)
setDT(df)[, .N, by=Country]
# Country N
#1: US 3
#2: France 2
#3: Germany 1
Or
library(plyr)
count(df$Country)
# x freq
#1 France 2
#2 Germany 1
#3 US 3

Resources