how to SPLIT and COUNT by GROUP BY - count

my current query is like this:
SELECT Discipline, COUNT(*) Cnt FROM [xxx].[dbo].[ScanDoc]
WHERE Discipline <> ''
GROUP BY Discipline
the result is like this..
Discipline Cnt
Advanced Material Science 1
Advanced Material Science;#Chemical Science 2
Advanced Material Science;#Engineering Science 1
Agriculture Science 1
Business and Economics 3
Computer Sciences and ICT 1
Computer Sciences and ICT;#Business and Economics 1
Engineering Science 3
Health and Medical Science 3
Health and Medical Science;#Life Science 2
Humanities and Social Science 9
Life Science 1
so what i want is to split the multiple value..sifoo please show me the way..
i want result like this
Discipline Cnt
Advanced Material Science 4
Chemical Science 2
Engineering Science 1
Agriculture Science 1
Business and Economics 3
Computer Sciences and ICT 2
Business and Economics 1
Engineering Science 3
Health and Medical Science 5
Humanities and Social Science 9
Life Science 3
do you see the different between the results?

Unfortunately there is no SPLIT function in SQL Server so your best bet would be to create a SPLIT function and then call it from a union query, first taking first part and the second taking the latter part of Discipline!

Related

Using spacyr for named entity recognition - inconsistent results

I plan to use the spacyr R library to perform named entity recognition across several news articles (spacyr is an R wrapper for the Python spaCy package). My goal is to identify partners for network analysis automatically. However, spacyr is not recognising common entities as expected. Here is sample code to illustrate my issue:
library(quanteda)
library(spacyr)
text <- data.frame(doc_id = c(1:5),
sentence = c("Brightmark LLC, the global waste solutions provider, and Florida Keys National Marine Sanctuary (FKNMS), today announced a new plastic recycling partnership that will reduce landfill waste and amplify concerns about ocean plastics.",
"Brightmark is launching a nationwide site search for U.S. locations suitable for its next set of advanced recycling facilities, which will convert hundreds of thousands of tons of post-consumer plastics into new products, including fuels, wax, and other products.",
"Brightmark will be constructing the facility in partnership with the NSW government, as part of its commitment to drive economic growth and prosperity in regional NSW.",
"Macon-Bibb County, the Macon-Bibb County Industrial Authority, and Brightmark have mutually agreed to end discussions around building a plastic recycling plant in Macon",
"Global petrochemical company SK Global Chemical and waste solutions provider Brightmark have signed a memorandum of understanding to create a partnership that aims to take the lead in the circular economy of plastic by construction of a commercial scale plastics renewal plant in South Korea"))
corpus <- corpus(text, text_field = "sentence")
spacy_initialize(model = "en_core_web_sm")
parsed <- spacy_parse(corpus)
entity <- entity_extract(parsed)
I expect the company "Brightmark" to be recognised in all 5 sentences. However this is what I get:
entity
doc_id sentence_id entity entity_type
1 1 1 Florida_Keys_National_Marine_Sanctuary ORG
2 1 1 FKNMS ORG
3 2 1 U.S. GPE
4 3 1 NSW ORG
5 4 1 Macon_-_Bibb_County ORG
6 4 1 Brightmark ORG
7 4 1 Macon GPE
8 5 1 SK_Global_Chemical ORG
9 5 1 South_Korea GPE
"Brightmark" only appears as an ORG entity type in the 4th sentence (doc_id refers to sentence number). It should show up in all the sentences. The "NSW Government" does not appear at all.
I am still figuring out spaCy and spacyr. Perhaps someone can advise me why this is happening and what steps I should take to remedy this issue. Thanks in advance.
I changed the model and achieved better results:
spacy_initialize(model = "en_core_web_trf")
parsed <- spacy_parse(corpus)
entity <- entity_extract(parsed)
entity
doc_id sentence_id entity entity_type
1 1 1 Brightmark_LLC ORG
2 1 1 Florida_Keys GPE
3 1 1 FKNMS ORG
4 2 1 Brightmark ORG
5 2 1 U.S. GPE
6 3 1 Brightmark ORG
7 3 1 NSW GPE
8 3 1 NSW GPE
9 4 1 Macon_-_Bibb_County GPE
10 4 1 the_Macon_-_Bibb_County_Industrial_Authority ORG
11 4 1 Brightmark ORG
12 4 1 Macon GPE
13 5 1 SK_Global_Chemical ORG
14 5 1 Brightmark ORG
15 5 1 South_Korea GPE
The only downside is that NSW Government and Florida Keys National Marine Sanctuary are not resolved. I also get this warning: UserWarning: User provided device_type of 'cuda', but CUDA is not available.

Scraping Amazon Prices and matching against entries in spreadsheet

In my job I currently compare wholesale items (sent to us from suppliers in CSV format) to Amazon listings to find profitable items for the business.
I want to build a tool to help me with this very manual process. I know basic Python but what other languages or packages could help me with this?
I imagine I'd need to load the CSV, do column matching (as these will differ across suppliers) then somehow scan amazon for the same products and scrape pricing information.
I'd also like whatever I build to look nice/be really user friendly so I can share it with my colleague.
I'm willing to put in the work to learn myself and know I might have to put in a fair amount of learning but a nudge in the right direction / a list of key skills I should research would be much appreciated.
Here is a function in R to get all matching products and prices from amazon via their search terms.
library(rvest)
library(xml2)
library(tidyverse)
price_xpath <- ".//parent::a/parent::h2/parent::div/following-sibling::div//a/span[not(#class='a-price a-text-price')]/span[#class='a-offscreen']"
description_xpath <- "//div/h2/a/span"
get_amazon_info <- function(item) {
source_html <- read_html(str_c("https://www.amazon.com/s?k=", str_replace_all(item, " ", "+")))
root_nodes <- source_html %>%
html_elements(xpath = description_xpath)
prices <- xml_find_all(root_nodes, xpath = price_xpath, flatten = FALSE)
prices <- lapply(prices, function(x) html_text(x)[1])
prices[lengths(prices) == 0] <- NA
tibble(product = html_text(root_nodes),
price = unlist(prices, use.names = FALSE)) %>%
mutate(price = parse_number(str_remove(price, "\\$")))
}
get_amazon_info("dog food")
# # A tibble: 67 × 2
# product price
# <chr> <dbl>
# 1 Purina Pro Plan High Protein Dog Food with Probiotics for Dogs, Shredded Blend Turkey & Rice Formula - 17 lb. Bag 45.7
# 2 Purina Pro Plan High Protein Dog Food With Probiotics for Dogs, Shredded Blend Chicken & Rice Formula - 35 lb. Bag 64.0
# 3 Rachael Ray Nutrish Premium Natural Dry Dog Food, Real Chicken & Veggies Recipe, 28 Pounds (Packaging May Vary) 40.0
# 4 Blue Buffalo Life Protection Formula Natural Adult Dry Dog Food, Chicken and Brown Rice 34-lb 69.0
# 5 Blue Buffalo Life Protection Formula Natural Adult Dry Dog Food, Chicken and Brown Rice 30-lb 61.0
# 6 Purina ONE Natural Dry Dog Food, SmartBlend Lamb & Rice Formula - 8 lb. Bag 14.0
# 7 Purina ONE High Protein Senior Dry Dog Food, +Plus Vibrant Maturity Adult 7+ Formula - 31.1 lb. Bag 44.4
# 8 Purina Pro Plan Weight Management Dog Food, Shredded Blend Chicken & Rice Formula - 34 lb. Bag 66.0
# 9 NUTRO NATURAL CHOICE Large Breed Adult Dry Dog Food, Chicken & Brown Rice Recipe Dog Kibble, 30 lb. Bag 63.0
# 10 NUTRO NATURAL CHOICE Healthy Weight Adult Dry Dog Food, Chicken & Brown Rice Recipe Dog Kibble, 30 lb. Bag 63.0
# # … with 57 more rows
If you are new to R, you will have to install.packages("tidyverse") first.
After library(tidyverse), you can read_csv the products file, and execute the following commands as so. Replace the sample data with the read_csv command.
products <- tribble(~item,
"best books",
"flour",
"dog food") %>%
rowwise() %>%
mutate(amazon = map(item, get_amazon_info)) %>%
unnest(everything())
# # A tibble: 208 × 3
# item product price
# <chr> <chr> <dbl>
# 1 best books Takeaway Quotes for Coaching Champions for Life: The Process of Mentoring the Person, Athlete and Player 15.0
# 2 best books The Art of War (Deluxe Hardbound Edition) 15.3
# 3 best books Turbulent: A Post Apocalyptic EMP Survival Thriller (Days of Want Series Book 1) 13.0
# 4 best books Revenge at Sea (Quint Adler Thrillers Book 1) 0
# 5 best books The Family Across the Street: A totally unputdownable psychological thriller with a shocking twist 9.89
# 6 best books Where the Crawdads Sing 9.98
# 7 best books The Seven Husbands of Evelyn Hugo: A Novel 9.42
# 8 best books Addlestone: The Addlestone Chronicles Book 1 October 1St 1934 - July 20Th 1935 4.99
# 9 best books The Wife Before: A Spellbinding Psychological Thriller with a Shocking Twist 13.6
# 10 best books Wish You Were Here: A Novel 11.0
# # … with 198 more rows
As you are new to StackOverflow, please remember to hit the green check mark if this helps. Please add some more tags to your post, and change the title to "Scraping Amazon Prices".

Fuzzy matching strings within a single column and documenting possible matches

I have a relatively large dataset of ~ 5k rows containing titles of journal/research papers. Here is a small sample of the dataset:
dt = structure(list(Title = c("Community reinforcement approach in the treatment of opiate addicts",
"Therapeutic justice: Life inside drug court", "Therapeutic justice: Life inside drug court",
"Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comprehensive approach for integrated care",
"An ecosystem for improving the quality of personal health records",
"Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders",
"A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders",
"A model for the assessment of static and dynamic factors in sexual offenders",
"The problem of co-occurring disorders among jail detainees: Antisocial disorder, alcoholism, drug abuse, and depression",
"Co-occurring disorders among mentally ill jail detainees. Implications for public policy",
"Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudinal Study",
"Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure",
"Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure",
"Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0",
"Diagnosis of active and latent tuberculosis: summary of NICE guidance",
"Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium"
)), row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"
))
You can see that there are some duplicates of titles in there, but with formatting/case differences. I want to identify titles that are duplicated and create a new variable that documents which rows are possibly matching. To do this, I have attempted to use the agrep function as suggested here :
dt$is.match <- sapply(dt$Title,agrep,dt$Title)
This identifies matches, but saves the results as a list in the new variable column. Is there a way to do this (preferably using base r or data.table) where the results of agrep are not saved as a list, but only identifying which rows are matches (e.g., 6:7)?
Thanks in advance - hope I have provided enough information.
Do you need something like this?
dt$is.match <- sapply(dt$Title,function(x) toString(agrep(x, dt$Title)), USE.NAMES = FALSE)
dt
# A tibble: 16 x 2
# Title is.match
# <chr> <chr>
# 1 Community reinforcement approach in the treatment of opiate addicts 1
# 2 Therapeutic justice: Life inside drug court 2, 3
# 3 Therapeutic justice: Life inside drug court 2, 3
# 4 Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comp… 4
# 5 An ecosystem for improving the quality of personal health records 5
# 6 Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders 6
# 7 A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders 7, 8
# 8 A model for the assessment of static and dynamic factors in sexual offenders 7, 8
# 9 The problem of co-occurring disorders among jail detainees: Antisocial disorder, alcoholism, drug abuse… 9
#10 Co-occurring disorders among mentally ill jail detainees. Implications for public policy 10
#11 Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudina… 11
#12 Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure 12, 13
#13 Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure 12, 13
#14 Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0 14
#15 Diagnosis of active and latent tuberculosis: summary of NICE guidance 15
#16 Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium 16
This isn't base r nor data.table, but here's one way using tidyverse to detect duplicates:
library(janitor)
library(tidyverse)
dt %>%
mutate(row = row_number()) %>%
get_dupes(Title)
Output:
# A tibble: 2 x 3
Title dupe_count row
<chr> <int> <int>
1 Therapeutic justice: Life inside drug court 2 2
2 Therapeutic justice: Life inside drug court 2 3
If you wanted to pick out duplicates that aren't case-sensitive, try this:
dt %>%
mutate(Title = str_to_lower(Title),
row = row_number()) %>%
get_dupes(Title)
Output:
# A tibble: 6 x 3
Title dupe_count row
<chr> <int> <int>
1 a model for the assessment of static and dynamic factors in sexual offend… 2 7
2 a model for the assessment of static and dynamic factors in sexual offend… 2 8
3 behavioral health and adult milestones in young adults with perinatal hiv… 2 12
4 behavioral health and adult milestones in young adults with perinatal hiv… 2 13
5 therapeutic justice: life inside drug court 2 2
6 therapeutic justice: life inside drug court 2 3

Couldn't get tq_exchange() or stockSymbols() to work

I am trying to get stock symbols with these functions (both failed)
TTR::stockSymbols("AMEX")
Error in symbols[, sort.by] : incorrect number of dimensions
tidyquant::tq_exchange("AMEX")
Getting data...
Error: Can't rename columns that don't exist.
x Column Symbol doesn't exist.
Do these functions work for you? What fixes do you know to correct them? Thank you!
I get the same error. It seems like there has been some changes in the website from which these packages get the information. There is an open issue about this.
In the same thread it is mentioned that you can get the information from the underlying JSON which returns this information.
tmp <- jsonlite::fromJSON('https://api.nasdaq.com/api/screener/stocks?tableonly=true&limit=25&offset=0&exchange=AMEX&download=true')
head(tmp$data$rows)
# symbol name
#1 AAMC Altisource Asset Management Corp Com
#2 AAU Almaden Minerals Ltd. Common Shares
#3 ACU Acme United Corporation. Common Stock
#4 ACY AeroCentury Corp. Common Stock
#5 AE Adams Resources & Energy Inc. Common Stock
#6 AEF Aberdeen Emerging Markets Equity Income Fund Inc. Common Stock
# lastsale netchange pctchange volume marketCap country ipoyear
#1 $24.60 -0.3595 -1.44% 15183 40595215.00 United States
#2 $0.846 0.0359 4.432% 2272603 101984125.00 Canada 2015
#3 $33.82 0.61 1.837% 7869 112922038.00 United States 1988
#4 $11.76 2.01 20.615% 739133 18179596.00 United States
#5 $28.31 0.11 0.39% 6217 120099060.00 United States
#6 $9.10 0.09 0.999% 40775 461841180.00 United States
# industry sector
#1 Real Estate Finance
#2 Precious Metals Basic Industries
#3 Industrial Machinery/Components Capital Goods
#4 Diversified Commercial Services Technology
#5 Oil Refining/Marketing Energy
#6
# url
#1 /market-activity/stocks/aamc
#2 /market-activity/stocks/aau
#3 /market-activity/stocks/acu
#4 /market-activity/stocks/acy
#5 /market-activity/stocks/ae
#6 /market-activity/stocks/aef

Create list of elements which match a value

I have a table of values with the name, zipcode and opening date of recreational pot shops in WA state.
name zip opening
1 The Stash Box 98002 2014-11-21
3 Greenside 98198 2015-01-01
4 Bud Nation 98106 2015-06-29
5 West Seattle Cannabis Co. 98168 2015-02-28
6 Nimbin Farm 98168 2015-04-25
...
I'm analyzing this data to see if there are any correlations between drug usage and location and opening of recreational stores. For one of the visualizations I'm doing, I am organizing the data by number of shops per zipcode using the group_by() and summarize() functions in dplyr.
zip count
(int) (int)
1 98002 1
2 98106 1
3 98168 2
4 98198 1
...
This data is then plotted onto a leaflet map. Showing the relative number of shops in a zipcode using the radius of the circles to represent shops.
I would like to reorganize the name variable into a third column so that this can popup in my visualization when scrolling over each circle. Ideally, the data would look something like this:
zip count name
(int) (int) (character)
1 98002 1 The Stash Box
2 98106 1 Bud Nation
3 98168 2 Nimbin Farm, West Seattle Cannabis Co.
4 98198 1 Greenside
...
Where all shops in the same zipcode appear together in the third column together. I've tried various for loops and if statements but I'm sure there is a better way to do this and my R skills are just not up there yet. Any help would be appreciated.

Resources