Scraping Amazon Prices and matching against entries in spreadsheet

Scraping Amazon Prices and matching against entries in spreadsheet - web-scraping

In my job I currently compare wholesale items (sent to us from suppliers in CSV format) to Amazon listings to find profitable items for the business.
I want to build a tool to help me with this very manual process. I know basic Python but what other languages or packages could help me with this?
I imagine I'd need to load the CSV, do column matching (as these will differ across suppliers) then somehow scan amazon for the same products and scrape pricing information.
I'd also like whatever I build to look nice/be really user friendly so I can share it with my colleague.
I'm willing to put in the work to learn myself and know I might have to put in a fair amount of learning but a nudge in the right direction / a list of key skills I should research would be much appreciated.

Here is a function in R to get all matching products and prices from amazon via their search terms.
library(rvest)
library(xml2)
library(tidyverse)
price_xpath <- ".//parent::a/parent::h2/parent::div/following-sibling::div//a/span[not(#class='a-price a-text-price')]/span[#class='a-offscreen']"
description_xpath <- "//div/h2/a/span"
get_amazon_info <- function(item) {
source_html <- read_html(str_c("https://www.amazon.com/s?k=", str_replace_all(item, " ", "+")))
root_nodes <- source_html %>%
html_elements(xpath = description_xpath)
prices <- xml_find_all(root_nodes, xpath = price_xpath, flatten = FALSE)
prices <- lapply(prices, function(x) html_text(x)[1])
prices[lengths(prices) == 0] <- NA
tibble(product = html_text(root_nodes),
price = unlist(prices, use.names = FALSE)) %>%
mutate(price = parse_number(str_remove(price, "\\$")))
}
get_amazon_info("dog food")
# # A tibble: 67 × 2
# product price
# <chr> <dbl>
# 1 Purina Pro Plan High Protein Dog Food with Probiotics for Dogs, Shredded Blend Turkey & Rice Formula - 17 lb. Bag 45.7
# 2 Purina Pro Plan High Protein Dog Food With Probiotics for Dogs, Shredded Blend Chicken & Rice Formula - 35 lb. Bag 64.0
# 3 Rachael Ray Nutrish Premium Natural Dry Dog Food, Real Chicken & Veggies Recipe, 28 Pounds (Packaging May Vary) 40.0
# 4 Blue Buffalo Life Protection Formula Natural Adult Dry Dog Food, Chicken and Brown Rice 34-lb 69.0
# 5 Blue Buffalo Life Protection Formula Natural Adult Dry Dog Food, Chicken and Brown Rice 30-lb 61.0
# 6 Purina ONE Natural Dry Dog Food, SmartBlend Lamb & Rice Formula - 8 lb. Bag 14.0
# 7 Purina ONE High Protein Senior Dry Dog Food, +Plus Vibrant Maturity Adult 7+ Formula - 31.1 lb. Bag 44.4
# 8 Purina Pro Plan Weight Management Dog Food, Shredded Blend Chicken & Rice Formula - 34 lb. Bag 66.0
# 9 NUTRO NATURAL CHOICE Large Breed Adult Dry Dog Food, Chicken & Brown Rice Recipe Dog Kibble, 30 lb. Bag 63.0
# 10 NUTRO NATURAL CHOICE Healthy Weight Adult Dry Dog Food, Chicken & Brown Rice Recipe Dog Kibble, 30 lb. Bag 63.0
# # … with 57 more rows
If you are new to R, you will have to install.packages("tidyverse") first.
After library(tidyverse), you can read_csv the products file, and execute the following commands as so. Replace the sample data with the read_csv command.
products <- tribble(~item,
"best books",
"flour",
"dog food") %>%
rowwise() %>%
mutate(amazon = map(item, get_amazon_info)) %>%
unnest(everything())
# # A tibble: 208 × 3
# item product price
# <chr> <chr> <dbl>
# 1 best books Takeaway Quotes for Coaching Champions for Life: The Process of Mentoring the Person, Athlete and Player 15.0
# 2 best books The Art of War (Deluxe Hardbound Edition) 15.3
# 3 best books Turbulent: A Post Apocalyptic EMP Survival Thriller (Days of Want Series Book 1) 13.0
# 4 best books Revenge at Sea (Quint Adler Thrillers Book 1) 0
# 5 best books The Family Across the Street: A totally unputdownable psychological thriller with a shocking twist 9.89
# 6 best books Where the Crawdads Sing 9.98
# 7 best books The Seven Husbands of Evelyn Hugo: A Novel 9.42
# 8 best books Addlestone: The Addlestone Chronicles Book 1 October 1St 1934 - July 20Th 1935 4.99
# 9 best books The Wife Before: A Spellbinding Psychological Thriller with a Shocking Twist 13.6
# 10 best books Wish You Were Here: A Novel 11.0
# # … with 198 more rows
As you are new to StackOverflow, please remember to hit the green check mark if this helps. Please add some more tags to your post, and change the title to "Scraping Amazon Prices".

Related

list of data frames, trying to create new column with normalisation values for each dataframe

I'm new to r and mostly work with dataframes. A frequent task is to normalize counts for several parameters from several data frames. I have a demo dataset:
dataset
Season
Product
Quality
Sales
Winter
Apple
bad
345
Winter
Apple
good
13
Winter
Potato
bad
23
Winter
Potato
good
66
Winter
Beer
bad
345
Winter
Beer
good
34
Summer
Apple
bad
88
Summer
Apple
good
90
Summer
Potato
bad
123
Summer
Potato
good
457
Summer
Beer
bad
44
Summer
Beer
good
546
What I want to do is add a column "FC" ([tag:fold change]) for "Sales". FC must be calculated for each "Season" and "Product" according to "Quality". "Bad" is the baseline.
Desired result:
Season
Product
Quality
Sales
FC
Winter
Apple
bad
345
1.00
Winter
Apple
good
13
0.04
Winter
Potato
bad
23
1.00
Winter
Potato
good
66
2.87
Winter
Beer
bad
345
1.00
Winter
Beer
good
34
0.10
Summer
Apple
bad
88
1.00
Summer
Apple
good
90
1.02
Summer
Potato
bad
123
1.00
Summer
Potato
good
457
3.72
Summer
Beer
bad
44
1.00
Summer
Beer
good
546
12.41
One way to do it is to filter first by "Season" and then by "Product" (e.g. creating subset data frame subset_winter_apple) and then calculate FC similarly to this:
subset_winter_apple$FC = subset_winter_apple$Sales / subset_winter_apple$Sales[1]
Later on, I can then combine all subset dataframes again e.g. using rbind() to reconstitute the original data frame with the FC column. However, this is highly inefficient. So I thought of splitting the data frame and creating a list:
split(
dataset,
list(dataset$Season, dataset$Product)
)
However, now I struggle with the normalisation (FC calculation) as I do not know how to reference the specific first cell value of "Sales" in the list of data frames so that each value in that column in each listed data frame is individually normalized. I did manage to calculate an FC value for the list, however, it is an exact copy in each listed data frame from the first one using lappy:
lapply(
dataset,
function(DF){DF$FC = dataset[[1]]$Sales/dataset[[1]]$Sales[1]; DF}
)
Clearly, I do not know how to reference the first cell in a specific column to normalize the entire column for each listed data frame. Can somebody please help me?
Many thanks in advance for your suggestions.

dplyr solution
Using logical indexing within a grouped mutate():
library(dplyr)
dataset %>%
group_by(Season, Product) %>%
mutate(FC = Sales / Sales[Quality == "bad"]) %>%
ungroup()
# A tibble: 12 × 5
Season Product Quality Sales FC
<chr> <chr> <chr> <int> <dbl>
1 Winter Apple bad 345 1
2 Winter Apple good 13 0.0377
3 Winter Potato bad 23 1
4 Winter Potato good 66 2.87
5 Winter Beer bad 345 1
6 Winter Beer good 34 0.0986
7 Summer Apple bad 88 1
8 Summer Apple good 90 1.02
9 Summer Potato bad 123 1
10 Summer Potato good 457 3.72
11 Summer Beer bad 44 1
12 Summer Beer good 546 12.4
Base R solution
Using by():
dataset <- by(
dataset,
list(dataset$Season, dataset$Product),
\(x) transform(x, FC = Sales / Sales[Quality == "bad"])
)
dataset <- do.call(rbind, dataset)
dataset[order(as.numeric(rownames(dataset))), ]
Season Product Quality Sales FC
1 Winter Apple bad 345 1.00000000
2 Winter Apple good 13 0.03768116
3 Winter Potato bad 23 1.00000000
4 Winter Potato good 66 2.86956522
5 Winter Beer bad 345 1.00000000
6 Winter Beer good 34 0.09855072
7 Summer Apple bad 88 1.00000000
8 Summer Apple good 90 1.02272727
9 Summer Potato bad 123 1.00000000
10 Summer Potato good 457 3.71544715
11 Summer Beer bad 44 1.00000000
12 Summer Beer good 546 12.40909091

Using spacyr for named entity recognition - inconsistent results

I plan to use the spacyr R library to perform named entity recognition across several news articles (spacyr is an R wrapper for the Python spaCy package). My goal is to identify partners for network analysis automatically. However, spacyr is not recognising common entities as expected. Here is sample code to illustrate my issue:
library(quanteda)
library(spacyr)
text <- data.frame(doc_id = c(1:5),
sentence = c("Brightmark LLC, the global waste solutions provider, and Florida Keys National Marine Sanctuary (FKNMS), today announced a new plastic recycling partnership that will reduce landfill waste and amplify concerns about ocean plastics.",
"Brightmark is launching a nationwide site search for U.S. locations suitable for its next set of advanced recycling facilities, which will convert hundreds of thousands of tons of post-consumer plastics into new products, including fuels, wax, and other products.",
"Brightmark will be constructing the facility in partnership with the NSW government, as part of its commitment to drive economic growth and prosperity in regional NSW.",
"Macon-Bibb County, the Macon-Bibb County Industrial Authority, and Brightmark have mutually agreed to end discussions around building a plastic recycling plant in Macon",
"Global petrochemical company SK Global Chemical and waste solutions provider Brightmark have signed a memorandum of understanding to create a partnership that aims to take the lead in the circular economy of plastic by construction of a commercial scale plastics renewal plant in South Korea"))
corpus <- corpus(text, text_field = "sentence")
spacy_initialize(model = "en_core_web_sm")
parsed <- spacy_parse(corpus)
entity <- entity_extract(parsed)
I expect the company "Brightmark" to be recognised in all 5 sentences. However this is what I get:
entity
doc_id sentence_id entity entity_type
1 1 1 Florida_Keys_National_Marine_Sanctuary ORG
2 1 1 FKNMS ORG
3 2 1 U.S. GPE
4 3 1 NSW ORG
5 4 1 Macon_-_Bibb_County ORG
6 4 1 Brightmark ORG
7 4 1 Macon GPE
8 5 1 SK_Global_Chemical ORG
9 5 1 South_Korea GPE
"Brightmark" only appears as an ORG entity type in the 4th sentence (doc_id refers to sentence number). It should show up in all the sentences. The "NSW Government" does not appear at all.
I am still figuring out spaCy and spacyr. Perhaps someone can advise me why this is happening and what steps I should take to remedy this issue. Thanks in advance.

I changed the model and achieved better results:
spacy_initialize(model = "en_core_web_trf")
parsed <- spacy_parse(corpus)
entity <- entity_extract(parsed)
entity
doc_id sentence_id entity entity_type
1 1 1 Brightmark_LLC ORG
2 1 1 Florida_Keys GPE
3 1 1 FKNMS ORG
4 2 1 Brightmark ORG
5 2 1 U.S. GPE
6 3 1 Brightmark ORG
7 3 1 NSW GPE
8 3 1 NSW GPE
9 4 1 Macon_-_Bibb_County GPE
10 4 1 the_Macon_-_Bibb_County_Industrial_Authority ORG
11 4 1 Brightmark ORG
12 4 1 Macon GPE
13 5 1 SK_Global_Chemical ORG
14 5 1 Brightmark ORG
15 5 1 South_Korea GPE
The only downside is that NSW Government and Florida Keys National Marine Sanctuary are not resolved. I also get this warning: UserWarning: User provided device_type of 'cuda', but CUDA is not available.

data wrangling in R with names_pattern for pivoting on ENDING pattern?

I have a dataset with a column, CatSex, that's got data in it in a form similar to "American.Indian.or.Alaska.Native.men"--the characters after the last period, I want to turn into a new pivoted column, so I have two columns, one called Cat with only the demographic info in it, and one called Sex with the sex in it. The characters before the sex designation don't follow any clear pattern. I am not very good at R, but it's better than Tableau Prep with large data sets, it seems. What I ultimately want is to pivot the data so that I have two distinct columns for the different categories here. I used this code to get part of the way there (the original data held like 119 columns with names like "Grand.total.men..C2005_A_RV..First.major..Area..ethnic..cultural..and.gender.studies...Degrees.total"), but I can't figure out how to do this with the pattern I'm now left with in the column CatSex:
pivot_longer(
cols = -c(UnitID, Institution.Name),
names_to = c("CatSex", "Disc"),
names_pattern = "(.*)..C2005_A_RV..First.major..(.*)",
values_to = "Count",
values_drop_na = TRUE
)
Here's a screenshot of the data structure I have now. I'm sorry for not putting in reproducible code--I don't know how to do that in this context!
EDIT: Here's a head(df) of the cleaned data so far:
# A tibble: 6 × 5
UnitID Institution.Name CatSex Disc Count
<int> <fct> <chr> <chr> <int>
1 177834 A T Still University of Health Sciences Grand.total.men Health.professions.and.related.clinical.sciences...Degrees.total. 212
2 177834 A T Still University of Health Sciences Grand.total.women Health.professions.and.related.clinical.sciences...Degrees.total. 359
3 177834 A T Still University of Health Sciences White.non.Hispanic.men Health.professions.and.related.clinical.sciences...Degrees.total. 181
4 177834 A T Still University of Health Sciences White.non.Hispanic.women Health.professions.and.related.clinical.sciences...Degrees.total. 317
5 177834 A T Still University of Health Sciences Black.non.Hispanic.men Health.professions.and.related.clinical.sciences...Degrees.total. 3
6 177834 A T Still University of Health Sciences Black.non.Hispanic.women Health.professions.and.related.clinical.sciences...Degrees.total. 5

Using extract from tidyr package (it is in tidyverse)
Capture 2 groups with ()
Define second group to have one or more characters that are not . up to the end $
library(dplyr)
library(tidyr)
df %>%
extract(CatSex, c("Cat", "Sex"), "(.*)\\.([^.]+)$")
UnitID Institution.Name Cat Sex
1 222178 Abilene Christian University Hispanic men
2 222178 Abilene Christian University Hispanic women
3 222178 Abilene Christian University American.Indian.or.Alaska.Native men
4 222178 Abilene Christian University American.Indian.or.Alaska.Native women
5 222178 Abilene Christian University Asian.or.Pacific.Islander women
6 222178 Abilene Christian University Asian.or.Pacific.Islander men
7 222178 Abilene Christian University Grand.total men
8 222178 Abilene Christian University Grand.total women
9 222178 Abilene Christian University White.non.Hispanic men
10 222178 Abilene Christian University White.non.Hispanic women
11 222178 Abilene Christian University lack.non.Hispanic men
12 222178 Abilene Christian University Black.non.Hispanic women
13 222178 Abilene Christian University Hispanic men
14 222178 Abilene Christian University Hispanic women
15 222178 Abilene Christian University American.Indian.or.Alaska.Native men
Disc
1 Communication journalism..and.related.programs
2 Communication journalism and.related.programs
3 Communication journalism..and.related.programs
4 Communication..journalism..and.related.programs
5 Communication..journalism..and.related.programs
6 Communication .journalism..and.related.program
7 Computer.and.information.sciences.and.support.serv
8 computer.and.information.sciences.and.support.servi
9 Computer.and.information.sciences.and.support.servi
10 Computer.and.information.sciences.and.support.servi
11 Computer.and.information.sciences.and.support.servi
12 Computer.and.information.sciences.and.support.servi.
13 Computer.and.information.sciences.and.support.serv
14 Computer.and.information.sciences.and.support.servi.
15 Computer.and.information.sciences.and.support.servi

pivot_longer is not the right function in this context.
Here are few options -
Using tidyr::separate
tidyr::separate(df, 'CatSex', c('Cat', 'Sex'), sep = '(\\.)(?!.*\\.)')
#. Cat Sex
#1 Grand.total men
#2 Grand.total women
#3 White.non.Hispanic men
#4 White.non.Hispanic women
#5 Black.non.Hispanic men
#6 Black.non.Hispanic women
Using stringr functions
library(dplyr)
library(stringr)
df %>%
mutate(Sex = str_extract(CatSex, 'men|women'),
Cat = str_remove(CatSex, '\\.(men|women)'))
In base R
transform(df, Sex = sub('.*\\.(men|women)', '\\1', CatSex),
Cat = sub('\\.(men|women)', '', CatSex))
data
It is easier to help if you provide data in a reproducible format
df <- data.frame(CatSex = c("Grand.total.men", "Grand.total.women",
"White.non.Hispanic.men", "White.non.Hispanic.women",
"Black.non.Hispanic.men", "Black.non.Hispanic.women"))

Fuzzy matching strings within a single column and documenting possible matches

I have a relatively large dataset of ~ 5k rows containing titles of journal/research papers. Here is a small sample of the dataset:
dt = structure(list(Title = c("Community reinforcement approach in the treatment of opiate addicts",
"Therapeutic justice: Life inside drug court", "Therapeutic justice: Life inside drug court",
"Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comprehensive approach for integrated care",
"An ecosystem for improving the quality of personal health records",
"Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders",
"A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders",
"A model for the assessment of static and dynamic factors in sexual offenders",
"The problem of co-occurring disorders among jail detainees: Antisocial disorder, alcoholism, drug abuse, and depression",
"Co-occurring disorders among mentally ill jail detainees. Implications for public policy",
"Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudinal Study",
"Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure",
"Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure",
"Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0",
"Diagnosis of active and latent tuberculosis: summary of NICE guidance",
"Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium"
)), row.names = c(NA, -16L), class = c("tbl_df", "tbl", "data.frame"
))
You can see that there are some duplicates of titles in there, but with formatting/case differences. I want to identify titles that are duplicated and create a new variable that documents which rows are possibly matching. To do this, I have attempted to use the agrep function as suggested here :
dt$is.match <- sapply(dt$Title,agrep,dt$Title)
This identifies matches, but saves the results as a list in the new variable column. Is there a way to do this (preferably using base r or data.table) where the results of agrep are not saved as a list, but only identifying which rows are matches (e.g., 6:7)?
Thanks in advance - hope I have provided enough information.

Do you need something like this?
dt$is.match <- sapply(dt$Title,function(x) toString(agrep(x, dt$Title)), USE.NAMES = FALSE)
dt
# A tibble: 16 x 2
# Title is.match
# <chr> <chr>
# 1 Community reinforcement approach in the treatment of opiate addicts 1
# 2 Therapeutic justice: Life inside drug court 2, 3
# 3 Therapeutic justice: Life inside drug court 2, 3
# 4 Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comp… 4
# 5 An ecosystem for improving the quality of personal health records 5
# 6 Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders 6
# 7 A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders 7, 8
# 8 A model for the assessment of static and dynamic factors in sexual offenders 7, 8
# 9 The problem of co-occurring disorders among jail detainees: Antisocial disorder, alcoholism, drug abuse… 9
#10 Co-occurring disorders among mentally ill jail detainees. Implications for public policy 10
#11 Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudina… 11
#12 Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure 12, 13
#13 Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure 12, 13
#14 Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0 14
#15 Diagnosis of active and latent tuberculosis: summary of NICE guidance 15
#16 Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium 16

This isn't base r nor data.table, but here's one way using tidyverse to detect duplicates:
library(janitor)
library(tidyverse)
dt %>%
mutate(row = row_number()) %>%
get_dupes(Title)
Output:
# A tibble: 2 x 3
Title dupe_count row
<chr> <int> <int>
1 Therapeutic justice: Life inside drug court 2 2
2 Therapeutic justice: Life inside drug court 2 3
If you wanted to pick out duplicates that aren't case-sensitive, try this:
dt %>%
mutate(Title = str_to_lower(Title),
row = row_number()) %>%
get_dupes(Title)
Output:
# A tibble: 6 x 3
Title dupe_count row
<chr> <int> <int>
1 a model for the assessment of static and dynamic factors in sexual offend… 2 7
2 a model for the assessment of static and dynamic factors in sexual offend… 2 8
3 behavioral health and adult milestones in young adults with perinatal hiv… 2 12
4 behavioral health and adult milestones in young adults with perinatal hiv… 2 13
5 therapeutic justice: life inside drug court 2 2
6 therapeutic justice: life inside drug court 2 3

Match and count total words from an external list with text strings (tweets) in r

I am attempting to conduct emotional sentiment analysis of a large corpus of Tweets (91k) with an external list of emotionally-charged words (from the NRC Emotion Lexicon). To do this, I want to run a count and sum the total number of times any word from the words of joy list is contained within each Tweet. Ideally, this would not be a partial match of the word and not exact match. I would like for the total total to show in a new column in the df.
The df and column name for the Tweets are Tweets_with_Emotions$full_text and the list is Words_of_joy$word.
Example 1
> head(Tweets_with_Emotions, n=10)
ID Date full_text
1 58150 2012-09-12 I love an excellent cookie
2 12357 2012-09-28 Oranges are delicious and excellent
3 50788 2012-10-04 Eager to visit Disneyland
4 66038 2012-10-11 I wish my boyfriend would propose already
5 18119 2012-10-11 Love Maggie Smith
6 48349 2012-10-14 The movie was excellent, loved it.
7 23328 2012-10-16 Pineapples are so delicious and excellent
8 66038 2012-10-26 Eager to see the Champions Cup next week
9 32717 2012-10-28 Hating this show
10 11345 2012-11-08 Eager for the food
Example 2
> > head(words_of_joy, n=5)
word
1 eager
2 champion
3 delicious
4 excellent
5 love
Desired output
> head(New_df, n=10)
ID Date full_text joy_count
1 58150 2012-09-12 I love an excellent cookie 2
2 12357 2012-09-28 Oranges are delicious and excellent 2
3 50788 2012-10-04 Eager to visit Disneyland 1
4 66038 2012-10-11 I wish my boyfriend would propose already 0
5 18119 2012-10-11 Love Maggie Smith 1
6 48349 2012-10-14 The movie was excellent, loved it. 2
7 23328 2012-10-16 Pineapples are so delicious and excellent 2
8 66038 2012-10-26 Eager to see the Champions Cup next week 2
9 32717 2012-10-28 Hating this show 0
10 11345 2012-11-08 Eager for the food 1
I've effectively run the emotion list through the Tweets so that it returns a yes or no as to whether any words from the emotion list are contained within the Tweets (no = 0, yes = 1), however I cannot figure out how to count and return the totals in a new column
new_df <- Tweets_with_Emotions[stringr::str_detect(Tweets_with_Emotions$full_text, paste(Words_of_negative$words,collapse = '|')),]
I'm extremely new to R (and stackoverflow!) and have been struggling to figure this out for a few days so any help would be incredibly appreciated!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping Amazon Prices and matching against entries in spreadsheet - web-scraping

Related

list of data frames, trying to create new column with normalisation values for each dataframe

Using spacyr for named entity recognition - inconsistent results

data wrangling in R with names_pattern for pivoting on ENDING pattern?

Fuzzy matching strings within a single column and documenting possible matches

Match and count total words from an external list with text strings (tweets) in r

Categories

Resources