I'm new to web scraping and am trying to scrape the data from this interactive chart using R so that all the series are displayed in a single table: https://www.e61.in/spendtracker
I've used developer tools in chrome (inspect - network - fetch/XHR) but cannot find the data points.
Would be highly appreciative if someone can take a quick look and let me know a) if the data points are stored on the page somewhere b) if possible, explain how they identified the right file, and c) if it is a reasonably straightforward task to then generate a table?
Continuing from that iframe url -
before switching to R & rvest you should check the actual page source and perhaps run it though some beautifier. You'll see Plotly.newPlot() call, check how it gets array of those data series as a 2nd parameter. One option would be extracting that piece of javascript with regex, parse it as JSON and work from there.
Perhaps something like this:
url <- "https://www-e61-in.filesusr.com/html/84f6c1_839cefc8bcc59c1cc688a6be6b4a5656.html"
html <- read_html(url)
# extract last <script> tag containing Plotly.newPlot() and dataseries'
plotly_js <- html %>%
html_element("script:last-of-type") %>%
# extract array from js string, using \Q and \E to no escape all special chars
p_dataseries <- str_extract(plotly_js, '\\Q[{"connectgaps"\\E.*?\\Q"type":"scatter"}]\\E' )
# parse extracted string
ds_j <- fromJSON(p_dataseries,simplifyVector = FALSE)
# extract data, result will be in long format
df <- map_df(ds_j, `[`, c("name", "x", "y")) %>%
unnest(c(x,y)) %>%
mutate(date = as.POSIXct(x))
#> tibble [2,346 × 4] (S3: tbl_df/tbl/data.frame)
#> $ name: chr [1:2346] "Total" "Total" "Total" "Total" ...
#> $ x : chr [1:2346] "2020-01-12T00:00:00" "2020-01-19T00:00:00" "2020-01-26T00:00:00" "2020-02-02T00:00:00" ...
#> $ y : num [1:2346] 100 100.1 100.7 99.3 97.8 ...
#> $ date: POSIXct[1:2346], format: "2020-01-12" "2020-01-19" ...
#> # A tibble: 6 × 4
#> name x y date
#> <chr> <chr> <dbl> <dttm>
#> 1 Total 2020-01-12T00:00:00 100 2020-01-12 00:00:00
#> 2 Total 2020-01-19T00:00:00 100. 2020-01-19 00:00:00
#> 3 Total 2020-01-26T00:00:00 101. 2020-01-26 00:00:00
#> 4 Total 2020-02-02T00:00:00 99.3 2020-02-02 00:00:00
#> 5 Total 2020-02-09T00:00:00 97.8 2020-02-09 00:00:00
#> 6 Total 2020-02-16T00:00:00 100. 2020-02-16 00:00:00
p <- df %>%
ggplot(aes(x = date, y = y, color = name)) +
geom_path() +
Created on 2022-09-27 with reprex v2.0.2
You're trying to scrap the wrong URL - the one you've provided uses an iframe with the chart. You should take a deep look into the source code of this page instead (the iframe source): https://www-e61-in.filesusr.com/html/84f6c1_839cefc8bcc59c1cc688a6be6b4a5656.html
Ultimately I want to use postcodes for all state-funded secondary schools in England, but for now I'm trying to figure out what code I will need to use, so using a selection of just 5.
I want to retrieve the coordinates (so latitude and longitude) and the lsoa value for each postcode.
pc_list <- list(postcodes = c("PE7 3BY", "ME15 9AZ", "BS21 6AH", "SG18 8JB", "M11 2NA"))
pclist1 <- bulk_postcode_lookup(pc_list)
This returns all the information about those 5 postcodes. Now I want it just to return information on those 3 variables (latitude, longitude and lsoa) that I'm interested in.
pclist2 <- subset(pclist1, select = c(longitude, latitude, lsoa))
This returns the following error.
Error in subset.default(pclist1, select = c(longitude, latitude, lsoa)) :
argument "subset" is missing, with no default
Once I am able to get this information, I want to add these 3 variables along with their relevant postcode into a new dataframe that I can perform susbequent analysis on - is this what pclist2 will be?
Slightly modified example from https://docs.ropensci.org/PostcodesioR/articles/Introduction.html#multiple-postcodes , for whatever reason I only received positive responses when removed spaces from postcodes :
pc_list <- list(postcodes = c("PE73BY", "ME159AZ", "BS216AH", "SG188JB", "M112NA"))
pclist1 <- bulk_postcode_lookup(pc_list)
# extract 2nd list item from each response (the "result" list)
bulk_list <- lapply(pclist1, "[[", 2)
# extract list of items from results lists, return tibble / data frame
bulk_df <- map_dfr(bulk_list, `[`, c("postcode", "longitude", "latitude", "lsoa"))
Resulting tibble / data frame :
#> # A tibble: 5 × 4
#> postcode longitude latitude lsoa
#> <chr> <dbl> <dbl> <chr>
#> 1 PE7 3BY -0.226 52.5 Peterborough 019D
#> 2 ME15 9AZ 0.538 51.3 Maidstone 013C
#> 3 BS21 6AH -2.84 51.4 North Somerset 005A
#> 4 SG18 8JB -0.249 52.1 Central Bedfordshire 006C
#> 5 M11 2NA -2.18 53.5 Manchester 015E
Created on 2023-01-13 with reprex v2.0.2
So I am trying to write an automated report in R with Functions. One of the questions I am trying to answer is this " During the first week of the month, what were the 10 most viewed products? Show the results in a table with the product's identifier, category, and count of the number of views.". To to this I wrote the following function
most_viewed_products_per_week <- function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
However the output I get is this:
function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
How do I fix that?
This report has more questions like this, so I am trying to get my function writing as correct as possible from the start.
Thanks in advance,
It is a good practice to code in functions. Still I recommend you get your code doing what you want and then think about what parts you want to wrap in a function (for future re-use). This is to get you going.
In general: to support your analysis, make sure that your data is in the right class. I.e. dates are formatted as dates, numbers as double or integers, etc. This will give you access to many helper functions and packages.
For the case at hand, read up on {tidyverse}, in particular {dplyr} which can help you with coding pipes.
simulate data
As mentioned - you will find many friends on Stackoverflow, if you provide a reproducible example.
Your questions suggests your data look a bit like the following simulated data.
Adapt as appropriate (or provide example)
library(tibble) # tibble are modern data frames
library(dplyr) # for crunching tibbles/data frames
library(lubridate) # tidyverse package for date (and time) handling
df <- tribble( # create row-tibble
~date, ~identifier, ~category, ~views
,"2020-02-01", 1, "TV", 27
,"2020-02-02", 2, "PC", 40
,"2020-02-03", 1, "TV", 12
,"2020-02-03", 2, "PC", 2
,"2020-02-08", 3, "UV", 200
) %>%
mutate(date = ymd(date)) # date is read in a character - lubridate::ymd() for date
This yields
> df
# A tibble: 5 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
5 2020-02-08 3 UV 200
Notice: date-column is in date-format.
work your algorithm
From your attempt it follows you want to extract the first 7 days.
Since we have a "date"-column, we can use a date-function to help us here.
{lubridate}'s day() extracts the "day-number".
> df %>% filter(day(date) <= 7)
# A tibble: 4 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
Anything outside the first 7 days is gone.
Next you want to summarise to get your product views total.
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
## ---------- summarise in bins that you need := groups -------
group_by(identifier, category) %>%
summarise(total_views = sum(views)
, .groups = "drop" ) # if grouping is not needed "drop" it
This gives you:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 1 TV 39
2 2 PC 42
Now pick the top-10 and sort the order:
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
group_by(identifier, category) %>%
summarise(total_views = sum(views), .groups = "drop" ) %>%
## ---------- make use of another helper function of dplyr
top_n(n = 10, total_views) %>% # note top-10 makes here no "real" sense :), try top_n(1, total_views)
arrange(desc(total_views)) # arrange in descending order on total_views
wrap in function
Now that the workflow is in place, think about breaking your code into the blocks you think are useful.
I leave this to you. You can assign interim results to new data frames and wrap the preparation of the data into a function and then the top_n() %>% arrange() in another function, ...
This yields:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 2 PC 42
2 1 TV 39
I am stuck with converting strings to times. I am aware that there are many topics on Stack regarding converting strings-to-times, however I couldn't fix this problem with the solutions.
I have a file with times like this:
> dput(df$Time[1:50])
c("1744.3", "2327.54", "1718.51", "2312.3200000000002", "1414.16",
"2046.15", "1442.5", "1912.22", "2303.2199999999998", "2146.3200000000002",
"1459.02", "1930.15", "1856.23", "2319.15", "1451.05", "25.460000000000036",
"1453.25", "2309.02", "2342.48", "2322.5300000000002", "2101.5",
"2026.07", "1245.04", "1945.15", "5.4099999999998545", "1039.5",
"1731.37", "2058.41", "2030.36", "1814.31", "1338.18", "1858.33",
"1731.36", "2343.38", "1733.27", "2304.59", "1309.47", "1916.11",
"1958.3", "1929.54", "1756.4", "1744.23", "1731.26", "1844.47",
"1353.25", "1958.3", "1746.44", "1857.53", "2047.15", "2327.2199999999998", "1915"
In this example, the times should be like this:
"1744.3" = 17:44:30
"2327.54" = 23:27:54
"1718.51" = 17:18:51
"2312.3200000000002" = 23:12:32
"25.460000000000036" = 00:25:46 # as you can see, the first two 00 are missing.
"1915" = 19:15:00
However, I tried multiple things (and now I am even stuck with str_replace()). Hopefully some one knows how I can transform this.
What have I tried?
format(df$Time, "%H%M.%S") # Yes I know...
# So therefore I thought, lets replace the strings to get them in a proper format
# like HH:MM:SS. First step was to replace the "." for a ":"
str_replace("." , ":", df$Time) # this was leading to "." (don't know why)
And that was the point that I was so frustrated that I posted it on Stack. Hope that you guys can help me.
Many thanks in advance!
Here is a way to do this, storing the output from dput in x.
#Remove all the dots
gsub('\\.', '', x) %>%
#Select only first 6 characters
substr(1, 6) %>%
#Pad 0's at the end
stringr::str_pad(6,pad = '0', side = 'right') %>%
#Add colon (:) separator
sub('(.{2})(.{2})', '\\1:\\2:', .)
# [1] "17:44:30" "23:27:54" "17:18:51" "23:12:32" "14:14:16" "20:46:15"
# [7] "14:42:50" "19:12:22" "23:03:21" "21:46:32" "14:59:02" "19:30:15"
#[13] "18:56:23" "23:19:15" "14:51:05" "25:46:00" "14:53:25" "23:09:02"
Note that this can be done without pipes as well but using it for clarity. From here you can convert the time to POSIXct format if needed.
The main problem is the time "25.460000000000036". But I think I found a clear though somewhat verbose solution:
df %>%
mutate(hours = formatC(as.numeric(Time), width = 4, format = "d", flag = "0"),
seconds = as.numeric(str_extract(Time, "[.].+")) * 100) %>%
mutate(Time_new = stringi::stri_datetime_parse(paste0(hours, seconds), format = "HHmm.ss"))
#> # A tibble: 51 x 4
#> Time hours seconds Time_new
#> <chr> <chr> <dbl> <dttm>
#> 1 25.460000000000036 0025 46. 2020-02-19 00:25:46 # I changed the order of the times so the weird format is on top
#> 2 1744.3 1744 30 2020-02-19 17:44:30
#> 3 2327.54 2327 54 2020-02-19 23:27:54
#> 4 1718.51 1718 51 2020-02-19 17:18:51
#> 5 2312.3200000000002 2312 32. 2020-02-19 23:12:32
#> 6 1414.16 1414 16 2020-02-19 14:14:16
#> 7 2046.15 2046 15 2020-02-19 20:46:15
#> 8 1442.5 1442 50 2020-02-19 14:42:50
#> 9 1912.22 1912 22 2020-02-19 19:12:22
#> 10 2303.2199999999998 2303 22.0 2020-02-19 23:03:21
#> # ... with 41 more rows
If you also have times without fractions (i.e., without the dot) you could use this approach:
normalize_time <- function(t) {
formatC(as.numeric(t) * 100, width = 6, format = "d", flag = "0")
df %>%
mutate(Time_new = as.POSIXct(normalize_time(Time), format = "%H%M%S"))
A roundabout way of doing it
a data.table way
First, convert your strings in your vector to numeric, multiply by 100 (to get the relevant part of HMS before the decimal separator) and set to integer. Then use sprintf() to add leading zero's to get a 6-digit string. Finally, convert to time.
data.table::as.ITime( sprintf( "%06d",
as.integer( as.numeric(time) * 100 ) ),
format = "%H%M%S" )
# [1] "17:44:30" "23:27:54" "17:18:51" "23:12:32" "14:14:16" "20:46:15" "14:42:50" "19:12:22" "23:03:21" "21:46:32" "14:59:02" "19:30:15"
# [13] "18:56:23" "23:19:15" "14:51:05" "00:25:46" "14:53:25" "23:09:02" "23:42:48" "23:22:53" "21:01:50" "20:26:07" "12:45:04" "19:45:15"
# [25] "00:05:40" "10:39:50" "17:31:37" "20:58:41" "20:30:36" "18:14:31" "13:38:18" "18:58:33" "17:31:36" "23:43:38" "17:33:27" "23:04:59"
# [37] "13:09:47" "19:16:11" "19:58:30" "19:29:54" "17:56:40" "17:44:23" "17:31:26" "18:44:47" "13:53:25" "19:58:30" "17:46:44" "18:57:53"
# [49] "20:47:15" "23:27:21"
I'm trying to scrape this webpage using R : http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all (All the pages)
I'm new to programming. And everywhere I've looked, tables are mostly identified with IDs or Divs or Class. On this page there's none. Data is stored in Table format. How should I scrape it?
This is what I did :
webpage <- read_html("http://zipnet.in/index.php
tbls <- html_nodes(webpage, "table")
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[9:10] %>%
html_table(fill = TRUE)
colnames(tbls_ls[[1]]) <- c("Mobile Make", "State", "District",
"Police Station", "Status", "Mobile Type(GSM/CDMA)",
"FIR/DD/GD Dat")
You can scrape the table data by targeting the css id of each table. It looks like each page is composed of 3 different tables pasted one after another. Two of the tables have #AutoNumber15 css id while the third (in the middle) has the #AutoNumber16 css id.
I put a simple code example that should get you started in the right direction.
# define function to scrape the table data from a page
get_page <- function(page_id = 1) {
# default link
link <- "http://zipnet.in/index.php?page=missing_mobile_phones_search&criteria=browse_all&Page_No="
# build link
link <- paste0(link, page_id)
# get tables data
wp <- read_html(link)
wp %>%
html_nodes("#AutoNumber16, #AutoNumber15") %>%
html_table(fill = TRUE) %>%
# get the data from the first three pages
iter_page <- 1:3
# this is just a progress bar
pb <- progress_estimated(length(iter_page))
# this code will iterate over pages 1 through 3 and apply the get_page()
# function defined earlier. The Sys.sleep() part is used to pause the code
# after each iteration so that the sever is not overloaded with requests.
map_df(iter_page, ~ {
df <- get_page(.x)
Sys.sleep(sample(10, 1) * 0.1)
#> # A tibble: 72 x 4
#> X1 X2 X3
#> <chr> <chr> <chr>
#> 1 FIR/DD/GD Number 000165 State
#> 2 FIR/DD/GD Date 17/08/2017 District
#> 3 Mobile Type(GSM/CDMA) GSM Police Station
#> 4 Mobile Make SAMSUNG J2 Mobile Number
#> 5 Missing/Stolen Date 23/04/2017 IMEI Number
#> 6 Complainant AKEEL KHAN Complainant Contact Number
#> 7 Status Stolen/Theft Report Date/Time on ZIPNET
#> 8 <NA> <NA> <NA>
#> 9 FIR/DD/GD Number FIR No 37/ State
#> 10 FIR/DD/GD Date 17/08/2017 District
#> # ... with 62 more rows, and 1 more variables: X4 <chr>
After recently taking Hadley Wickham's functional programming class I decided I'd try applying some of the lessons to my projects at work. Naturally, the first project I tried has proven to be more complicated than the examples worked demonstrated in the class. Does anyone have recommendations for a way to use the purrr package to make the task described below more efficient?
Project Background
I need to assign quintile groups to records in a spatial polygon dataframe. In addition to the record identifier there are several other variables and I need to calculate the quintile group for each.
Here's the crux of the problem: I have been asked to identify outliers in one particular variable and to omit those records from the entire analysis as long as it doesn't change the quintile composition of the first quintile group for any of the other variables.
I have put together a dplyr pipeline (see the example below) that performs this checking process for a single variable, but how might I rewrite this process so that I can efficiently check each variable?
EDIT: While it is certainly possible to change the shape of the data from wide to long as an intermediary step, in the end it needs to return to its wide format so that it matches up with the #polygons slot of the spatial polygons dataframe.
Reproducible Example
You can find the complete script here: https://gist.github.com/tiernanmartin/6cd3e2946a77b7c9daecb51aa11e0c94
Libraries and Settings
library(grDevices) # boxplot.stats()
library(operator.tools) # %!in% logical operator
library(tmap) # 'metro' data set
library(magrittr) # piping
library(dplyr) # exploratory data analysis verbs
library(purrr) # recursive mapping of functions
library(tibble) # improved version of a data.frame
library(ggplot2) # dot plot
library(ggrepel) # avoid label overlap
Load the example data and take a small sample of it
m_spdf <- metro
# Take a sample
m <-
metro#data %>%
as_tibble %>%
select(-name_long,-iso_a3) %>%
> m
# A tibble: 50 x 10
name pop1950 pop1960 pop1970 pop1980 pop1990
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Sydney 1689935 2134673 2892477 3252111 3631940
2 Havana 1141959 1435511 1779491 1913377 2108381
3 Campinas 151977 293174 540430 1108903 1693359
4 Kano 123073 229203 541992 1349646 2095384
5 Omsk 444326 608363 829860 1032150 1143813
6 Ouagadougou 33035 59126 115374 265200 537441
7 Marseille 755805 928768 1182048 1372495 1418279
8 Taiyuan 196510 349535 621625 1105695 1636599
9 La Paz 319247 437687 600016 809218 1061850
10 Baltimore 1167656 1422067 1554538 1748983 1848834
# ... with 40 more rows, and 4 more variables:
# pop2000 <dbl>, pop2010 <dbl>, pop2020 <dbl>,
# pop2030 <dbl>
Calculate quintile groups with and without outlier records
# Calculate the quintile groups for one variable (e.g., `pop1990`)
m_all <-
m %>%
mutate(qnt_1990_all = dplyr::ntile(pop1990,5))
# Find the outliers for a different variable (e.g., 'pop1950')
# and subset the df to exlcude these outlier records
m_out <- boxplot.stats(m$pop1950) %>% .[["out"]]
m_trim <-
m %>%
filter(pop1950 %!in% m_out) %>%
mutate(qnt_1990_trim = dplyr::ntile(pop1990,5))
# Assess whether the outlier trimming impacted the first quintile group
m_comp <-
m_trim %>%
select(name,dplyr::contains("qnt")) %>%
left_join(m_all,.,"name") %>%
select(name,dplyr::contains("qnt"),everything()) %>%
mutate(qnt_1990_chng_lgl = !is.na(qnt_1990_trim) & qnt_1990_trim != qnt_1990_all,
qnt_1990_chng_dir = if_else(qnt_1990_chng_lgl,
paste0(qnt_1990_all," to ",qnt_1990_trim),
"No change"))
With a little help from ggplot2, I can see that in this example six outliers were identified and that their omission did not affect the first quintile group for pop1990.
Importantly, this information is tracked in two new variables: qnt_1990_chng_lgl and qnt_1990_chng_dir.
> m_comp %>% select(name,qnt_1990_chng_lgl,qnt_1990_chng_dir,everything())
# A tibble: 50 x 14
name qnt_1990_chng_lgl qnt_1990_chng_dir qnt_1990_all qnt_1990_trim
<chr> <lgl> <chr> <dbl> <dbl>
1 Sydney FALSE No change 5 NA
2 Havana TRUE 4 to 5 4 5
3 Campinas TRUE 3 to 4 3 4
4 Kano FALSE No change 4 4
5 Omsk FALSE No change 3 3
6 Ouagadougou FALSE No change 1 1
7 Marseille FALSE No change 3 3
8 Taiyuan TRUE 3 to 4 3 4
9 La Paz FALSE No change 2 2
10 Baltimore FALSE No change 4 4
# ... with 40 more rows, and 9 more variables: pop1950 <dbl>, pop1960 <dbl>,
# pop1970 <dbl>, pop1980 <dbl>, pop1990 <dbl>, pop2000 <dbl>, pop2010 <dbl>,
# pop2020 <dbl>, pop2030 <dbl>
I now need to find a way to repeat this process for every variable in the dataframe (i.e., pop1960 - pop2030). Ideally, two new variables would be created for each existing pop* variable and their names would be preceded by qnt_ and followed by either _chng_dir or _chng_lgl.
Is purrr the right tool to use for this? dplyr::mutate_? data.table?
It turns out this problem is solvable using tidyr::gather + dplyr::group_by + tidyr::spread functions. While #shayaa and #Gregor didn't provide the solution I was looking for, their advice helped me course-correct away from the functional programming methods I was researching.
I ended up using #shayaa's gather and group_by combination, followed by mutate to create the variable names (qnt_*_chng_lgl and qnt_*_chng_dir) and then using spread to make it wide again. An anonymous function passed to summarize_all removed all the extra NA's that the wide-long-wide transformations created.
m_comp <-
m %>%
mutate(qnt = dplyr::ntile(pop1950,5)) %>%
filter(pop1950 %!in% m_out) %>%
gather(year,pop,-name,-qnt) %>%
group_by(year) %>%
mutate(qntTrim = dplyr::ntile(pop,5),
qnt_chng_lgl = !is.na(qnt) & qnt != qntTrim,
qnt_chng_dir = ifelse(qnt_chng_lgl,
paste0(qnt," to ",qntTrim),
"No change"),
year_lgl = paste0("qnt_chng_",year,"_lgl"),
year_dir = paste0("qnt_chng_",year,"_dir")) %>%
spread(year_lgl,qnt_chng_lgl) %>%
spread(year_dir,qnt_chng_dir) %>%
spread(year,pop) %>%
select(-qnt,-qntTrim) %>%
group_by(name) %>%
summarize_all(function(.){subset(.,!is.na(.)) %>% first})
Nothing wrong with your analysis it seems to me,
After this part
m <- metro#data %>%
as_tibble %>%
select(-name_long,-iso_a3) %>%
Just melt your data and continue your analysis but with group_by(year)
mm <- melt(m)
mm[,2] <- as.factor(str_sub(mm[,2],-4))
names(mm)[2:3] <- c("year", "population")
mm %>% group_by(year) %>%
+ mutate(qnt_all = dplyr::ntile(population,5))