I'm trying to scrape data from https://3g.dxy.cn/newh5/view/pneumonia , using SelectorGadget and rvest
I successfully scrape some text in the page with the following code.
library(rvest)
url <- 'https://3g.dxy.cn/newh5/view/pneumonia'
webpage <- read_html(url)
TEXT_html <- html_nodes(webpage,'.descText___Ui3tV')
TEXT <- html_text(TEXT_html)
But when I try to select the table where the most important data (the number of people infected in the table) see selection using the following code
TABLE_html <- html_nodes(webpage,'.areaBlock1___3V3UU p')
TABLE <- html_text(TABLE_html)
the output is "character 0"
I guess it's because the data in the table can't be seen because they are refreshed via API, but I don't really know how to solve this
Anybody has an idea? thank you very much
On this page, the data isn't retrieved seperately from an API. It is actually present in the html page you downloaded, but it is in JSON format inside a script tag and the reason why rvest can't read it is that the data is only added to the DOM by Javascript after the page loads. To get at the data you need to extract and parse the JSON:
library(rvest)
library(tibble)
library(jsonlite)
data <- 'https://3g.dxy.cn/newh5/view/pneumonia' %>%
read_html() %>%
html_node('#getAreaStat') %>% # This is the tag containing the JSON
html_text() %>% # Get the javascript from the node
strsplit("(getAreaStat = )|(}catch)") %>% # Carve out the JSON
unlist() %>%
`[`(2) %>% # Unlist and extract the JSON
fromJSON() # Parse the JSON
Now data is a data frame containing all the information from the JSON. However, the last column of data is actually a list of city-level data frames. Since these all have the same column names they can be bound together with rbind. The final column can then be removed from data so you have a data frame of province-level data and another one of city-level data.
city_data <- as_tibble(do.call(rbind, data$cities))
province_data <- as_tibble(data[, -8])
So province_data looks like a bit like this (the Chinese symbols haven't copied over but appear in the R console)
province_data
#> # A tibble: 33 x 7
#> provinceName provinceShortNa~ confirmedCount suspectedCount curedCount deadCount
#> <chr> <chr> <int> <int> <int> <int>
#> 1 ??? ?? 2714 0 52 100
#> 2 ??? ?? 207 0 4 0
#> 3 ??? ?? 173 0 3 0
#> 4 ??? ?? 168 0 0 1
#> 5 ??? ?? 143 0 0 0
#> 6 ??? ?? 132 0 0 0
#> 7 ??? ?? 106 0 0 0
#> 8 ??? ?? 95 0 0 0
#> 9 ??? ?? 91 0 2 1
#> 10 ??? ?? 90 0 0 0
#> # ... with 23 more rows, and 1 more variable: comment <chr>
and city_data looks like this (again, with cityName printed properly in the console).
#> # A tibble: 329 x 5
#> cityName confirmedCount suspectedCount curedCount deadCount
#> <chr> <int> <int> <int> <int>
#> 1 ?? 1590 0 47 85
#> 2 ?? 213 0 2 4
#> 3 ?? 173 0 0 1
#> 4 ?? 114 0 0 3
#> 5 ?? 91 0 0 0
#> 6 ?? 71 0 1 2
#> 7 ?? 70 0 0 0
#> 8 ?? 70 0 0 0
#> 9 ?? 65 0 0 0
#> 10 ?? 57 0 0 0
#> # ... with 319 more rows
Related
The goal is to read the 1-5yr GIC rates for Guaranteed Investment Certificate - Long-Term and Compound Interest under the Non-Cashable GICs tab.
Selector Gadget tells me that the css identifier is #container-9565195e5e .cmp-chart__chart span. Using rvest:
page <- read_html('https://www.td.com/ca/en/personal-banking/products/saving-investing/gic-rates-canada/')
page %>%
html_nodes("#container-9565195e5e .cmp-chart__chart span")
# {xml_nodeset (5)}
# [1] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:1" data-value="postedRate"></span>
# [2] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:2" data-value="postedRate"></span>
# [3] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:3" data-value="postedRate"></span>
# [4] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:4" data-value="postedRate"></span>
# [5] <span data-source="tdct-gic" data-view="single" data-filter-item="productId:315|minimumDepositAmt:0.01|minimumTermYearCnt:5" data-value="postedRate"></span>}
rvest can't read the actual rates because of the use of JavaScript on the site.
Turning to RSelenium using the same css selector results in an error:
remDr$navigate("https://www.td.com/ca/en/personal-banking/products/saving-investing/gic-rates-canada/")
webElem <- remDr$findElement(using = "css", "#container-9565195e5e .cmp-chart__chart span")
# Selenium message:Unable to locate element: {"method":"css selector","selector":"#container-9565195e5e .cmp-chart__chart span"}
# For documentation on this error, please visit: http://seleniumhq.org/exceptions/no_such_element.html
# Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03'
# System info: host: 'ef4080d2cb73', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-135-generic', java.version: '1.8.0_91'
# Driver info: driver.version: unknown
#
# Error: Summary: NoSuchElement
# Detail: An element could not be located on the page using the given search parameters.
# class: org.openqa.selenium.NoSuchElementException
# Further Details: run errorDetails method
So how do I use RSelenium to read the 1-5yr rates for Guaranteed Investment Certificate - Long-Term and Compound Interest for Non-registered and Registered (TFSA, RSP, RIF, RESP)
Replaced RSelenium with Chromote (which is on its way to rvest: r4ds, gh). The selector in question seems to refer to another table, Long-Term and Simple Interest. While values are currently the same, still switched to the one mentioned in question.
library(chromote)
library(rvest)
b <- ChromoteSession$new()
# Display the current session in the Chromote browser:
# b$view()
b$Page$navigate("https://www.td.com/ca/en/personal-banking/products/saving-investing/gic-rates-canada/")
b$Page$loadEventFired()
# Non-Cashable GICs >> Guaranteed Investment Certificate - Long-Term and Compound Interest
b$Runtime$evaluate("document.querySelector('#container-8a263227af table').outerHTML")$result$value %>%
minimal_html() %>%
html_element("table") %>%
html_table()
#> # A tibble: 5 × 2
#> Term `Non-registered and Registered (TFSA, RSP, RIF, RESP)`
#> <chr> <chr>
#> 1 1 year 4.65%
#> 2 2 years 4.35%
#> 3 3 years 3.75%
#> 4 4 years 4%
#> 5 5 years 4.05%
### Few alternatives
# evalute js in runtime:
sapply(1:5, \(x) b$Runtime$evaluate(paste0("document.querySelector('[data-filter-item=\"productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:",x,"\"]').innerText"))$result$value)
#> [1] "4.65" "4.35" "3.75" "4" "4.05"
doc <- b$DOM$getDocument()
# elements where "data-filter-item" attribute starts with "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:"
nodeids <- b$DOM$querySelectorAll(doc$root$nodeId, '[data-filter-item^="productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:"]')
sapply(nodeids$nodeIds, \(x) b$DOM$getOuterHTML(x) %>% minimal_html() %>% html_text())
#> [1] "4.65" "4.35" "3.75" "4" "4.05"
# close session
b$close()
#> [1] TRUE
Created on 2023-01-21 with reprex v2.0.2
The page does an initial POST request that gets all the data (let's call it master) for all the options. It then uses the various data-filter-item attribute values associated with a given table's cells e.g., data-filter-item="productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:1", to filter the master data to the items needed to update that table.
You can replicate a simplified version of this POST request, create a DataFrame of all values in the value part of the response (think of it like a master reference table), then extract the required filters from a request to the original URI
> filters
[1] "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:1" "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:2"
[3] "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:3" "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:4"
[5] "productId:703|minimumDepositAmt:0.01|minimumTermYearCnt:5"
and turn those into a DataFrame for filtering.
You can then subset the master table using the smaller DataFrame. Column names will be matched upon if set in the master using the key values from the key:value response.
Finally, update the table, when extracted from request response for initial webpage, by updating the relevant column with the rate % from the filtered master DataFrame.
The html from the initial webpage is invalid so the target table selection was not as straightforward as I would like.
The CSS selector list I went with was designed with hopefully a longer shelf-life, in terms of remaining valid for longer than a more brittle path might.
One other thing to show might be the response from the POST request. The response has the following key:value format, where I use the key column to generate headers for my master DataFrame. The values get turned into the master DataFrame of all possible rates (and other dynamic page info.)
Credit:
I took the approach used by #akrun in their answer here, whereby read.dcf is used to map out a set of rows, with potentially repeated/new headers, into a single DataFrame with all headers present. NA is entered if a particular entry is not present in a given processed row.
This allowed me to turn this list of split filtering instructions:
> lapply(filters, str_split, "\\|") %>% unlist(recursive = F)
[[1]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:1"
[[2]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:2"
[[3]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:3"
[[4]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:4"
[[5]]
[1] "productId:703" "minimumDepositAmt:0.01" "minimumTermYearCnt:5"
into this:
> data_df
productId minimumDepositAmt minimumTermYearCnt
1 703 0.01 1
2 703 0.01 2
3 703 0.01 3
4 703 0.01 4
5 703 0.01 5
i.e. the set of filtering instructions for the master DataFrame as a DataFrame
The master DataFrame looking as follows:
> df %>% head()
productId minimumDepositAmt maximumDepositAmt minimumTermYearCnt maximumTermYearCnt minimumTermDayCnt maximumTermDayCnt postedRate
1 107 0.01 4999.99 0 0 90 119 4
2 107 5000 9999.99 0 0 90 119 4
3 107 10000 24999.99 0 0 90 119 4
4 107 25000 49999.99 0 0 90 119 4
5 107 50000 99999.99 0 0 90 119 4
6 107 100000 249999.99 0 0 90 119 4
minimumMarketGrowthRate maximumMarketGrowthRate stepperYear1Rate stepperYear2Rate stepperYear3Rate stepperYear4Rate stepperYear5Rate
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0
The subset master DataFrame:
> filtered_df
productId minimumDepositAmt minimumTermYearCnt maximumDepositAmt maximumTermYearCnt minimumTermDayCnt maximumTermDayCnt postedRate
1 703 0.01 1 4999.99 1 0 364 4.65
2 703 0.01 2 4999.99 2 0 364 4.35
3 703 0.01 3 4999.99 3 0 364 3.75
4 703 0.01 4 4999.99 4 0 364 4
5 703 0.01 5 4999.99 5 0 364 4.05
minimumMarketGrowthRate maximumMarketGrowthRate stepperYear1Rate stepperYear2Rate stepperYear3Rate stepperYear4Rate stepperYear5Rate
1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
The extracted table, from initial page, before update:
> table
# A tibble: 5 × 2
Term `Non-registered and Registered (TFSA, RSP, RIF, RESP)`
<chr> <chr>
1 1 year %
2 2 years %
3 3 years %
4 4 years %
5 5 years %
And the table after update using master (df - data from POST request to get rates info):
> print(table)
# A tibble: 5 × 2
Term `Non-registered and Registered (TFSA, RSP, RIF, RESP)`
<chr> <chr>
1 1 year 4.65%
2 2 years 4.35%
3 3 years 3.75%
4 4 years 4%
5 5 years 4.05%
r:
library(rvest)
library(tidyverse)
library(httr2)
page <- read_html("https://www.td.com/ca/en/personal-banking/personal-investing/products/gic/gic-rates-canada")
table_node <- page %>%
html_element('div.container:contains("Guaranteed Investment Certificate - Long-Term") .text:contains("Compound") ~ div table')
filters <- table_node %>%
html_elements("[data-filter-item]") %>%
html_attr("data-filter-item")
res <- request("https://www.td.com/ca/en/personal-banking/getRates/") %>%
req_headers(
"user-agent" = "Mozilla/5.0",
"content-type" = "application/json",
"x-kl-ajax-request" = "Ajax_Request"
) %>%
req_body_json(list("errorText" = "Unable to get the rate", "ratesType" = "gic")) %>%
req_perform() %>%
resp_body_string()
data <- jsonlite::parse_json(res, simplifyVector = T)
df <- set_names(data$value %>% as.data.frame(), data$key)
data_df <- map_dfr(lapply(filters, str_split, "\\|") %>%
unlist(recursive = F), ~ {
new <-
if (length(.x) > 0) {
as.data.frame(read.dcf(textConnection(.x)))
} else {
NULL
}
})
filtered_df <- inner_join(data_df, df)
table <- table_node %>% html_table()
table[2] <- str_c(filtered_df$postedRate, table[[2]])
print(table)
I have been working hours on that and I simply cannot find any solution to the problem. Hopefully someone here can help.
I'm trying to create a personal choice matrix for some data with the following structure:
# A tibble: 2,152 x 32
age choice canton lr_s dist_svp dist_fdp dist_bdp dist_cvp dist_glp dist_sp
<dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 39 sp GE 3 49 25 25 4 16 1
2 67 sp ZH 0 100 49 64 4 25 0
3 42 svp ZH 7 4 4 1 36 4 36
dist_gps pid_svp pid_fdp pid_bdp pid_cvp pid_glp pid_sp pid_gps french italian
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0 0 0 0 0 1 1 0
2 9 0 0 0 0 0 1 0 0 0
3 36 0 0 0 0 0 1 0 0 0
Now, I need to create a personal choice matrix with the 7 alternatives that are indicated by dist_* / pid_* in the columns.
This should, according to my understanding, work with the following code:
work.pc <- mlogit.data(work,
varying = c(5:11, 12:18),
choice = "choice",
shape = "wide",
sep = "_")
However, when I run this code, I get the following Error message and a few Warning messages:
Error: Assigned data `ids` must be compatible with existing data.
x Existing data has 15064 rows.
x Assigned data has 2152 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: Setting row names on a tibble is deprecated.
2: Setting row names on a tibble is deprecated.
3: Setting row names on a tibble is deprecated.
4: Setting row names on a tibble is deprecated.
5: Setting row names on a tibble is deprecated.
6: Setting row names on a tibble is deprecated.
7: Setting row names on a tibble is deprecated.
What's the issue here? I'm grateful for any help! I've tried everything.
Problem solved: the tibble "work" has to be converted into dataframe.
After using
work <- as.data.frame(work)
the code functions properly i.e. the Error message is eliminated.
Blockquote
I got eye tracking gaze data in the form of x/y coordinates and timestamps.
Now I want to plot the saccades using the R package saccades. Unfortunately, it doesn't work. I guess it's a matter of having the data in the wrong format.
My data:
> View(EUFKDCDL_Q09AS_saccades_2)
> head(EUFKDCDL_Q09AS_saccades)
# A tibble: 6 x 4
time x y trial
<dbl> <dbl> <dbl> <dbl>
1 1550093577941 732 391 1
2 1550093577962 706 320 1
3 1550093577980 666 352 1
4 1550093578000 886 288 1
5 1550093578017 787 221 1
6 1550093578037 729 302 1
The code that didn't work:
> fixations <- detect.fixations(EUFKDCDL_Q09AS_saccades)
Error in detect.fixations(EUFKDCDL_Q09AS_saccades) :
No saccades were detected. Something went wrong.
The full code that shouldwork according github (it'swith the sample data):
> library(saccades)
> data(samples)
> head(samples)
time x y trial
1 0 53.18 375.73 1
2 4 53.20 375.79 1
3 8 53.35 376.14 1
4 12 53.92 376.39 1
5 16 54.14 376.52 1
6 20 54.46 376.74 1
> fixations <- detect.fixations(samples)
> head(fixations[c(1,4,5,10)])
trial x y dur
0 1 53.81296 377.40741 71
1 1 39.68156 379.58711 184
2 1 59.99267 379.92467 79
3 1 18.97898 56.94046 147
4 1 40.28365 39.03599 980
5 1 47.36547 35.39441 1310
> diagnostic.plot(samples, fixations)
So there must be a problem with how my data is structured I guess? What does the mean?
I hope that any of you can help me creating this saccade plot as in the sceenshot attached
I am an R newbie as well...please be patient with me. :D
I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.
Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1
I have the following data frame which I imported into R using read.table() (I incorporated read.table() within read_data() which is a function I created that also throw messages in case the file name is not written appropriately):
> raw_data <- read_data("n44.txt")
[1] #### Reading txt file ####
> head(raw_data)
subject block trial_num soa target_identity prime_type target_type congruency prime_exposure target_exposure button_pressed rt ac
1 99 1 1 200 82 9 1 9 0 36 1 1253 1
2 99 1 2 102 95 2 1 2 75 36 1 1895 1
3 99 1 3 68 257 2 2 1 75 36 2 1049 1
4 99 1 4 68 62 9 1 9 0 36 1 1732 1
5 99 1 5 34 482 9 3 9 0 36 3 765 1
6 99 1 6 68 63 9 1 9 0 36 1 2027 1
Then I'm using raw_data within the early_prep() function I created (I copied only the relevant part of the function):
early_prep <- function(file_name, keep_rows = NULL, id = NULL){
if (is.null(id)) {
# Stops running the function
stop("~~~~~~~~~~~ id is missing. Please provide name of id column ~~~~~~~~~~~")
}
# Call read_data() function
raw_data <- read_data(file_name)
if (!is.null(keep_rows)) {
raw_data <- raw_data[keep_rows, ]
# Print to console
print("#### Deleting unnecesarry rows in raw_data ####", quote = FALSE)
}
print(dim(raw_data))
print(head(raw_data))
return(raw_data)
}
}
My problem is with raw_data <- raw_data[keep_rows, ].
When I enter keep_rows = "raw_data$block > 1" this is what I get:
> x1 <- early_prep(file_name = "n44.txt", keep_rows = "raw_data$block > 1", id = "subject")
[1] #### Reading txt file ####
[1] #### Deleting unnecesarry rows in raw_data ####
[1] 1 13
subject block trial_num soa target_identity prime_type target_type congruency prime_exposure target_exposure button_pressed rt ac
NA NA NA NA NA NA NA NA NA NA NA NA NA NA
How can I solve this so it will only delete the rows I want?
Any help will be greatly appreciated
Best,
Ayala
The problem is that you pass the condition as a string and not as a real condition, so R can't evaluate it when you want it to.
if you still want to pass it as string you need to parse and eval it in the right place for example:
cond = eval(parse(text=keep_rows))
raw_data = raw_data[cond,]
This should work, I think