Which selector to write in rvest package in R? - r

I am trying to extract informations from source code of a specific website
In the source code there are lines:
# [[4]]
# <script type="text/javascript">
# <![CDATA[
# <!-- // <![CDATA[
# var wp_dot_addparams = {
# "cid": "148938",
# "ctype": "article",
# "ctags": "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions",
# "cauthor": "",
# "csource": "film.wp.pl",
# "cpageno": 1,
# "cpagemax": 1,
# "cdate": "2015-02-18"
# };
# // ]]]]><![CDATA[> -->
# ]]>
# </script>
From which I'd like to extract:
"ctags": "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions",
Does anyone know how I should specify the selector in html_nodes function in rvest package in R?
html("http://film.wp.pl/id,148938,title,dziejesiewkulturze-Codzienna-dawka-informacji-kulturalnych-180215-WIDEO,wiadomosc.html") %>%
html_nodes("script")

Extract the JSON object from the element's text (tidy the selector up while you're at it)
Parse it as a list using jsonlite's fromJSON() function.
You can access it directly using "$ctags"
library(jsonlite)
json <- html("http://film.wp.pl/id,148938,title,dziejesiewkulturze-Codzienna-dawka-informacji-kulturalnych-180215-WIDEO,wiadomosc.html") %>%
html_nodes("script:contains('var wp_dot_addparams')") %>%
gsub(x=., pattern=".*var wp_dot_addparams = (\\{.*\\});.*",replacement="\\1") %>%
fromJSON()
json$ctags
[1] "dziejesiewkulturze,piraci z karaibów,Charlie Hebdo,Scorpions"

Related

Click button using R + httr

I'm trying to scrape randomly generated names from a website.
library(httr)
library(rvest)
url <- "https://letsmakeagame.net//tools/PlanetNameGenerator/"
mywebsite <- read_html(url) %>%
html_nodes(xpath="//div[contains(#id,'title')]")
However, that does not work. I'm assuming I have to «click» the «generate» button before extracting the content. Is there a simple way (without RSelenium) to achieve that?
Something similar to:
POST(url,
body = list("EntryPoint.generate()" = T),
encode = "form") -> res
res_t <- content(res, as="text")
Thanks!
rvest isn't much of a help here as planet names are not requested from a remote service, names are generated locally with javascript, that's what the EntryPoint.generate() call does. A relatively simple way is to use chromote, though its session/process closing seems kind of messy at the moment:
library(chromote)
b <- ChromoteSession$new()
{
b$Page$navigate("https://letsmakeagame.net/tools/PlanetNameGenerator")
b$Page$loadEventFired()
}
# call EntryPoint.generate(), read result from <p id="title></p> element,
# replicate 10x
replicate(10, b$Runtime$evaluate('EntryPoint.generate();document.getElementById("title").innerText')$result$value)
#> [1] "Torade" "Ukiri" "Giconerth" "Dunia" "Brihoria"
#> [6] "Tiulaliv" "Giahiri" "Zuthewei 4A" "Elov" "Brachomia"
b$close()
#> [1] TRUE
b$parent$close()
#> Error in self$send_command(msg, callback = callback_, error = error_, : Chromote object is closed.
b$parent$get_browser()$close()
#> [1] TRUE
Created on 2023-01-25 with reprex v2.0.2

Namespace without prefix in XML in R

In the XML package in R, it is possible to create a new xmlTree object with a namespace, e.g. using:
library(XML)
d = xmlTree("foo", namespaces = list(prefix = "url"))
d$doc()
# <?xml version="1.0"?>
# <foo xmlns:prefix="url"/>
How do I create a default namespace, without the prefix bar, such that it looks like the following?
# <?xml version="1.0"?>
# <foo xmlns="url"/>
The following does not produce what I expected.
library(XML)
d = xmlTree("foo", namespaces = list("url"))
d$doc()
# <?xml version="1.0"?>
# <url:foo xmlns:url="<dummy>"/>
There seems to be a difference between nameless lists and lists with an empty name in R.
1 - A nameless list:
list("url")
# [[1]]
# [1] "url"
names(list("url"))
# NULL
2 - A named list:
list(prefix = "url")
# $prefix
# [1] "url"
names(list(prefix = "url"))
# [1] "prefix"
3 - An incorrectly initialised empty-name list:
list("" = "url")
# Error: attempt to use zero-length variable name
4 - An hacky way to initialise an empty-name list:
setNames(list(prefix = "url"), "")
# [[1]]
# [1] "url"
names(setNames(list(prefix = "url"), ""))
# [1] ""
It would seem 1. and 4. are identical, however, in the package XML they produce different results. The first gives the incorrect XML as mentioned in the OP, whereas option 4. produces:
library(XML)
d = d = xmlTree("foo", namespaces = setNames(list(prefix = "url"), ""))
d$doc()
# <?xml version="1.0"?>
# <foo xmlns="url"/>

Getting different results from the same function inside and outside a mutate function call

Can someone explain to me why I get a different result when I run the convertToDisplayTime function inside mutate than when I run it on its own? The correct result is the one I obtain when I run it on its own. Also, why do I get these warnings? It feels like I might be passing the whole timeInSeconds column as an argument when I call convertToDisplayTime in the mutate function, but I'm not sure that I really understand the mechanics in play here.
library('tidyverse')
#> Warning: package 'tibble' was built under R version 4.1.2
convertToDisplayTime <- function(timeInSeconds){
## Takes a time in seconds and converts it
## to a xx:xx:xx string format
if(timeInSeconds>86400){ #Not handling time over a day
stop(simpleError("Enter a time below 86400 seconds (1 day)"))
} else if(timeInSeconds>3600){
numberOfMinutes = 0
numberOfHours = timeInSeconds%/%3600
remainingSeconds = timeInSeconds%%3600
if(remainingSeconds>60){
numberOfMinutes = remainingSeconds%/%60
remainingSeconds = remainingSeconds%%60
}
if(numberOfMinutes<10){displayMinutes = paste0("0",numberOfMinutes)}
else{displayMinutes = numberOfMinutes}
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfHours,":",displayMinutes,":", displaySeconds))
} else if(timeInSeconds>60){
numberOfMinutes = timeInSeconds%/%60
remainingSeconds = timeInSeconds%%60
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfMinutes,":", displaySeconds))
} else{
return(paste0("0:",timeInSeconds))
}
}
(df <- tibble(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10)) %>% mutate(displayTime = convertToDisplayTime(timeInSeconds)))
#> Warning in if (timeInSeconds > 86400) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 3600) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 60) {: the condition has length > 1 and only the
#> first element will be used
#> Warning in if (remainingSeconds < 10) {: the condition has length > 1 and only
#> the first element will be used
#> # A tibble: 4 x 2
#> timeInSeconds displayTime
#> <dbl> <chr>
#> 1 2710. 45:10
#> 2 2705. 45:5
#> 3 2692. 44:52
#> 4 2708. 45:8
convertToDisplayTime(2710.46)
#> [1] "45:10"
convertToDisplayTime(2705.04)
#> [1] "45:05"
convertToDisplayTime(2691.66)
#> [1] "44:52"
convertToDisplayTime(2708.10)
#> [1] "45:08"
Created on 2022-01-06 by the reprex package (v2.0.1)
Like mentioned in the comments, the problem here is that your function is not vectorized: it works with a single value for an input and outputs a single value. However, this does not work when the input is a vector of values, hence the condition has length 1 warning you get:
1: Problem with `mutate()` column `displayTime`.\
ℹ `displayTime = convertToDisplayTime(timeInSeconds)`.
ℹ the condition has length > 1 and only the first element will be used
Here, when you use dplyr::mutate, you're technically trying to feed a vector to your function, which is not formatted to process it.
Several options you may consider:
1. The "fast and ugly" way:
df <- data.frame(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10))
## This one does not work
df %>% mutate(displayTime = convertToDisplayTime(timeInSeconds))
## This one works
df %>%
rowwise() %>%
mutate(displayTime = convertToDisplayTime(timeInSeconds)) %>%
ungroup()
dplyr::rowwise() allows dplyr::mutate() to work on each row independently, rather than by columns. I assume this is the behavior you initially expected. dplyr::ungroup() sorta reverts rowwise, eg. go back to the default column-wise behavior.
I may be a little harsh on this one, but this is the kind of trick that I used back when I did not quite understand my way around dataframes and their manipulation...
2. Vectorize directly from your dplyr verbs:
df %>%
mutate(displayTime = base::mapply(convertToDisplayTime, timeInSeconds))
## or
df %>%
mutate(displayTime = purrr::map_chr(timeInSeconds, convertToDisplayTime))
Both options are similar.
3. Vectorize your function:
convertToDisplayTime_vec <- base::Vectorize(convertToDisplayTime)
# class(convertToDisplayTime_vec)
df %>% mutate(displayTime = convertToDisplayTime_vec(timeInSeconds))
## or
convertToDisplayTime_vec2 <- function(timeInSeconds_vec) {
mapply(FUN = convertToDisplayTime, timeInSeconds_vec)
}
# class(convertToDisplayTime_vec2)
df %>%
mutate(displayTime = convertToDisplayTime_vec2(timeInSeconds))
# Still works on single variables!
# convertToDisplayTime_vec2(6475)
This is my favourite option, as once it is implemented you can use it either on single variables, vectors or dataframes, without worring about it.
A little documentation to dig a little into the subject.
PS: As an aside, a little tip worth remembering: you may want to be careful when manipulating data.frame and tibble objects. Despite their similarity, they have slight differences, and some functions deal differently with one or the other, or actually convert one to the other without your noticing...

How can I input a single additional parameter to disk.frame's inmapfn at readin?

According to the article https://diskframe.com/articles/ingesting-data.html a good use case for inmapfn as part of csv_to_disk_frame(...) is for date conversion. In my data I know the name of the date column at runtime and would like to feed in the date to a convert at read in time function. One issue I am having is that it doesn't seem any additional parameters can be passed into the inmapfn argument beyond the chunk itself. I can't use a hardcoded variable at runtime as the name of the column isn't known until runtime.
To clarify the issue is that the inmapfn seems to run in its own environment to prevent any data races/other parallelisation issues but I know the variable won't be changed so I am hoping there is someway to override this as I can make sure that this is safe.
I know the function I am calling works when called on an arbitrary dataframe.
I have provided a reproducible example below.
library(tidyverse)
library(disk.frame)
setup_disk.frame()
a <- tribble(~dates, ~val,
"09feb2021", 2,
"21feb2012", 2,
"09mar2013", 3,
"20apr2021", 4,
)
write_csv(a, "a.csv")
dates_col <- "dates"
tmp.df <- csv_to_disk.frame(
"a.csv",
outdir = file.path(tempdir(), "tmp.df"),
in_chunk_size = 1L,
inmapfn = function(chunk) {
chunk[, sdate := as.Date(do.call(`$`, list(chunk,dates_col)), "%d%b%Y")]
}
)
#> -----------------------------------------------------
#> Stage 1 of 2: splitting the file a.csv into smallers files:
#> Destination: C:\Users\joelk\AppData\Local\Temp\RtmpcFBBkr\file4a1876e87bf5
#> -----------------------------------------------------
#> Stage 1 of 2 took: 0.020s elapsed (0.000s cpu)
#> -----------------------------------------------------
#> Stage 2 of 2: Converting the smaller files into disk.frame
#> -----------------------------------------------------
#> csv_to_disk.frame: Reading multiple input files.
#> Please use `colClasses = ` to set column types to minimize the chance of a failed read
#> =================================================
#>
#> -----------------------------------------------------
#> -- Converting CSVs to disk.frame -- Stage 1 of 2:
#>
#> Converting 5 CSVs to 6 disk.frames each consisting of 6 chunks
#>
#> Error in do.call(`$`, list(chunk, dates_col)): object 'dates_col' not found
You can experiment with different backend and chunk_reader arguments. For example, if you set the backend to readr, the inmapfn user defined function will have access to previously defined variables. Furthermore, readr will do column type guessing
and will automatically impute Date type columns if it recognizes the string format as a date (in your example data it wouldn't recognize that as a date type, however).
If you don't want to use the readr backend for performance reasons, then I would ask if your example correctly represents your actual scenario? I'm not seeing the need to pass in the date column as a variable in the example you provided.
There is a working solution in the Just-in-time transformation section of the link you provided, and I'm not seeing any added complexities between that example and yours.
If you really need to use the default backend and chunk_reader plan AND you really need to send the inmapfn function a previously defined variable, you can wrap the the csv_to_disk.frame call in a wrapper function:
library(disk.frame)
setup_disk.frame()
df <- tribble(~dates, ~val,
"09feb2021", 2,
"21feb2012", 2,
"09mar2013", 3,
"20apr2021", 4,
)
write.csv(df, file.path(tempdir(), "df.csv"), row.names = FALSE)
wrap_csv_to_disk <- function(col) {
my_date_col <- col
csv_to_disk.frame(
file.path(tempdir(), "df.csv"),
in_chunk_size = 1L,
inmapfn = function(chunk, dates = my_date_col) {
chunk[, dates] <- lubridate::dmy(chunk[[dates]])
chunk
})
}
date_col <- "dates"
df_disk_frame <- wrap_csv_to_disk(date_col)
#> str(collect(df_disk_frame)$dates)
# Date[1:4], format: "2021-02-09" "2012-02-21" "2013-03-09" "2021-04-20"
I see. For a work around would it be possible to do something like this?
date_var = knonw_at_runtime()
saveRDS(date_var, "some/path/date_var.rds")
a = csv_to_disk.frame(files, inmapfn = function(chunk) {
date_var = readRDS("some/path/date_var.rds")
# do the rest
})
I think letting inmapfn have other options is doable see https://github.com/xiaodaigh/disk.frame/issues/377 for tracking

Replacing the attribute value of an htmltools::tag

Say I have the following tag:
library(htmltools)
t = div(name = 'oldname')
I can overwrite the 'name' attribute of this tag using t$attribs$name = 'newname' but prefer using htmltools getters/setters, does the package have a function that facilitates this?
Looking through the package manual, the only function that allows for the manipulation of tag attributes is tagAppendAttributes, which only appends the new atrribute value to the original:
t = tagAppendAttributes(t, name = 'newname')
t
#<div name="oldname newname"></div>
Does the absence of a helper function that overwrites the value of an attribute mean that tag attributes are not meant to be overwritten?
You're probably overthinking this. Look at the code for tagAppendAttributes:
tagAppendAttributes
#> function (tag, ...)
#> {
#> tag$attribs <- c(tag$attribs, list(...))
#> tag
#> }
All it does is take whatever you pass and write directly to tag$attribs. If you unclass your object you'll see it's just a list really:
unclass(t)
#> $name
#> [1] "div"
#>
#> $attribs
#> $attribs$name
#> [1] "oldname"
#>
#>
#> $children
#> list()
I can see why writing directly to an object's data member rather than using a setter might not feel right if you come from an object-oriented programming background, but this is clearly a "public" data member in an informal S3 class. Setting it directly is no more likely to break it that any other implementation.
If you really want to I suppose you could define a setter:
tagSetAttributes <- function(tag, ...) {tag$attribs <- list(...); tag}
tagSetAttributes(t, name = "new name")
#> <div name="new name"></div>

Resources