Getting the text values using Rvest - r

The page in question is this:
https://tolltariffen.toll.no/tolltariff/headings/03.02?language=en (Click on OPEN ALL LEVELS to get the complete data)
I'm using RSelenium to load the page and then getting the pagesource and using rvest to capture the required field. This is the data I'm trying to capture.
The code I've come up so far splits some descriptions data into multiple chunks which is not useful for me.
x <- remdr$getPageSource()
xpg <- read_html(x[[1]])
# get the HS descriptions
treeView <- xpg %>%
html_nodes(xpath = '//*/div[#class="MuiGrid-root MuiGrid-container MuiGrid-wrap-xs-nowrap"]') %>%
html_nodes(xpath = '//*/p[contains(#class, "MuiTypography-body1")]') %>%
html_nodes('span') %>%
html_text(trim = TRUE)
I need all the descriptions in order as a list.
Update: This is the output format. Descriptions and the 8-digit code

General thoughts:
RSelenium isn't strictly needed, and you can avoid the overhead of launching a browser. There is an API call, you can see in the browser network tab, which supplies the content of interest, and this can be called with no requirement for additional configuration of the request e.g. headers.
The question of how to extract the items you want from the API response, in the format you want, then becomes a fun challenge (at least to me) as we do not know 1) how many levels of nesting there may be in this response (and possible future ones) 2) whether the level of nesting can vary across listings within a given response for the items of interest 3) whether there will be a commodityCode at a given level (though the pattern appears to be that there is one at the deepest level for a given listing); and we need to consider how we generate columns/lists of equal length for output. These are just some starting considerations that I go on to discuss how I handled below.
The API call:
* You can click on many of the smaller images below to enlarge
The API response:
This request returns nested JSON:
The content of interest is a list of named lists, within the response, accessible via the parent "key" $headingItems:
Each of these named lists is nested as per the levels on the webpage:
You can see the repeated accessor key of headingItems (red boxed), with the first shown above as the parent list stored in data in code to follow.
Below that, indicated by level (orange boxed), are the expanded entries you are after; nested within the response JSON.
Finally, we have the descriptions (green boxed) which contains html for the descriptive text you are after, with English and Norwegian versions of the text:
In addition to this, there is, where present, a commodityCode key within the nested headingItems:
Approach and challenges:
Given that the commodityCode can be at different levels and may not be present (unless assumed to always be present at greatest depth of a given listing), and that it is unknown how many levels of headingItem there can be, the approach I chose was to use regex to identify the relevant child named list's names in a boolean mask (though for purposes here we could just say logical vector); one mask for English headers and one for the commodity codes. I processed each child list separately, using purrr::map and applying a custom function to extract data as a data.table/data.frame.
Example mask (descriptions|text):
The TRUE values are for the following chained accessors (chaining dependent on depth):
Notice how some accessor paths are repeated. This means therefore, that I do not use the mask to retrieve the names and extract the associated values. Instead, I keep the TRUE and FALSE values and thereby have equal lengths for both vectors. I combine the two logical vectors as columns within a data.table; along with the entire set of values within the child list:
This work is done within the custom function get_data, where I also then do the following steps:
I filter for only rows where there is a TRUE value i.e. a value I wish to retrieve
Apply a function utilizing gsub(), to remove non-breaking whitespace, and read_html() to convert those descriptions which are actual html to text. N.B. Some entries are not actually html and are handled by the if statement. In those cases, the input value is returned:
At this point the codes and descriptions/text are in a single column:
I use the booleans in commodity_code to update that columns value where TRUE to match the text column, and wrap in if to replace FALSE with NA.
Knowing that there is actually a 1 row offset between description and associated code, where applicable, I then shift the commodity column values down one row to correctly align with descriptions:
I then keep only the rows where description_header_flag is TRUE:
Finally, I remove the now not needed flag column:
This leaves me with a clean data.table to return from the function.
Generating the final output:
As map() applying the custom function above to a list returns a list of data.tables, I then simply call rbindlist() to combine these into a single data.table:
df <- rbindlist(map(data, get_data))
This can then be written to csv for example.
fwrite(df, 'result.csv')
Example rows in df:
N.B. I return a data.table as you showed 2 columns in your desired output.
R:
library(jsonlite)
library(tidyverse)
library(rvest)
library(data.table)
get_data <- function(x) {
y <- x %>% unlist(recursive = T)
t <- data.table(text = y, description_header_flag = grepl("(?:headingItems\\.)description\\.en$|^description.en$", names(y)), commodity_code = grepl("*commodityCode$", names(y)))
t <- t[description_header_flag | commodity_code, ]
t$text <- map2(t$text, t$description_header_flag, ~ gsub(intToUtf8(160), " ", if (.y & str_detect(.x, pattern = "<div>|<p>")) {
html_text(read_html(.x))
} else {
.x
}))
t$commodity_code <- map2(t$commodity_code, t$text, ~ if (.x) {
.y
} else {
NA
})
t[, commodity_code := c(NA, commodity_code[.I - 1])]
t <- t[description_header_flag == T, ]
t[, description_header_flag := NULL]
return(t)
}
data <- jsonlite::read_json("https://tolltariffen.toll.no/api/search/headings/03.02") %>% .$headingItems
df <- rbindlist(map(data, get_data))
fwrite(df, "result.csv")
Sample output:
Credits:
gsub solution taken from: #shabbychef here
row shift solution adapted from: #Gary Weissman here

Related

How can I create a variable with different names depending on the number of an iteration in R?

Hello I am trying to create a for loop where a variable is created depending on the value of a column that only has 10 possible values. Ideally the for loop goes and filters the data using dplr by the number and then rewrites the variable sliced for only the first 15 observations.
I created the following but it doesn’t work
for (i in 1:10){
mvendidos[[i]] <- filter(dff,grupo==i)
mvendidos[[i]] <- slice(dff.1:15)}
You need to cast the type of i. Add %>% as.character() and do not forget to create the "container" list:
mvendidos <- list()
for (i in 1:10){
mvendidos[[i %>% as.character()]] <- filter(dff,grupo==i)
# mvendidos[[i %>% as.character()]] <- slice(dff.1:15) # Commented as the synthax is doubtful
}
By the way. I doubt that dff.1:15 is correct. But it is not a part of your question and the structure of dff is not available.
Appended
I think I need to attract your attention to the operator %>%. You can read about dplyr pipes here.
In this case you can replace it with base R 4.2+ operator |>as well.

R Apply a jsonlite::unbox command to a list

I want to communicate with an API that needs a certain format, see below:
library(jsonlite)
list(limits = list("Overall_Wave3/0" = unbox("14000"),
"Overall_Wave3/1" = unbox("14005")))
which gives (note the indexes of that list are [x]):
$limits
$limits$`Overall_Wave3/0`
[x] "14000"
$limits$`Overall_Wave3/1`
[x] "14005"
Now in my real life use case, I would need to create hundreds of such elements within a list, so I need to somehow automate things. My input will be a data frame (or tibble) that I need to put into that format. I get this working, however, only without successfully doing the unbox. I.e. here's how far I got:
library(tidyverse)
library(jsonlite)
dat <- data.frame(marker = c("Overall_Wave3/0", "Overall_Wave3/0"),
value = c(14000, 14005)) %>%
mutate(value = as.character(value))
args <- as.list(dat$value)
names(args) <- dat$marker
list(limits = args)
which gives (note that the indexes are now [1] instead of [x]:
$limits
$limits$`Overall_Wave3/0`
[1] "14000"
$limits$`Overall_Wave3/0`
[1] "14005"
Now converting both lists to a JSON body with toJSON(...) gives different results:
First command gives: {"limits":{"Overall_Wave3/0":"14000","Overall_Wave3/0.1":"14005"}}
Second command gives: {"limits":{"Overall_Wave3/0":["14000"],"Overall_Wave3/0.1":["14005"]}}
The second command has unnecessary squared brackets around the numbers that must not be there. I know I could probably do a hack with a string replace, but would strongly prefer a solution that works right from the start (if it can be done within the tidyverse, I wouldn't be too sad about it).
Thanks.

Making list of factors in a function but return warning error

Let say that I have these vectors:
time <- c(306,455,1010,210,883,1022,310,361,218,166)
status <- c(1,1,0,1,1,0,1,1,1,1)
gender <- c(1,1,1,1,1,1,2,2,1,1)
And I turn it into these data frame:
dataset <- data.frame(time, status, gender)
I want to list the factors in the third column using this function (p/s: pardon the immaturity. I'm still learning):
getFactor<-function(dataset){
result <- list()
result["Factors"] <- unique(dataset[[3]])
return(result)
}
And all I get is this:
getFactor(dataset)
$Factors
[1] 1
Warning message:
In result["Factors"] <- unique(dataset[[3]]) :
number of items to replace is not a multiple of replacement length
I tried using levels, but all I get is an empty list. My question is (1) why does this happen? and (2) is there any other way that I can get the list of the factor in a function?
Solution is simple, you just need double brackets around "Factors" :)
In the function
result[["Factors"]] <- unique(dataset[[3]])
That should be the line.
The double brackets return an element, single brackets return that selection as a list.
Sounds silly, by try this
test <- list()
class(test["Factors"])
class(test[["Factors"]])
The first class will be of type 'list'. The second will be of type 'NULL'. This is because the single brackets returns a subset as a list, and the double brackets return the element itself. It's useful depending on the scenario. The element in this case is "NULL" because nothing has been assigned to it.
The error "number of items to replace is not a multiple of replacement length" is because you've asked it to put 3 things into a single element (that element is a list). When you use double brackets you actually put it inside a list, where you can have multiple elements, so it can work!
Hope that makes sense!
Currently, when you create your data frame, dataset$gender is double vector (which R will automatically do if everything in it is numbers). If you want it to be a factor, you can declare it that way at the beginning:
dataset <- data.frame(time, status, gender = as.factor(gender))
Or coerce it to be a factor later:
dataset$gender <- as.factor(gender)
Then getting a vector of the levels is simple, without writing a function:
level_vector <- levels(dataset$gender)
level_vector
You're also subsetting lists & data frames incorrectly in your function. To call the third column of dataset, use dataset[,3]. The first element of a list is called by list[[1]]

running apply (or variant) like an embedded loop

I'd like to do something like an embedded loop, but using apply functions, the goal of which is to check various conditions prior to moving on to the next part of my program.
I have two objects, a list of product descriptions, which can be created as follows:
test_products <- list(c("dingdong","small","affordable","polished"),c("wingding","medium","cheap","dull"),c("doodad","big","expensive","shiny"))
And a data frame of combinations of features that are not allowed, where each row represents a disallowed combination of features. A sample data frame can be created as follows:
disallowed <- data.frame(trait1 = c("dingdong","wingding","doodad"),
trait2 = c("medium","big","big"),
stringsAsFactors = FALSE)
My goal is to check each product against each of the disallowed combinations as efficiently as possible. So far I can check one product against all prohibitions as follows (in this case, the third product):
apply(disallowed, 1, function(x) x %in% unlist(test_products[[3]]))
OR I can check all products against one of the disallowed combinations of traits (the third combination).
lapply(test_products, function(x) disallowed[3,] %in% x)
Is it possible to check all products against all rows of the data frame of disallowed feature combination, without using a loop?
My end result should look something like this:
Product 1: OK
Product 2: OK
Product 3: NOT OK
Since Product 3 runs afoul of the third disallowed row.
There are definitely more elegant ways, but I am going to share my thoughts on this.
First, the way you created the disallowed data frame is convoluted. I decided to use the following code to create disallowed.
# Create a data frame showing disallowed traits
disallowed <- data.frame(trait1 = c("dingdong","wingding","doodad"),
trait2 = c("medium","big","big"),
stringsAsFactors = FALSE)
I then created a function called violate, which has two arguments. The first argument product is a vector of character. The second argument, check_df, is the data frame contains disallowed traits.
The output of violate is a logical vector. TRUE means all two traits from the check_df of the row are both TRUE.
# Create the violate function
violate <- function(product, check_df){
temp_df <- as.data.frame(lapply(check_df, function(Col) Col %in% product))
temp_vec <- apply(temp_df, 1, function(Row) sum(Row) == 2)
return(temp_vec)
}
# Test the violate function
violate(test_products[[3]], check_df = disallowed)
# [1] FALSE FALSE TRUE
After that, I applied the violate function using sapply through the test_products list. The results from violate were evaluated to see if all disallowed checks are FALSE
# Apply the violate function and check if all results from violate is FALSE
sapply(test_products, function(product){
sum(violate(product, check_df = disallowed)) == 0})
# [1] TRUE TRUE FALSE
As you can see, the third element of the results is FALSE, indicating that the third product is not OK, while product 1 and product 2 are OK because the final results from sapply are both TRUE.

R: Logical from 2 vectors on pattern match

Trying to clean up some dirty data (for work), my data frame has a column for customer information (for our example lets say store and product) in a long weird string, as well as a column for store and a column for product. I can parse the store and the product from the string. Here is where I arrive at my problem.
let's say (consider these vectors part of a larger dataframe, appended with data$ if that helps, I was just working with them as vectors thinking it may speed up the code not having to pull the whole dataframe):
WeirdString <- c("fname: john; lname:smith; store:Amazon Inc.; product:Echo", "fname: cindy; lname:smith; store:BestBuy; product:Ps-4","fname: jon; lname:smith; store:WALMART; product:Pants")
so I parse this to be:
WS_Store <- c("Amazon Inc.", "BestBuy", "WALMART")
WS_Prod <- c("Echo", "Ps-4", "Pants")
What's in the tables (i.e. the non-parsed columns) is:
DB_Store <- c("Amazon", "BEST BUY", "Other")
DB_Prod <- c("ECHO", "PS4", "Jeans")
I currently am using a for loop to loop through i to grepl the "true" string from the parsed string. This takes forever, and I know R was designed to use vectorized code, So my question is, how do I eliminate the loop and use something like lapply (which I tried, and failed at, because I'm not savvy enough with lapply), or some other vectorized thing?
My current code:
for(i in 1:nrow(data)){ # could be i in length(DB_prod) or whatever, all vectors are the same length)
Diff_Store[i] <- !grepl(DB_Store[i], WS_Store[i], ignore.case=T)
Diff_Prod[i] <- !grepl(DB_Prod[i] , WS_Prod[i] , ignore.case=T)
}
I intend to append those columns back into the dataframe, as the true goal is to diagnose why the database has this problem.
If there's a better way than this, rather than trying to vectorize it, I'm open to it. The data in the DB_Store is restricted to a specific number of "stores" (in the table it comes from) but in the string, it seems to be open, which is why I use the DB as the pattern, not the x. Product is similar, but not as restricted, this is why some have dashes and some don't. I would love to match "close things" like Ps-4 vs. PS4, but I will probably just build a table of matches once I see how weird the string gets. To be true though, the string may not match, which is represented by the Pants/Jeans thing. The dataset is 2.5 million records, and there are many different "stores" and "products", and I do want to make sure they match on the same line, not "is it in the database" (which is what previous questions seem to ask, can I see if a string is in a list of strings, rather than a 1:1 comparison, and the last question did end in a loop, which takes minutes and hours to run)
Thanks!
Please check if this works for you:
check <- function(vec_a, vec_b){
mat <- cbind(vec_a, vec_b)
diff <- apply(mat, 1, function(x) !grepl(pattern = x[1], x = x[2], ignore.case = TRUE))
diff
}
Use your different vectors for stores (or products) in the arguments vec_a and vec_b, respectively (example: diff_stores <- check(DB_Store, WS_Store) ). This function will return a logical vector with TRUE values referring to items that weren't a match in the two original vectors. Is this what you wanted?

Resources