Related
The page in question is this:
https://tolltariffen.toll.no/tolltariff/headings/03.02?language=en (Click on OPEN ALL LEVELS to get the complete data)
I'm using RSelenium to load the page and then getting the pagesource and using rvest to capture the required field. This is the data I'm trying to capture.
The code I've come up so far splits some descriptions data into multiple chunks which is not useful for me.
x <- remdr$getPageSource()
xpg <- read_html(x[[1]])
# get the HS descriptions
treeView <- xpg %>%
html_nodes(xpath = '//*/div[#class="MuiGrid-root MuiGrid-container MuiGrid-wrap-xs-nowrap"]') %>%
html_nodes(xpath = '//*/p[contains(#class, "MuiTypography-body1")]') %>%
html_nodes('span') %>%
html_text(trim = TRUE)
I need all the descriptions in order as a list.
Update: This is the output format. Descriptions and the 8-digit code
General thoughts:
RSelenium isn't strictly needed, and you can avoid the overhead of launching a browser. There is an API call, you can see in the browser network tab, which supplies the content of interest, and this can be called with no requirement for additional configuration of the request e.g. headers.
The question of how to extract the items you want from the API response, in the format you want, then becomes a fun challenge (at least to me) as we do not know 1) how many levels of nesting there may be in this response (and possible future ones) 2) whether the level of nesting can vary across listings within a given response for the items of interest 3) whether there will be a commodityCode at a given level (though the pattern appears to be that there is one at the deepest level for a given listing); and we need to consider how we generate columns/lists of equal length for output. These are just some starting considerations that I go on to discuss how I handled below.
The API call:
* You can click on many of the smaller images below to enlarge
The API response:
This request returns nested JSON:
The content of interest is a list of named lists, within the response, accessible via the parent "key" $headingItems:
Each of these named lists is nested as per the levels on the webpage:
You can see the repeated accessor key of headingItems (red boxed), with the first shown above as the parent list stored in data in code to follow.
Below that, indicated by level (orange boxed), are the expanded entries you are after; nested within the response JSON.
Finally, we have the descriptions (green boxed) which contains html for the descriptive text you are after, with English and Norwegian versions of the text:
In addition to this, there is, where present, a commodityCode key within the nested headingItems:
Approach and challenges:
Given that the commodityCode can be at different levels and may not be present (unless assumed to always be present at greatest depth of a given listing), and that it is unknown how many levels of headingItem there can be, the approach I chose was to use regex to identify the relevant child named list's names in a boolean mask (though for purposes here we could just say logical vector); one mask for English headers and one for the commodity codes. I processed each child list separately, using purrr::map and applying a custom function to extract data as a data.table/data.frame.
Example mask (descriptions|text):
The TRUE values are for the following chained accessors (chaining dependent on depth):
Notice how some accessor paths are repeated. This means therefore, that I do not use the mask to retrieve the names and extract the associated values. Instead, I keep the TRUE and FALSE values and thereby have equal lengths for both vectors. I combine the two logical vectors as columns within a data.table; along with the entire set of values within the child list:
This work is done within the custom function get_data, where I also then do the following steps:
I filter for only rows where there is a TRUE value i.e. a value I wish to retrieve
Apply a function utilizing gsub(), to remove non-breaking whitespace, and read_html() to convert those descriptions which are actual html to text. N.B. Some entries are not actually html and are handled by the if statement. In those cases, the input value is returned:
At this point the codes and descriptions/text are in a single column:
I use the booleans in commodity_code to update that columns value where TRUE to match the text column, and wrap in if to replace FALSE with NA.
Knowing that there is actually a 1 row offset between description and associated code, where applicable, I then shift the commodity column values down one row to correctly align with descriptions:
I then keep only the rows where description_header_flag is TRUE:
Finally, I remove the now not needed flag column:
This leaves me with a clean data.table to return from the function.
Generating the final output:
As map() applying the custom function above to a list returns a list of data.tables, I then simply call rbindlist() to combine these into a single data.table:
df <- rbindlist(map(data, get_data))
This can then be written to csv for example.
fwrite(df, 'result.csv')
Example rows in df:
N.B. I return a data.table as you showed 2 columns in your desired output.
R:
library(jsonlite)
library(tidyverse)
library(rvest)
library(data.table)
get_data <- function(x) {
y <- x %>% unlist(recursive = T)
t <- data.table(text = y, description_header_flag = grepl("(?:headingItems\\.)description\\.en$|^description.en$", names(y)), commodity_code = grepl("*commodityCode$", names(y)))
t <- t[description_header_flag | commodity_code, ]
t$text <- map2(t$text, t$description_header_flag, ~ gsub(intToUtf8(160), " ", if (.y & str_detect(.x, pattern = "<div>|<p>")) {
html_text(read_html(.x))
} else {
.x
}))
t$commodity_code <- map2(t$commodity_code, t$text, ~ if (.x) {
.y
} else {
NA
})
t[, commodity_code := c(NA, commodity_code[.I - 1])]
t <- t[description_header_flag == T, ]
t[, description_header_flag := NULL]
return(t)
}
data <- jsonlite::read_json("https://tolltariffen.toll.no/api/search/headings/03.02") %>% .$headingItems
df <- rbindlist(map(data, get_data))
fwrite(df, "result.csv")
Sample output:
Credits:
gsub solution taken from: #shabbychef here
row shift solution adapted from: #Gary Weissman here
I'm trying to convert these lists like Python's list. I've used these codes
library(GenomicRanges)
library(data.table)
library(Repitools)
pcs_by_tile<-lapply(as.list(1:length(tiled_chr)) , function(x){
obj<-tileSplit[[as.character(x)]]
if(is.null(obj)){
return(0)
} else {
runs<-filtered_identical_seqs.gr[obj]
df <- annoGR2DF(runs)
score = split(df[,c("start","end")], 1:nrow(df[,c("start","end")]))
#print(score)
return(score)
}
})
dt_text <- unlist(lapply(tiled_chr$score, paste, collapse=","))
writeLines(tiled_chr, paste0("x.txt"))
The following line of code iterates through each row of the DataFrame (only 2 columns) and splits them into the list. However, its output is different from what I desired.
score = split(df[,c("start","end")], 1:nrow(df[,c("start","end")]))
But I wanted the following kinda output:
[20350, 20355], [20357, 20359], [20361, 20362], ........
If I understand your question correctly, using as.tuple from the package 'sets' might help. Here's what the code might look like
library(sets)
score = split(df[,c("start","end")], 1:nrow(df[,c("start","end")]))
....
df_text = unlist(lapply(score, as.tuple),recursive = F)
This will return a list of tuples (and zeroes) that look more like what you are looking for. You can filter out the zeroes by checking the type of each element in the resulting list and removing the ones that match the type. For example, you could do something like this
df_text_trimmed <- df_text[!lapply(df_text, is.double)]
to get rid of all your zeroes
Edit: Now that I think about it, you probably don't even need to convert your dataframes to tuples if you don't want to. You just need to make sure to include the 'recursive = F' option when you unlist things to get a list of 0s and dataframes containing the numbers you want.
I've been struggling to figure this out on my own, so reaching out for some assistance. I am trying to build urls based on multiple variables (months and years) of different lengths so that I have a url for each combination of month and year from the lists I created.
I've done something similar in Python but need to translate it into R, and I'm running into issues with building the function and for loops. Here's the Python code ..
# set years and months
oasis_market_yr = ('2020','2019','2018','2017','2016','2015','2014','2013','2012','2011')
oasis_market_mn = ('01','02','03','04','05','06','07','08','09','10','11','12')
# format url string
URL_FORMAT_STRING = 'http://oasis.caiso.com/oasisapi/SingleZip?queryname=CRR_INVENTORY&market_name=AUC_MN_{year}_M{month}_TC&resultformat=6&market_term=ALL&time_of_use=ALL&startdatetime={year}{month}01T07:00-0000&enddatetime={year}{month}{last_day_of_month}T07:00-0000&version=1'
# create function to make urls
def make_url(year,month):
last_day_of_month = calendar.monthrange(int(year), int(month))[1]
return URL_FORMAT_STRING.format(year=year,month=month,last_day_of_month=last_day_of_month)
# build urls for download
for y in oasis_market_yr:
for m in oasis_market_mn:
url = make_url(y,m)
I've tried using sapply and mapply with str_glue and a few other methods but can't seem to replicate the outcome. I keep getting an error that reads: Error: Variables must be length 1 or 5. Or, for instance with mapply, it maps the first value in one list to the first in the other list and so on, then returns when the short list runs out of values. What I need is all the combinations from both lists.
Any assistance would be much appreciated.
Your syntax was a little too python and won't work like that in R.
In R, the same syntax would look like this:
# set years and months
oasis_market_yr = c('2020','2019','2018','2017','2016','2015','2014','2013','2012','2011')
oasis_market_mn = c('01','02','03','04','05','06','07','08','09','10','11','12')
# create function to make urls
make_url = function(year,month){
# format url string
URL_FORMAT_STRING = 'http://oasis.caiso.com/oasisapi/SingleZip?queryname=CRR_INVENTORY&market_name=AUC_MN_{year}_M{month}_TC&resultformat=6&market_term=ALL&time_of_use=ALL&startdatetime={year}{month}01T07:00-0000&enddatetime={year}{month}{last_day_of_month}T07:00-0000&version=1'
lastdays = c(31,28,31,30,31,30,31,31,30,31,30,31)
if(as.integer(year)%%4==0 & as.integer(year)%%100 !=0){lastdays[2]=29}
last_day_of_month = as.character(lastdays[as.integer(month)])
fs = gsub("{month}",month,URL_FORMAT_STRING, fixed=T)
fs = gsub("{year}",year,fs, fixed=T)
fs = gsub("{last_day_of_month}",last_day_of_month, fs, fixed=T)
return(fs)
}
# build urls for download
for(y in oasis_market_yr){
for(m in oasis_market_mn){
url = make_url(y,m)
print(url)
}
}
As I am not aware of a direct correspondence of the string formatting method in R, I changed it to replacements (a = gsub(pattern, replacement, a) corresponds the python command a=a.replace(pattern,replacement). It should work beautifully.
Also, you don't really need a calendar package to get the last dates. Just offer it as a list and adjust it for leap days and Bob's your uncle.
I don't know whether the URLs that are generated are really the ones you need. But you might be able to work from this translation to correct it, if something is wrong.
An option using glue and lubridate. Note I added _i to the {month} and {year} variables to avoid confusion with the month and year functions in lubridate.
library(glue)
library(lubridate)
URL_FORMAT_STRING <- 'http://oasis.caiso.com/oasisapi/SingleZip?queryname=CRR_INVENTORY&market_name=AUC_MN_{year_i}_M{month_i}_TC&resultformat=6&market_term=ALL&time_of_use=ALL&startdatetime={year_i}{month_i}01T07:00-0000&enddatetime={year_i}{month_i}{last_day_of_month}T07:00-0000&version=1'
make_url<- function(year_i, month_i){
last_day_of_month <- day(ceiling_date(my(paste(month_i, year_i)), 'month') - days(1))
glue(URL_FORMAT_STRING)
}
And then rather than a nested for loop you can use mapply to apply your function to all combinations of oasis_market_yr and oasis_market_mn.
df_vars <- expand.grid(year_i = oasis_market_yr, month_i = oasis_market_mn)
mapply(make_url, df_vars$year_i, df_vars$month_i)
# [1] "http://oasis.caiso.com/oasisapi/SingleZip?queryname=CRR_INVENTORY&market_name=AUC_MN_2020_M01_TC&resultformat=6&market_term=ALL&time_of_use=ALL&startdatetime=20200101T07:00-0000&enddatetime=20200131T07:00-0000&version=1"
# [2] "http://oasis.caiso.com/oasisapi/SingleZip?queryname=CRR_INVENTORY&market_name=AUC_MN_2019_M01_TC&resultformat=6&market_term=ALL&time_of_use=ALL&startdatetime=20190101T07:00-0000&enddatetime=20190131T07:00-0000&version=1"
#....
I have a tibble called 'Volume' in which I store some data (10 columns - the first 2 columns are characters, 30 rows).
Now I want to calculate the relative Volume of every column that corresponds to Column 3 of my tibble.
My current solution looks like this:
rel.Volume_unmod = tibble(
"Volume_OD" = Volume[[3]] / Volume[[3]],
"Volume_Imp" = Volume[[4]] / Volume[[3]],
"Volume_OD_1" = Volume[[5]] / Volume[[3]],
"Volume_WS_1" = Volume[[6]] / Volume[[3]],
"Volume_OD_2" = Volume[[7]] / Volume[[3]],
"Volume_WS_2" = Volume[[8]] / Volume[[3]],
"Volume_OD_3" = Volume[[9]] / Volume[[3]],
"Volume_WS_3" = Volume[[10]] / Volume[[3]])
rel.Volume_unmod
I would like to keep the tibble structure and the labels. I am sure there is a better solution for this, but I am relative new to R so I it's not obvious to me. What I tried is something like this, but I can't actually run this:
rel.Volume = NULL
for(i in Volume[,3:10]){
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
}
Mockup Data
Since you did not provide some data, I've followed the description you provided to create some mockup data. Here:
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE))
Volume[3:10] <- rnorm(30*8)
Solution with Dplyr
library(dplyr)
# rename columns [brute force]
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(Volume)[3:10] <- cols
# divide by Volumn_OD
rel.Volume_unmod <- Volume %>%
mutate(across(all_of(cols), ~ . / Volume_OD))
# result
rel.Volume_unmod
Explanation
I don't know the names of your columns. Probably, the names correspond to the names of the columns you intended to create in rel.Volume_unmod. Anyhow, to avoid any problem I renamed the columns (kinda brutally). You can do it with dplyr::rename if you wan to.
There are many ways to select the columns you want to mutate. mutate is a verb from dplyr that allows you to create new columns or perform operations or functions on columns.
across is an adverb from dplyr. Let's simplify by saying that it's a function that allows you to perform a function over multiple columns. In this case I want to perform a division by Volum_OD.
~ is a tidyverse way to create anonymous functions. ~ . / Volum_OD is equivalent to function(x) x / Volumn_OD
all_of is necessary because in this specific case I'm providing across with a vector of characters. Without it, it will work anyway, but you will receive a warning because it's ambiguous and it may work incorrectly in same cases.
More info
Check out this book to learn more about data manipulation with tidyverse (which dplyr is part of).
Solution with Base-R
rel.Volume_unmod <- Volume
# rename columns
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(rel.Volume_unmod)[3:10] <- cols
# divide by columns 3
rel.Volume_unmod[3:10] <- lapply(rel.Volume_unmod[3:10], `/`, rel.Volume_unmod[3])
rel.Volume_unmod
Explanation
lapply is a base R function that allows you to apply a function to every item of a list or a "listable" object.
in this case rel.Volume_unmod is a listable object: a dataframe is just a list of vectors with the same length. Therefore, lapply takes one column [= one item] a time and applies a function.
the function is /. You usually see / used like this: A / B, but actually / is a Primitive function. You could write the same thing in this way:
`/`(A, B) # same as A / B
lapply can be provided with additional parameters that are passed directly to the function that is being applied over the list (in this case /). Therefore, we are writing rel.Volume_unmod[3] as additional parameter.
lapply always returns a list. But, since we are assigning the result of lapply to a "fraction of a dataframe", we will just edit the columns of the dataframe and, as a result, we will have a dataframe instead of a list. Let me rephrase in a more technical way. When you are assigning rel.Volume_unmod[3:10] <- lapply(...), you are not simply assigning a list to rel.Volume_unmod[3:10]. You are technically using this assigning function: [<-. This is a function that allows to edit the items in a list/vector/dataframe. Specifically, [<- allows you to assign new items without modifying the attributes of the list/vector/dataframe. As I said before, a dataframe is just a list with specific attributes. Then when you use [<- you modify the columns, but you leave the attributes (the class data.frame in this case) untouched. That's why the magic works.
Whithout a minimal working example it's hard to guess what the Variable Volume actually refers to. Apart from that there seems to be a problem with your for-loop:
for(i in Volume[,3:10]){
Assuming Volume refers to a data.frame or tibble, this causes the actual column-vectors with indices between 3 and 10 to be assigned to i successively. You can verify this by putting print(i) inside the loop. But inside the loop it seems like you actually want to use i as a variable containing just the index of the current column as a number (not the column itself):
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
Also, two brackets are usually used with lists, not data.frames or tibbles. (You can, however, do so, because data.frames are special cases of lists.)
Last but not least, initialising the variable rel.Volume with NULL will result in an error, when trying to reassign to that variable, since you haven't told R, what rel.Volume should be.
Try this, if you like (thanks #Edo for example data):
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE),
Vol1 = rnorm(30),
Vol2 = rnorm(30),
Vol3 = rnorm(30))
rel.Volume <- Volume[1:2] # Assuming you want to keep the IDs.
# Your data.frame will need to have the correct number of rows here already.
for (i in 3:ncol(Volume)){ # ncol gives the total number of columns in data.frame
rel.Volume[i] = Volume[i]/Volume[3]
}
A more R-like approach would be to avoid using a for-loop altogether, since R's strength is implicit vectorization. These expressions will produce the same result without a loop:
# OK, this one messes up variable names...
rel.V.2 <- data.frame(sapply(X = Volume[3:5], FUN = function(x) x/Volume[3]))
rel.V.3 <- data.frame(Map(`/`, Volume[3:5], Volume[3]))
Since you said you were new to R, frankly I would recommend avoiding the Tidyverse-packages while you are still learing the basics. From my experience, in the long run you're better off learning base-R first and adding the "sugar" when you're more familiar with the core language. You can still learn to use Tidyverse-functions later (but then, why would anybody? ;-) ).
I'm having issues with a specific problem I have a dataset of a ton of matrices that all have V1 as their column names, essentially NULL. I'm trying to write a loop to replace all of these with column names from a list but I'm running into some issues.
To break this down to the most simple form, this code isn't functioning as I'd expect it to.
nameofmatrix <- paste('column_', i, sep = "")
colnames(eval(as.name(nameofmatrix))) <- c("test")
I would expect this to take the value of column_1 for example, and replace (in the 2nd line) with "test" as the column name.
I tried to break this down smaller, for example, if I run print(eval(as.name(nameofmatrix)) I get the object's column/rows printed as expected and if I run print(colnames(eval(as.name(nameofmatrix))) I'm getting NULL as expected for the column header (since it was set as V1).
I've even tried to manually type in the column name, such as colnames(column_1) <- c("test) and this successfully works to rename the column. But once this variable is put in the text's place as shown above, it does not work the same. I'm having difficulties finding a solution on how to rename several matrix columns after they have been created with this method. Does anyone have any advice or suggestions?
Note, the error I'm receiving on trying to run this is
Error in eval([as.name](nameofmatrix)) <- \`vtmp\` : could not find function "eval<-"
We could return the values of the objects in a list with get (if there are multiple objects use mget, then rename the objects in the list and update those objects in the global env with list2env
list2env(lapply(mget(nameofmatrix), function(x) {colnames(x) <- newnames
x}), .GlobalEnv)
It can also be done with assign
data(mtcars)
nameofobject <- 'mtcars'
assign(nameofobject, `colnames<-`(get(nameofobject),
c('mpg1', names(mtcars)[-1])))
Now, check the names of 'mtcars'
names(mtcars)[1]
#[1] "mpg1"