Let's say we have a url in R like:
url <- 'http://google.com/maps'
And the objective is to change the 'maps' part of it. I'd like to write a function where basically I can just input something (e.g. 'maps', 'images'), etc., and the relevant part of the url will automatically change to reflect what I'm typing in.
Is there a way to do this in R, where part of the url can be changed by typing something into a function?
Thanks!
You have to store the part you type into a variable and paste this to the base URL:
base_url <- "http://google.com/"
your_extension <- "maps"
paste0(base_url, your_extension)
[1] "http://google.com/maps"
If you have to start with a fixed URL, use sub to replace the last part:
sub("\\w+$", 'foo', url)
# "http://google.com/foo"
You can use dirname to remove the last part of the URL and paste it with additional custom string.
change_url_part <- function(base_url, string) {
paste(dirname(base_url), string, sep = '/')
}
change_url_part('http://google.com/maps', 'images')
#[1] "http://google.com/images"
I am trying to make an interactive dashboard with analysis, base on car side. I would like user to be able to pick car brand for example BMW, Audi etc. and base on this choise he will have only avaiablity to pick BMW/Audi etc. models. I have a problem after selecting each brand, I am not able to scrape the models that belongs to that brand. Page that I am scraping from:
main page --> https://www.otomoto.pl/osobowe/
sub car brand page example --> https://www.otomoto.pl/osobowe/audi/
I have tried to scrape every option, so later on I can maybe somehow clean the data to store only models
code:
otomoto_models - paste0("https://www.otomoto.pl/osobowe/"audi/")
models <- read_html(otomoto_models) %>%
html_nodes("option") %>%
html_text()
But it is just scraping the brands with other options avaiable on the page engine type etc. While after inspecting element I can clearly see models types.
otomoto <- "https://www.otomoto.pl/osobowe/"
brands <- read_html(otomoto) %>%
html_nodes("option") %>%
html_text()
brands <- data.frame(brands)
for (i in 1:nrow(brands)){
no_marka_pojazdu <- i
if(brands[i,1] == "Marka pojazdu"){
break
}
}
no_marka_pojazdu <- no_marka_pojazdu + 1
for (i in 1:nrow(brands)){
zuk <- i
if(substr(brands[i,1],1,3) == "Żuk"){
break
}
}
Modele_pojazdow <- as.character(brands[no_marka_pojazdu:zuk,1])
Modele_pojazdow <- removeNumbers(Modele_pojazdow)
Modele_pojazdow <- substr(Modele_pojazdow,1,nchar(Modele_pojazdow)-2)
Modele_pojazdow <- data.frame(Modele_pojazdow)
Above code is only to pick supported car brands on the webpage and store them in the data frame. With that I am able to create html link and direct everything to one selected brand.
I would like to have similar object to "Modele_pojazdow" but with models limited on previous selected car brand.
Dropdown list with models appears as white box with text "Model pojazdu" next to the "Audi" box on the right side.
Some may frown on the solution language being Python, but the aim of this is was to give some pointers (high level process). I haven't written R in a long time so Python was quicker.
EDIT: R script now added
General outline:
The first dropdown options can be grabbed from the value attribute of each node returned by using a css selector of #param571 option. This uses an id selector (#) to target the parent dropdown select element, and then option type selector in descendant combination, to specify the option tag elements within. The html to apply this selector combination to can be retrieved by an xhr request to the url you initially provided. You want a nodeList returned to iterate over; akin to applying selector with js document.querySelectorAll.
The page uses ajax POST requests to update the second dropdown based on your first dropdown choice. Your first dropdown choice determines the value of a parameter search[filter_enum_make], which is used in the POST request to the server. The subsequent response contains a list of the available options (it includes some case alternatives which can be trimmed out).
I captured the POST request by using fiddler. This showed me the request headers and params in the request body. Screenshot sample shown at end.
The simplest way to extract the options from the response text, IMO, is to regex the appropriate string out (I wouldn't normally recommend regex for working with html but in this case it serves us nicely). If you don't want to use regex, you can grab the relevant info from the data-facets attribute of the element with id body-container. For the non-regex version you need to handle unquoted nulls, and retrieve the inner dictionary whose key is filter_enum_model. I show a function re-write, at the end, to handle this.
The retrieved string is a string representation of a dictionary. This needs converting to an actual dictionary object which you can then extract the option values from. Edit: As R doesn't have a dictionary object a similar structure needs to be found. I will look at this when converting.
I create a user defined function, getOptions(), to return the options for each make. Each car make value comes from the list of possible items in the first dropdown. I loop those possible values, use the function to return a list of options for that make, and add those lists as values to a dictionary, results ,whose keys are the make of car. Again, for R an object with similar functionality to a python dictionary needs to be found.
That dictionary of lists needs converting to a dataframe which includes a transpose operation to make a tidy output of headers, which are the car makes, and columns underneath each header, which contain the associated models.
The whole thing can be written to csv at the end.
So, hopefully that gives you an idea of one way to achieve what you want. Perhaps someone else can use this to help write you a solution.
Python demonstration of this below:
import requests
from bs4 import BeautifulSoup as bs
import re
import ast
import pandas as pd
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
try:
# verify the regex here: https://regex101.com/r/emvqXs/1
data = re.search(r'"filter_enum_model":(.*),"new_used"', r.text ,flags=re.DOTALL).group(1) #regex to extract the string containing the models associated with the car make filter
aDict = ast.literal_eval(data) #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
except:
cleanedList = [] # sometimes there are no associated values in 2nd dropdown
return cleanedList
r = requests.get('https://www.otomoto.pl/osobowe/')
soup = bs(r.content, 'lxml')
values = [item['value'] for item in soup.select('#param571 option') if item['value'] != '']
results = {}
# build a dictionary of lists to hold options for each make
for value in values:
results[value] = getOptions(value) #function call to return options based on make
# turn into a dataframe and transpose so each column header is the make and the options are listed below
df = pd.DataFrame.from_dict(results,orient='index').transpose()
#write to csv
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )
Sample of csv output:
Example as sample json for alfa-romeo:
Example of regex match for alfa-romeo:
{"145":1,"146":1,"147":218,"155":1,"156":118,"159":559,"164":2,"166":39,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":89,"GTV":7,"Giulia":251,"Giulietta":378,"Mito":224,"Spider":24,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":378,"gt":89,"gtv":7,"mito":224,"spider":24,"sportwagon":2,"stelvio":242}
Example of the filter option list returned from function call with make parameter value alfa-romeo:
['145', '146', '147', '155', '156', '159', '164', '166', '33', 'Alfasud', 'Brera', 'Crosswagon', 'GT', 'GTV', 'Giulia', 'Giulietta', 'Mito', 'Spider', 'Sportwagon', 'Stelvio']
Sample of fiddler request:
Sample of ajax response html containing options:
<section id="body-container" class="om-offers-list"
data-facets='{"offer_seek":{"offer":2198},"private_business":{"business":1326,"private":872,"all":2198},"categories":{"29":2198,"161":953,"163":953},"categoriesParent":[],"filter_enum_model":{"145":1,"146":1,"147":219,"155":1,"156":116,"159":561,"164":2,"166":37,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":88,"GTV":7,"Giulia":251,"Giulietta":380,"Mito":226,"Spider":25,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":380,"gt":88,"gtv":7,"mito":226,"spider":25,"sportwagon":2,"stelvio":242},"new_used":{"new":371,"used":1827,"all":2198},"sellout":null}'
data-showfacets=""
data-pagetitle="Alfa Romeo samochody osobowe - otomoto.pl"
data-ajaxurl="https://www.otomoto.pl/osobowe/alfa-romeo/?search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="
data-searchid=""
data-keys=''
data-vars=""
Alternative version of function without regex:
from bs4 import BeautifulSoup as bs
def getOptions(make): #function to return options based on make
data = {
'search[filter_enum_make]': make,
'search[dist]' : '5',
'search[category_id]' : '29'
}
r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)
soup = bs(r.content, 'lxml')
data = soup.select_one('#body-container')['data-facets'].replace('null','"null"')
aDict = ast.literal_eval(data)['filter_enum_model'] #convert string representation of dictionary to python dictionary
d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
dirtyList = list(aDict)[:d] #trim to unique values
cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
return cleanedList
print(getOptions('alfa-romeo'))
R conversion and improved python:
Whilst converting to R I found a better way of extracting the parameters from a js file on the server. If you open dev tools you can see the file listed in the sources tab.
R (To be improved):
library(httr)
library(jsonlite)
url <- 'https://www.otomoto.pl/ajax/jsdata/params/'
r <- GET(url)
contents <- content(r, "text")
data <- strsplit(contents, "var searchConditions = ")[[1]][2]
data <- strsplit(as.character(data), ";var searchCondition")[[1]][1]
source <- fromJSON(data)$values$'573'$'571'
makes <- names(source)
for(make in makes){
print(make)
print(source[make][[1]]$value)
#break
}
Python:
import requests
import json
import pandas as pd
r = requests.get('https://www.otomoto.pl/ajax/jsdata/params/')
data = r.text.split('var searchConditions = ')[1]
data = data.split(';var searchCondition')[0]
items = json.loads(data)
source = items['values']['573']['571']
makes = [item for item in source]
results = {}
for make in makes:
df = pd.DataFrame(source[make]) ## build a dictionary of lists to hold options for each make
results[make] = list(df['value'])
dfFinal = pd.DataFrame.from_dict(results,orient='index').transpose() # turn into a dataframe and transpose so each column header is the make and the options are listed below
mask = dfFinal.applymap(lambda x: x is None) #tidy up None values to empty strings https://stackoverflow.com/a/31295814/6241235
cols = dfFinal.columns[(mask).any()]
for col in dfFinal[cols]:
dfFinal.loc[mask[col], col] = ''
print(dfFinal)
I'm very new to R and beginner level at programming in general, and trying to figure out how to get hovertext in plotly to display a Japanese string from my dataframe. After venturing through character encoding hell, I've got things mostly worked out but am getting stuck on a single point: Getting the Japanese string to display in the final plot.
plot_ly(df, x = ~cost, y = ~grossSales, type = "scatter", mode = "markers",
hoverinfo = "text",
text = ~paste0("Product name: ", productName,
"<br>Gross: ", grossSales, "<br> Cost: ", cost,
)
)
The problem I encounter is that using 'productName' returns the Japanese string from the dataframe, which causes the plot to fail to render. DOM Inspector's console shows JSON encountering issues with the string (even though it's just encoded in UTF-8).
Using toJSON(productName), I am able to render the table, however this renders the hover textbox with the full information of the productName column (e.g., ["","Product1","Product2","Product3"...]). I only want the name of that specific product; just as 'grossSales' and 'cost' only return one the data specific to that product at each point on the plot.
Is there a way I can execute toJSON() only on each specific instance of 'productName'? (i.e., output should be "Product1" with JSON friendly string format) Alternatively, is there a way I can have plotly read the list output and select only the correct productName?
Stepping away from the problem to continue studying other things, I found a partial solution in using a for-loop:
productNames <- NULL
for (i in 1:nrow(df))
{
productNames <- c(productNames, toJSON(df[i, "productName"]))
}
df$jsonProductNames <- productNames
Using the jsonProductNames variable within plotly, the graph renders and displays only the name for each product! The sole issue remaining is that it is displayed with the JSON [""] formatting around each product's name.
Update:
I've finally got this working fully how I want it. I imagine there are more elegant solutions, and I'd still be interested to learn how to achieve what I originally was looking at if possible (run a function on a variable within R for each time it is encountered in a loop), but here is how I have it working:
colToJSON <- function(df, colStr)
{
JSONCol <- NULL
for (i in 1:nrow(df))
{
JSONCol <- c(JSONCol, toJSON(df[i, colStr]))
}
JSONCol <- gsub("\\[\"", "", JSONCol)
JSONCol <- gsub("\"\\]", "", JSONCol)
return(JSONCol)
}
df$jsonProductNames <- colToJSON(df, "productName")
I am a new user of Julia and I want to work on graphs. I found the Graphs.jl library but not very documented. I tried to create a GenericGraph based on ExVertex and ExEdge but I need more information.
The code I'm using :
using Graphs
CompGraph = GenericGraph{ExVertex, ExEdge{ExVertex}}
temp = ExVertex(1, "VertexName")
temp.attributes["Att"] = "Test"
add_vertex!(CompGraph, temp)
Now I still need the ExVertex list and ExEdge list. Is there any defined parameters? or how can I create such lists?
The solution was too simple. a list is juste a simple array and not a new type. Besides, there is a simple defined function which creates graphs based on different types of edges and vertecies.
I changed my code to :
using Graphs
CG_VertexList = ExVertex[]
CG_EdgeList = ExEdge{ExVertex}[]
CompGraph = graph(CG_VertexList, CG_EdgeList)
temp = ExVertex(1, "VertexName")
temp.attributes["Att"] = "Test"
add_vertex!(CompGraph, temp)
I will have to generate a gantt diagram in a daily basis. My idea is to use the mermaid api included in R's DiagrammeR package.
My data will always have the same structure and, therefore, I have created a quite primitive parser that is included in the reproducible example.
The problem I face is that after 4 sections the styling starts again from zero:
rect.section.section0
rect.section.section1
rect.section.section2
rect.section.section3
rect.section.section0
I can change rect.section.sectionx colour from the .css but I cannot add new ones.
Is there a way around to change/personalise the section's colour/styling?
My R reproducible example:
library(DiagrammeR)
library(htmltools)
fromdftogantt<-function(df,Title="Proba",filename="proba.html"){
txt<-paste("gantt","dateFormat YYYY-MM-DD",paste("title",Title),"",sep="\n")
for(i in unique(df$section)){
txt<-paste(txt,paste("section",i),sep="\n")
for(j in which(df$section==i)){
txt<-paste(txt,paste0(df$name[j],":",df$status[j],",",
df$fecini[j],",",
df$fecfin[j]),sep="\n")
}
txt<-paste0(txt,"\n")
}
m<-mermaid(txt)
m$x$config = list(ganttConfig = list(
axisFormatter = list(list(
"%m-%Y"
,htmlwidgets::JS(
'function(d){ return d.getDate() == 1 }'
)
))
))
save_html(as.tags(m),file=filename)
}
df<-data.frame(section=letters[1:6],name=paste("Name",1:6),
status=rep("active",6),
fecini=as.Date(c("2015-02-03","2015-03-05","2015-04-07",
"2015-02-03","2015-03-05","2015-04-07")),
fecfin=as.Date(c("2015-06-01","2015-04-30","2015-12-31",
"2015-06-01","2015-04-30","2015-12-31")),
stringsAsFactors = FALSE)
fromdftogantt(df,Title="Proba",filename="proba.html")
You don't need to change the .js file at all. mermaid supports a numberSectionStyles config parameter. Just add the following line to your R function before saving the HTML:
m$x$config$ganttConfig$numberSectionStyles = 6
You'll still need to adjust the .css file to add the additional sections following the same template as the existing ones.