RSelenium - Storing data in an array - r

I'm extracting events description from a list of event in my website.
Each event is a href link which goes to another page where we can find the image and the description of the event. I'm trying to store the image url and the description of all events in an array so I used the code below in the end of my loop, but I only get the image and the description of the last event looped:
m<-c(images_of_events)
n<-c( description_of_events)
cc<-remDr$findElement(using = "css", "[class = '_24er']")
cc<-remDr$getPageSource()
page_events<-read_html(cc[[1]][1])
links_events_data=html_nodes(page_events,'._24er > table > tbody > tr > td >
div> div._4dmk > a ')
events_urls<-html_attr(links_events_data,"href")
//the loop of each event
for (i in events_urls) {
remDr$navigate(paste("localhost://www.mywebsite",i,sep=""))
#get image
imagewebElem <- remDr$findElement(using = "class", "scaledImageFitWidth")
images_of_events<-imagewebElem $getElementAttribute("src")
descriptionwebElem <-remDr$findElement(using = "css", "[class = '_63ew']")
descriptionwebElem <-remDr$getPageSource()
page_event_description<-read_html(descriptionwebElem[[1]][1])
events_desc =html_nodes(page_event_description,'._63ew > span')
description_of_events= html_text(events_desc)
m<-c(images_of_events)
n<-c( description_of_events)
}

To save values in array in R you have to
1) create the array/data.frame dta <- data.frame(m=c(),n=c()) and then save to it dta[i,1] <- image_of_events and dta[i,2] <- description_of_evants where i is numeric iterator
2) create the array/data.frame and use rbind to add values like dta <- rbind(dta, data.frame(m=images_of_events, n = description_of_events))

Related

Using R- My for loop only works on first iteration

I have a loop on each iteration does at least 1 and at most 2 API GET requests
I have a list of movies I loop through, I do a GET request to get the movies id corresponding in the API database, then I do a request to get the reviews using the movies ID.
I run the code and it works for the first movie but all the other movies remain empty. even when the first movie has no reviews.
Here is the code
for(i in 1:9964) {
id_url<- paste(url,search_title,key,COPY$Film[i],sep = "")
#GET request for movie
api_call_id<-GET(id_url)
#make readdble
read<- rawToChar(api_call_id$content)
#turn json into object
JSON<-fromJSON(read,flatten = TRUE)
if(is.null(JSON$results$id[1])){
break;
}
else{ #get the movie id from the json
id<-JSON$results$id[1]
}
#url to get movies raitings
raiting_url<- paste(url,raiting,key,id,sep = "")
#call
api_call_raiting<- GET(raiting_url)
#readble
read<- rawToChar(api_call_raiting$content)
#json
JSON<-fromJSON(read,flatten = TRUE)
if(is.null(JSON$rottenTomatoes)){
#set column value
COPY[ i, 'rTomatoes'] <- "No Review"
}
else{
#set column value
COPY[ i, 'rTomatoes'] <- JSON$rottenTomatoes
}

I have hard time to webscrape values that are both in <span> and outside <span>

I am using scrapy to scrape this website https://www.coop.se/butiker-erbjudanden/coop-butiker/coop-lilla-edet/ . I am trying to retrieve the values that are shown in the picture. But I believe I am webscraping the values in a bad way, see my variable called "Info". Please give me some tips of how I should actually webscrape the values in the picture.
The code I use today is:
categories = response.css("body > main > div.js-childLayoutContainer.u-marginTmd > div > div.js-favoriteStoreView.js-settings > div.Main-container.Main-container--padding > div:nth-child(4) > div")
for category in categories:
if category.css("div.Grid-cell.u-sizeFull.u-marginVxsm > h2::text").extract_first() == "Butikens bästa erbjudanden denna vecka":
continue
else:
Category = category.css("div.Grid-cell.u-sizeFull.u-marginVxsm > h2::text").extract_first()
items = category.css("div > article > div")
for item in items:
Product = item.css("div.ItemTeaser-info > h3::text").extract_first()
if not Product:
Product = None
else:
Product
Info = item.css("p.ItemTeaser-description").extract_first()
Info = Info.replace("<br>","")
Info = Info.replace('<div class="">',"")
Info = Info.replace("</div>","")
Info = Info.replace("</p>","")
Info = Info.replace("<p>","")
Info = Info.replace("</span>","")
Info = Info.replace('<p class="ItemTeaser-description">',"")
Info = Info.replace('<span class="ItemTeaser-brand">',"")
Info = Info.strip()
Info = " ".join(Info.split())
You could use an xpath using product.xpath('.//p[#class="ItemTeaser-description"]/text()'). The result will be a list of selector with the text
For example:
import scrapy
class ShopSpider(scrapy.Spider):
name = "shopspider"
start_urls = [
'https://www.coop.se/butiker-erbjudanden/coop-butiker/coop-lilla-edet/']
def parse(self, response):
for product in response.css('.ItemTeaser-info'):
yield {
'title': product.xpath('.//h3/text()').get(),
'description': "".join([
t.get().strip()
for t in product.xpath('.//p[#class="ItemTeaser-description"]/text()')
])
}

R: hide cells in DT::datatable based on condition

I am trying to create a datatable with child rows: the user will be able to click on a name and see a list of links related to that name. However, the number of itens to show is different for each name.
> data1 <- data.frame(name = c("John", "Maria", "Afonso"),
a = c("abc", "def", "rty"),
b=c("ghj","lop",NA),
c=c("zxc","cvb",NA),
d=c(NA, "mko", NA))
> data1
name a b c d
1 John abc ghj zxc <NA>
2 Maria def lop cvb mko
3 Afonso rty <NA> <NA> <NA>
I am using varsExplore::datatable2 to hide specific columns:
varsExplore::datatable2(x=data1, vars=c("a","b","c","d"))
and it produces the below result
Is it possible to modify DT::datatable in order to only render cells that are not "null"? So, for example, if someone clicked on "Afonso", the table would only render "rty", thus hiding "null" values for the other columns (for this row), while still showing those columns if the user clicked "Maria" (that doesn't have any "null").
(Should I try a different approach in order to achieve this behavior?)
A look into the inner working of varsExplore::datatable2
Following your request I took a look into the varsExplore::datatable2 source code. And I found out that varsExplore::datatable2 calls varsExplore:::.callback2 (3: means that it's not an exported function) to create the javascript code. this function also calls varsExplore:::.child_row_table2 which returns a javascript function format(row_data) that formats the rowdata into the table you see.
A proposed solution
I simply used my js knowledge to change the output of varsExplore:::.child_row_table2 and I came up with the following :
.child_row_table2 <- function(x, pos = NULL) {
names_x <- paste0(names(x), ":")
text <- "
var format = function(d) {
text = '<div><table >' +
"
for (i in seq_along(pos)) {
text <- paste(text, glue::glue(
" ( d[{pos[i]}]!==null ? ( '<tr>' +
'<td>' + '{names_x[pos[i]]}' + '</td>' +
'<td>' + d[{pos[i]}] + '</td>' +
'</tr>' ) : '' ) + " ))
}
paste0(text,
"'</table></div>'
return text;};"
)
}
the only change I did was adding the d[{pos[i]}]!==null ? ....... : '' which will only show the column pos[i] when its value d[pos[i]] is not null.
Looking at the fact that loading the package and adding the function to the global environment won't do the trick, I forked it on github and commited the changes you can now install it by running (the github repo is a read-only cran mirror can't submit pull request)
devtools::install_github("moutikabdessabour/varsExplore")
EDIT
if you don't want to redownload the package I found a solution basically you'll need to override the datatable2 function :
first copy the source code into your R file located at path/to/your/Rfile
# the data.table way
data.table::fwrite(list(capture.output(varsExplore::datatable2)), quote=F, sep='\n', file="path/to/your/Rfile", append=T)
# the baseR way
fileConn<-file("path/to/your/Rfile", open='a')
writeLines(capture.output(varsExplore::datatable2), fileConn)
close(fileConn)
then you'll have to substitute the last ligne
DT::datatable(
x,
...,
escape = -2,
options = opts,
callback = DT::JS(.callback2(x = x, pos = c(0, pos)))
)
with :
DT::datatable(
x,
...,
escape = -2,
options = opts,
callback = DT::JS(gsub("('<tr>.+?(d\\[\\d+\\]).+?</tr>')" , "(\\2==null ? '' : \\1)", varsExplore:::.callback2(x = x, pos = c(0, pos))))
)
what this code is basically doing is adding the js condition using a regular expression.
Result

The New York Times API with R

I'm trying to get articles' information using The New York Times API. The csv file I get doesn't reflect my filter query. For example, I restricted the source to 'The New York Times', but the file I got contains other sources also.
I would like to ask you why the filter query doesn't work.
Here's the code.
if (!require("jsonlite")) install.packages("jsonlite")
library(jsonlite)
api = "apikey"
nytime = function () {
url = paste('http://api.nytimes.com/svc/search/v2/articlesearch.json?',
'&fq=source:',("The New York Times"),'AND type_of_material:',("News"),
'AND persons:',("Trump, Donald J"),
'&begin_date=','20160522&end_date=','20161107&api-key=',api,sep="")
#get the total number of search results
initialsearch = fromJSON(url,flatten = T)
maxPages = round((initialsearch$response$meta$hits / 10)-1)
#try with the max page limit at 10
maxPages = ifelse(maxPages >= 10, 10, maxPages)
#creat a empty data frame
df = data.frame(id=as.numeric(),source=character(),type_of_material=character(),
web_url=character())
#save search results into data frame
for(i in 0:maxPages){
#get the search results of each page
nytSearch = fromJSON(paste0(url, "&page=", i), flatten = T)
temp = data.frame(id=1:nrow(nytSearch$response$docs),
source = nytSearch$response$docs$source,
type_of_material = nytSearch$response$docs$type_of_material,
web_url=nytSearch$response$docs$web_url)
df=rbind(df,temp)
Sys.sleep(5) #sleep for 5 second
}
return(df)
}
dt = nytime()
write.csv(dt, "trump.csv")
Here's the csv file I got.
It seems you need to put the () inside the quotes, not outside. Like this:
url = paste('http://api.nytimes.com/svc/search/v2/articlesearch.json?',
'&fq=source:',"(The New York Times)",'AND type_of_material:',"(News)",
'AND persons:',"(Trump, Donald J)",
'&begin_date=','20160522&end_date=','20161107&api-key=',api,sep="")
https://developer.nytimes.com/docs/articlesearch-product/1/overview

neography get actual node or node id from index

I am using the below to get nodes from an index:
neo.get_node_index('nodes_index', 'type', 'repo')
Which works fine. However, the data returned is a Hash object, as below:
> {"indexed"=>"http://localhost:7474/db/data/index/node/nodes_index/type/repo/12", "outgoing_relationships"=>"http://localhost:7474/db/data/node/12/relationships/out",
> "data"=>{"name"=>"irc-logs"},
> "traverse"=>"http://localhost:7474/db/data/node/12/traverse/{returnType}",
> "all_typed_relationships"=>"http://localhost:7474/db/data/node/12/relationships/all/{-list|&|types}",
> "property"=>"http://localhost:7474/db/data/node/12/properties/{key}",
> "self"=>"http://localhost:7474/db/data/node/12",
> "properties"=>"http://localhost:7474/db/data/node/12/properties",
> "outgoing_typed_relationships"=>"http://localhost:7474/db/data/node/12/relationships/out/{-list|&|types}",
> "incoming_relationships"=>"http://localhost:7474/db/data/node/12/relationships/in",
> "extensions"=>{},
> "create_relationship"=>"http://localhost:7474/db/data/node/12/relationships", "paged_traverse"=>"http://localhost:7474/db/data/node/12/paged/traverse/{returnType}{?pageSize,leaseTime}",
> "all_relationships"=>"http://localhost:7474/db/data/node/12/relationships/all",
> "incoming_typed_relationships"=>"http://localhost:7474/db/data/node/12/relationships/in/{-list|&|types}"}
I would like either the actual node object to be returned, or be able to retrieve the id easily. By id, I am referring to the integer inside http://localhost:7474/db/data/node/12.
I could get it by regex, but this surely isn't the best way?
You could use the 'Phase 2' API to find it as below;
n = Neography::Node.find('nodes_index', 'type', 'repo')
n.neo_id # 12

Resources