Web Crawler using R

Web Crawler using R - r

I want to build a webcrawler using R program for website "https://www.latlong.net/convert-address-to-lat-long.html", which can visit the website with the parameter for address and then fetch the generated latitude and longitude from the site. And this would repeat for the length of the dataset which I have.
Since I am new to web crawling domain, I would seek guidance.
Thanks in advance.

In the past I have used an API called IP stack (ipstack.com).
Example: a data frame 'd' that contains a column of IP addresses called 'ipAddress'
for(i in 1:nrow(d)){
#get data from API and save the text to variable 'str'
lookupPath <- paste("http://api.ipstack.com/", d$ipAddress[i], "?access_key=INSERT YOUR API KEY HERE&format=1", sep = "")
str <- readLines(lookupPath)
#save all the data to a file
f <- file(paste(i, ".txt", sep = ""))
writeLines(str,f)
close(f)
#save data to main data frame 'd' as well:
d$ipCountry[i]<-str[7]
print(paste("Successfully saved ip #:", i))
}
In this example, I was specifically after the Country location of each IP, which appears on line 7 of the data returned by the API (hence the str[7])
This API lets you lookup 10,000 addresses per month for free, which was enough for my purposes.

Related

Malformed JSON missing a comma separator, Insert comma in R

I'm new to R and have a json file, containing data I'm hoping to convert to an R dataframe, that's been scraped in the following format:
The picture indicates where the data was scraped incorrectly, as no commas were inserted to separate entries. I've tried reading the data in with scan and separating into a list (to then read into a df) with this code:
indices <- grep(":[{",x, fixed=TRUE)
n <- length(indices)
l <- vector("list", n);
for(i in 1:n) {
ps <- substr(x ,indices[[i]], indices[i+1]) ## where i is whatever your Ps is
l[[i]] <- ps
}
But am getting empty string and NAN values. I've tried parsing with jsonlite, tidyjson, rjson, without any luck (which makes sense since the json is malformed). This article seems to match my json's structure, but the solution isn't working because of the missing commas. How would I insert a comma before every instance of "{"entries":[" in R when the file is read in as one string?
UPDATE: first, second and third entries
{"entries":[{"url":"/leonardomso/playground","name":"playground","lang":"TypeScript","desc":"Playground using React, Emotion, Relay, GraphQL, MongoDB.","stars":5,"forks":"2","updated":"2021-03-24T09:35:44Z","info":["react","reactjs","graphql","typescript","hooks","apollo","boilerplate","!DOCTYPE html \"\""],"repo_url":"/leonardomso?tab=repositories"}
{"entries":[{"url":"/leonardomso/playground","name":"playground","lang":"TypeScript","desc":"Playground using React, Emotion, Relay, GraphQL, MongoDB.","stars":5,"forks":"2","updated":"2021-03-24T09:35:44Z","info":["react","reactjs","graphql","typescript","hooks","apollo","boilerplate","!DOCTYPE html \"\""],"repo_url":"/leonardomso?tab=repositories"}
{"entries":[{"url":"/shiffman/Presentation-Manager","name":"Presentation-Manager","lang":"JavaScript","desc":"Simple web app to manage student presentation schedule.","stars":17,"forks":"15","updated":"2021-01-19T15:28:55Z","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"url":"/shiffman/A2Z-F20","name":"A2Z-F20","lang":"JavaScript","desc":"ITP Course Programming from A to Z Fall 2020","stars":40,"forks":"31","updated":"2020-12-21T13:52:58Z","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"desc":"","stars":null,"forks":"","info":[]},{"url":"/shiffman/RunwayML-Object-Detection","name":"RunwayML-Object-Detection","lang":"JavaScript","desc":"Object detection model with RunwayML, node.js, and p5.js","stars":16,"forks":"2","updated":"2020-11-15T23:36:36Z","info":[]},{"url":"/shiffman/ShapeClassifierCNN","name":"ShapeClassifierCNN","lang":"JavaScript","desc":"test code for new tutorial","stars":11,"forks":"1","updated":"2020-11-06T15:02:26Z","info":[]},{"url":"/shiffman/Bot-Code-of-Conduct","name":"Bot-Code-of-Conduct","desc":"Code of Conduct to guide ethical bot making practices","stars":15,"forks":"1","updated":"2020-10-15T18:30:26Z","info":[]},{"url":"/shiffman/Twitter-Bot-A2Z","name":"Twitter-Bot-A2Z","lang":"JavaScript","desc":"New twitter bot examples","stars":26,"forks":"2","updated":"2020-10-13T16:17:45Z","info":["hacktoberfest","!DOCTYPE html \"\""],"repo_url":"/shiffman?tab=repositories"}

You can use
gsub('}{"entries":[', '},{"entries":[', x, fixed=TRUE)
So, this is a plain replacement of all {"entries":[ with },{"entries":[.
Note the fixed=TRUE parameter that disables the regex engine parsing the string.

Python3: print specific values from json

Python 3:
I need to print the name, email, city, phone for all users in a json file.
I am just learning Python, so I don't know what code to use.
I can get the file, but don't know what to do to print the correct info.
#Imported functions
import requests
import json
#Using the following API endpoint:
#https://jsonplaceholder.typicode.com/users
#Use the GET method of the requests library to read and JSON encode your request.
r = requests.get('https://jsonplaceholder.typicode.com/users')
data = r.json()
print(r)
print()
print(data)
I want a nicely formatted list of the name, email, city, phone for all users.
Thanks for your help!

import requests
import json
r = requests.get('https://jsonplaceholder.typicode.com/users')
data = r.json()
for row in data:
print("Name: {}\nEmail: {}\nCity: {}\nPhone: {}\n".format(row['name'], row['email'],row['address']['city'],row['phone']))
# alternative to the line above
# print("Name: {name}\nEmail: {email}\nCity: {address[city]}\nPhone: {phone}\n".format_map(row))
Short explanation: data contains a list of the entries in the json-file that you are requesting. In this case 10 entries -> so data will have 10 items.
for row in data:
print(...)
will loop through data (the list with 10 entries) and each entry will be written to row. each row will be printed out, in a certain format. not the whole row, but certain fields in that row. you access them by their key. in this case['name'] and so on...

Importing option chain data from Bloomberg

I would like to import from Bloomberg into R for a specified day the entire option chain for a particular stock, i.e. all expiries and strikes for the exchange traded options. I am able to import the option chain for a non-specified day (today):
bbgData <- bds(connection,sec,"OPT_CHAIN")
Where connection is a valid Bloomberg connection and sec is a Bloomberg security ticker such as "TLS AU Equity"
However, if I add extra fields it doesn't work, i.e.
bbgData <- bds(connection, sec,"OPT_CHAIN", testDate, "OPT_STRIKE_PX", "MATURITY", "PX_BID", "PX_ASK")
bbgData <- bds(connection, sec,"OPT_CHAIN", "OPT_STRIKE_PX", "MATURITY", "PX_BID", "PX_ASK")
Similarly, if I switch to using the historical data function it doesn't work
bbgData <- dateDataHist <- bdh(connection,sec,"OPT_CHAIN","20160201")
I just need the data for one day, but for a specified day, and including the additional fields
Hint: I think the issue is that every field following "OPT_CHAIN" is dependent on the result of "OPT_CHAIN", so for example it is the strike price given the code in "OPT_CHAIN", but I am unsure how to introduce this conditionality into the R Bloomberg query.

It's better to use the field CHAIN_TICKERS and related overrides when retrieving option data for a given underlying from Bloomberg. You can, for example, request points for a given moneyness by getting CHAIN_TICKERS with an override of CHAIN_STRIKE_PX_OVRD equal to 90%-110%.
In either case you need to use the tickers that are the result of your first request in a second request if you want to retrieve additional data. So:
option_tickers <- bds("TLS AU Equity","CHAIN_TICKERS",
overrides=c(CHAIN_STRIKE_PX_OVRD="90%-110%"))
option_prices <- bdp(sapply(option_tickers, paste, "equity"), c("PX_BID","PX_ASK"))

extracting all .com, .in, .co.in from all elements

I have data in csv which contains following column
ARTICLE_URL
http://twitter.com/aviryadsh/statuses/528219883872337920
http://www.ibtimes.co.in/2014
I want to create an another columns next to this column where I can have only the web address like twitter.com, team-bhp.com, ibtimes.co.in,broadbandforum.co.
I have tried
text$ne=str_extract(Brand$ARTICLE_URL, '\\w+(.com)')
but this is giving only url which are ending with .com how to fetch all other also.

I'd recommend using string replacement as opposed to string extraction in this instance. It's possible to do with string extraction, but the regular expression is kind of messy and not as readable as a two-step string replacement method. Here's how I'd do it:
urls <- c("http://twitter.com/aviryadsh/statuses/528219883872337920", "http://www.ibtimes.co.in/2014", "https://www.ibtimes.co.in/2014")
tmp <- stringr::str_replace_all(urls, "https?://|www.", "")
domains <- stringr::str_replace_all(tmp, "/.*", "")
And then looking at our output:
domains
# [1] "twitter.com" "ibtimes.co.in" "ibtimes.co.in"

Reconstitute PNG file stored as RAW in SQL Database

I am working toward writing a report from a SQL database (Windows SQL Server) that will require certain people to sign the report before submitting it to the client. We are hoping to have a system where these people can authorize their signature in the database, and then we can use an image of their signature stored in the database and place it on the report generated by LaTeX.
The signature images are created as PNGs, then stored in the database in a field with type varbinary. In order to use the signature in the report, I need to reconstitute the PNG to a file that I can with \includegraphics in LaTeX.
Unfortunately, I can't seem to recreate the PNGs out of the data base. Since I can't post a signature, we'll use the image below as an example.
With this image on my computer, I'm able to read the file as raw, write it to a different file, and get the same image when I open the new file.
#* It works to read the image from a file and rewrite it elsewhere
pal <- readBin("C:/[filepath]/ColorPalette.png",
what = "raw", n = 1e8)
writeBin(pal,
"C:/[filepath]/colors.png",
useBytes=TRUE)
Now, I've saved that same image to the database, and using RODBC, I can extract it like so:
#*** Capture the raw from the database
con <- odbcConnect("DATABASE")
Users <- sqlQuery(con, "SELECT * FROM dbo.[User]")
db_pal <- Users$Signature[Users$LastName == "MyName"]
#*** Write db_pal to a file, but the image won't render
#*** Window Photo Viewer can't open this picture because the file appears to be damaged, corrupted, or is too large (12KB)
writeBin(db_pal[[1]],
"C:/[filename]/db_colors.png",
useBytes=TRUE)
The objects pal and db_pal are defined here in this Gist (they are too long to fit in the allowable space here)
Note: db_pal is a list of one raw vector. Also, it's clearly different than the raw vector pal
> length(pal)
[1] 2471
> length(db_pal[[1]])
[1] 9951
Any thoughts on what I may need to do to get this image out of the database?

Well, we've figured out a solution. The raw vector being returned through RODBC did not match what was in the SQL database. Somewhere in the pipeline, the varbinary object from SQL was getting distorted. I'm not sure why or how. But this answer to a different problem inspired us to recast the variables. As soon as we recast them, we could see the correct representation.
The next problem was that all of our images are more than 8000 bytes, and RODBC only allows 8000 characters at a time. So I had to fumble my way around that. The code below does the following:
Determine the largest number of bytes in an image file
Create a set of variables (ImagePart1, ..., ImagePart[n]) breaking the image into as many parts as necessary, each with max length 8000.
Query the database for all of the images.
Combine the image parts into a single object
Write the images to a local file.
The actual code
library(RODBC)
lims <- odbcConnect("DATABASE")
#* 1. Determine the largest number of bytes in the largest image file
ImageLength <- sqlQuery(lims,
paste0("SELECT MaxLength = MAX(LEN(u.Image)) ",
"FROM dbo.[User] u"))
#* Create a query string to make a set of variables breaking
#* the images into as many parts as necessary, each with
#* max length 8000
n_img_vars <- ImageLength$MaxLength %/% 8000 + 1
start <- 1 + 8000 * (0:(n_img_vars - 1))
end <- 8000 + 8000 * (0:(n_img_vars - 1))
img_parts <- paste0("ImagePart", 1:n_img_vars,
" = CAST(SUBSTRING(u.Image, ", start,
", ", end, ") AS VARBINARY(8000))")
full_query <- paste0("SELECT u.OID, u.LastName, u.FirstName,\n",
paste0(img_parts, collapse =",\n"), "\n",
"FROM dbo.[User] u \n",
"WHERE LEN(u.Image) > 0")
#* 3. Query the database for all the images
Images <- sqlQuery(lims, full_query)
#* 4. Combine the images parts into a single object
Images$full_image <-
apply(Images[, grepl("ImagePart", names(Images))], 1,
function(x) do.call("c", x))
#* 5. Write the images to a local file
for(i in seq_len(nrow(Images))){
DIR <- "[FILE_DIR]"
FILENAME <- with(Images, paste0(OID[i], "-", LastName[i], ".png"))
writeBin(unlist(Images$full_image[i]),
file.path(DIR, FILENAME))
}

I may be misinterpreting the question, but it is possible that the raster package should be of help to you.
library(raster)
your_image <- raster(nrows=587,ncols=496,values=db_pal[[1]])
plot(your_image)
But it doesn't make sense that the length of db_pal[[1]] isn't 291,152 (587*496), so something isn't adding up for me. Do you know where these 291,152 values would be stored?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web Crawler using R - r

Related

Malformed JSON missing a comma separator, Insert comma in R

Python3: print specific values from json

Importing option chain data from Bloomberg

extracting all .com, .in, .co.in from all elements

Reconstitute PNG file stored as RAW in SQL Database

Categories

Resources