R: Open websites from a URL string in a data frame

R: Open websites from a URL string in a data frame - r

I have an R data frame with a list of 500ish URLs. It looks a bit like this:
websites <- data.frame(rbind("www.nytimes.com", "www.google.com", "www.facebook.com"))
I want to go through these URLs and open them (maybe 10 at a time) in Google Chrome.
How would I go about this automatically with R?

I used this to get all 3 of them to open.
websites <- data.frame(rbind("www.nytimes.com", "www.google.com", "www.facebook.com"))
websites <- as.data.frame(t(websites))
websites[] <- lapply(websites, as.character)
webVec <- unname(unlist(websites[1,]))
for(i in 1:length(webVec)){
shell.exec(paste(webVec[i]))
}
This opens all of them however, and I'm not sure how to open only a certain amount at a time. I took a stab at it though:
setTen <- 1
for(i in (1 * (10 * (setTen - 1))):(10 * setTen )){
shell.exec(paste(webVec[i]))
}
the setTen variable asks if you want the first ten websites, second ten, ect.
I couldn't test it though since there is only 3 sites in this data frame.
If it doesn't work let me know and I'll try to figure out a different method.

Related

How To Prevent Web Scraping Script From Being Blocked By Google (HTTP 429)

I have this script to take each domain names in Dataframe and perform a "inurl:domain automation testing" Google search for each of them. I will scrape the 1st search result and add to my Dataframe.
import random
# Convert the Domain column in Dataframe into a list
working_backlink = backlink_df.iloc[23:len(backlink_df['Domain']), 1:22]
working_domain = working_backlink["Domain"]
domain_list = working_domain.values.tolist()
# Iterate through the list and perform query search for them
for x in range(23, len(domain_list)):
sleeptime = random.randint(1,10)
time.sleep(sleeptime)
for i in domain_list:
query = "inurl:{} automation testing".format(i)
delay = random.randint(10,30)
for j in search(query, tld="com", num=1,stop=1,pause=delay):
working_backlink.iat[x,5] = j
# Show Dataframe
working_backlink.head(n=40)
I tried using sleeptime and random delay time to prevent HTTP 429 error, but it still doesn't work. Could you suggest any solution to this? Thanks a lot!

downloading data and saving data to a folder in batches

I have 200,000 links that I am trying to download, I have tried downloading it all in one go but I ran into memory issues.
I am trying to create a function which will download 1000 links at a time and save them in a folder.
Packages:
library(dplyr)
library(purrr)
library(edgarWebR)
A small sample of the data is as follows:
Data 1:
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm"
)
I then apply the following function to download these 10 links
parsed_files <- map(urls_to_parse, possibly(parse_filing, otherwise = NA))
Which stores it as a nice list, I can then apply names(parsed_files) <- urls_to_parse to name the lists as the links from where they were downloading them from. I can also use output <- plyr::ldply(parsed_files, data.frame) to store everything in a nice data frame.
Using the below data, how could I create batches to download the data in say batches of 10?
What I have currently:
start = 1
end = 100
output <- NULL
output_fin <- NULL
for(i in start:end){
output[[i]] <- map(urls_to_parse[[i]], possibly(parse_filing, otherwise = NA))
names(output) <- urls_to_parse[start:end]
save(output_fin, file = paste0("C:/Users/Downloads/data/",i, "output.RData"))
}
I am sure there is a better way using a function, since this code breaks for some of the results.
More data: - 100 links
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746908008126/a2186742z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465907055173/a07-18543_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465906047248/a06-15961_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465905033688/a05-12324_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746904023905/a2140220z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746903028005/a2116671z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000091205702033450/a2087919z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095012310108231/c61492e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095015208010514/n48172e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707018659/c22309e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707000193/c11187e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013406000594/c01109e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677405000032/d16006.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677404000013/d13773.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000104746903001075/a2097401z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000091205702001614/a2067550z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752308008030/a5800571.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752307009801/a5515869.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752306009238/a5227919.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046908000102/alpharmainc_10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046907000017/alo10k2006.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046906000027/alo10k2005.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046905000021/alo10k2004final.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046904000058/alo10k2003master.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046903000001/alo10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046902000004/alo10k2001.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046901500003/alo.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620118000009/a10k123117.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312517051216/d286458d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312516474605/d78287d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312515061145/d829913d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620113000023/amr-10kx20121231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000119312512063516/d259681d10k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095012311014726/d78201e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620110000006/ar123109.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620109000009/ar120810k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000451508000014/ar022010k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013407003888/d43815e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013406003715/d33303e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013405003726/d22731e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013404002668/d12953e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000104746903013301/a2108197z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095013407003823/h42902e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012906002343/h31028e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012905002955/h22337e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459018005085/cece-10k_20171231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459017004264/cece-10k_20161231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459016015157/cece-10k_20151231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312515095828/d864880d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312514098407/d661608d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312513109153/d444138d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312512119293/d293768d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312511067373/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312510069639/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312509055504/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312508058939/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312507071909/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312506068031/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312505077739/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312504052176/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465910047121/a10-16705_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000114420409046933/v159572_10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465906060737/a06-19311_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746905022854/a2162888z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746904028585/a2143353z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746903031974/a2119476z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000143774918010388/avx20180331_10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916317000028/avx-20170331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916316000079/avx-20160331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916315000024/avx-20150331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916314000035/avx-20140331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916313000022/avx-20130331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916312000024/avxform10kfy12.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916311000013/avxform10kfy11.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916310000020/avxform10kfy10.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916309000117/form10kfy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000192/form10qq1fy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000101/form10kfy08.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916307000122/form10kfy07.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916306000102/avxfy06form10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916305000094/fy0510k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916304000091/fy0410k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916303000020/fy0310k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916302000007/r10k-0302.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462218000018/pnw2017123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462217000010/pnw2016123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462216000087/pnw2015123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462215000013/pnw12311410-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000110465914012068/a13-25897_110k.htm"
)

Looping over to do batch job as you showed is a bad idea. If you have a 1000s of files to be downloaded, how do you recover from errors?
The performance is not solely depend on your computer's configuration, but the network performance is crucial.
Here are couple of suggestions.
Option 1
partition all URLs in to batches to be able to download them parallelly. The number of files to be downloaded could be equal to number of cores in your computer. Look at this question; reading multiple files quickly in R
store these batches in a queue objects - For ex: using a package like https://cran.r-project.org/web/packages/dequer/dequer.pdf
pop the queue and use the batch of URLs in your parallel file download function.
Use a retryable file download function like in -- HTTP error 400 in R, error handling, How to retry instead of forcing to stop?
Once the queue is completed, move to the next partition.
wrap the whole operation in a retryable loop. For example; How to retry a statement on error?
Why do I use a queue? Because you could retry on error easily.
A pseudo code
file_url_partitions <- partion_as_batches(all_urls, batch_size)
attempts = 3
while( file_url_partitions is not empty && attempt <= 3 ) {
batch = file_url_partitions.pop()
tryCatch({
download_parallel(batch)
}, some_exception = function(se) {
file_url_partitions.push(batch)
attemp = attempt+1
})
}
Note: I don't have access to R studio/environment now hence no way to try.
Option 2
Download files separately using a download manager/similar and use downloaded files.
Some useful resources:
https://www.r-bloggers.com/r-with-parallel-computing-from-user-perspectives/
http://adv-r.had.co.nz/beyond-exception-handling.html

Accessing Spoitify API with Rspotify to obtain genre information for multiple artisrts

I am using RStudio 3.4.4 on a windows 10 machine.
I have got a vector of artist names and I am trying to get genre information for them all on spotify. I have successfully set up the API and the RSpotify package is working as expected.
I am trying to build up to create a function but I am failing pretty early on.
So far i have the following but it is returning unexpected results
len <- nrow(Artist_Nam)
artist_info <- character(artist)
for(i in 1:len){
ifelse(nrow(searchArtist(Artist_Nam$ArtistName[i], token = keys))>=1,
artist_info[i] <- searchArtist(Artist_Nam$ArtistName[i], token = keys)$genres[1],
artist_info[i] <- "")
}
artist_info
I was expecting this to return a list of genres, and artists where there is not a match on spotify I would have an empty entry ""
Instead what is returned is a list and entries are populated with genres and on inspection these genres are correct and there are "" where there is no match however, something odd happens from [73] on wards (I have over 3,000 artists), the list now only returns "".
despite when i actually look into this using the searchArtist() manually there are matches.
I wonder if anyone has any suggestions or has experienced anything like this before?

There may be a rate limit to the number of requests you can make a minute and you may just be hitting that limit. Adding a small delay with Sys.sleep() within your loop to prevent you from hitting their API too hard to be throttled.

Tarifx.geo - Creating multiple georoutes

I'm not sure what I'm doing wrong. I'm trying to find the drive time between many zip codes. I am able to do so with ggmap, however the API load limit is 2,500. I need to run this for more and heard that Bing can help. I tried using tarifx.geo, which works for a single zip to zip drive, but I need each zip combination individually listed. Lets get to the examples:
Google:
require(ggmap)
from <- as.character(c("27205","48212"))
to <- as.character(c("54952","14450"))
driveTimes <- mapdist(from, to, mode='driving')
print(driveTimes)
from to m km miles seconds minutes hours
1 27205 54952 1533077 1533.077 952.654 52716 878.6000 14.643333
2 48212 14450 555906 555.906 345.440 19700 328.3333 5.472222
^^ Notice the two drive times
taRifx.geo
When I use tarifx.geo, it looks at each zip code as a waypoint along a trip.
require("taRifx.geo")
from <- as.character(c("27205","48212"))
to <- as.character(c("54952","14450"))
combined <- data.frame(from, to)
combined[] <- lapply(combined, as.character)
for (i in 1:nrow(combined){
driveTimes <- georoute( c(combined[i,1], combined[i,2]),
verbose=TRUE,
returntype="time",
service="bing" )
}
print(driveTimes)
time
1 21040
^^ Here I need it to print two rows for each zip to zip drive time.
Thanks in advance for your help! I've tried several methods defining the from/to, but I might be suffering from lack of sleep and can't see the problem right in front of me. If there is a better solution, please do tell. :)

R- Excluding random numbers that have already been generated

So I'm working on a webscraping script in R and because the particular website I'm scraping doesn't take too kindly to people who scrape their data in large volumes, I have broken down my loop to handle only 10 links at a time. I still want to go through all the links, however, just in a random and slow manner.
productLink # A list of all the links that I'll be scraping
x<- length(productLink)
randomNum <- sample(1:x, 10)
library(rvest)
for(i in 1:10){
url <- productLink[randomNum[i]]
specs <- url %>%
html() %>%
html_nodes("h5") %>%
html_text()
specs
message<- "\n Temporarily unavailable\n "
if(specs == message){
print("Item unavailable")
}
else{
print("Item available")
}
}
Now the next time I run this for-loop I want to exclude all the random numbered indices that have already been tried in the previous running of the loop. That way this for loop runs through 10 new links each time until all the links have been used. There is another aspect to this that I'd like some input on. Since I can raise alarm flags by brute force scraping the particular company's website, is there any way I can slow down this loop so that it only runs every couple of minutes? I'm thinking of a timeout function or such where the code runs the for-loop once, waits a few minutes then runs it again (with new links each time as mentioned above). Any ideas?

Use something like this. Loop over all the product index randomly.
for (i in sample(1:x)){
<Your code here>
# Sleep for 120 seconds
Sys.sleep(120)
}
And if you want to do 10 at a time. Sleep for 120 seconds every 10 executions.
n = 1
for (i in sample(1:x)){
# Sleep for 120 seconds every 10 runs
if (n == 10) {Sys.sleep(120); n = 0}
n = n+1
<Your code here>
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Open websites from a URL string in a data frame - r

I have an R data frame with a list of 500ish URLs. It looks a bit like this: websites <- data.frame(rbind("www.nytimes.com", "www.google.com", "www.facebook.com")) I want to go through these URLs and open them (maybe 10 at a time) in Google Chrome. How would I go about this automatically with R?

Related

How To Prevent Web Scraping Script From Being Blocked By Google (HTTP 429)

downloading data and saving data to a folder in batches

Accessing Spoitify API with Rspotify to obtain genre information for multiple artisrts

Tarifx.geo - Creating multiple georoutes

R- Excluding random numbers that have already been generated

Categories

Resources