R Text data scraper loop through dates - r

I'm doing a little project where there goal is to retrieve data in text format from a website. (http://regsho.finra.org/regsho-Index.html)
The website was nice enough to provide it online but they sorted the data over several days in different links
I thought about looping through the dates and store the data with the following code:
#Download the needed data
my_data <- c()
for (i in 01:13){
my_data <- read.delim(sprintf("http://regsho.finra.org/CNMSshvol202005%i.txt", i), header=TRUE, sep="|")
}
head(my_data)
The problem here is that in line
for (i in 01:13){ # The date in the website is 01-02-03 and the loop seems to ommit the 0
I've used the sprintf() method so I can have a variable in a string.
and this line the empty variable my_data always seems to be overwritten by the last data downloaded.
my_data <- read.delim(sprintf("http://regsho.finra.org/CNMSshvol202005%i.txt", i), header=TRUE, sep="|")
# the empty variable my_data always seems to be overwritten by the last data downloaded.
Could somebody reassure me if i'm going in the right direction because i'm starting to doubt myself here
Any help would be greatly appreciated!
Thanks in advance

This should give you a leading 0 without using an extra package:
sprintf("%02d", i)
i.e.
sprintf("http://regsho.finra.org/CNMSshvol202005%02d.txt", i)

Related

R scraper - how to find entry with missing data

Problem:
During scraping from webpage (imdb.com, webpage with film details) there is error message displayed. When I have checked in details, I have noticed that there is no data available for some of the entries.
How to figured out during scraping for which line there is no data and how to fill it with NA?
Manual investigation:
I have checked on webpage manually, and problem is with the rank number 1097 where there is only film genre available and there is no runtime.
Tried:
to add if entering the 0, but it is added to the last line, not to the title which is missing the value.
Code:
#install packages
install.packages("rvest")
install.packages("RSelenium")
library(rvest)
library(RSelenium)
#open browser (in my case Firefox)
rD <- rsDriver(browser=c("firefox"))
remDr <- rD[["client"]]
#set variable for the link
ile<-seq(from=1, by=250, length.out = 5)
#create empty frame
filmy_df=data.frame()
#empty values
rank_data<-NA;link<-NA;year<-NA;title_data<-NA;description_data<-NA;runtime_data<-NA;genre_data<-NA
#loop reading the data from each page
for (j in ile){
#set link for browser
newURL<-"https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
startNumberURL<-paste0(newURL,j)
#open link
remDr$navigate(startNumberURL)
#read webpage code
strona_int<-read_html(startNumberURL)
#empty values
rank_data<-NA;link<-NA;year<-NA;title_data<-NA;description_data<-NA;runtime_data<-NA;genre_data<-NA
#read rank
rank_data<-html_nodes(strona_int,'.text-primary')
#convert text
rank_data<-html_text(rank_data)
#remove the comma for thousands
rank_data<-gsub(",","",rank_data)
#convert numeric
rank_data<-as.numeric(rank_data)
#read link for each movie
link<-url_absolute(html_nodes(strona_int, '.lister-item-header a')%>%html_attr(.,'href'),"https://www.imdb.com")
#release year
year<-html_nodes(strona_int,'.lister-item-year')
#convert text
year<-html_text(year)
#remove non numeric
year<-gsub("\\D","",year)
#set factor
year<-as.factor(year)
#read title
title_data<-html_nodes(strona_int,'.lister-item-header a')
#convert text
title_data<-html_text(title_data)
#title_data<-as.character(title_data)
#read description
description_data<-html_nodes(strona_int,'.ratings-bar+ .text-muted')
#convert text
description_data<-html_text(description_data)
#remove '\n'
description_data<-gsub("\n","",description_data)
#remove space
description_data<-trimws(description_data,"l")
#read runtime
runtime_data <- html_nodes(strona_int,'.text-muted .runtime')
#convert text
runtime_data <- html_text(runtime_data)
#remove min
runtime_data<-gsub(" min","",runtime_data)
length_runtime_data<-length(runtime_data)
#if (length_runtime_data<250){ runtime_data<-append(runtime_data,list(0))}
runtime_data<-as.numeric(runtime_data)
#temp_df
filmy_df_temp<-data.frame(Rank=rank_data,Title=title_data,Release.Year=year,Link=link,Description=description_data,Runtime=runtime_data)
#add to df
filmy_df<-rbind(filmy_df,filmy_df_temp)
}
#close browser
remDr$close()
#stop RSelenium
rD[["server"]]$stop()
Error message displayed:
"Error in data.frame(Rank = rank_data, Title = title_data, Release.Year = year, :
arguments imply differing number of rows: 250, 249"
Runtime_data contains only 249 entries instead of 250, and there is no runtime data for last line (instead for the line where it is really missing).
Update
I have found interesting think which maybe can help to solve the problem.
Please check the pictures.
Anima - source of error
Knocked Up - next entry
When we compare the pictures, we can notice that Anima, which is causing the problem with runtime_data, do not have html_node containing runtime at all.
So question: is there a way to check if html_node exists or not? If yes, how to do this?
You wouldn't run into this problem if you have structured your program a little different. In general, it is better to split your program into logically separate chunks that are more or less independent from each other instead of doing everything at once. That makes debugging much easier.
First, scrape the data and store it in a list – use lapply or something similar for that.
newURL <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
pages <- lapply(ile, function(j) { #set link for browser
startNumberURL<-paste0(newURL,j)
#open link
remDr$navigate(startNumberURL)
#read webpage code
read_html(startNumberURL)
})
Then you have your data scraped and you can take all the time you need to analyse and filter it, without having to start the reading process again. For example, define the function as follows:
parsuj_strone <- function(strona_int) {
#read rank
rank_data<-html_nodes(strona_int,'.text-primary')
#convert text
rank_data<-html_text(rank_data)
#remove the comma for thousands
rank_data<-gsub(",","",rank_data)
#convert numeric
rank_data<-as.numeric(rank_data)
#read link for each movie
link<-url_absolute(html_nodes(strona_int, '.lister-item-header a')%>%html_attr(.,'href'),"https://www.imdb.com")
#release year
year<-html_nodes(strona_int,'.lister-item-year')
#convert text
year<-html_text(year)
#remove non numeric
year<-gsub("\\D","",year)
#set factor
year<-as.factor(year)
#read title
title_data<-html_nodes(strona_int,'.lister-item-header a')
#convert text
title_data<-html_text(title_data)
#read description
description_data<-html_nodes(strona_int,'.ratings-bar+ .text-muted')
#convert text
description_data<-html_text(description_data)
#remove '\n'
description_data<-gsub("\n","",description_data)
#remove space
description_data<-trimws(description_data,"l")
#read runtime
runtime_data <- html_nodes(strona_int,'.text-muted .runtime')
#convert text
runtime_data <- html_text(runtime_data)
#remove min
runtime_data<-gsub(" min","",runtime_data)
length_runtime_data<-length(runtime_data)
runtime_data<-as.numeric(runtime_data)
#temp_df
filmy_df_temp<- data.frame(Rank=rank_data,Title=title_data,Release.Year=year,Link=link,Description=description_data,Runtime=runtime_data)
return(filmy_df_temp)
}
Now, apply the function to each scraped web site:
pages_parsed <- lapply(pages, parsuj_strone)
And finally put them together in the data frame:
pages_df <- Reduce(rbind, pages_parsed)
Reduce won't mind an occasional NULL. Powodzenia!
EDIT: OK, so the problem is in the parsuj_strone() function. First, replace the final line of that function by this:
filmy_df_temp<- list(Rank=rank_data,
Title=title_data,
Release.Year=year, Link=link,
Description=description_data,
Runtime=runtime_data)
return(filmy_df_temp)
Run
pages_parsed <- lapply(pages, parsuj_strone)
Then, identify which of the 5 web sites returned problematic entries:
sapply(pages_parsed, function(x) sapply(x, length))
This should give you a 5 x 6 matrix. Finally, pick an element which has only 249 entries; how does it look? Without knowing your parser well, this at least should give you a hint where the problems may be.
On Stackoverflow you will find everything.
You just need to know how to search and search answer.
Here is the link to the answer for my problem: Scraping with rvest: how to fill blank numbers in a row to transform in a data frame?
In short: instead of using html_nodes, html_node (without s) should be used.
#read runtime
runtime_data <- html_node(szczegoly_filmu,'.text-muted .runtime')
#convert to text
runtime_data <- html_text(runtime_data)
#remove " min"
runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)

A continuation of... Extracting data from an API using R

I'm a super new at this and working on R for my thesis. The code in this answer finally worked for me (Extracting data from an API using R), but I can't figure out how to add a loop to it. I keep getting the first page of the API when I need all 3360.
Here's the code:
library(httr)
library(jsonlite)
r1 <- GET("http://data.riksdagen.se/dokumentlista/?
sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12- 31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s#soktraff")
r2 <- rawToChar(r1$content)
class(r2)
r3 <- fromJSON(r2)
r4 <- r3$dokumentlista$dokument
By the time I reach r4, it's already a data frame.
Please and thank you!
Edit: originally, I couldn't get a url that had the page as info within it. Now I have it (below). I still haven't been able to loop it.
"http://data.riksdagen.se/dokumentlista/?sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12-31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s&p="
I think you can extract the url of the next page from r3 as follows:
next_url <- r3$dokumentlista$`#nasta_sida`
# you need to re-check this, but sometimes I'm getting white spaces within the url,
# you may not face this problem, but in any case this line of code solved the issue
next_url <- gsub(' ', '', n_url)
GET(next_url)
Update
I tried the url with the page number with 10 pages and it worked
my_dfs <- lapply(1:10, function(i){
my_url <- paste0("http://data.riksdagen.se/dokumentlista/?sok=&doktyp=mot&rm=&from=2000-01-01&tom=2017-12-31&ts=&bet=&tempbet=&nr=&org=&iid=&webbtv=&talare=&exakt=&planering=&sort=rel&sortorder=desc&rapport=&utformat=json&a=s&p=", i)
r1 <- GET(my_url)
r2 <- rawToChar(r1$content)
r3 <- fromJSON(r2)
r4 <- r3$dokumentlista$dokument
return(r4)
})
Update 2:
The extracted data frames are complex (e.g. some columns are lists of data frames) which is why a simple rbind will not work here, you'll have to do some pre-processing before you stack up the data together, something like this would work
my_dfs %>% lapply(function(df_0){
# Do some stuff here with the data, and choose the variables you need
# I chose the first 10 columns to check that I got 200 different observations
df_0[1:10]
}) %>% do.call(rbind, .)

Two different issues with dates and qplots

I am a student and have been given a project to study climate data from Giovanni (NASA). Our code is provided and we are left to 'find our way' and therefore other answers don't seem to relate to the style of code i've been given. Further to this i am a beginner in R so changing the code is very difficult.
Basically i'm trying to create a time-series plot from the following code:
## Function for loading Giovanni time series data
load_giovanni_time <- function(path){
file_data <- read.csv(path,
skip=6,
col.names = c("Date",
"Temperature",
"NA",
"Site",
"Bleached"))
file_data$Date <- parse_date_time(file_data$Date, orders="ymdHMS")
return(file_data)
}
## Creat a list of files
file.list <- list.files("./Data/courseworktimeseries/")
file.list <- as.list(paste0("./Data/courseworktimeseries/", file.list))
# for(i in file.list){
# load_giovanni_time(i)
# }
#Load all the files
all_data <- lapply(X=file.list,
FUN=load_giovanni_time)
all_data <- as.data.frame(do.call(rbind, all_data))
## Inspect the data with a plot
p <- qplot(data=all_data,
x=Date,
y=Temperature,
colour=Site,
linetype=Bleached,
geom="line")
print(p)
Now the first problem is that when the data is merged into one dataset, it changes all the dates (the starting date range is 2002-2015 and it changes to 2002-2030), which obviously ruins the plot. I found that i can stop the dates changing by deleting this code:
file_data$Date <- parse_date_time(file_data$Date, orders="ymdHMS")
However, when this is deleted, i get the following error:
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
Could anyone help me get round this without editing the code too much? I feel like it's a problem with the line of the code formatting the date incorrectly or something so i imagine it's only a small problem. I'm just very much a beginner and have to implement the code within 1-2 days.
Thanks
For anyone who ever has this problem... I found the solution.
file_data$Date <- parse_date_time(file_data$Date, orders="ymdHMS")
This line of code reads in the date information from my CSV in that order. In excel my date was the other way round so if it said 30th December 2015 (30/12/2015), R would read it in as 2030-12-20, screwing up the data.
In Excel select all dates, CTRL+1 and then change format to match the date R is 'parsing'.
All done =)

Darksky api in R loop not working

I am a huge R fan, but it never seems to work out for me, I am trying to use an API to get weather data, but I cannot write the loop. I have all the codes in the right format, but when I import the file into r, the cells appear like
-33.86659241, 151.2081909, \"2014-10-01T02:00:00"\
and this is preventing me from running the code. So rather than using a loop I need to use a mailmerge to create 5000 lines of code. Any help would be really appreciated.
tmp <- get_forecast_for(-33.86659241, 151.2081909, "2014-10-01T02:00:00", add_headers=TRUE)
fdf <- as.data.frame(tmp)
fdf$ID <- "R_3nNli1Hj2mlvFVo"
fd <- rbind(fd,fdf)
Here is the code with loop -
df <- read.csv("~/Machine Learning/Darksky.csv", header=T,sep=",", fill = TRUE)
for(i in 1:length(df$DarkSky)){
fdf <- get_forecast_for(df$LocationLatitude[i], df$LocationLongitude[i], df$DarkSky[i], add_headers=TRUE)
fdf <- as.data.frame(fdf)
fdf <- fdf[1:2,]
fd <- rbind(fd,fdf)
}
I also wanted to rbind the retreived data onto a dataframe but it does not work. I also wanted to cbind the identifier, which would be the value in df$DarkSky[i], but it will not work.
CSV -
LocationLatitude LocationLongitude DarkSky
-33.86659241 151.2081909 "2014-10-01T02:00:00"
The get_forecast_for function takes three parameters, the latitude, longitude and the date, structured as above, I have the loop working for latitude and longitude, but the time/date is not working.

Multiple text file processing using scan

I have this code that works for me (it's from Jockers' Text Analysis with R for Students of Literature). However, what I need to be able to do is to automate this: I need to perform the "ProcessingSection" for up to thirty individual text files. How can I do this? Can I have a table or data frame that contains thirty occurrences of "text.v" for each scan("*.txt")?
Any help is much appreciated!
# Chapter 5 Start up code
setwd("D:/work/cpd/R/Projects/5/")
text.v <- scan("pupil-14.txt", what="character", sep="\n")
length(text.v)
#ProcessingSection
text.lower.v <- tolower(text.v)
mars.words.l <- strsplit(text.lower.v, "\\W")
mars.word.v <- unlist(mars.words.l)
#remove blanks
not.blanks.v <- which(mars.word.v!="")
not.blanks.v
#create a new vector to store the individual words
mars.word.v <- mars.word.v[not.blanks.v]
mars.word.v
It's hard to help as your example is not reproducible.
Admitting you're happy with the result of mars.word.v,
you can turn this portion of code into a function that will accept a single argument,
the result of scan.
processing_section <- function(x){
unlist(strsplit(tolower(x), "\\W"))
}
Then, if all .txt files are in the current working directory, you should be able to list them,
and apply this function with:
lf <- list.files(pattern=".txt")
lapply(lf, function(path) processing_section(scan(path, what="character", sep="\n")))
Is this what you want?

Resources