Problem:
During scraping from webpage (imdb.com, webpage with film details) there is error message displayed. When I have checked in details, I have noticed that there is no data available for some of the entries.
How to figured out during scraping for which line there is no data and how to fill it with NA?
Manual investigation:
I have checked on webpage manually, and problem is with the rank number 1097 where there is only film genre available and there is no runtime.
Tried:
to add if entering the 0, but it is added to the last line, not to the title which is missing the value.
Code:
#install packages
install.packages("rvest")
install.packages("RSelenium")
library(rvest)
library(RSelenium)
#open browser (in my case Firefox)
rD <- rsDriver(browser=c("firefox"))
remDr <- rD[["client"]]
#set variable for the link
ile<-seq(from=1, by=250, length.out = 5)
#create empty frame
filmy_df=data.frame()
#empty values
rank_data<-NA;link<-NA;year<-NA;title_data<-NA;description_data<-NA;runtime_data<-NA;genre_data<-NA
#loop reading the data from each page
for (j in ile){
#set link for browser
newURL<-"https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
startNumberURL<-paste0(newURL,j)
#open link
remDr$navigate(startNumberURL)
#read webpage code
strona_int<-read_html(startNumberURL)
#empty values
rank_data<-NA;link<-NA;year<-NA;title_data<-NA;description_data<-NA;runtime_data<-NA;genre_data<-NA
#read rank
rank_data<-html_nodes(strona_int,'.text-primary')
#convert text
rank_data<-html_text(rank_data)
#remove the comma for thousands
rank_data<-gsub(",","",rank_data)
#convert numeric
rank_data<-as.numeric(rank_data)
#read link for each movie
link<-url_absolute(html_nodes(strona_int, '.lister-item-header a')%>%html_attr(.,'href'),"https://www.imdb.com")
#release year
year<-html_nodes(strona_int,'.lister-item-year')
#convert text
year<-html_text(year)
#remove non numeric
year<-gsub("\\D","",year)
#set factor
year<-as.factor(year)
#read title
title_data<-html_nodes(strona_int,'.lister-item-header a')
#convert text
title_data<-html_text(title_data)
#title_data<-as.character(title_data)
#read description
description_data<-html_nodes(strona_int,'.ratings-bar+ .text-muted')
#convert text
description_data<-html_text(description_data)
#remove '\n'
description_data<-gsub("\n","",description_data)
#remove space
description_data<-trimws(description_data,"l")
#read runtime
runtime_data <- html_nodes(strona_int,'.text-muted .runtime')
#convert text
runtime_data <- html_text(runtime_data)
#remove min
runtime_data<-gsub(" min","",runtime_data)
length_runtime_data<-length(runtime_data)
#if (length_runtime_data<250){ runtime_data<-append(runtime_data,list(0))}
runtime_data<-as.numeric(runtime_data)
#temp_df
filmy_df_temp<-data.frame(Rank=rank_data,Title=title_data,Release.Year=year,Link=link,Description=description_data,Runtime=runtime_data)
#add to df
filmy_df<-rbind(filmy_df,filmy_df_temp)
}
#close browser
remDr$close()
#stop RSelenium
rD[["server"]]$stop()
Error message displayed:
"Error in data.frame(Rank = rank_data, Title = title_data, Release.Year = year, :
arguments imply differing number of rows: 250, 249"
Runtime_data contains only 249 entries instead of 250, and there is no runtime data for last line (instead for the line where it is really missing).
Update
I have found interesting think which maybe can help to solve the problem.
Please check the pictures.
Anima - source of error
Knocked Up - next entry
When we compare the pictures, we can notice that Anima, which is causing the problem with runtime_data, do not have html_node containing runtime at all.
So question: is there a way to check if html_node exists or not? If yes, how to do this?
You wouldn't run into this problem if you have structured your program a little different. In general, it is better to split your program into logically separate chunks that are more or less independent from each other instead of doing everything at once. That makes debugging much easier.
First, scrape the data and store it in a list – use lapply or something similar for that.
newURL <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
pages <- lapply(ile, function(j) { #set link for browser
startNumberURL<-paste0(newURL,j)
#open link
remDr$navigate(startNumberURL)
#read webpage code
read_html(startNumberURL)
})
Then you have your data scraped and you can take all the time you need to analyse and filter it, without having to start the reading process again. For example, define the function as follows:
parsuj_strone <- function(strona_int) {
#read rank
rank_data<-html_nodes(strona_int,'.text-primary')
#convert text
rank_data<-html_text(rank_data)
#remove the comma for thousands
rank_data<-gsub(",","",rank_data)
#convert numeric
rank_data<-as.numeric(rank_data)
#read link for each movie
link<-url_absolute(html_nodes(strona_int, '.lister-item-header a')%>%html_attr(.,'href'),"https://www.imdb.com")
#release year
year<-html_nodes(strona_int,'.lister-item-year')
#convert text
year<-html_text(year)
#remove non numeric
year<-gsub("\\D","",year)
#set factor
year<-as.factor(year)
#read title
title_data<-html_nodes(strona_int,'.lister-item-header a')
#convert text
title_data<-html_text(title_data)
#read description
description_data<-html_nodes(strona_int,'.ratings-bar+ .text-muted')
#convert text
description_data<-html_text(description_data)
#remove '\n'
description_data<-gsub("\n","",description_data)
#remove space
description_data<-trimws(description_data,"l")
#read runtime
runtime_data <- html_nodes(strona_int,'.text-muted .runtime')
#convert text
runtime_data <- html_text(runtime_data)
#remove min
runtime_data<-gsub(" min","",runtime_data)
length_runtime_data<-length(runtime_data)
runtime_data<-as.numeric(runtime_data)
#temp_df
filmy_df_temp<- data.frame(Rank=rank_data,Title=title_data,Release.Year=year,Link=link,Description=description_data,Runtime=runtime_data)
return(filmy_df_temp)
}
Now, apply the function to each scraped web site:
pages_parsed <- lapply(pages, parsuj_strone)
And finally put them together in the data frame:
pages_df <- Reduce(rbind, pages_parsed)
Reduce won't mind an occasional NULL. Powodzenia!
EDIT: OK, so the problem is in the parsuj_strone() function. First, replace the final line of that function by this:
filmy_df_temp<- list(Rank=rank_data,
Title=title_data,
Release.Year=year, Link=link,
Description=description_data,
Runtime=runtime_data)
return(filmy_df_temp)
Run
pages_parsed <- lapply(pages, parsuj_strone)
Then, identify which of the 5 web sites returned problematic entries:
sapply(pages_parsed, function(x) sapply(x, length))
This should give you a 5 x 6 matrix. Finally, pick an element which has only 249 entries; how does it look? Without knowing your parser well, this at least should give you a hint where the problems may be.
On Stackoverflow you will find everything.
You just need to know how to search and search answer.
Here is the link to the answer for my problem: Scraping with rvest: how to fill blank numbers in a row to transform in a data frame?
In short: instead of using html_nodes, html_node (without s) should be used.
#read runtime
runtime_data <- html_node(szczegoly_filmu,'.text-muted .runtime')
#convert to text
runtime_data <- html_text(runtime_data)
#remove " min"
runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)
Related
I want to import multiple pdf-files into R but per page there are 4 columns, a header/footer line and a table of contents.
For purpose of text mining I want to remove them from my file or character vector.
Right now I am using two functions to read in the files. The first one is pdf_text because it keeps the pages but can't deal with the 4 columns. The second one is extract_text, this one on its own doesn't keep the pages but can deal with the column structure (and is decently with occuring tables) .
But neither one of them is able to remove the table of contents (as far as I have tried).
My data set is not exactly minimal but otherwise I had some problems with the data structures. Here a working code:
################ relevant code ##############
library(pdftools)
library(tidyverse)
library(tabulizer)
files_name <- "Nachhaltigkeit 2021.pdf"
file_url <- c("https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/sustainability/documents/Allianz_Group_Sustainability_Report_2021-web.pdf", "https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/investor-relations/en/results-reports/annual-report/ar-2021/en-Allianz-Group-Annual-Report-2021.pdf")
reports_list <- lapply(file_url, pdf_text)
createTibble <- function(){
tibble_together <- NULL
#for all files
for(i in 1:length(files_name)){
page_nr <- length(reports_list[[i]])
tib <- tibble(report = rep(files_name[i], page_nr), page = 1:page_nr, text = gsub("\r\n", " ",
extract_text(files_name[[i]], pages = 1:page_nr)))
tibble_together <- rbind(tibble_together, tib)
}
return(tibble_together)
}
reports_df <- createTibble()
############ code for problem visualization ###############
reports_df <- reports_df %>% unnest_tokens(output = word, input = text, token = "words")
#e.g this part contains the table of contents which is not intended
(reports_df %>% filter(page == 34, report == "Nachhaltigkeit 2021.pdf"))$word[832:885]
Thanks for your help in advance
PS: it's my first question so if you need sth. let me know.
And I know that the function createTibble probably isn't optimal. But that's not my primary concern.
I'm doing a little project where there goal is to retrieve data in text format from a website. (http://regsho.finra.org/regsho-Index.html)
The website was nice enough to provide it online but they sorted the data over several days in different links
I thought about looping through the dates and store the data with the following code:
#Download the needed data
my_data <- c()
for (i in 01:13){
my_data <- read.delim(sprintf("http://regsho.finra.org/CNMSshvol202005%i.txt", i), header=TRUE, sep="|")
}
head(my_data)
The problem here is that in line
for (i in 01:13){ # The date in the website is 01-02-03 and the loop seems to ommit the 0
I've used the sprintf() method so I can have a variable in a string.
and this line the empty variable my_data always seems to be overwritten by the last data downloaded.
my_data <- read.delim(sprintf("http://regsho.finra.org/CNMSshvol202005%i.txt", i), header=TRUE, sep="|")
# the empty variable my_data always seems to be overwritten by the last data downloaded.
Could somebody reassure me if i'm going in the right direction because i'm starting to doubt myself here
Any help would be greatly appreciated!
Thanks in advance
This should give you a leading 0 without using an extra package:
sprintf("%02d", i)
i.e.
sprintf("http://regsho.finra.org/CNMSshvol202005%02d.txt", i)
I have the following R script for scraping some textual data from a website.
library('rvest')
term_data_final <- c()
defn_data_final <- c()
for (term in 1:10) {
url_base <- 'http://www.nplg.gov.ge/gwdict/index.php?a=term&d=9&t='
url <- paste(url_base, term, sep="")
webpage <- read_html(url)
term_data_html <- html_nodes(webpage, '.term')
term_data <- html_text(term_data_html)
if (!grepl("\\?", term_data)) {
term_data_final <- c(term_data_final, term_data)
defn_data_html <- html_nodes(webpage, '.defnblock')
defn_data <- html_text(defn_data_html)
defn_data_final <- c(defn_data_final, defn_data)
}
}
RusGeoDict <- data.frame(term_data_final, defn_data_final)
write.csv(RusGeoDict, file = 'RusGeoDict.csv', fileEncoding="UTF-8")
The script combines the scraped data into a dataframe and then writes that dataframe to a csv file. The scraped text is in Russian and Georgian characters and when saved to a dataframe and a csv, instead of text I get hexadecimal unicode of the format: . When I output the lists that are created before being combined into a dataframe, such as term_data_final I get the original text, but once I save to a dataframe and output to a csv file I get unicode. Is there any way to get the original text in Georgian and Russian characters saved to a csv without the unicode output. Thanks!
Ok, I don't know any Russian, but I think you can just set the GeoDict before you run your script, right.
Sys.setlocale("LC_CTYPE", "russian")
RusGeoDict <- data.frame(term_data_final, defn_data_final)
I just tried it and I think it's working. I can't really say for sure though. Try it and feedback with your findings.
Finally, see the link below. for additional ideas.
https://www.r-bloggers.com/r-and-foreign-characters/
I have a large, un-organized XML file that I need to search to determine if a certain ID numbers are in the file. I would like to use R to do so and because of the format, I am having trouble converting it to a data frame or even a list to extract to a csv. I figured I can search easily if it is in a csv format. So , I need help understanding how to do convert it and extract it properly, or how to search the document for values using R. Below is the code I have used to try and covert the doc,but several errors occur with my various attempts.
## Method 1. I tried to convert to a data frame, but the each column is not the same length.
require(XML)
require(plyr)
file<-"EJ.XML"
doc <- xmlParse(file,useInternalNodes = TRUE)
xL <- xmlToList(doc)
data <- ldply(xL, data.frame)
datanew <- read.table(data, header = FALSE, fill = TRUE)
## Method 2. I tried to convert it to a list and the file extracts but only lists 2 words on the file.
data<- xmlParse("EJ.XML")
print(data)
head(data)
xml_data<- xmlToList(data)
class(data)
topxml <- xmlRoot(data)
topxml <- xmlSApply(topxml,function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml),
row.names=NULL)
write.csv(xml_df, file = "MyData.csv",row.names=FALSE)
I am going to do some research on how to search within R as well, but I assume the file needs to be in a data frame or list to so either way. Any help is appreciated! Attached is a screen shot of the data. I am interested in finding matching entity id numbers to a list I have in a excel doc.
I have this code that works for me (it's from Jockers' Text Analysis with R for Students of Literature). However, what I need to be able to do is to automate this: I need to perform the "ProcessingSection" for up to thirty individual text files. How can I do this? Can I have a table or data frame that contains thirty occurrences of "text.v" for each scan("*.txt")?
Any help is much appreciated!
# Chapter 5 Start up code
setwd("D:/work/cpd/R/Projects/5/")
text.v <- scan("pupil-14.txt", what="character", sep="\n")
length(text.v)
#ProcessingSection
text.lower.v <- tolower(text.v)
mars.words.l <- strsplit(text.lower.v, "\\W")
mars.word.v <- unlist(mars.words.l)
#remove blanks
not.blanks.v <- which(mars.word.v!="")
not.blanks.v
#create a new vector to store the individual words
mars.word.v <- mars.word.v[not.blanks.v]
mars.word.v
It's hard to help as your example is not reproducible.
Admitting you're happy with the result of mars.word.v,
you can turn this portion of code into a function that will accept a single argument,
the result of scan.
processing_section <- function(x){
unlist(strsplit(tolower(x), "\\W"))
}
Then, if all .txt files are in the current working directory, you should be able to list them,
and apply this function with:
lf <- list.files(pattern=".txt")
lapply(lf, function(path) processing_section(scan(path, what="character", sep="\n")))
Is this what you want?