R - xmlnodeset output into dataframe or table

R - xmlnodeset output into dataframe or table - r

I'm trying to save html code chunk in csv file rows from two different pages.
Took two links
Use loop to visit the links and select two html code chunks using rvest
print them using sapply
Want to print the output in a row in csv file ( Need help on this)
I can see the html chunks in console but can't save them in csv .. I want to save the html code rather than values. I used IMDB just for code replication purpose.
library(rvest
movielinks <- c("http://www.imdb.com/movies-coming-soon/?ref_=inth_cs", "http://www.imdb.com/movies-in-theaters/?ref_=nv_tp_inth_1")
moviesheet <- NULL
for (mov in 1: length(movielinks)) {
#print(mov)
pageurl <- paste0(movielinks[mov])
# print(pageurl)
movieurl <- html(pageurl)
movie_name <- movieurl %>%
html_nodes("h4 a ")# %>% # find all links
strings<-paste(sapply(movie_name, function(x) { print(x) }))
moviesheet <- rbind(moviesheet, strings)
}
write.csv(moviesheet, "moviesheet.csv")
Final Outcome is something like this
Product Price HtmlCode
Soap 20 <a href="/title/tt3691740/?ref_=cs_ov_tt" title="
The BFG (2016)"
itemprop="url"> The BFG (2016)</a>

Related

how to extract title from a pdf documment with R

I need help to extract information from a pdf file in r
(for example https://arxiv.org/pdf/1701.07008.pdf)
I'm using pdftools, but sometimes pdf_info() doesn't work and in that case I can't manage to do it automatically with pdf_text()
NB notice that tabulizer didn't work on my PC.
Here is the treatment I'm doing (Sorry you need to save the pdf and do it with your own path):
info <- pdf_info(paste0(path_folder,"/",pdf_path))
title <- c(title,info$keys$Title)
key <- c(key,info$keys$Keywords)
auth <- c(auth,info$keys$Author)
dom <- c(dom,info$keys$Subject)
metadata <- c(metadata,info$metadata)
I would like to get title and abstract most of the time.

We will need to make some assumptions about the structure of the pdf we wish to scrape. The code below makes the following assumptions:
Title and abstract are on page 1 (fair assumption?)
Title is of height 15
The abstract is between the first occurrence of the word "Abstract" and first occurrence of the word "Introduction"
library(tidyverse)
library(pdftools)
data = pdf_data("~/Desktop/scrape.pdf")
#Get First page
page_1 = data[[1]]
# Get Title, here we assume its of size 15
title = page_1%>%
filter(height == 15)%>%
.$text%>%
paste0(collapse = " ")
#Get Abstract
abstract_start = which(page_1$text == "Abstract.")[1]
introduction_start = which(page_1$text == "Introduction")[1]
abstract = page_1$text[abstract_start:(introduction_start-2)]%>%
paste0(collapse = " ")
You can, of course, work off of this and impose stricter constraints for your scraper.

how to search through a column of links looking for string matches in r?

I have a data table with a list of .txt links in the same column. I am looking for a way for R to search within each link to see if the file contains either of the strings discount rate or discounted cash flow. I then want R to create 2 columns next to each link (one for discount rate and one for discounted cash flow) that is either going to have a 1 in it if present or a 0 if not.
Here's a small list of sample links that I would like to sift through:
http://www.sec.gov/Archives/edgar/data/1015328/0000913849-04-000510.txt
http://www.sec.gov/Archives/edgar/data/1460306/0001460306-09-000001.txt
http://www.sec.gov/Archives/edgar/data/1063761/0001047469-04-028294.txt
http://www.sec.gov/Archives/edgar/data/1230588/0001178913-09-000260.txt
http://www.sec.gov/Archives/edgar/data/1288246/0001193125-04-155851.txt
http://www.sec.gov/Archives/edgar/data/1436866/0001172661-09-000349.txt
http://www.sec.gov/Archives/edgar/data/1089044/0001047469-04-026535.txt
http://www.sec.gov/Archives/edgar/data/1274057/0001047469-04-023386.txt
http://www.sec.gov/Archives/edgar/data/1300379/0001047469-04-026642.txt
http://www.sec.gov/Archives/edgar/data/1402440/0001225208-09-007496.txt
http://www.sec.gov/Archives/edgar/data/35527/0001193125-04-161618.txt

Maybe something like this...
checktext <- function(file, text) {
filecontents <- readLines(file)
return(as.numeric(any(grepl(text, filecontents, ignore.case = TRUE))))
}
df$DR <- sapply(df$file_name, checktext, "discount rate")
df$DCF <- sapply(df$file_name, checktext, "discounted cash flow")
A much faster version, thanks to Gregor's comment below, would be
checktext <- function(file, text) {
filecontents <- readLines(file)
sapply(text, function(x) as.numeric(any(grepl(x, filecontents,
ignore.case = T))))
}
df[,c("DR","DCF")] <- t(sapply(df$file_name, checktext,
c("discount rate", "discounted cash flow")))
Or if you are doing it from URLs rather than local files, replace df$file_name with df$websiteURL in the above. It worked for me on the short list you provided.

1 column contains headers and data, how to make it into multiple

I'm trying to take a Word doc that has data not in a table, and make it into a table. There are hundreds of identical word docs and I would like to write a script that could take the data and make it into a table.
My first idea is to convert it all into one column, and then I can somehow pull the column headers out and organize the data underneath it.
Word file: https://github.com/cstaulbee/Operation-WordDoc/blob/master/Sanitized_sampe.docx
library(docxtractr)
filenames <- list.files(".", pattern="*.docx", full.names=TRUE)
docx.files <- lapply(filenames, function(file) read_docx(file))
idx <- 1
docx.tables <- lapply(docx.files, function(file) {
ifelse(dir.exists("Contents"), {
unlink("Contents", recursive=T, force=T)
dir.create("Contents")
}, {
dir.create("Contents")
})
filename <- filenames[idx]
idx <- idx + 1
tbl <- docx_extract_tbl(file, 1)
file.copy(filename, "Contents\\word.zip", overwrite=T)
unzip("Contents\\word.zip", exdir='Contents')
x <- xml2::read_xml("Contents\\word\\document.xml")
nodes <- xml2::xml_find_all(x, "w:body/w:p/w:r/w:t")
data.date <- paste(xml2::xml_text(nodes, trim=T), collapse="::")
word_df <- strsplit(gsub("[:]{1,}", ":", txt), ":")
return(
list(
date=data.date
)
)
})
word_df <- strsplit(gsub("[:]{1,}", ":", docx.tables), ":")
This converts the word doc to a zip file, then reads it as an XML. It pulls out the info that isn't in tables, and then puts it all into a list that can then be manipulated.
I wanted to know if anyone knows of a way to take this column and make it into a few columns based on the data. For example, Date, Time in, Pilot, and Assistants will appear 3 or so times in the column, but I want each of those to be their own column, with the data between them and the next column header to be the data that makes up the rows.
So basically it looks like this:
df_col
Date
2/
2/16
Pilot
John, Mark
Assistants
Alfred, James
But I want it to look like this
Date_col Pilot_col Assistants_col
2/22/16 John, Mark Alfred, James
Unless someone has an idea of a better way of doing this.

You can use officer to scrap your docx document:
library(officer)
doc <- read_docx(path = "Sanitized_sampe.docx")
docx_summary(doc)
The last step would be to regexp column text when content_type=="paragraph".

for loop read html challenge in R

I would like to loop a parse query.
The thing that stops me is that I need to insert a number in the html that R then reads and parses. The html should be between " ", does anyone know how to insert the "i" from the "for loop" so that it will be replaced and R is also able to retrieve the html?
This is the code (I would like a list with all the artists of the charts of the 52 weeks):
library(rvest)
weeknummer = 1:52
l <- c()
b <- c()
for (i in weeknummer){
htmlpage <- read_html("http://www.top40.nl/top40/2015/week-"[i]"")
Top40html <- html_nodes(htmlpage,".credit")
top40week1 <- html_text(Top40html)
b <- top40week1
l <- c(l,b)
}

You need to turn the URL into one string.
pageurl <- paste0("http://www.top40.nl/top40/2015/week-",i)
htmlpage <- read_html(pageurl)

Loop for a string

This code will be used to count number of links in my tweets collection. The collection is collected from 10 accounts. The questions is, how could I loop through the ten accounts in one code and drop the output in a table or graph? "Unames" is representing the name of the account. Thanks in adavance,
mydata <- read.csv("tweets.csv",sep=",", header=TRUE)
head(mydata)
dim(mydata)
colnames(mydata)
****#tweets for each university****
table(mydata$University)
Unames<- unique(mydata$University)
mystring <- function(Uname, string){
mydata_temp <- subset(mydata,University==Uname)
mymatch <- rep(NA,dim(mydata_temp)[1])
for(i in 1:dim(mydata_temp)[1]){
mymatch[i] <- length(grep(string, mydata_temp[i,2]))
}
return(mymatch)
}
**#web link e.g. (Here I would like to see the total links for all universities in table or graph. The below code is only giving me the output one by one!
mylink <- mystring(Unames[1],"http://")

So my suspicions are wrong and you do have a body of data for which this command produces the desired results (and you expect all the :
mylink <- mystring(Unames[1],"http://")
In that case, you should just do this:
links_list <- lapply(Unames, mystring, "http://")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - xmlnodeset output into dataframe or table - r

Related

how to extract title from a pdf documment with R

how to search through a column of links looking for string matches in r?

1 column contains headers and data, how to make it into multiple

for loop read html challenge in R

Loop for a string

Categories

Resources