creating corpus from multiple html text files

creating corpus from multiple html text files - r

I have a list of html files, I have taken some texts from the web and make them read with the read_html.
My files names are like:
a1 <- read_html(link of the text)
a2 <- read_html(link of the text)
.
.
. ## until:
a100 <- read_html(link of the text)
I am trying to create a corpus with these.
Any ideas how can I do it?
Thanks.

You could allocate the vector beforehand:
text <- rep(NA, 100)
text[1] <- read_html(link1)
...
text[100] <- read_html(link100)
Even better, if you organize your links as vector. Then you can use, as suggested in the comments, lapply:
text <- lapply(links, read_html)
(here links is a vector of the links).
It would be rather bad coding style to use assign:
# not a good idea
for (i in 1:100) assign(paste0("text", i), get(paste0("link", i)))
since this is rather slow and hard to process further.

I would suggest using purrr for this solution:
library(tidyverse)
library(purrr)
library(rvest)
files <- list.files("path/to/html_links", full.names = T)
all_html <- tibble(file_path = files) %>%
mutate(filenames = basename(files)) %>%
mutate(text = map(file_path, read_html))
Is a nice way to keep track of which piece of text belongs to which file. It also makes things like sentiment or any other type analysis easy at a document level.

Related

read a pdf-file into R without header/contents

I want to import multiple pdf-files into R but per page there are 4 columns, a header/footer line and a table of contents.
For purpose of text mining I want to remove them from my file or character vector.
Right now I am using two functions to read in the files. The first one is pdf_text because it keeps the pages but can't deal with the 4 columns. The second one is extract_text, this one on its own doesn't keep the pages but can deal with the column structure (and is decently with occuring tables) .
But neither one of them is able to remove the table of contents (as far as I have tried).
My data set is not exactly minimal but otherwise I had some problems with the data structures. Here a working code:
################ relevant code ##############
library(pdftools)
library(tidyverse)
library(tabulizer)
files_name <- "Nachhaltigkeit 2021.pdf"
file_url <- c("https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/sustainability/documents/Allianz_Group_Sustainability_Report_2021-web.pdf", "https://www.allianz.com/content/dam/onemarketing/azcom/Allianz_com/investor-relations/en/results-reports/annual-report/ar-2021/en-Allianz-Group-Annual-Report-2021.pdf")
reports_list <- lapply(file_url, pdf_text)
createTibble <- function(){
tibble_together <- NULL
#for all files
for(i in 1:length(files_name)){
page_nr <- length(reports_list[[i]])
tib <- tibble(report = rep(files_name[i], page_nr), page = 1:page_nr, text = gsub("\r\n", " ",
extract_text(files_name[[i]], pages = 1:page_nr)))
tibble_together <- rbind(tibble_together, tib)
}
return(tibble_together)
}
reports_df <- createTibble()
############ code for problem visualization ###############
reports_df <- reports_df %>% unnest_tokens(output = word, input = text, token = "words")
#e.g this part contains the table of contents which is not intended
(reports_df %>% filter(page == 34, report == "Nachhaltigkeit 2021.pdf"))$word[832:885]
Thanks for your help in advance
PS: it's my first question so if you need sth. let me know.
And I know that the function createTibble probably isn't optimal. But that's not my primary concern.

How to select for certain data in a .txt file

I have a .txt import file from a weather station using some pretty advanced code, and I need to sort based on one area of content within each line. Here's a few lines:
13:30:00.587: <- $GPGGA,183000.30,4415.6243,N,08823.9769,W,1,7,1.7,225.5,M,-33.4,M,,*68
13:30:00.683: <- $GPGLL,4415.6243,N,08823.9769,W,183000.40,A,A*72
13:30:00.779: <- $GPVTG,159.6,T,163.2,M,0.1,N,0.1,K,A*2E
I basically need to be able to group together all lines with a $GPGGA, and do the same for $GPGLL, $GPVTG, and I believe 6 other types of entries that repeat. group_by() does work, nor do select() or sort() for obvious reasons. The formatting here is clearly not in any organized table format, making this very difficult for me. How do I do this?
Here's the code I used to import the original file (I replaced my actual username with "my username"):
filefolder <-"C:\\Users\\"my username"\\Downloads\\"
Weather_data = paste(filefolder, "Jul_13_2021_Weatherstation_Test_File.txt", sep = "")
Weather_data <- read.delim("Jul_13_2021_Weatherstation_Test_File.txt")
And here's what I have so far in my attempt:
Screenshot of what I have so far
1: https://i.stack.imgur.com/FSlzf.png][1]

As you say there is no organisation in the table. I would suggest doing something with regular expressions:
df <- data.frame(text = c("13:30:00.587: <- $GPGGA,183000.30,4415.6243,N,08823.9769,W,1,7,1.7,225.5,M,-33.4,M,,*68",
"13:30:00.683: <- $GPGLL,4415.6243,N,08823.9769,W,183000.40,A,A*72",
"13:30:00.779: <- $GPVTG,159.6,T,163.2,M,0.1,N,0.1,K,A*2E"))
library(dplyr)
df %>%
mutate(Entry = gsub(".*\\$([A-Z]+),.*", "\\1", text)) %>%
group_by(Entry)

R find and replace text in .xlsx files

So my task is very simple, I would like to use R to solve this. I have hundreds of excel files (.xlsx) in a folder and I want to replace an especific text without altering formating of worksheet and preserving the rest of text in the cell, for example:
Text to look for:
F13 A
Replace for:
F20
Text in a current cell:
F13 A Year 2019
Desired result:
F20 Year 2019
I have googled a lot and havent found something appropiate, even though it seems to be a common task. I have a solution using Powershell but it is very slow and I cant believe that there is no simple way using R. Im sure someone had the same problem before, Ill take any sugestions.

You can try :
text_to_look <- 'F13 A'
text_to_replace <- 'F20'
all_files <- list.files('/path/to/files', pattern = '\\.xlsx$', full.names = TRUE)
lapply(all_files, function(x) {
df <- openxlsx::read.xlsx(x)
#Or use readxl package
#df <- readxl::read_excel(x)
df[] <- lapply(df, function(x) {x[grep(text_to_look, x)] <- text_to_replace;x})
openxlsx::write.xlsx(df, basename(x))
})

Creating a dataframe from paragraph text scraped from website in R

I'm trying to scrape a website that has numerous different information I want in paragraphs. I got this to work perfect... However, I don't understand how to break the text up and create a dataframe.
Website :Website I want Scraped
Code:
library(rvest)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"
#Reading the HTML code from the website
webpage <- read_html(url)
p_nodes<-webpage%>%
html_nodes(xpath = '//p')%>%
html_text()
#replace multiple whitespaces with single space
p_nodes<- gsub('\\s+',' ',p_nodes)
#trim spaces from ends of elements
p_nodes <- trimws(p_nodes)
#drop blank elements
p_nodes <- p_nodes[p_nodes != '']
How I want the dataframe to look:
I'm not sure if this is even possible. I tried to extract each piece of information separately and then make the dataframe like that but it doesn't work since most of the info is stored in the p tag. I would appreciate any guidance. Thanks!

Proof-of-concept (based on what I wrote in the comment):
Code
lapply(c('data.table', 'httr', 'rvest'), library, character.only = T)
tags <- 'tr:nth-child(6) td , tr~ tr+ tr p , td+ p'
burl <- 'https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml'
url_text <- read_html(burl)
chunks <- url_text %>% html_nodes(tags) %>% html_text()
coordFunc <- function(chunk){
patter_lat <- 'Longitude:.*(-[[:digit:]]{1,2}.[[:digit:]]{0,15})'
ret <- regmatches(x = chunk, m = regexec(pattern = patter_lat, text = chunk))
return(ret[[1]][2])
}
longitudes <- as.numeric(unlist(lapply(chunks, coordFunc)))
Output
# using 'cat' to make the output easier to read
> cat(chunks[14])
Mt. Laurel DOT
Rt. 38, East
1/4 mile East of Rt. 295
Mt. Laurel Open 24 Hrs
Unleaded / Diesel
856-235-3096Latitude: 39.96744662Longitude: -74.88930386
> longitudes[14]
[1] -74.8893
If you do not coerce longitudes to be numeric, you get:
longitudes <- (unlist(lapply(chunks, coordFunc)))
> longitudes[14]
[1] "-74.88930386"
I chose the longitude as a proof-of-concept but you can modify your function to extract all relevant bits in a single call. For getting the right tag you can use SelectorGadget extension (works well in Chrome for me). Alliteratively most browsers let you 'inspect element' to get the html tag. The function could return the extracted values in a data.table which can then be combined into one using rbindlist.
You could even advance pages programatically to scrape the entire website - be sure to check with the usage policy (it's generally frowned upon or restricted to scrape websites).
Edit
the text is not structured the same way throughout the webpage so you'll need to spend more time examining what exceptions can take place.
Here's a new function to resolve each chunk into separate lines and then you can try to use additional regular expressions to get what you want.
newfunc <- function(chunk){
# Each chunk is a couple of lines. First, we split at '\r\n' using strsplit
# the output is a list so we use 'unlist' to get a vector
# then use 'trimws' to remove whitespace around it - try out each of these functions
# separately to understand what is going on. The final output here is a vector.
txt <- trimws(unlist(strsplit(chunk, '\r\n')))
return(txt)
}
This returns the 'text' contained in each chunk as a vector of separate lines. Taking a look at the number of lines in the first 20 chunks, you can see it is not the same:
> unlist(lapply(chunks[1:20], function(z) length(newfunc(z))))
[1] 5 6 5 7 5 5 5 5 5 4 1 6 6 6 5 1 1 1 5 6
A good way to resolve this would be to put in a conditional statement based on the number of lines of text in each chunk, e.g. in newfunc you could add:
if(length(txt) == 1){
return(NULL)
}
This is because that is for the entries that don't have any text in them. since this a proof of concept I haven't checked all entries but there's some simple logic:
The first line is typically the name
the coordinates are in the last line
The fuel can be either unleaded or diesel. You can grep on these two strings to see what each depot offers. e.g. grepl('diesel', newfunc(chunks[12]))
Another approach would be to use a different set of html tags e.g. all coorindates and opening hours are in boldface and have the tag strong. You can extract those separately and then use regular expressions to get what you want.
You could search for 24(Hrs|Hours) to first extract all sites that are open 24 hours and then use selective regex on the remainder to get their operating times.
There is no simple easy answer with most web-scraping, you have to find patterns and then apply some logic based on that. Only on the most structured websites will you find something that works for the entire page/range.

You can use tidyverse package (stringr, tibble, purrr)
library(rvest)
library(tidyverse)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"
#Reading the HTML code from the website
webpage <- read_html(url)
p_nodes<-webpage%>%
html_nodes(xpath = '//p')%>%
html_text()
# Split on new line
l = p_nodes %>% stringr::str_split(pattern = "\r\n")
var1 = sapply(l, `[`, 1) # replace var by the name you want
var2 = sapply(l, `[`, 2)
var3 = sapply(l, `[`, 3)
var4 = sapply(l, `[`, 4)
var5 = sapply(l, `[`, 5)
t = tibble(var1,var2,var3,var4,var5) # make tibble
t = t %>% filter(!is.na(var2)) # delete useless lines
purrr::map_dfr(t,trimws) # delete blanks

Multiple text file processing using scan

I have this code that works for me (it's from Jockers' Text Analysis with R for Students of Literature). However, what I need to be able to do is to automate this: I need to perform the "ProcessingSection" for up to thirty individual text files. How can I do this? Can I have a table or data frame that contains thirty occurrences of "text.v" for each scan("*.txt")?
Any help is much appreciated!
# Chapter 5 Start up code
setwd("D:/work/cpd/R/Projects/5/")
text.v <- scan("pupil-14.txt", what="character", sep="\n")
length(text.v)
#ProcessingSection
text.lower.v <- tolower(text.v)
mars.words.l <- strsplit(text.lower.v, "\\W")
mars.word.v <- unlist(mars.words.l)
#remove blanks
not.blanks.v <- which(mars.word.v!="")
not.blanks.v
#create a new vector to store the individual words
mars.word.v <- mars.word.v[not.blanks.v]
mars.word.v

It's hard to help as your example is not reproducible.
Admitting you're happy with the result of mars.word.v,
you can turn this portion of code into a function that will accept a single argument,
the result of scan.
processing_section <- function(x){
unlist(strsplit(tolower(x), "\\W"))
}
Then, if all .txt files are in the current working directory, you should be able to list them,
and apply this function with:
lf <- list.files(pattern=".txt")
lapply(lf, function(path) processing_section(scan(path, what="character", sep="\n")))
Is this what you want?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

creating corpus from multiple html text files - r

Related

read a pdf-file into R without header/contents

How to select for certain data in a .txt file

R find and replace text in .xlsx files

Creating a dataframe from paragraph text scraped from website in R

Multiple text file processing using scan

Categories

Resources