I'm trying to scrape the webpage of Fangraphs with alphabetical player indices to get a single column dataframe of each letter reference.
I have been able to get the code below to successfully work on a Windows version of R 3.4.1, but cannot get it to work on the Linux side at all, and I can't figure out what exactly is going wrong/different.
library(XML)
# Scrape to get the webpage
url <- paste0("http://www.fangraphs.com/players.aspx?")
table <- readHTMLTable(url, stringsAsFactors = FALSE)
letterz <- table[[2]]
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=", ")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters")
letterz$letters <- as.character(letterz$letters)
# Below this is where I can notice that the code is not operating the same
# as on my Windows machine. None of the gsub commands seem to impact
# the strings at all.
# Stripping the trailing whitespace
letterz$letters <- gsub("[[:space:]]+$", "", letterz$letters)
# Replacing patterns like "AzB Ba" to instead have "Az,Ba"
letterz$letters <- gsub("[[:upper:]]+?[[:space:]]+?[[:space:]]+?[[:space:]]+", ",", letterz$letters)
# Final cleaning up
letterz <- as.character(letterz)
letterz <- strsplit(letterz, split=",")
letterz <- as.data.frame(letterz)
names(letterz) <- c("letters")
letterz$letters <- as.character(letterz$letters)
letterz$letters <- gsub('c\\("|"\\)|"', "", letterz$letters)
letterz$letters <- gsub('^$', NA, letterz$letters)
letterz$letters <- gsub("^[[:space:]]+","", letterz$letters)
letterz$letters <- gsub("[[:space:]]+$","", letterz$letters)
letterz$letters <- gsub("'", "%27", letterz$letters)
letterz <- na.omit(letterz)
From what I could find, the only real difference between Windows/Linux regex would be the linebreak implementation, which I went back and tried to see if that was making the difference... but still got no change.
I also tried to substitute the R-specific "[[:space:]]" and "[[:upper:]]" style notation with the more standardized "\s" to see if that would fix anything.
As for fixes, I know there are a handful of other packages that I can look into to simply get the result I'm looking for, but more generally, are there just simply differences in how Windows and Linux implement regex that I'm unaware of and am oblivious to? And if so, how would I implement them into gsub to get the same result I get on Windows?
Thanks.
Related
I am trying to automatically spell-check a string column of a data.table/data.frame.
Looking around, I found several approaches that all give an "out of bounds" error in the case hunspell.suggest returns no suggestions (that is, an empty list, e.g. "pippasnjfjsfiadjg"), see approaches here (the accepted answer here yields NA so does work in principal) and here
We seem to require unlist in order to identify these empty suggestions and then exclude them from the part of the code that picks the first suggestion but I cannot figure out how.
library(dplyr)
library(stringi)
library(hunspell)
df1 <- data.frame("Index" = 1:7, "Text" = c("pippasnjfjsfiadjg came to dinner with us tonigh.",
"Wuld you like to trave with me?",
"There is so muh to undestand.",
"Sentences cone in many shaes and sizes.",
"Learnin R is fun",
"yesterday was Friday",
"bing search engine"),
stringsAsFactors = FALSE)
# Get bad words.
badwords <- hunspell(df1$Text) %>% unlist
# Extract the first suggestion for each bad word.
suggestions <- sapply(hunspell_suggest(badwords), "[[", 1)
mutate(df1, Text = stri_replace_all_fixed(str = Text,
pattern = badwords,
replacement = suggestions,
vectorize_all = FALSE)) -> out
You'll want to filter the list of bad words and suggestions to get rid of those without suggestions
badwords <- hunspell(df1$Text) %>% unlist()
# note use of '[' rather than '[['
suggestions <- sapply(hunspell_suggest(badwords), '[', 1)
badwords <- badwords[!is.na(suggestions)]
suggestions <- suggestions[!is.na(suggestions)]
I'm attempting to automate scraping the practice words from this site https://www.livechatinc.com/typing-speed-test/#/ but get a result of 'character(o)'.
I read the url with read_html then use that for x in html_nodes() along with the css selector for the practice words and then read it with html_text, but I get character(0) every time.
No clue what I'm doing wrong, here is the code:
library('rvest')
url <- read_html("https://www.livechatinc.com/typing-speed-test/#/")
wbpg_html <- html_nodes(url,".test-prompt")
wbpg_txt <- html_text(wbpg_html)
> wbpg_txt
character(0)
I'd just like to get the practice words into r, find out how to automate it later.
Thanks for any help.
The word list comes from this js file: https://cdn.livechatinc.com/gtt/app.3.8.min.js
You can try to regex out with R using:
e\\.exports=\\{words:\\[(.*?)\\]
I ran a quick test with python:
import requests, re
r = requests.get('https://cdn.livechatinc.com/gtt/app.3.8.min.js')
p = re.compile(r'e\.exports={words:\[(.*?)\]')
words = p.findall(r.text)
print(words)
With r
library(rvest)
library(stringr)
library(readr)
library(dplyr)
urlmatrix <- paste(readLines('https://cdn.livechatinc.com/gtt/app.3.8.min.js', warn=FALSE),
collapse=" ", fileEncoding = "UTF-16") %>%
str_match(., 'e\\.exports=\\{words:\\[(.*?)\\]')
words <- strsplit(as.character(as.list(urlmatrix[,2])[[1]]), '","')
words[[1]][1] <- substring(words[[1]][1],2,nchar(words[[1]][1]))
words[[1]][length(words[[1]])] <- gsub('\\"', "", words[[1]][length(words[[1]])])
I am trying to scrape a website using the following:
industryurl <- "https://finance.yahoo.com/industries"
library(rvest)
read <- read_html(industryurl) %>%
html_table()
library(plyr)
industries <- ldply(read, data.frame)
industries = industries[-1,]
read <- read_html(industryurl)
industryurls <- html_attr(html_nodes(read, "a"), "href")
links <- industryurls[grep("/industry/", industryurls)]
industryurl <- "https://finance.yahoo.com"
links <- paste0(industryurl, links)
links
##############################################################################################
store <- NULL
tbl <- NULL
for(i in links){
store[[i]] = read_html(i)
tbl[[i]] = html_table(store[[i]])
}
#################################################################################################
I am mostly interested in the code between ########## and I want to apply a function instead of a for loop since I am running into time out issues with yahoo and I want to make it more human like to extract this data (it is not too much).
My question is, how can I take links apply a function and set a sort of delay timer to read in the contents of the for loop?
I can paste my own version of the for loop which does not work.
This is the function I came up with
##First argument is the link you need
##The second argument is the total time for Sys.sleep
extract_function <- function(define_link, define_time){
print(paste0("The system will stop for: ", define_time, " seconds"))
Sys.sleep(define_time)
first <- read_html(define_link)
print(paste0("It will now return the table for link", define_link))
return(html_table(first))
}
##I added the following tryCatch function
link_try_catch <- function(define_link, define_time){
out <- tryCatch(extract_function(define_link,define_time), error =
function(e) NA)
return(out)
}
##You can now retrieve the data using the links vector in two ways
##Picking the first ten, so it should not crash on link 5
p <- lapply(1:10, function(i)link_try_catch(links[i],1))
##OR (I subset the vector just for demo purposes
p2 <- lapply(links[1:10], function(i)extract_function(i,1))
Hope it helps
I used the below on one website and it returned a perfect result:
looking for key word: Emaar pasted at the end of the query:
library(httr)
library(jsonlite)
query<-"https://www.googleapis.com/customsearch/v1?key=AIzaSyA0KdZHRkAjmoxKL14eEXp2vnI4Yg_po38&cx=006431301429107149113:as7yqcm2qc8&q=Emaar"
result11 <- content(GET(query))
print(result11)
result11_JSON <- toJSON(result11)
result11_JSON <- fromJSON(result11_JSON)
result11_df <- as.data.frame(result11_JSON)
now I want to apply the same function over a data.frame containing key words:
so i did the below testing .csv file:
Company Name
[1] ADES International Holding Ltd
[2] Emirates REIT (CEIC) Limited
[3] POLARCUS LIMITED
called it Testing Website Extraction.csv
code used:
test_companies <- read.csv("... \\Testing Website Extraction.csv")
#removing space and adding "+" sign then pasting query before it (query already has my unique google key and search engine ID
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
Result here shows a list of length 3 (the 3 search terms) and sublist within each term containing: url (list[2]), queries (list[2]), ... items (list[10]) and these are the same for each search term (same length separately), my issue here is applying the remainder of the code
#when i run:
result_JSON <- toJSON(result)
result_JSON <- as.list(fromJSON(result_JSON))
I get a list of 6 list that has sublists
and putting it into a tidy dataframe where the results are listed under each other (not separately) is proving to be difficult
also note that I tried taking from the "result" list that has 3 separate lists in it each one by itself but its a lot of manual labor if I have a longer list of keywords
The expected end result should include 30 observations of 37 variables (for each search term 10 observations of 37 variables and all are underneath each other.
Things I have tried unsuccessfully:
These work to flatten the list:
#do.call(c , result)
#all.equal(listofvectors, res, check.attributes = FALSE)
#unlist(result, recursive = FALSE)
# for (i in 1:length(result)) {listofvectors <- c(listofvectors, result[[i]])}
#rbind()
#rbind.fill()
even after flattening I dont know how to organize them into a tidy final output for a non-R user to interact with.
Any help here would be greatly appreciated,
I am here in case anything is not clear about my question,
Always happy to learn more about R so please bear with me as I am just starting to catch up.
All the best and thanks in advance!
Basically what I did is extract only the columns I need from the dataframe list, below is the final code:
library(httr)
library(jsonlite)
library(tidyr)
library(stringr)
library(purrr)
library(plyr)
test_companies <- read.csv("c:\\users\\... Companies Without Websites List.csv")
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
function_toJSONall <- function(all) {toJSON(all)}
a <- lapply(result, function_toJSONall)
function_fromJSONall <- function(all) {fromJSON(all)}
b <- lapply(a, function_fromJSONall)
function_dataframe <- function(all) {as.data.frame(all)}
c <- lapply(b, function_dataframe)
function_column <- function(all) {all[ ,15:30]}
result_final <- lapply(c, function_column)
results_df <- rbind.fill(c[])
I have a file 'check_text.txt' that contains "said say says make made". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument in tm package, as the following, but only get "said say say make made". Is there a way to perform stemming on past tense words? Is it necessary to do so in real-world natural language processing? Thanks!
filename = 'check_text.txt'
con <- file(filename, "rb")
text_data <- readLines(con,skipNul = TRUE)
close(con)
text_VS <- VectorSource(text_data)
text_corpus <- VCorpus(text_VS)
text_corpus <- tm_map(text_corpus, stemDocument, language = "english")
as.data.frame(text_corpus)$text
EDIT: I also tried wordStem in SnowballC package
> library(SnowballC)
> wordStem(c("said", "say", "says", "make", "made"))
[1] "said" "sai" "sai" "make" "made"
If there is a data set of irregular English verbs in a package, this task would be easy. I just do not know any packages with such data, so I chose to create my own database by scraping. I am not sure if this website covers all irregular words. If necessary, you want to search better websites to create your own database. Once you have your database, You can engage in your task.
First, I used stemDocument() and clean up present forms with -s. Then, I collected past forms in words (i.e., past), infinitive forms of the past forms (i.e., inf1),identified the order of the past forms in temp. I further identified the positions of the past forms in temp. I finally replaced the sat forms with their infinitive forms. I repeated the same procedure for past participles.
library(tm)
library(rvest)
library(dplyr)
library(splitstackshape)
### Create a database
x <- read_html("http://www.englishpage.com/irregularverbs/irregularverbs.html")
x %>%
html_table(header = TRUE) %>%
bind_rows %>%
rename(Past = `Simple Past`, PP = `Past Participle`) %>%
filter(!Infinitive %in% LETTERS) %>%
cSplit(splitCols = c("Past", "PP"),
sep = " / ", direction = "long") %>%
filter(complete.cases(.)) %>%
mutate_each(funs(gsub(pattern = "\\s\\(.*\\)$|\\s\\[\\?\\]",
replacement = "",
x = .))) -> mydic
### Work on the task
words <- c("said", "drawn", "say", "says", "make", "made", "done")
### says to say
temp <- stemDocument(words)
### past forms become present form
### Collect past forms
past <- mydic$Past[which(mydic$Past %in% temp)]
### Collect infinitive forms of past forms
inf1 <- mydic$Infinitive[which(mydic$Past %in% temp)]
### Identify the order of past forms in temp
ind <- match(temp, past)
ind <- ind[is.na(ind) == FALSE]
### Where are the past forms in temp?
position <- which(temp %in% past)
temp[position] <- inf1[ind]
### Check
temp
#[1] "say" "drawn" "say" "say" "make" "make" "done"
### PP forms to infinitive forms (same as past forms)
pp <- mydic$PP[which(mydic$PP %in% temp)]
inf2 <- mydic$Infinitive[which(mydic$PP %in% temp)]
ind <- match(temp, pp)
ind <- ind[is.na(ind) == FALSE]
position <- which(temp %in% pp)
temp[position] <- inf2[ind]
### Check
temp
#[1] "say" "draw" "say" "say" "make" "make" "do"