Some troubles with Web scraping using R - r

I have some troubles to scrape the text information from this webpage: http://www.iplant.cn/info/Acer%20stachyophyllum?t=foc
What I need is the text information in the center of this webpage: "Trees to 15 m tall, dioecious. ..."
I tried to use the read_html function in R package rvest, but got nothing. Could anyone help me with that? Thanks so much.

This part of the page is generated from an xhr call. You can get the specific piece of text you are looking for from any species by doing:
get_description <- function(species_name)
{
url <- "http://www.iplant.cn/ashx/getfoc.ashx"
query <- paste0("?key=", gsub(" ", "+", species_name),
"&key_no=&m=", runif(1), 9)
jsonlite::fromJSON(paste0(url, query))$Description
}
So for example:
get_description("Actaea asiatica")
#> [1] "<p>Rhizome black-brown, with numerous slender fibrous roots.
#> Stems 30--80 cm tall, terete, 4--6(--9) mm in diam., unbranched,
#> basally glabrous, apically white pubescent. Leaves 2 or 3, proximal
#> cauline leaves 3 × ternately pinnate ...<truncated>
get_description("Acer stachyophyllum")
# > [1] "<p>Trees to 15 m tall, dioecious. Bark yellowish brown, smooth.
#> Branchlets glabrous. Leaves deciduous; petiole 2.5-8 cm, slightly
#> pubescent near apex; leaf blade ovate or oblong, 5-11 × 2.5-6 cm,
#> undivided or 3-lobed, papery, abaxially densely pale or white pubescent,
#> becoming less so when mature or nearly glabrous, adaxially glabrous,
#> 3-5-veined at base abaxially, rarely with rudimentary...<truncated>

Related

Scraping movie scripts failing on small subset

I'm working on scraping the lord of the rings movie scripts from this website here. Each script is broken up across multiple pages that look like this
I can get the info I need for a single page with this code:
library(dplyr)
library(rvest)
url_success <- "http://www.ageofthering.com/atthemovies/scripts/fellowshipofthering1to4.php"
success <- read_html(url_success) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(success)
Length Class Mode
[1,] 2 tbl_df list
This works for all Fellowship of the Ring pages, and all Return of the King pages. It also works for Two Towers pages covering scenes 57 to 66. However, any other Two Towers page (scenes 1-56) does not return the same result
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(fail)
Length Class Mode
0 list list
I've inspected the pages in Chrome, and the failing pages appear to have the same structure as the succeeding ones, including the 'AutoNumber1' table. Can anyone help with this?
Works with xpath. Perhaps ill-formed html (page doesn't seem too spec compliant)
library(rvest)
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements( xpath = '//*[#id="AutoNumber1"]') %>%
html_table()
fail
#> [[1]]
#> # A tibble: 139 × 2
#> X1 X2
#> <chr> <chr>
#> 1 "Scene 1 ~ The Foundations of Stone\r\n\r\n\r\nThe movie opens as the … "Sce…
#> 2 "GANDALF VOICE OVER:" "You…
#> 3 "FRODO VOICE OVER:" "Gan…
#> 4 "GANDALF VOICE OVER:" "I a…
#> 5 "The scene changes to \r\n inside Moria.  Gandalf is on the Bridge … "The…
#> 6 "GANDALF:" "You…
#> 7 "Gandalf slams down his staff onto the Bridge, \r\ncausing it to crack… "Gan…
#> 8 "BOROMIR :" "(ho…
#> 9 "FRODO:" "Gan…
#> 10 "GANDALF:" "Fly…
#> # … with 129 more rows

Clean spaces in an extracted list of authors packages in R

I face a "strange" issue with a code that I have written in R to extract all the authors of the R packages installed on my computer. Indeed, I try to remove undesirable spaces before and after the commas ( , ) but I can't get the expected clean result using R text cleaning common techniques.
Here is the script for reproduction so that you can see the issue in the final result on your own screen:
library("tools")
pdb<-CRAN_package_db()
subset<-pdb[,c(1,17)]
ipck<-as.vector(installed.packages()[,1])
pdbCleaned <- subset[subset$Package %in% ipck, ]
pdbCleaned$Author
Authors <-gsub("[\r\n]", "", pdbCleaned$Author)
Authors <-gsub("\\[.*?\\]", "", Authors)
Authors <-gsub("\\(.*?\\)", "", Authors)
Authors <-gsub("<.*>", "", Authors)
Authors <-gsub("))", "", Authors)
Authors <-gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", " ", Authors)
Authors
Here is an attempt at a solution with stringr. Note that you don't have to match your installed packages to the entire CRAN db, you can just pull the author field from your installed packages.
I just use two regexes: one to remove anything wrapped in [], (), <> which is often things like [aut] or <email#domain>, and one to remove anything with spaces surrounding , or and. Note that depending on the packages you have installed this will work varyingly well. You will have to tweak to adjust for the packages you have, for example I might want to remove double commas ,, because of the ada package. Other packages just have a lot of random text in their Author field which makes it hard to manage automatically, such as the akima package. But as a first pass this should do the trick.
library(tidyverse)
authors <- installed.packages(fields = "Author") %>%
as_tibble() %>%
select(package = Package, author = Author)
authors %>%
mutate(
author = str_replace_all(author, "(\\[|\\(|<).*(\\]|\\)|>)", ""),
author = str_replace_all(author, "[:space:]*(,|and)[:space:]*", ","),
author = str_trim(author)
)
#> # A tibble: 620 x 2
#> package author
#> <chr> <chr>
#> 1 abind Tony Plate,Richard Heiberger
#> 2 actuar Vincent Goulet,Sébastien Auclair,Christophe Dutang,Xavier Mi~
#> 3 ada Mark Culp,Kjell Johnson,,George Michailidis
#> 4 AER Christian Kleiber,Achim Zeileis
#> 5 AGD Stef van Buuren
#> 6 agricolae Felipe de Mendiburu
#> 7 akima "Hiroshi Akima,Albrecht Gebhardt,bicubic*\n functions),Th~
#> 8 alr3 Sanford Weisberg
#> 9 alr4 Sanford Weisberg
#> 10 amap Antoine Lucas
#> # ... with 610 more rows
Created on 2018-03-14 by the reprex package (v0.2.0).

Web scraping with R and selector gadget

I am trying to scrape data from a website using R. I am using rvest in an attempt to mimic an example scraping the IMDB page for the Lego Movie. The example advocates use of a tool called Selector Gadget to help easily identify the html_node associated with the data you are seeking to pull.
I am ultimately interested in building a data frame that has the following schema/columns:
rank, blog_name, facebook_fans, twitter_followers, alexa_rank.
My code below. I was able to use Selector Gadget to correctly identity the html tag used in the Lego example. However, following the same process and same code structure as the Lego example, I get NAs (...using firstNAs introduced by coercion[1] NA
). My code is below:
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_node(".stats") %>%
html_text() %>%
as.numeric()
I have also experimented with: html_node("html_node(".stats , .stats span")), which seems to work for the "Facebook fans" column since it reports 714 matches, however only returns 1 number is returned.
714 matches for .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')] | .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')]/descendant-or-self::*/span: using first{xml_node}
<td>
[1] <span>997,669</span>
This may help you:
library(rvest)
d1 <- read_html("http://blog.feedspot.com/video_game_news/")
stats <- d1 %>%
html_nodes(".stats") %>%
html_text()
blogname <- d1%>%
html_nodes(".tlink") %>%
html_text()
Note that it is html_nodes (plural)
Result:
> head(blogname)
[1] "Kotaku - The Gamer's Guide" "IGN | Video Games" "Xbox Wire" "Official PlayStation Blog"
[5] "Nintendo Life " "Game Informer"
> head(stats,12)
[1] "997,669" "1,209,029" "873" "4,070,476" "4,493,805" "399" "23,141,452" "10,210,993" "879"
[10] "38,019,811" "12,059,607" "500"
blogname returns the list of blog names that is easy to manage. On the other hand the stats info comes out mixed. This is due to the way the stats class for Facebook and Twitter fans are indistinguishable from one another. In this case the output array has the information every three numbers, that is stats = c(fb, tw, alx, fb, tw, alx...). You should separate each vector from this one.
FBstats = stats[seq(1,length(stats),3)]
> head(stats[seq(1,length(stats),3)])
[1] "997,669" "4,070,476" "23,141,452" "38,019,811" "35,977" "603,681"
You can use html_table to extract the whole table with minimal work:
library(rvest)
library(tidyverse)
# scrape html
h <- 'http://blog.feedspot.com/video_game_news/' %>% read_html()
game_blogs <- h %>%
html_node('table') %>% # select enclosing table node
html_table() %>% # turn table into data.frame
set_names(make.names) %>% # make names syntactic
mutate(Blog.Name = sub('\\s?\\+.*', '', Blog.Name)) %>% # extract title from name info
mutate_at(3:5, parse_number) %>% # make numbers actually numbers
tbl_df() # for printing
game_blogs
#> # A tibble: 119 x 5
#> Rank Blog.Name Facebook.Fans Twitter.Followers Alexa.Rank
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 Kotaku - The Gamer's Guide 997669 1209029 873
#> 2 2 IGN | Video Games 4070476 4493805 399
#> 3 3 Xbox Wire 23141452 10210993 879
#> 4 4 Official PlayStation Blog 38019811 12059607 500
#> 5 5 Nintendo Life 35977 95044 17727
#> 6 6 Game Informer 603681 1770812 10057
#> 7 7 Reddit | Gamers 1003705 430017 25
#> 8 8 Polygon 623808 485827 1594
#> 9 9 Xbox Live's Major Nelson 65905 993481 23114
#> 10 10 VG247 397798 202084 3960
#> # ... with 109 more rows
It's worth checking that everything is parsed like you want, but it should be usable at this point.
This uses html_nodes (plural) and str_replace to remove commas in numbers. Not sure if these are all the stats you need.
library(rvest)
library(stringr)
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_nodes(".stats") %>%
html_text() %>%
str_replace_all(',', '') %>%
as.numeric()

How to perform Lemmatization in R?

This question is a possible duplicate of Lemmatizer in R or python (am, are, is -> be?), but I'm adding it again since the previous one was closed saying it was too broad and the only answer it has is not efficient (as it accesses an external website for this, which is too slow as I have very large corpus to find the lemmas for). So a part of this question will be similar to the above mentioned question.
According to Wikipedia, lemmatization is defined as:
Lemmatisation (or lemmatization) in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.
A simple Google search for lemmatization in R will only point to the package wordnet of R. When I tried this package expecting that a character vector c("run", "ran", "running") input to the lemmatization function would result in c("run", "run", "run"), I saw that this package only provides functionality similar to grepl function through various filter names and a dictionary.
An example code from wordnet package, which gives maximum of 5 words starting with "car", as the filter name explains itself:
filter <- getTermFilter("StartsWithFilter", "car", TRUE)
terms <- getIndexTerms("NOUN", 5, filter)
sapply(terms, getLemma)
The above is NOT the lemmatization that I'm looking for. What I'm looking for is, using R I want to find true roots of the words: (For e.g. from c("run", "ran", "running") to c("run", "run", "run")).
Hello you can try package koRpus which allow to use Treetagger :
tagged.results <- treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
TT.tknz=FALSE , lang="en",
TT.options=list(path="./TreeTagger", preset="en"))
tagged.results#TT.res
## token tag lemma lttr wclass desc stop stem
## 1 run NN run 3 noun Noun, singular or mass NA NA
## 2 ran VVD run 3 verb Verb, past tense NA NA
## 3 running VVG run 7 verb Verb, gerund or present participle NA NA
See the lemma column for the result you're asking for.
As a previous post mentioned, the function lemmatize_words() from the R package textstem can perform this and give you what I understand as your desired results:
library(textstem)
vector <- c("run", "ran", "running")
lemmatize_words(vector)
## [1] "run" "run" "run"
#Andy and #Arunkumar are correct when they say textstem library can be used to perform stemming and/or lemmatization. However, lemmatize_words() will only work on a vector of words. But in a corpus, we do not have vector of words; we have strings, with each string being a document's content. Hence, to perform lemmatization on a corpus, you can use function lemmatize_strings() as an argument to tm_map() of tm package.
> corpus[[1]]
[1] " earnest roughshod document serves workable primer regions recent history make
terrific th-grade learning tool samuel beckett applied iranian voting process bard
black comedy willie loved another trumpet blast may new mexican cinema -bornin "
> corpus <- tm_map(corpus, lemmatize_strings)
> corpus[[1]]
[1] "earnest roughshod document serve workable primer region recent history make
terrific th - grade learn tool samuel beckett apply iranian vote process bard black
comedy willie love another trumpet blast may new mexican cinema - bornin"
Do not forget to run the following line of code after you have done lemmatization:
> corpus <- tm_map(corpus, PlainTextDocument)
This is because in order to create a document-term matrix, you need to have 'PlainTextDocument' type object, which gets changed after you use lemmatize_strings() (to be more specific, the corpus object does not contain content and meta-data of each document anymore - it is now just a structure containing documents' content; this is not the type of object that DocumentTermMatrix() takes as an argument).
Hope this helps!
Maybe stemming is enough for you? Typical natural language processing tasks make do with stemmed texts. You can find several packages from CRAN Task View of NLP: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
If you really do require something more complex, then there's specialized solutsions based on mapping sentences to neural nets. As far as I know, these require massive amount of training data. There is lots of open software created and made available by Stanford NLP Group.
If you really want to dig into the topic, then you can dig through the event archives linked at the same Stanford NLP Group publications section. There's some books on the topic as well.
I think the answers are a bit outdated here. You should be using R package udpipe now - available at https://CRAN.R-project.org/package=udpipe - see https://github.com/bnosac/udpipe or docs at https://bnosac.github.io/udpipe/en
Notice the difference between the word meeting (NOUN) and the word meet (VERB) in the following example when doing lemmatisation and when doing stemming, and the annoying screwing up of the word 'someone' to 'someon' when doing stemming.
library(udpipe)
x <- c(doc_a = "In our last meeting, someone said that we are meeting again tomorrow",
doc_b = "It's better to be good at being the best")
anno <- udpipe(x, "english")
anno[, c("doc_id", "sentence_id", "token", "lemma", "upos")]
#> doc_id sentence_id token lemma upos
#> 1 doc_a 1 In in ADP
#> 2 doc_a 1 our we PRON
#> 3 doc_a 1 last last ADJ
#> 4 doc_a 1 meeting meeting NOUN
#> 5 doc_a 1 , , PUNCT
#> 6 doc_a 1 someone someone PRON
#> 7 doc_a 1 said say VERB
#> 8 doc_a 1 that that SCONJ
#> 9 doc_a 1 we we PRON
#> 10 doc_a 1 are be AUX
#> 11 doc_a 1 meeting meet VERB
#> 12 doc_a 1 again again ADV
#> 13 doc_a 1 tomorrow tomorrow NOUN
#> 14 doc_b 1 It it PRON
#> 15 doc_b 1 's be AUX
#> 16 doc_b 1 better better ADJ
#> 17 doc_b 1 to to PART
#> 18 doc_b 1 be be AUX
#> 19 doc_b 1 good good ADJ
#> 20 doc_b 1 at at SCONJ
#> 21 doc_b 1 being be AUX
#> 22 doc_b 1 the the DET
#> 23 doc_b 1 best best ADJ
lemmatisation <- paste.data.frame(anno, term = "lemma",
group = c("doc_id", "sentence_id"))
lemmatisation
#> doc_id sentence_id
#> 1 doc_a 1
#> 2 doc_b 1
#> lemma
#> 1 in we last meeting , someone say that we be meet again tomorrow
#> 2 it be better to be good at be the best
library(SnowballC)
tokens <- strsplit(x, split = "[[:space:][:punct:]]+")
stemming <- lapply(tokens, FUN = function(x) wordStem(x, language = "en"))
stemming
#> $doc_a
#> [1] "In" "our" "last" "meet" "someon" "said"
#> [7] "that" "we" "are" "meet" "again" "tomorrow"
#>
#> $doc_b
#> [1] "It" "s" "better" "to" "be" "good" "at" "be"
#> [9] "the" "best"
Lemmatization can be done in R easily with textStem package.
Steps are:
1) Install textstem
2) Load the package by
library(textstem)
3) stem_word=lemmatize_words(word, dictionary = lexicon::hash_lemmas)
where stem_word is the result of lemmatization and word is the input word.

Using write.graph in igraph to output a .net file

I think I am missing rather simple here, but what is the syntax for adding arguments to the write.graph function in R's igraph package? I am trying to output a network to a pajek formatted file (.net) with weighted edges and IDs. I've tried the following commands, but keep getting errors ("Unknown arguments to write.graph (Pajek format)."):
write.graph(weightedg,file="musGiant2012.net", format="pajek",'weight')
write.graph(weightedg,file="musGiant2012.net", format="pajek", id=TRUE)
write.graph(weightedg,file="musGiant2012.net", format="pajek", ("id"))
Plus many others. I am pretty sure that I am committing a simple syntax error, but cannot find any guidance on how to correct it.
From the docs at http://igraph.org/r/doc/write.graph.html:
The Pajek format is a text file, see read.graph for details. Appropriate vertex and edge attributes are also written to the file. This format has no additional arguments.
And http://igraph.org/r/doc/read.graph.html shows that edge weights are supported, and vertex ids are supported as well. So if you have your vertex ids as an attribute called id, and your edge weights as an attribute called weight, then you do not need any extra argument. E.g.
library(igraph)
g <- graph.ring(5)
V(g)$id <- letters[1:5]
E(g)$weight <- runif(ecount(g))
tmp <- tempfile()
write.graph(g, file = tmp, format = "pajek")
cat(readLines(tmp), sep = "\n")
#> *Vertices 5
#> 1 "a"
#> 2 "b"
#> 3 "c"
#> 4 "d"
#> 5 "e"
#> *Edges
#> 1 2 0.054399197222665
#> 2 3 0.503386947326362
#> 3 4 0.373047293629497
#> 4 5 0.84542120853439
#> 1 5 0.610330935101956

Resources