Scraping non-table information r

Scraping non-table information r - r

I'm trying to scrape information from this webpage, https://www.ncleg.gov/Laws/GeneralStatuteSections/Chapter14, (the info under the "Chapter 14" tab)
and put it into a datafram with two columns in R, but these skills are out of my wheelhouse and I need some help. More specifically, I want one column with the G.S. numbers ("G.S. 14-1", "G.S. 14-1.1", etc.) and one column with the names corresponding to these G.S. numbers ("14.1 Felonies and Misdemeanors Defined", "14-1.1: Repealed by Session Laws 1993, c. 538, s. 2.", etc.). As text and not the links.
I've tried using the selector gadget, but this tool is pretty new to me and I don't really understand how to apply what I do with it in R.
Any advice or tips on how to do this?

Yes, this is fairly tricky. I would probably approach it with a combination of xpath and regular expressions:
library(rvest)
#> Loading required package: xml2
page <- read_html("https://www.ncleg.gov/Laws/GeneralStatuteSections/Chapter14")
x1 <- ("//div[#class = 'col-12 col-md-3 col-lg-2 d-flex mobile-font-size-large']")
x2 <- ("//div[#class='col-12 col-md-9 col-lg-10']")
description <- html_nodes(page, xpath = x2) %>% html_text() %>% trimws()
col2 <- gsub("^.*.\\d[A-Z]?(\\.|:) +", "", description)
col1 <- gsub("^(.*.\\d[A-Z]?[\\.|:]) +.*$", "\\1", description)
col1 <- gsub("\u00a7", "GS", col1)
df <- data.frame(section = col1, description = col2)
For ease of printing, I'll show the resulting data frame as a tibble:
tibble::as_tibble(df)
#> # A tibble: 1,059 x 2
#> section description
#> <chr> <chr>
#> 1 GS 14-1. Felonies and misdemeanors defined.
#> 2 GS 14-1.1: Repealed by Session Laws 1993, c. 538, s. 2.
#> 3 GS 14-2: Repealed by Session Laws 1993, c. 538, s. 2.1.
#> 4 GS 14-2.1: Repealed by Session Laws 1993, c. 538, s. 3.
#> 5 GS 14-2.2: Repealed by Session Laws 2003-0378, s. 1, effective August 1, 200~
#> 6 GS 14-2.3. Forfeiture of gain acquired through criminal activity.
#> 7 GS 14-2.4. Punishment for conspiracy to commit a felony.
#> 8 GS 14-2.5. Punishment for attempt to commit a felony or misdemeanor.
#> 9 GS 14-2.6. Punishment for solicitation to commit a felony or misdemeanor.
#> 10 GS 14-3. Punishment of misdemeanors, infamous offenses, offenses committed~
#> # ... with 1,049 more rows
Created on 2020-09-30 by the reprex package (v0.3.0)

Related

Searching a single column for multiple values at once in R

I am working with a trade dataset, and I need to subset out all rows which represent goods going to China, Korea, Dominican Republic, and several others. I can programmatically create this list, and I know how to subset the trade dataset for anyone of these countries, but not all of them at once. What I have tried is using the which() function.
DesiredSubset = TotalTradeData[which(TotalTradeData$Destination.Code == c(List of desired country codes), ]
This runs but produces this:
Warning message: In DesiredSubset = TotalTradeData[which(TotalTradeData$Destination.Code == : longer object length is not a multiple of shorter object length
It subsets out some rows, but nowhere close to all of the ones that I need.
I'm pretty sure that this would work if I just typed all the codes with | in between as an or operator, but I have to do this for dozens of codes dozens of times so that isn't practical.
How can I subset out all of the rows containing any one of the country codes in my list?

Welcome to stackoverflow. Here it is always a good idea to share a minimal reproducible example of your data and, if necessary, an example of your desired output.
In your case you are using the wrong logical operator, use %in% instead of ==.
# The data
TotalTradeData <- data.frame(
Destination.Code = c('COL', 'DOM', 'KOR', 'CHINA', 'USA', 'BRA'),
variable1 = letters[1:6]
)
TotalTradeData
#> Destination.Code variable1
#> 1 COL a
#> 2 DOM b
#> 3 KOR c
#> 4 CHINA d
#> 5 USA e
#> 6 BRA f
TotalTradeData[TotalTradeData$Destination.Code %in% c('DOM', 'KOR', 'CHINA'), ]
#> Destination.Code variable1
#> 2 DOM b
#> 3 KOR c
#> 4 CHINA d
Created on 2022-03-27 by the reprex package (v2.0.1)

Finding the dominant topic in each sentence in topic modeling

One question that I can't find the answer for in R is that how I can find the dominant topic in NLP model for each sentence?
Imagine I have data frame like this:
comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
"solidly constructed lovingly maintained sf crest built",
"one year since built new this well designed storey home",
"beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
"rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
"fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
"original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")
id <- c(1,2,3,4,5,6,7)
data <- data.frame(id, comment)
I do preprocess as shown below:
text_cleaning_tokens <- data %>%
tidytext::unnest_tokens(word, comment)
text_cleaning_tokens$word <- gsub('[[:digit:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens$word <- gsub('[[:punct:]]+', '', text_cleaning_tokens$word)
text_cleaning_tokens <- text_cleaning_tokens %>% filter(!(nchar(word) == 1))%>%
anti_join(stop_words)
stemmed_token <- text_cleaning_tokens %>% mutate(word=wordStem(word))
tokens <- stemmed_token %>% filter(!(word==""))
tokens <- tokens %>% mutate(ind = row_number())
tokens <- tokens %>% group_by(id) %>% mutate(ind = row_number()) %>%
tidyr::spread(key = ind, value = word)
tokens [is.na(tokens)] <- ""
tokens <- tidyr::unite(tokens, clean_remark,-id,sep =" " )
tokens$clean_remark <- trimws(tokens$clean_remark)
The I ran FitLdaModel function on this data and finally, found the best topics based on 2 groups:
t_1 t_2
1 beauti built
2 block home
3 renov legal
4 bathroom locat
5 bdm bachelor
6 bdm_heart bachelor_suit
7 beauti_street block_norquai
8 beauti_upgr blueridg
9 bedroom blueridg_legal
10 bedroom_unit built_design
now based on the result I have, I want to find the most dominant topic in each sentence in topic modelling. For example, I want to know that for comment 1 ("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior"), which topic (topic 1 or topic 2) is the most dominant?
Can anyone help me with this question? do we have any package that can do this?

It is pretty easy to work with quanteda and topicmodels. The former is for data management and quantitative analysis of textual data, the latter is for topic modeling inference.
Here I take your comment object and transform it to a corpus and then to a dfm. I then convert it to be understandable by topicmodels.
The function LDA() gives you all you need to easily extract information. In particular, with get_topics() you get the most probable topic for each document. If you instead want to see the document-topic-weights you can do so with ldamodel#gamma. You will see that get_topics() does exactly what you asked.
Please, see if this works for you.
library(quanteda)
#> Package version: 2.1.2
#> Parallel computing: 2 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> Attaching package: 'quanteda'
#> The following object is masked from 'package:utils':
#>
#> View
library(topicmodels)
comment <- c("outstanding renovation all improvements are topoftheline and done with energy efficiency in mind low monthly utilities even the interior",
"solidly constructed lovingly maintained sf crest built",
"one year since built new this well designed storey home",
"beautiful street large bdm in the heart of lynn valley over sqft bathrooms",
"rare to find legal beautiful upgr in port moody centre with a mountain view all bedroom units were nicely renovated",
"fantastic opportunity to get value for the money excellent family home in desirable blueridge with legal selfcontained bachelor suite on the main floor great location close to swimming ice skating community",
"original owner tired but rock solid perfect location half a block to norquay elementary school and short quiet blocks to slocan park and sky train station")
mycorp <- corpus(comment)
docvars(mycorp, "id") <- 1L:7L
mydfm <- dfm(mycorp)
# convert the DFM to a Document Matrix for topicmodels
forTM <- convert(mydfm, to = "topicmodels")
myLDA <- LDA(forTM, k = 2)
dominant_topics <- get_topics(myLDA)
dominant_topics
#> text1 text2 text3 text4 text5 text6 text7
#> 2 2 2 2 1 1 1
dtw <- myLDA#gamma
dtw
#> [,1] [,2]
#> [1,] 0.4870600 0.5129400
#> [2,] 0.4994974 0.5005026
#> [3,] 0.4980144 0.5019856
#> [4,] 0.4938985 0.5061015
#> [5,] 0.5037667 0.4962333
#> [6,] 0.5000727 0.4999273
#> [7,] 0.5176960 0.4823040
Created on 2021-03-18 by the reprex package (v1.0.0)

I agree with the other answer that quanteda and topicmodels are a better choice. Maybe also look into seededlda which is an LDA implementation from one of the quanteda authors (with extra features you don't have to use).
However, if you want to stick with your choice of tidytext and textmineR, this is how you would do it.
First, I simplified your preprocessing a bit, since you did some steps that seemed unnecessary to me:
library(tidyverse)
library(tidytext)
text_cleaning_tokens <- data %>%
unnest_tokens(word, comment) %>%
mutate(word = str_remove(word, "[[:digit:]]|[[:punct:]]")) %>%
filter(!(nchar(word) <= 1))%>%
anti_join(stop_words, by = "word") %>%
mutate(word = SnowballC::wordStem(word))
Then I run LDA according to the textmineR example:
lda <- text_cleaning_tokens %>%
cast_sparse(id, word) %>%
textmineR::FitLdaModel(k = 2,
iterations = 200,
burnin = 175,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_r2 = TRUE)
Now all implementations of LDA deliver two important results:
phi (φ) which shows for each word in the corpus how it scored on each topic. The higher the phi-value, the more prevalent the word in this specific topic.
theta (θ) which shows for each document in the corpus how it scored on each topic. The higher the theta-value, the more prevalent the topic is in the document. (topicmodels calls it gamma for some reason.)
In other words, all you have to do to find the most dominant topic in a text is:
lda$theta %>%
as_tibble() %>%
rowwise() %>%
mutate(top = which.max(c_across(everything()))) %>% # find highest value per row dplyr style
bind_cols(data, .) %>% # bind to original data
as_tibble() # just for nicer printing
#> # A tibble: 7 x 5
#> id comment t_1 t_2 top
#> <int> <chr> <dbl> <dbl> <int>
#> 1 1 1 . outstanding renovation all improvements are t… 0.892 0.108 1
#> 2 2 solidly constructed lovingly maintained sf crest … 0.0161 0.984 2
#> 3 3 one year since built new this well designed store… 0.0238 0.976 2
#> 4 4 beautiful street large bdm in the heart of lynn v… 0.986 0.0139 1
#> 5 5 rare to find legal beautiful upgr in port moody c… 0.992 0.00820 1
#> 6 6 fantastic opportunity to get value for the money … 0.266 0.734 2
#> 7 7 original owner tired but rock solid perfect locat… 0.00549 0.995 2
Created on 2021-03-18 by the reprex package (v1.0.0)
I also recommend you read Julia Silge's stuff on the matter. For example, this and this.

Rvest reading separated article data

I am looking to scrape article data from inquirer.net.
This is a follow-up question to Scrape Data through RVest
Here is the code that works based on the answer:
library(rvest)
#> Loading required package: xml2
library(tibble)
year <- 2020
month <- 06
day <- 13
url <- paste0('http://www.inquirer.net/article-index?d=', year, '-', month, '-', day)
div <- read_html(url) %>% html_node(xpath = '//*[#id ="index-wrap"]')
links <- html_nodes(div, xpath = '//a[#rel = "bookmark"]')
post_date <- html_nodes(div, xpath = '//span[#class = "index-postdate"]') %>%
html_text()
test <- tibble(date = post_date,
text = html_text(links),
link = html_attr(links, "href"))
test
#> # A tibble: 261 x 3
#> date text link
#> <chr> <chr> <chr>
#> 1 1 day a~ ‘We can never let our guard down~ https://newsinfo.inquirer.net/129~
#> 2 1 day a~ PNP spox says mañanita remark di~ https://newsinfo.inquirer.net/129~
#> 3 1 day a~ After stranded mom’s death, Pasa~ https://newsinfo.inquirer.net/129~
#> 4 1 day a~ Putting up lining for bike lanes~ https://newsinfo.inquirer.net/129~
#> 5 1 day a~ PH Army provides accommodation f~ https://newsinfo.inquirer.net/129~
#> 6 1 day a~ DA: Local poultry production suf~ https://newsinfo.inquirer.net/129~
#> 7 1 day a~ IATF assessing proposed design t~ https://newsinfo.inquirer.net/129~
#> 8 1 day a~ PCSO lost ‘most likely’ P13B dur~ https://newsinfo.inquirer.net/129~
#> 9 2 days ~ DOH: No IATF recommendations yet~ https://newsinfo.inquirer.net/129~
#> 10 2 days ~ PH coronavirus cases exceed 25,0~ https://newsinfo.inquirer.net/129~
#> # ... with 251 more rows
I now want to add a new column to this output which has the full article for each row. Before doing the for-loop, I was investigating the html code for the first article: https://newsinfo.inquirer.net/1291178/pnp-spox-says-he-did-not-intend-to-put-sinas-in-bad-light
Digging into the html code, I'm noticing it is not that clean. From my findings so far, the main article data falls under #article_content , p. So my output right now is multiple rows separated and there is a lot of non-article data appearing. here is what I have currently:
article_data<-data.frame(test)
article_url<- read_html(article_data[2, 3])
article<-article_url %>%
html_nodes("#article_content , p") %>%
html_text()
View(article)
I'm ok with this being multiple rows because I can just union the final result. But since there are other non-article items then it will mess up what I am trying to do (sentiment analysis).
Can someone please assist on how to clean this data so that the full article is next to each article link?
I could simply just union the results excluding the first row and last 2 rows but looking for a cleaner way because I want to do this for all article data and not just this one.

After a short look in the structure of the article page, I suggest using the css selector: ".article_align div p".
library(rvest)
library(dplyr)
url <- "https://newsinfo.inquirer.net/1291178/pnp-spox-says-he-did-not-intend-to-put-sinas-in-bad-light"
read_html(url) %>%
html_nodes(".article_align div p") %>%
html_text()

Complex lag inheritance with dplyr

Firstly "complex lag inheritance" may not be the clearest title, so suggestions welcome. I have a large dataset of ordered segmented strings that I need to group by stem matching of segments. This looping example demonstrates the required logic:
require(tidyverse)
x = data_frame(name = c('smith', 'smith.james', 'smith.jill',
'taylor', 'taylor.ian', 'walker', 'walker.john', 'walker.john.sid',
'reed.snow', 'reed.snow.harry', 'reed.snow.helen.jane'),
family_name = NA_character_)
x$family_name[1] = x$name[1]
for(i in 2:nrow(x)){
# if current record matches previous record's family assignment..
family_match = str_detect(string = x$name[i], pattern = paste0('^', x$family_name[i-1], '[.]'))
x$family_name[i] = ifelse(family_match, x$family_name[i-1], x$name[i])
}
print(x)
#> # A tibble: 11 x 2
#> name family_name
#> <chr> <chr>
#> 1 smith smith
#> 2 smith.james smith
#> 3 smith.jill smith
#> 4 taylor taylor
#> 5 taylor.ian taylor
#> 6 walker walker
#> 7 walker.john walker
#> 8 walker.john.sid walker
#> 9 reed.snow reed.snow
#> 10 reed.snow.harry reed.snow
#> 11 reed.snow.helen.jane reed.snow
I have tried using this looping approach and it does not seem feasible given the data size, so the alternative is a vectored dplyr approach or python.
The heart of the problem is that each family_name assignment is based on match of either the current record's name (when inferring new family names), or the previous record's family_name. I don't see how to reconcile this logic with an approach using e.g. pmap_chr, but if I'm wrong I'd love to know how.

This uses no regular expressions or explicit loops although internally Reduce would be using a loop. No packages are used.
Names <- paste0(x$name, ".")
iter <- function(x, y) if (startsWith(y, x)) x else y
Reduce(iter, Names, acc = TRUE)
giving:
[1] "smith." "smith." "smith." "taylor." "taylor."
[6] "walker." "walker." "walker." "reed.snow." "reed.snow."
[11] "reed.snow."

Sentiment analysis (AFINN) in R

I am trying to the sentiment of a dataset of Tweets using the AFINN dictionary (get_sentiments("afinn"). A sample of the dataset is provided below:
A tibble: 10 x 2
Date TweetText
<dttm> <chr>
1 2018-02-10 21:58:19 "RT #RealSirTomJones: Still got the moves! That was a lo~
2 2018-02-10 21:58:19 "Yass Tom \U0001f600 #snakehips still got it #TheVoiceUK"
3 2018-02-10 21:58:19 Yasss tom he’s some chanter #TheVoiceUK #ItsNotUnusual
4 2018-02-10 21:58:20 #TheVoiceUK SIR TOM JONES...HE'S STILL HOT... AMAZING VO~
5 2018-02-10 21:58:21 I wonder how many hips Tom Jones has been through? #TheV~
6 2018-02-10 21:58:21 Tom Jones has still got it!!! #TheVoiceUK
7 2018-02-10 21:58:21 Good grief Tom Jones is amazing #TheVoiceuk
8 2018-02-10 21:58:21 RT #tonysheps: Sir Thomas Jones you’re a bloody legend #~
9 2018-02-10 21:58:22 #ITV Tom Jones what a legend!!! ❤️ #StillGotIt #TheVoice~
10 2018-02-10 21:58:22 "RT #RealSirTomJones: Still got the moves! That was a lo~
What I want to do is:
1. Split up the Tweets into individual words.
2. Score those words using the AFINN lexicon.
3. Sum the score of all the words of each Tweet
4. Return this sum into a new third column, so I can see the score per Tweet.
For a similar lexicon I found the following code:
# Initiate the scoreTopic
scoreTopic <- 0
# Start a loop over the documents
for (i in 1:length (myCorpus)) {
# Store separate words in character vector
terms <- unlist(strsplit(myCorpus[[i]]$content, " "))
# Determine the number of positive matches
pos_matches <- sum(terms %in% positive_words)
# Determine the number of negative matches
neg_matches <- sum(terms %in% negative_words)
# Store the difference in the results vector
scoreTopic [i] <- pos_matches - neg_matches
} # End of the for loop
dsMyTweets$score <- scoreTopic
I am however not able to adjust this code to get it working with the afinn dictionary.

This would be a great use case for tidy data principles. Let's set up some example data (these are real tweets of mine).
library(tidytext)
library(tidyverse)
tweets <- tribble(
~tweetID, ~TweetText,
1, "Was Julie helping me because I don't know anything about Python package management? Yes, yes, she was.",
2, "#darinself OMG, this is my favorite.",
3, "#treycausey #ftrain THIS IS AMAZING.",
4, "#nest No, no, not in error. Just the turkey!",
5, "The #nest people should write a blog post about how many smoke alarms went off yesterday. (I know ours did.)")
Now we have some example data. In the code below, unnest_tokens() tokenizes the text, i.e. breaks it up into individual words (the tidytext package allows you to use a special tokenizer for tweets) and the inner_join() implements the sentiment analysis.
tweet_sentiment <- tweets %>%
unnest_tokens(word, TweetText, token = "tweets") %>%
inner_join(get_sentiments("afinn"))
#> Joining, by = "word"
Now we can find the scores for each tweet. Take the original data set of tweets and left_join() on to it the sum() of the scores for each tweet. The handy function replace_na() from tidyr lets you replace the resulting NA values with zero.
tweets %>%
left_join(tweet_sentiment %>%
group_by(tweetID) %>%
summarise(score = sum(score))) %>%
replace_na(list(score = 0))
#> Joining, by = "tweetID"
#> # A tibble: 5 x 3
#> tweetID TweetText score
#> <dbl> <chr> <dbl>
#> 1 1. Was Julie helping me because I don't know anything about … 4.
#> 2 2. #darinself OMG, this is my favorite. 2.
#> 3 3. #treycausey #ftrain THIS IS AMAZING. 4.
#> 4 4. #nest No, no, not in error. Just the turkey! -4.
#> 5 5. The #nest people should write a blog post about how many … 0.
Created on 2018-05-09 by the reprex package (v0.2.0).
If you are interested in sentiment analysis and text mining, I invite you to check out the extensive documentation and tutorials we have for tidytext.

For future reference:
Score_word <- function(x) {
word_bool_vec <- get_sentiments("afinn")$word==x
score <- get_sentiments("afinn")$score[word_bool_vec]
return (score) }
Score_tweet <- function(sentence) {
words <- unlist(strsplit(sentence, " "))
words <- as.vector(words)
scores <- sapply(words, Score_word)
scores <- unlist(scores)
Score_tweet <- sum(scores)
return (Score_tweet)
}
dsMyTweets$score<-apply(df, 1, Score_tweet)
This executes what I initially wanted! :)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping non-table information r - r

Related

Searching a single column for multiple values at once in R

Finding the dominant topic in each sentence in topic modeling

Rvest reading separated article data

Complex lag inheritance with dplyr

Sentiment analysis (AFINN) in R

Categories

Resources