How to apply stopwords accurately in French using R - r

I'm trying to pull a book using the Gutenberg library and then remove French stopwords. I've been able to do this accurately in English by doing this:
twistEN <- gutenberg_download(730)
twistEN <- twistEN[118:nrow(twistEN),]
twistEN <- twistEN %>%
unnest_tokens(word, text)
data(stop_words)
twistEN <- twistEN %>%
anti_join(stop_words)
countsEN <- twistEN %>%
count(word, sort=TRUE)
top.en <- countsEN[1:20,]
I can see here that the top 20 words (by frequency) in the English version of Oliver Twist are these:
word n
<chr> <int>
1 oliver 746
2 replied 464
3 bumble 364
4 sikes 344
5 time 329
6 gentleman 309
7 jew 294
8 boy 291
9 fagin 291
10 dear 277
11 door 238
12 head 226
13 girl 223
14 night 218
15 sir 210
16 lady 209
17 hand 205
18 eyes 204
19 rose 201
20 cried 182
I'm trying to accomplish the same thing with the French version of the same novel:
twistFR <- gutenberg_download(16023)
twistFR <- twistFR[123:nrow(twistFR),]
twistFR <- twistFR %>%
unnest_tokens(word, text)
stop_french <- data.frame(word = stopwords::stopwords("fr"), stringsAsFactors = FALSE)
stop_french <- get_stopwords("fr","snowball")
as.data.frame(stop_french)
twistFR <- twistFR %>%
anti_join(stop_words, by = c('word')) %>%
anti_join(stop_french, by = c("word"))
countsFR <- twistFR %>%
count(word, sort=TRUE)
top.fr <- countsFR[1:20,]
I did alter the code for the French stopwords based on info I found online, and it is removing some stopwords. But this is the list I'm getting:
word n
<chr> <int>
1 dit 1375
2 r 1311
3 tait 1069
4 re 898
5 e 860
6 qu'il 810
7 plus 780
8 a 735
9 olivier 689
10 si 673
11 bien 656
12 tout 635
13 tre 544
14 d'un 533
15 comme 519
16 c'est 494
17 pr 481
18 pondit 472
19 juif 450
20 monsieur 424
At least half of these words should be getting captured by a stopwords list and they're not. Is there something I'm doing wrong in my code? I'm new to tidy text, so I'm sure there are better ways to get at this.

I used a few different packages to get what you want. I used the stopwords from tidystopwords as these are based on the universal dependency models. But you could use the stopwords from snowball, stopwords or from the proustr package. You might even decide to use the stopwords from multiple packages depending on your requirements and what you consider to be stopwords. All stopword lists are slightly different.
I use the udpipe package to split the text into it's separate tokens. This takes longer than unnest_tokens from tidytext (but I use the default option, which includes pos tagging and lemmatisation). I find that unnest_tokens doesn't work well with non english languages.
library(gutenbergr)
library(tidystopwords)
library(udpipe)
library(dplyr)
# get twist in French
twistFR <- gutenberg_download(16023)
# Convert all lines to utf8 (needed on my system)
twistFR$text <- iconv(twistFR$text, to = "UTF-8")
# get french stopwords based on ud language model
my_french_stopswords <- generate_stoplist(lang_name = "French")
my_french_stopswords <- data.frame(word = my_french_stopswords, stringsAsFactors = FALSE)
# download udpipe model for french language
ud_model <- udpipe_download_model(language = "french")
ud_model_fr <- udpipe_load_model(ud_model)
# set parallel.cores. Udpipe annotate can take a while as it does a lot more than just tokenizing.
ud_twistFR <- udpipe_annotate(ud_model_fr, twistFR$text[123:nrow(twistFR)], parallel.cores = 3)
# transform to data.frame
ud_twistFR_df <- data.frame(ud_twistFR, stringsAsFactors = FALSE)
# put tokens in lowercase, remove stopwords and punctuations
ud_twistFR_df <- ud_twistFR_df %>%
mutate(token = tolower(token)) %>%
anti_join(my_french_stopswords, by = c("token" = "word")) %>%
filter(upos != "PUNCT") # remove punctuations.
# count tokens
ud_countsFR <- ud_twistFR_df %>%
count(token, sort=TRUE)
ud_countsFR[1:20,]
# A tibble: 20 x 2
token n
<chr> <int>
1 pas 1558
2 dit 1366
3 m. 915
4 olivier 843
5 plus 775
6 bien 652
7 répondit 469
8 juif 435
9 monsieur 412
10 bumble 367
11 enfant 355
12 sikes 341
13 jeune 336
14 air 290
15 porte 281
16 tête 279
17 encore 278
18 homme 267
19 même 261
20 demanda 257

It turns out that my main problem was actually not the stop words. It was that accented characters were coming through as codes instead of as the accents. I applied this:
twistFR$text <- iconv(twistFR$text, "latin1", "UTF-8")
And the situation pretty much resolved itself. I did also apply the stopwords-iso larger list. Thanks for both of your comments!

Related

Can I scrape a single list that spans across multiple pages when webscraping with R?

Okay so I'm trying to scrape the table with dog temperaments from this website: https://atts.org/breed-statistics/statistics-page1/
However the table spans across 8 pages in total (and therefore 8 unique urls)
Currently, for page 1 of the table, I have written the following code:
url <- "https://atts.org/breed-statistics/statistics-page1/"
webpage <- read_html(url)
bn_data_html <- html_nodes(webpage, "td:nth-child(1)")
bn_data <- html_text(bn_data_html)
nt_data_html <- html_nodes(webpage, "td:nth-child(2)")
nt_data <- html_text(nt_data_html)
passed_data_html <- html_nodes(webpage, "td:nth-child(3)")
passed_data <- html_text(passed_data_html)
failed_data_html <- html_nodes(webpage, "td:nth-child(4)")
failed_data <- html_text(failed_data_html)
percent_data_html <- html_nodes(webpage, "td:nth-child(5)")
percent_data <- html_text(percent_data_html)
breeds <- data.frame(Breed = bn_data, Number_tested = nt_data, Passed = passed_data, Failed = failed_data, Percent = percent_data)
Which works wonderfully to scrape the data from the first page. However, in order to scrape the entire table, the only way I can think of to do it would be to replace the original url and rerun the chunk of code eight times for each page of the table. Is there a way to do this without having to rerun it eight times? Say the table spanned 100 pages and rerunning the code that many times just wasn't feasible?
This is how you get them dogs into a dataframe, scraping 1:8 pages. Note the usage of html_table().
library(tidyverse)
library(rvest)
get_dogs <- function(page) {
str_c("https://atts.org/breed-statistics/statistics-page", page) %>%
read_html() %>%
html_table() %>%
getElement(1) %>%
janitor::row_to_names(1) %>%
janitor::clean_names()
}
dogs_df <- map_dfr(1:8, get_dogs)
# A tibble: 250 x 5
breed_name tested passed failed percent
<chr> <chr> <chr> <chr> <chr>
1 AFGHAN HOUND 165 120 45 72.7%
2 AIREDALE TERRIER 110 86 24 78.2%
3 AKBASH DOG 16 14 2 87.5%
4 AKITA 598 465 133 77.8%
5 ALAPAHA BLUE BLOOD BULLDOG 12 9 3 75.0%
6 ALASKAN KLEE KAI 2 1 1 50.0%
7 ALASKAN MALAMUTE 244 207 37 84.8%
8 AMERICAN BANDAGGE 1 1 0 100.0%
9 AMERICAN BULLDOG 214 186 28 86.9%
10 AMERICAN ESKIMO 86 71 15 82.6%
# ... with 240 more rows
# i Use `print(n = ...)` to see more rows

Web scraping with R (rvest)

I'm new to R and am having some trouble to create a good web scraper with R.... It has been only 5 days since I started to study this language. So, any help I'll appreciate!
Idea
I'm trying to web scraping the classification table of "Campeonato Brasileiro" from 2003 to 2021 on Wikipedia to group the teams later to analyze some stuff.
Explanation and problem
I'm scraping the page of the 2002 championship. I read the HTML page to extract the HTML nodes that I select with the "SelectorGadget" extension at Google Chrome. There is some considerations:
The page that I'm trying to access is from the 2002 championship. I done that because it was easier to extract the links of the tables that are present on a board in the final of the page, selecting just one selector for all (tr:nth-child(9) div a) to access their links by HTML attribute "href";
The selected CSS was from 2003 championship page.
So, in my twisted mind I thought: "Hey! I'm going to create a function to extract the tables from those pages and I'll save them in a data frame!". However, it went wrong and I'm not understanding why... When I tried to ran the "tabelageral" line, the following error returned : "Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"". I think that it is reading a string instead of a xml. What am I misunderstanding here? Where is my error? The "sapply" method? Since now, thanks!
The code
library("dplyr")
library("rvest")
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
tabela_wiki = link %>%
html_nodes("table.wikitable") %>%
html_table() %>%
paste(collapse = "|")
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
tabela_final <- data.frame(tabela_geral)
You can use :contains to target the appropriate table by class and then a substring that the table contains. Furthermore, you can use html_table() to extract in tabular format from matched node. You can then subset on a vector of desired columns. I don't know the correct football terms so have guessed the columns to subset on. You can adjusted the columns vector.
If you wrap the years and constructed urls to make requests to inside of a map2_dfr() call you can return a single DataFrame for all desired years.
library(tidyverse)
library(rvest)
years <- 2003:2021
urls <- paste("https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_", years, sep = "")
columns <- c("Pos.", "Equipes", "GP", "GC", "SG")
df <- purrr::map2_dfr(urls, years, ~
read_html(.x, encoding = "utf-8") %>%
html_element('.wikitable:contains("ou rebaixamento")') %>%
html_table() %>%
.[columns] %>%
mutate(year = .y, SG = as.character(SG)))
You can get all the tables from those links by doing this:
tabela <- function(link){
read_html(link) %>% html_nodes("table.wikitable") %>% html_table()
}
all_tables = lapply(links_temporadas, tabela)
names(all_tables)<-2003:2022
This gives you a list of length 20, named 2003 to 2022 (i.e. one element for each of those years). Each element is itself a list of tables (i.e. the tables that are available at that link of links_temporadas. Note that the number of tables avaialable at each link varies.
lengths(all_tables)
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
6 5 10 9 10 12 11 10 12 11 13 14 17 16 16 16 16 15 17 7
You will need to determine which table(s) you are interested in from each of these years.
Here is a way. It's more complicated than your function because those pages have more than one table so the function returns only the tables with a column names matching "Pos.".
Then, before rbinding the tables, keep only the common columns since the older tables have one less column, column "M".
suppressPackageStartupMessages({
library("dplyr")
library("rvest")
})
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
lista_wiki <- pagina_tabela %>%
html_elements("table.wikitable") %>%
html_table()
i <- sapply(lista_wiki, \(x) "Pos." %in% names(x))
i <- which(i)[1]
lista_wiki[[i]]
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
sapply(tabela_geral, ncol)
#> [1] 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 13 13 13 13
#sapply(tabela_geral, names)
common_names <- Reduce(intersect, lapply(tabela_geral, names))
tabela_reduzida <- lapply(tabela_geral, `[`, common_names)
tabela_final <- do.call(rbind, tabela_reduzida)
head(tabela_final)
#> # A tibble: 6 x 12
#> Pos. Equipes P J V E D GP GC SG `%`
#> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <chr> <int>
#> 1 1 Cruzeiro 100 46 31 7 8 102 47 +55 72
#> 2 2 Santos 87 46 25 12 9 93 60 +33 63
#> 3 3 São Paulo 78 46 22 12 12 81 67 +14 56
#> 4 4 São Caetano 742 46 19 14 13 53 37 +16 53
#> 5 5 Coritiba 73 46 21 10 15 67 58 +9 52
#> 6 6 Internacional 721 46 20 10 16 59 57 +2 52
#> # ... with 1 more variable: `Classificação ou rebaixamento` <chr>
Created on 2022-04-03 by the reprex package (v2.0.1)
To have all columns, including the "M" columns:
data.table::rbindlist(tabela_geral, fill = TRUE)

How to read a CSV file into R which uses two types of separators in the file?

I am trying to read a CSV file into R which makes use of two different separators: the "," and the ";". Below is an short example of the CSV format:
"car_brand; car_model","total"
"Toyota; 9289","29781"
"Seat; 20981","1610"
"Volkswagen; 11140","904"
"Suzuki; 11640","658"
"Renault; 13075","647"
"Ford; 15855","553"
The CSV file should contain 3 columns, car_brand, car_model, and total. However, car_brand and car_model are separated by a ";" rather than a ",". Any guidance on how to import such a file would be really appreciated.
A double-tap:
x1 <- read.csv("quux.csv", check.names = FALSE)
x2 <- read.csv2(text = x1[[1]], header = FALSE)
names(x2) <- unlist(read.csv2(text = names(x1)[1], header = FALSE))
cbind(x2, x1[,-1,drop=FALSE])
# car_brand car_model total
# 1 Toyota 9289 29781
# 2 Seat 20981 1610
# 3 Volkswagen 11140 904
# 4 Suzuki 11640 658
# 5 Renault 13075 647
# 6 Ford 15855 553
The use of check.names=FALSE is required because otherwise names(x1)[1] looks like "car_brand..car_model". While it can be parsed like this, I thought it better to parse the original text.
a tidyverse solution;
library(tidyverse)
read.csv('file.csv',header = T) %>%
separate(col='car_brand..car_model',into = c('car_brand','car_model'),sep = ';') %>%
mutate(car_model=as.numeric(car_model))
output;
car_brand car_model total
<chr> <dbl> <int>
1 Toyota 9289 29781
2 Seat 20981 1610
3 Volkswagen 11140 904
4 Suzuki 11640 658
5 Renault 13075 647
6 Ford 15855 553
One option would be to use a combination of fread and gsub:
library(data.table)
fread(gsub(";", "", '"car_brand; car_model","total"
"Toyota; 9289","29781"
"Seat; 20981","1610"
"Volkswagen; 11140","904"
"Suzuki; 11640","658"
"Renault; 13075","647"
"Ford; 15855","553"
'))
car_brand car_model total
1: Toyota 9289 29781
2: Seat 20981 1610
3: Volkswagen 11140 904
4: Suzuki 11640 658
5: Renault 13075 647
6: Ford 15855 553
If you write the csvImporter yourself, you simply have to change the separator dynamically (depending on the index) in the loop.

Looking for an R function to detect a duplicate or the same value in a tibble

I have a tibble :
athletes <-athletes %>%
clean_names() %>%
rename(atheletes_id = id)
athletes_with_mutli_country <- country %>%
select(athlete_id, NOC) %>%
unique()
with these data:
enter image description here
etc.
I would like to detect when an athelete_id is there more than once.
How would be the simplest what to do that?
Best,
Stephen
Depends on what you mean by simple. Davids answer covers how to do it in base R. If you want simple as in lines of code, you can use the janitor package.
if(!require("janitor")){
install.packages("janitor")
}
# returns the df with a new column "dupe_count" which shows number of entries with same column(s):
athletes %>%
janitor::get_dupes(atheletes_id)
This could help
library(tidyverse)
tibble(x = c(1:5, 5:10),
y = LETTERS[1:11]) %>%
group_by(x) %>%
mutate(count = n())
It would help to clarify the question if you could show your expected output. I'm assuming that you want to remove duplicate athlete_ids from your data frame.
In Base R there is a function duplicated that can find duplicate values in a vector. You can use that plus the subsetting operators to remove the duplicate athletes from your data frame. Here is an example:
# Create sample data
athletes <- read.table(header = TRUE, text = '
athlete_id NOC
183 UAR
184 EGY
185 EGY
186 UAR
187 EGY
188 EGY
189 SUD
184 EGY
191 EGY
192 EGY
193 UAR
183 UAR
195 EGY')
# Detect duplicates
dupes <- duplicated(athletes$athlete_id)
# Remove duplicates
athletes[!dupes, ]
# athlete_id NOC
# 1 183 UAR
# 2 184 EGY
# 3 185 EGY
# 4 186 UAR
# 5 187 EGY
# 6 188 EGY
# 7 189 SUD
# 9 191 EGY
# 10 192 EGY
# 11 193 UAR
# 13 195 EGY

How use docvars in Quanteda to create document variables from raw text

I am using Quanteda to analyze party platforms, which are txt files. I have binded two sets of party platforms into one dfm:
corp20dr <- corp20d + corp20r
summary(corp20dr)`
Document-feature matrix of: 28 documents, 6,595 features (85.0% sparse).
> summary(corp20dr)
Corpus consisting of 28 documents:
Text Types Tokens Sentences
akdem20.txt 1895 7624 332
azdem20.txt 908 2921 94
cadem20.txt 3255 19881 150
medem20.txt 355 863 39
.....................................
wvgop20.txt 1419 5013 106
wygop20.txt 428 1085 45
I would like to compare the Democratic (corp20d) and Republican platforms (corp20r). But, I seem to need to use docvars to make comparisons between the different groups (15 Dem, 13 GOP). When I use textplot_keyness, I intend to get a comparison of all the texts, but the result is to draw the first text against all other texts in the corpus.
corp20dr_dfm <- dfm(corpus(corp20dr),
remove = stopwords("english"), stem = TRUE, remove_numbers = TRUE, ```
remove_punct = TRUE)
corp20dr_dfm
result_keyness <- textstat_keyness(corp20dr_dfm)
textplot_keyness(result_keyness,
color = c('blue', 'red'))
The result is a comparison of the Alaska platform to the "reference" which seems to be the other 27 documents. I was hoping to use the compare differences in word usage between the two groups of corpora (15 Democratic platforms compared to the 13 Republican platforms), but I seem to have to identify each group using docvars. But I am not sure how to do this. Any help would be appreciated.
The keyness function only compares one reference document to all others, so you should group the documents by the original corpus, before calling textstat_keyness(). You can do this by using dfm_group() on a new docvar that identifies the corpus. See below for a reproducible example.
library("quanteda")
## Package version: 2.1.2
corp_a <- corpus(data_corpus_inaugural[1:5])
corp_b <- corpus(data_corpus_inaugural[6:10])
# this is the key: identifying the original corpus
# will be used to group the dfm later into just two combined "documents"
corp_a$source <- "a"
corp_b$source <- "b"
corp <- corp_a + corp_b
summary(corp)
## Corpus consisting of 10 documents, showing 10 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington 625 1537 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2577 37 1797 Adams John
## 1801-Jefferson 717 1923 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2380 45 1805 Jefferson Thomas
## 1809-Madison 535 1261 21 1809 Madison James
## 1813-Madison 541 1302 33 1813 Madison James
## 1817-Monroe 1040 3677 121 1817 Monroe James
## 1821-Monroe 1259 4886 131 1821 Monroe James
## 1825-Adams 1003 3147 74 1825 Adams John Quincy
## Party source
## none a
## none a
## Federalist a
## Democratic-Republican a
## Democratic-Republican a
## Democratic-Republican b
## Democratic-Republican b
## Democratic-Republican b
## Democratic-Republican b
## Democratic-Republican b
Now we can go through the steps of forming the dfm, grouping, and getting the keyness statistics. (Here, I've removed stopwords and punctuation as well.)
# using the separate package since we are moving textstat_*() functions
# to this module package with quanteda v3 release planned in 2021
library("quanteda.textstats")
corp %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("en")) %>%
dfm() %>%
dfm_group(groups = "source") %>%
textstat_keyness() %>%
head()
## feature chi2 p n_target n_reference
## 1 love 11.236174 0.0008021834 10 1
## 2 mind 10.108762 0.0014756604 11 3
## 3 good 9.971163 0.0015901101 17 8
## 4 may 9.190508 0.0024327341 38 31
## 5 can 8.887529 0.0028712512 27 19
## 6 shall 7.728615 0.0054352433 23 16

Resources