removing special apostrophes from French article contractions when tokenizing

removing special apostrophes from French article contractions when tokenizing - r

I am currently running an stm (structural topic model) of a series of articles from the french newspaper Le Monde. The model is working just great, but I have a problem with the pre-processing of the text.
I'm currently using the quanteda package and the tm package for doing things like removing words, removing numbers...etc...
There's only one thing, though, that doesn't seem to work.
As some of you might know, in French, the masculine determinative article -le- contracts in -l'- before vowels. I've tried to remove -l'- (and similar things like -d'-) as words with removeWords
lmt67 <- removeWords(lmt67, c( "l'","d'","qu'il", "n'", "a", "dans"))
but it only works with words that are separate from the rest of text, not with the articles that are attached to a word, such as in -l'arbre- (the tree).
Frustrated, I've tried to give it a simple gsub
lmt67 <- gsub("l'","",lmt67)
but that doesn't seem to be working either.
Now, what's a better way to do this, and possibly through a c(...) vector so that I can give it a series of expressions all together?
Just as context, lmt67 is a "large character" with 30,000 elements/articles, obtained by using the "texts" functions on data imported from txt files.
Thanks to anyone that will want to help me.

I'll outline two ways to do this using quanteda and quanteda-related tools. First, let's define a slightly longer text, with more prefix cases for French. Notice the inclusion of the ’ apostrophe as well as the ASCII 39 simple apostrophe.
txt <- c(doc1 = "M. Trump, lors d’une réunion convoquée d’urgence à la Maison Blanche,
n’en a pas dit mot devant la presse. En réalité, il s’agit d’une
mesure essentiellement commerciale de ce pays qui l'importe.",
doc2 = "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme
successeur Jordi Sanchez, partisan de l’indépendance catalane,
actuellement en prison pour sédition.")
The first method will use pattern matches for the simple ASCII 39 (apostrophe) plus a bunch of
Unicode variants, matched through the category "Pf" for "Punctuation: Final Quote" category.
However, quanteda does its best to normalize the quotes at the tokenization stage - see the
"l'indépendance" in the second document for instance.
The second way below uses a French part-of-speech tagger integrated with quanteda that allows similar
selection after recognizing and separating the prefixes, and then removing determinants (among other POS).
1. quanteda tokens
toks <- tokens(txt, remove_punct = TRUE)
# remove stopwords
toks <- tokens_remove(toks, stopwords("french"))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "d'une" "réunion"
# [6] "convoquée" "d'urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "s'agit" "d'une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "l'importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "l'indépendantiste"
# [5] "catalan" "a" "désigné" "comme"
# [9] "successeur" "Jordi" "Sanchez" "partisan"
# [13] "de" "l'indépendance" "catalane" "actuellement"
# [17] "en" "prison" "pour" "sédition"
Then, we apply the pattern to match l', d', or l', using a regular expression replacement on the types (the unique tokens):
toks <- tokens_replace(
toks,
types(toks),
stringi::stri_replace_all_regex(types(toks), "[lsd]['\\p{Pf}]", "")
)
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "une" "réunion"
# [6] "convoquée" "urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "agit" "une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "indépendantiste" "catalan"
# [6] "a" "désigné" "comme" "successeur" "Jordi"
# [11] "Sanchez" "partisan" "de" "indépendance" "catalane"
# [16] "actuellement" "En" "prison" "pour" "sédition"
From the resulting toks object you can form a dfm and then proceed to fit the STM.
2. using spacyr
This will involve more sophisticated part-of-speech tagging and then converting the tagged object into quanteda tokens. This requires first that you install Python, spacy, and the French language model. (See https://spacy.io/usage/models.)
library(spacyr)
spacy_initialize(model = "fr", python_executable = "/anaconda/bin/python")
# successfully initialized (spaCy Version: 2.0.1, language model: fr)
toks <- spacy_parse(txt, lemma = FALSE) %>%
as.tokens(include_pos = "pos")
toks
# tokens from 2 documents.
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" ",/PUNCT"
# [4] "lors/ADV" "d’/PUNCT" "une/DET"
# [7] "réunion/NOUN" "convoquée/VERB" "d’/ADP"
# [10] "urgence/NOUN" "à/ADP" "la/DET"
# [13] "Maison/PROPN" "Blanche/PROPN" ",/PUNCT"
# [16] "\n /SPACE" "n’/VERB" "en/PRON"
# [19] "a/AUX" "pas/ADV" "dit/VERB"
# [22] "mot/ADV" "devant/ADP" "la/DET"
# [25] "presse/NOUN" "./PUNCT" "En/ADP"
# [28] "réalité/NOUN" ",/PUNCT" "il/PRON"
# [31] "s’/AUX" "agit/VERB" "d’/ADP"
# [34] "une/DET" "\n /SPACE" "mesure/NOUN"
# [37] "essentiellement/ADV" "commerciale/ADJ" "de/ADP"
# [40] "ce/DET" "pays/NOUN" "qui/PRON"
# [43] "l'/DET" "importe/NOUN" "./PUNCT"
#
# doc2 :
# [1] "Réfugié/VERB" "à/ADP" "Bruxelles/PROPN"
# [4] ",/PUNCT" "l’/PRON" "indépendantiste/ADJ"
# [7] "catalan/VERB" "a/AUX" "désigné/VERB"
# [10] "comme/ADP" "\n /SPACE" "successeur/NOUN"
# [13] "Jordi/PROPN" "Sanchez/PROPN" ",/PUNCT"
# [16] "partisan/VERB" "de/ADP" "l’/DET"
# [19] "indépendance/ADJ" "catalane/ADJ" ",/PUNCT"
# [22] "\n /SPACE" "actuellement/ADV" "en/ADP"
# [25] "prison/NOUN" "pour/ADP" "sédition/NOUN"
# [28] "./PUNCT"
Then we can use the default glob-matching to remove the parts of speech in which we are probably not interested, including the newline:
toks <- tokens_remove(toks, c("*/DET", "*/PUNCT", "\n*", "*/ADP", "*/AUX", "*/PRON"))
toks
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" "lors/ADV" "réunion/NOUN" "convoquée/VERB"
# [6] "urgence/NOUN" "Maison/PROPN" "Blanche/PROPN" "n’/VERB" "pas/ADV"
# [11] "dit/VERB" "mot/ADV" "presse/NOUN" "réalité/NOUN" "agit/VERB"
# [16] "mesure/NOUN" "essentiellement/ADV" "commerciale/ADJ" "pays/NOUN" "importe/NOUN"
#
# doc2 :
# [1] "Réfugié/VERB" "Bruxelles/PROPN" "indépendantiste/ADJ" "catalan/VERB" "désigné/VERB"
# [6] "successeur/NOUN" "Jordi/PROPN" "Sanchez/PROPN" "partisan/VERB" "indépendance/ADJ"
# [11] "catalane/ADJ" "actuellement/ADV" "prison/NOUN" "sédition/NOUN"
Then we can remove the tags, which you probably don't want in your STM - but you could leave them if you prefer.
## remove the tags
toks <- tokens_replace(toks, types(toks),
stringi::stri_replace_all_regex(types(toks), "/[A-Z]+$", ""))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M." "Trump" "lors" "réunion" "convoquée"
# [6] "urgence" "Maison" "Blanche" "n’" "pas"
# [11] "dit" "mot" "presse" "réalité" "agit"
# [16] "mesure" "essentiellement" "commerciale" "pays" "importe"
#
# doc2 :
# [1] "Réfugié" "Bruxelles" "indépendantiste" "catalan" "désigné"
# [6] "successeur" "Jordi" "Sanchez" "partisan" "indépendance"
# [11] "catalane" "actuellement" "prison" "sédition"
From there, you can use the toks object to form your dfm and fit the model.

Here's a scrape from the current page at Le Monde's website. Notice that the apostrophe they use is not the same character as the single-quote here "'":
text <- "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."
It has a little angle and is not actually "straight down" when I view it. You need to copy that character into your gsub command:
sub("l’", "", text)
[#1] "Réfugié à Bruxelles, indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."

Related

lapply() with XPath to obtain all text after a specific tag not working

Background:
I am scraping this website to obtain a list of all people named under a respective section of the editorial board.
In total, there are 6 sections, each one beginning with a <b>...</b> part. (It actually should be 5, but the code is a bit messy.)
My goal:
I want to get a list of all people per section (a list of 6 elements called people).
My approach:
I try to fetch all the text, or text(), after each respective <b>...</b>-tag.
However, with the following R-code and XPath, I fail to get the correct list:
journal_url <- "https://aepi.biomedcentral.com/about/editorial-board"
webpage <- xml2::read_html(url(journal_url))
# get a list of 6 sections
all_sections <- rvest::html_nodes(wholepage, css = '#editorialboard p')
# the following does not work properly
people <- lapply(all_sections, function(x) rvest::html_nodes(x, xpath = '//b/following-sibling::text()'))
The mistaken outcome:
Instead of giving me a list of 6 elements comprising the people per section, it gives me a list of 6 elements comprising all people in every element.
The expected outcome:
The expected output would start with:
people
[[1]]
[1] Shichuo Li
[[2]]
[1] Zhen Hong
[2] Hermann Stefan
[3] Dong Zhou
[[3]]
[1] Jie Mu
# etc etc

The double forward slash xpath selects all nodes in the whole document, even when the object is a single node. Use the current node selector .
people <- lapply(all_sections, function(x) {
rvest::html_nodes(x, xpath = './b/following-sibling::text()')
})
Output:
[[1]]
{xml_nodeset (1)}
[1] Shichuo Li,
[[2]]
{xml_nodeset (3)}
[1] Zhen Hong,
[2] Hermann Stefan,
[3] Dong Zhou,
[[3]]
{xml_nodeset (0)}
[[4]]
{xml_nodeset (1)}
[1] Jie Mu,
[[5]]
{xml_nodeset (2)}
[1] Bing Liang,
[2] Weijia Jiang,
[[6]]
{xml_nodeset (35)}
[1] Aye Mye Min Aye,
[2] Sándor Beniczky,
[3] Ingmar Blümcke,
[4] Martin J. Brodie,
[5] Eric Chan,
[6] Yanchun Deng,
[7] Ding Ding,
[8] Yuwu Jiang,
[9] Hennric Jokeit,
[10] Heung Dong Kim,
[11] Patrick Kwan,
[12] Byung In Lee,
[13] Weiping Liao,
[14] Xiaoyan Liu,
[15] Guoming Luan,
[16] Imad M. Najm,
[17] Terence O'Brien,
[18] Jiong Qin,
[19] Markus Reuber,
[20] Ley J.W. Sander,
...

quanteda - stopwords not working in French

For some reason, stop words is not working for my corpus, entirely in French. I've been trying repeatedly over the past few days, but many words that should have been filtered simply are not. I am not sure if anyone else has a similar issue? I read somewhere that it could be because of the accents. I tried stringi::stri_trans_general(x, "Latin-ASCII") but I am not certain I did this correctly. Also, I notice that French stop words are sometimes referred to as "french" or "fr".
This is one example of code I tried, I would be extremely grateful for any advice.
I also manually installed quanteda, because I had difficulties downloading it, so it could be linked to that.
text_corp <- quanteda::corpus(data,
text_field="text")
head(stopwords("french"))
summary(text_corp)
my_dfm <- dfm(text_corp)
myStemMat <- dfm(text_corp, remove = stopwords("french"), stem = TRUE, remove_punct = TRUE, remove_numbers = TRUE, remove_separators = TRUE)
myStemMat[, 1:5]
topfeatures(myStemMat 20)
In this last step, there are still words like "etre" (to be), "plus" (more), comme ("like"), avant ("before"), avoir ("to have")
I also tried to filter stop words in a different way, through token creation:
tokens <-
tokens(
text_corp,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
split_hyphens = TRUE,
include_docvars = TRUE,
)
mydfm <- dfm(tokens,
tolower = TRUE,
stem = TRUE,
remove = stopwords("french")
)
topfeatures(mydfm, 20)

The stopwords are working just fine, however the default Snowball list of French stopwords simply does not include the words you wish to remove.
You can see that by inspecting the vector of stopwords returned by stopwords("fr"):
library("quanteda")
## Package version: 2.1.2
c("comme", "avoir", "plus", "avant", "être") %in%
stopwords("fr")
## [1] FALSE FALSE FALSE FALSE FALSE
This is the full list of words:
sort(stopwords("fr"))
## [1] "à" "ai" "aie" "aient" "aies" "ait"
## [7] "as" "au" "aura" "aurai" "auraient" "aurais"
## [13] "aurait" "auras" "aurez" "auriez" "aurions" "aurons"
## [19] "auront" "aux" "avaient" "avais" "avait" "avec"
## [25] "avez" "aviez" "avions" "avons" "ayant" "ayez"
## [31] "ayons" "c" "ce" "ceci" "cela" "celà"
## [37] "ces" "cet" "cette" "d" "dans" "de"
## [43] "des" "du" "elle" "en" "es" "est"
## [49] "et" "étaient" "étais" "était" "étant" "été"
## [55] "étée" "étées" "étés" "êtes" "étiez" "étions"
## [61] "eu" "eue" "eues" "eûmes" "eurent" "eus"
## [67] "eusse" "eussent" "eusses" "eussiez" "eussions" "eut"
## [73] "eût" "eûtes" "eux" "fûmes" "furent" "fus"
## [79] "fusse" "fussent" "fusses" "fussiez" "fussions" "fut"
## [85] "fût" "fûtes" "ici" "il" "ils" "j"
## [91] "je" "l" "la" "le" "les" "leur"
## [97] "leurs" "lui" "m" "ma" "mais" "me"
## [103] "même" "mes" "moi" "mon" "n" "ne"
## [109] "nos" "notre" "nous" "on" "ont" "ou"
## [115] "par" "pas" "pour" "qu" "que" "quel"
## [121] "quelle" "quelles" "quels" "qui" "s" "sa"
## [127] "sans" "se" "sera" "serai" "seraient" "serais"
## [133] "serait" "seras" "serez" "seriez" "serions" "serons"
## [139] "seront" "ses" "soi" "soient" "sois" "soit"
## [145] "sommes" "son" "sont" "soyez" "soyons" "suis"
## [151] "sur" "t" "ta" "te" "tes" "toi"
## [157] "ton" "tu" "un" "une" "vos" "votre"
## [163] "vous" "y"
That's why they are not removed. We can see this with an example I created, using many of your words:
toks <- tokens("Je veux avoir une glace et être heureux, comme un enfant avant le dîner.",
remove_punct = TRUE
)
tokens_remove(toks, stopwords("fr"))
## Tokens consisting of 1 document.
## text1 :
## [1] "veux" "avoir" "glace" "être" "heureux" "comme" "enfant"
## [8] "avant" "dîner"
How to remove them? Either use a more complete list of stopwords, or customize the Snowball list by appending the stopwords you want to the existing ones.
mystopwords <- c(stopwords("fr"), "comme", "avoir", "plus", "avant", "être")
tokens_remove(toks, mystopwords)
## Tokens consisting of 1 document.
## text1 :
## [1] "veux" "glace" "heureux" "enfant" "dîner"
You could also use one of the other stopword sources, such as the "stopwords-iso", which does contain all of the words you wish to remove:
c("comme", "avoir", "plus", "avant", "être") %in%
stopwords("fr", source = "stopwords-iso")
## [1] TRUE TRUE TRUE TRUE TRUE
With regard to the language question, see the help for ?stopwords::stopwords, which states:
The language codes for each stopword list use the two-letter ISO code from https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes. For backwards compatibility, the full English names of the stopwords from the quanteda package may also be used, although these are deprecated.
With regard to what you tried with stringi::stri_trans_general(x, "Latin-ASCII"), this would only help you if you wanted to remove "etre" and your stopword list contained only "être". In the example below, the stopword vector containing the accented character is concatenated with a version of itself in which the accents have been removed.
sw <- "être"
tokens("etre être heureux") %>%
tokens_remove(sw)
## Tokens consisting of 1 document.
## text1 :
## [1] "etre" "heureux"
tokens("etre être heureux") %>%
tokens_remove(c(sw, stringi::stri_trans_general(sw, "Latin-ASCII")))
## Tokens consisting of 1 document.
## text1 :
## [1] "heureux"
c(sw, stringi::stri_trans_general(sw, "Latin-ASCII"))
## [1] "être" "etre"

How to parse sub path of an XML file using XML2 package

I have the following xml page that looks like this which I need to parse using xml2
However, with this code, I cannot get the list under the subcellularLocation xpath :
library(xml2)
xmlfile <- "https://www.uniprot.org/uniprot/P09429.xml"
doc <- xmlfile %>%
xml2::read_xml()
xml_name(doc)
xml_children(doc)
x <- xml_find_all(doc, "//subcellularLocation")
xml_path(x)
# character(0)
What is the right way to do it?
Update
The desired output is a vector:
[1] "Nucleus"
[2] "Chromosome"
[3] "Cytoplasm"
[4] "Secreted"
[5] "Cell membrane"
[6] "Peripheral membrane protein"
[7] "Extracellular side"
[8] "Endosome"
[9] "Endoplasmic reticulum-Golgi intermediate compartment"

Use x <- xml_find_all(doc, "//d1:subcellularLocation")
Whenever you meet a troublesome problem, check the document is the first thing to do, use ?xml_find_all and you will see this (at the end of the page)
# Namespaces ---------------------------------------------------------------
# If the document uses namespaces, you'll need use xml_ns to form
# a unique mapping between full namespace url and a short prefix
x <- read_xml('
<root xmlns:f = "http://foo.com" xmlns:g = "http://bar.com">
<f:doc><g:baz /></f:doc>
<f:doc><g:baz /></f:doc>
</root>
')
xml_find_all(x, ".//f:doc")
xml_find_all(x, ".//f:doc", xml_ns(x))
So you then go to check xml_ns(doc) and find
d1 <-> http://uniprot.org/uniprot
xsi <-> http://www.w3.org/2001/XMLSchema-instance
Update
xml_find_all(doc, "//d1:subcellularLocation")
%>% xml_children()
%>% xml_text()
## [1] "Nucleus"
## [2] "Chromosome"
## [3] "Cytoplasm"
## [4] "Secreted"
## [5] "Cell membrane"
## [6] "Peripheral membrane protein"
## [7] "Extracellular side"
## [8] "Endosome"
## [9] "Endoplasmic reticulum-Golgi intermediate compartment"ent"

If you don't mind, you can use the rvest package:
library(rvest)
a=read_html(xmlfile)%>%
html_nodes("subcellularlocation")
a%>%html_children()%>%html_text()
[1] "Nucleus" "Chromosome"
[3] "Cytoplasm" "Secreted"
[5] "Cell membrane" "Peripheral membrane protein"
[7] "Extracellular side" "Endosome"
[9] "Endoplasmic reticulum-Golgi intermediate compartment"

How to clean string columns (with capital letters and accents) in R?

I'm working with the following dataset, which contains average temperatures in each of the 32 states of Mexico.
library(data.table)
# Read data from website
col.names <- c('ENTIDAD', 'ANYO', 'ENERO', 'FEBRERO', 'MARZO', 'ABRIL', 'MAYO', 'JUNIO',
'JULIO', 'AGOSTO', 'SEPTIEMBRE', 'OCTUBRE', 'NOVIEMBRE', 'DICIEMBRE', 'UNIDAD')
temperature <- fread('http://201.116.60.46/DatosAbiertos/Temperatura_promedio.csv',
col.names = col.names)
The column ENTIDAD has the 32 names of the states. However, all the names appear in capital letters, and there are some weird numbers that replace the letters which are supposed to have accents:
unique(temperature$ENTIDAD)
[1] "AGUASCALIENTES" "BAJA CALIFORNIA"
[3] "BAJA CALIFORNIA SUR" "CAMPECHE"
[5] "COAHUILA DE ZARAGOZA" "COLIMA"
[7] "CHIAPAS" "CHIHUAHUA"
[9] "DISTRITO FEDERAL" "DURANGO"
[11] "GUANAJUATO" "GUERRERO"
[13] "HIDALGO" "JALISCO"
[15] "M\311XICO" "MICHOAC\301N DE OCAMPO"
[17] "MORELOS" "NAYARIT"
[19] "NUEVO LE\323N" "OAXACA"
[21] "PUEBLA" "QUER\311TARO"
[23] "QUINTANA ROO" "SAN LUIS POTOS\315"
[25] "SINALOA" "SONORA"
[27] "TABASCO" "TAMAULIPAS"
[29] "TLAXCALA" "VERACRUZ DE IGNACIO DE LA LLAVE"
[31] "YUCAT\301N" "ZACATECAS"
Is there a simple way to replace each of these with the following strings?
states <- c('Aguascalientes',
'Baja California',
'Baja California Sur',
'Campeche',
'Chiapas',
'Chihuahua',
'Coahuila',
'Colima',
'DF',
'Durango',
'Guanajuato',
'Guerrero',
'Hidalgo',
'Jalisco',
'Michoacan',
'Morelos',
'Mexico',
'Nayarit',
'Nuevo Leon',
'Oaxaca',
'Puebla',
'Queretaro',
'Quintana Roo',
'San Luis Potosi',
'Sinaloa',
'Sonora',
'Tabasco',
'Tamaulipas',
'Tlaxcala',
'Veracruz',
'Yucatan',
'Zacatecas')

It appears that you have the replacement names you want to change the names in unique(temperature$ENTIDAD) to.
If you already have the names you wish to change the old names to you can use mapvalues from the plyr package to change the names:
temperatures$ENTIDAD <- mapvalues(temperature$ENTIDAD, from=unique(temperature$ENTIDAD), to=states)

I think this will solve you problem:
temperature <- fread('http://201.116.60.46/DatosAbiertos/Temperatura_promedio.csv',
col.names = col.names, encoding = "Latin-1")

You can set the encoding (probably better via fread), and use tolower for lower case,
x <- temperature$ENTIDAD
Encoding(x) <- "latin1"
# might also want to convert to utf8
# x <- iconv(x, "latin1", "UTF-8")
cbind(x, tolower(x))

With R and XML, can an XPath 1.0 expression eliminate duplicates in the content returned?

When I extract content from the following URL, using XPath 1.0, the cities that are returned contain duplicates, starting with Birmingham. (The complete set of values returned is more than 140, so I have truncated it.) Is there a way with the XPath expression to avoid the duplicates?
require(XML)
doc <- htmlTreeParse("http://www.littler.com/locations", useInternal = TRUE)
xpathSApply(doc, "//div[#class = 'mm-location-usa']//a[position() < 12]", xmlValue, trim = TRUE)
[1] "Birmingham" "Mobile" "Anchorage" "Phoenix" "Fayetteville" "Fresno"
[7] "Irvine" "L.A. - Century City" "L.A. - Downtown" "Sacramento" "San Diego" "Birmingham"
[13] "Mobile" "Anchorage" "Phoenix" "Fayetteville" "Fresno" "Irvine"
[19] "L.A. - Century City" "L.A. - Downtown" "Sacramento" "San Diego"
Is there an XPath expression or work around along the lines of [not-duplicate()]?
Also, various [position() < X] permutations don't produce only the cities and only one instance of each. In fact, it's hard to figure out how positions are counted.
I would appreciate any guidance or finding out that the best I can do is limit the number of duplicates returned.
BTW XPath result with duplicates is not the same problem nor are the questions that pertain to duplicate nodes, e.g., How do I identify duplicate nodes in XPath 1.0 using an XPathNavigator to evaluate?

There is a function for this, it is called distinct-values(), but unfortunately, it is only available in XPath 2.0. In R, you are limited to XPath 1.0.
What you can do is
//div[#class = 'mm-location-usa']//a[position() < 12 and not(normalize-space(.) = normalize-space(following::a))]
What it does, in plain English:
Look for div elements, but only if their class attribute value equals "mm-location-usa". Look for descendant a element of those div elements, but only if the a element's position is less than 12 and if the normalized text content of that a element is not equal to the text content of an a element that follows.
But is is a computationally intensive approach and not the most elegant one. I recommend you take jlhoward's solution.

Can't you just do it this way??
require(XML)
doc <- htmlTreeParse("http://www.littler.com/locations", useInternal = TRUE)
xPath <- "//div[#class = 'mm-location-usa']//a[position() < 12]"
unique(xpathSApply(doc, xPath, xmlValue, trim = TRUE))
# [1] "Birmingham" "Mobile" "Anchorage"
# [4] "Phoenix" "Fayetteville" "Fresno"
# [7] "Irvine" "L.A. - Century City" "L.A. - Downtown"
# [10] "Sacramento" "San Diego"

Or, you can just create an XPath to process the li tags in the first div (since they are duplicate divs):
xpathSApply(doc, "//div[#id='lmblocks-mega-menu---locations'][1]/
div[#class='mm-location-usa']/
ul/
li[#class='mm-list-item']", xmlValue, trim = TRUE)
## [1] "Birmingham" "Mobile" "Anchorage"
## [4] "Phoenix" "Fayetteville" "Fresno"
## [7] "Irvine" "L.A. - Century City" "L.A. - Downtown"
## [10] "Sacramento" "San Diego" "San Francisco"
## [13] "San Jose" "Santa Maria" "Walnut Creek"
## [16] "Denver" "New Haven" "Washington, DC"
## [19] "Miami" "Orlando" "Atlanta"
## [22] "Chicago" "Indianapolis" "Overland Park"
## [25] "Lexington" "Boston" "Detroit"
## [28] "Minneapolis" "Kansas City" "St. Louis"
## [31] "Las Vegas" "Reno" "Newark"
## [34] "Albuquerque" "Long Island" "New York"
## [37] "Rochester" "Charlotte" "Cleveland"
## [40] "Columbus" "Portland" "Philadelphia"
## [43] "Pittsburgh" "San Juan" "Providence"
## [46] "Columbia" "Memphis" "Nashville"
## [49] "Dallas" "Houston" "Tysons Corner"
## [52] "Seattle" "Morgantown" "Milwaukee"
I made an assumption here that you're going after US locations.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

removing special apostrophes from French article contractions when tokenizing - r

Related

lapply() with XPath to obtain all text after a specific tag not working

quanteda - stopwords not working in French

How to parse sub path of an XML file using XML2 package

How to clean string columns (with capital letters and accents) in R?

With R and XML, can an XPath 1.0 expression eliminate duplicates in the content returned?

Categories

Resources