Using strsplit results in terms with quotation marks in r - r

I have a large set of data, which I have imported from excel. I wish to get term frequency table for the data set. But, when I use strspplit, it includes quotation marks and other punctuation which gives wrong results.
There is a small error in the way I am using strsplit and need help on the same as I am not able to figure it out myself.
df = read_excel("C:/Users/B M Consulting/Documents/Book2.xlsx", col_types=c("text","numeric"), range=cell_cols("A:B"))
vect <- c(df[1])
vectsplit <- strsplit(tolower(vect), "\s+")
vectlev <- unique(unlist(vectsplit))
vecttermf <- sapply(vectsplit, function(x) table(factor(x, levels=vectlev)))
The output vect is something like this:
[1] "3 inch c clamp" "baby vice" "baby vice bench" "baby vise"
[5] "bench" "bench vice" "bench vice clamp" "bench vise"
[9] "bench voice" "bench wise" "bench wise heavy" "bench wise table"
[13] "box for tools" "c clamp" "c clamp set" "c clamps"
[17] "carpenter tools" "carpenter tools low price" "cast iron pipe" "clamp"
[21] "clamp set" "clamps woodworking" "g clamp" "g clamp set 3 inch"
I need to get each word out. When I use strplit, it includes all the punctuation marks.
Below is a small section of vectsplit that I get. It includes all inverted commas, backslashes and commas which I dont want.
[1] "c(\"3" "inch" "c" "clamp\"," "\"baby" "vice\"," "\"baby" "vice"
[9] "bench\"," "\"baby" "vise\"," "\"bench\"," "\"bench" "vice\"," "\"bench" "vice"
[17] "clamp\"," "\"bench" "vise\"," "\"bench" "voice\"," "\"bench" "wise\"," "\"bench"
[25] "wise" "heavy\"," "\"bench" "wise" "table\"," "\"box" "for" "tools\","
[33] "\"c" "clamp\"," "\"c" "clamp" "set\"," "\"c" "clamps\"," "\"carpenter"
[41] "tools\"," "\"carpenter" "tools" "low" "price\"," "\"cast" "iron" "pipe\","

If you check the class of vect, you'll notice that it's not a character vector, but a list.
vect<-c(df[1])
class(vect)
> "list"
If you define vect as below, the issue disappears:
vect<-df[[1]]
class(vect)
> "character"
If you define vect as such and then use strsplit, it should work just fine. Keep in mind that different kinds of subsetting ([1] vs. [[1]]) will produce different classes of outputs.

Related

Levels of a dataframe after filtering

i've been doing an assignment for a self study in R programming. I have a question about what happens with factors in a dataframe once you filter it. I have a dataframe that has the columns (movie)Studio and Genre.
For the assignment i need to filter it. I succeeded in this, but when i check the levels of the newly filtered columns all factors are still present, so not only the filtered ones.
Why is this? Am i doing something wrong?
StudioTarget <- c("Buena Vista Studios","Fox","Paramount Pictures","Sony","Universal","WB")
GenreTarget <- c("action","adventure","animation","comedy","drama")
dftest <- df[df$Studio %in% StudioTarget & df$Genre %in% GenreTarget,]
> levels(dftest$Studio)
[1] "Art House Studios" "Buena Vista Studios" "Colombia Pictures"
[4] "Dimension Films" "Disney" "DreamWorks"
[7] "Fox" "Fox Searchlight Pictures" "Gramercy Pictures"
[10] "IFC" "Lionsgate" "Lionsgate Films"
[13] "Lionsgate/Summit" "MGM" "MiraMax"
[16] "New Line Cinema" "New Market Films" "Orion"
[19] "Pacific Data/DreamWorks" "Paramount Pictures" "Path_ Distribution"
[22] "Relativity Media" "Revolution Studios" "Screen Gems"
[25] "Sony" "Sony Picture Classics" "StudioCanal"
[28] "Summit Entertainment" "TriStar" "UA Entertainment"
[31] "Universal" "USA" "Vestron Pictures"
[34] "WB" "WB/New Line" "Weinstein Company"
You can do droplevels(dftest$Studio) to remove unused levels
No, you're not doing anything wrong. A factor defines a fixed number of levels. These levels remain the same even if one or more of them are not present in the data. You've asked for the levels of your factor, not the values present after filtering.
Consider:
library(tidyverse)
mtcars %>%
mutate(cyl= as.factor(cyl)) %>%
filter(cyl == 4) %>%
distinct(cyl) %>%
pull(cyl)
[1] 4
Levels: 4 6 8
Welcome to SO. Next time, pleasetry to provide a minumum working example. This post will help you construct one.

gsub doesn't substitute names in column [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 4 years ago.
Using R, my data frame capstone3 with column Certificate...HQA has the following levels:
levels(capstone3$Certificate...HQA)
[1] "CUM LAUDE" "DIPLOM"
[3] "DOCTORATE" "GRADUATE DIPLOMA"
[5] "HIGHEST HONS" "HONOURS (DISTINCTION)"
[7] "HONOURS (HIGHEST DISTINCTION)" "HONS"
[9] "HONS I" "HONS II"
[11] "HONS II LOWER" "HONS II UPPER"
[13] "HONS III" "HONS UNCLASSIFIED"
[15] "HONS WITH MERIT" "MAGNA CUM LAUDE"
[17] "MASTER'S DEGREE" "OTHER HONS"
[19] "PASS DEGREE" "PASS WITH CREDIT"
[21] "PASS WITH DISTINCTION" "PASS WITH HIGH MERIT"
[23] "PASS WITH MERIT" "SUMMA CUM LAUDE"
I wrote a code to reduce the number of levels by substituting level [7] with level [9], level [6] with level [12], etc:
capstone3$Certificate...HQA <- as.factor(capstone3$Certificate...HQA)
capstone3$Certificate...HQA <- gsub("HONOURS (HIGHEST DISTINCTION)","HONS I", capstone3$Certificate...HQA)
capstone3$Certificate...HQA <- gsub("HONOURS (DISTINCTION)","HONS II UPPER", capstone3$Certificate...HQA)
capstone3$Certificate...HQA <- gsub("HONS WITH MERIT","HONS II LOWER", capstone3$Certificate...HQA)
But the above gsub code did not replace the names in the column, could someone kindly point out the problem with my code please?
Parentheses () are special characters used in regular expressions to create groups. If you have literal parentheses you need to escape them using \\
gsub("HONOURS \\(HIGHEST DISTINCTION\\)","HONS I", capstone3$Certificate...HQA)
OR as #ManuelBickel: Using fixed = TRUE the pattern is a string will be matched as is.
gsub("HONOURS (HIGHEST DISTINCTION)","HONS I", capstone3$Certificate...HQA, fixed = TRUE)

str_split on first and second occurence of delimter at different locations in character vector

I have a character list that has weather variables followed by "mean_#" where # is a number between 5 and 10. I want to subset the list to only have the weather variable names themselves. The mean weather variables look like this:
> mean_vars
[1] "dew_mean_10" "dew_mean_5" "dew_mean_6" "dew_mean_7"
[5] "dew_mean_8" "dew_mean_9" "humid_mean_10" "humid_mean_5"
[9] "humid_mean_6" "humid_mean_7" "humid_mean_8" "humid_mean_9"
[13] "rain_mean_10" "rain_mean_5" "rain_mean_6" "rain_mean_7"
[17] "rain_mean_8" "rain_mean_9" "soil_moist_mean_10" "soil_moist_mean_5"
[21] "soil_moist_mean_6" "soil_moist_mean_7" "soil_moist_mean_8" "soil_moist_mean_9"
[25] "soil_temp_mean_10" "soil_temp_mean_5" "soil_temp_mean_6" "soil_temp_mean_7"
[29] "soil_temp_mean_8" "soil_temp_mean_9" "solar_mean_10" "solar_mean_5"
[33] "solar_mean_6" "solar_mean_7" "solar_mean_8" "solar_mean_9"
[37] "temp_mean_10" "temp_mean_5" "temp_mean_6" "temp_mean_7"
[41] "temp_mean_8" "temp_mean_9" "wind_dir_mean_10" "wind_dir_mean_5"
[45] "wind_dir_mean_6" "wind_dir_mean_7" "wind_dir_mean_8" "wind_dir_mean_9"
[49] "wind_gust_mean_10" "wind_gust_mean_5" "wind_gust_mean_6" "wind_gust_mean_7"
[53] "wind_gust_mean_8" "wind_gust_mean_9" "wind_spd_mean_10" "wind_spd_mean_5"
[57] "wind_spd_mean_6" "wind_spd_mean_7" "wind_spd_mean_8" "wind_spd_mean_9"
And this is all I want at the end:
> var_names
"dew" "humid" "rain" "solar" "temp" "soil_moist" "soil_temp" "wind_dir" "wind_gust" "wind_spd"
Now I figured out how to do it but I fill my method is extraneous due to a lack of ability with regular expressions. I also will have to repeat my process 20 times substituting "mean" with other words.
var_names <- unique(str_split_fixed(mean_vars, "_", n = 3)[c(1:18,31:42),1])
var_names <- unlist(c(var_names, unique(unite(as_tibble(str_split_fixed(mean_vars, "_", n = 3)[c(19:30,43:60), 1:2])))))
I've been trying to stay within the realm of the tidyverse packages as much as possible so I was using stringr::str_split_fixed.
If you have a solution using this same function that would be ideal as I could continue the same programming style, but I'm open to all suggestions.
Thanks.
Use sub and unique. This is shorter and has no package dependencies (or use unique(str_replace(mean_vars, "_mean.*", "")) with stringr):
unique(sub("_mean.*", "", mean_vars))
giving:
[1] "dew" "humid" "rain" "soil_moist" "soil_temp"
[6] "solar" "temp" "wind_dir" "wind_gust" "wind_spd"
If for some reason you really want to use str_split then:
rmMean <- function(x) paste(head(x, -2), collapse = "_")
unique(sapply(str_split(mean_vars, "_"), rmMean))
Note
mean_vars <- c("dew_mean_10", "dew_mean_5", "dew_mean_6", "dew_mean_7", "dew_mean_8",
"dew_mean_9", "humid_mean_10", "humid_mean_5", "humid_mean_6",
"humid_mean_7", "humid_mean_8", "humid_mean_9", "rain_mean_10",
"rain_mean_5", "rain_mean_6", "rain_mean_7", "rain_mean_8", "rain_mean_9",
"soil_moist_mean_10", "soil_moist_mean_5", "soil_moist_mean_6",
"soil_moist_mean_7", "soil_moist_mean_8", "soil_moist_mean_9",
"soil_temp_mean_10", "soil_temp_mean_5", "soil_temp_mean_6",
"soil_temp_mean_7", "soil_temp_mean_8", "soil_temp_mean_9", "solar_mean_10",
"solar_mean_5", "solar_mean_6", "solar_mean_7", "solar_mean_8",
"solar_mean_9", "temp_mean_10", "temp_mean_5", "temp_mean_6",
"temp_mean_7", "temp_mean_8", "temp_mean_9", "wind_dir_mean_10",
"wind_dir_mean_5", "wind_dir_mean_6", "wind_dir_mean_7", "wind_dir_mean_8",
"wind_dir_mean_9", "wind_gust_mean_10", "wind_gust_mean_5", "wind_gust_mean_6",
"wind_gust_mean_7", "wind_gust_mean_8", "wind_gust_mean_9", "wind_spd_mean_10",
"wind_spd_mean_5", "wind_spd_mean_6", "wind_spd_mean_7", "wind_spd_mean_8",
"wind_spd_mean_9")

removing special apostrophes from French article contractions when tokenizing

I am currently running an stm (structural topic model) of a series of articles from the french newspaper Le Monde. The model is working just great, but I have a problem with the pre-processing of the text.
I'm currently using the quanteda package and the tm package for doing things like removing words, removing numbers...etc...
There's only one thing, though, that doesn't seem to work.
As some of you might know, in French, the masculine determinative article -le- contracts in -l'- before vowels. I've tried to remove -l'- (and similar things like -d'-) as words with removeWords
lmt67 <- removeWords(lmt67, c( "l'","d'","qu'il", "n'", "a", "dans"))
but it only works with words that are separate from the rest of text, not with the articles that are attached to a word, such as in -l'arbre- (the tree).
Frustrated, I've tried to give it a simple gsub
lmt67 <- gsub("l'","",lmt67)
but that doesn't seem to be working either.
Now, what's a better way to do this, and possibly through a c(...) vector so that I can give it a series of expressions all together?
Just as context, lmt67 is a "large character" with 30,000 elements/articles, obtained by using the "texts" functions on data imported from txt files.
Thanks to anyone that will want to help me.
I'll outline two ways to do this using quanteda and quanteda-related tools. First, let's define a slightly longer text, with more prefix cases for French. Notice the inclusion of the ’ apostrophe as well as the ASCII 39 simple apostrophe.
txt <- c(doc1 = "M. Trump, lors d’une réunion convoquée d’urgence à la Maison Blanche,
n’en a pas dit mot devant la presse. En réalité, il s’agit d’une
mesure essentiellement commerciale de ce pays qui l'importe.",
doc2 = "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme
successeur Jordi Sanchez, partisan de l’indépendance catalane,
actuellement en prison pour sédition.")
The first method will use pattern matches for the simple ASCII 39 (apostrophe) plus a bunch of
Unicode variants, matched through the category "Pf" for "Punctuation: Final Quote" category.
However, quanteda does its best to normalize the quotes at the tokenization stage - see the
"l'indépendance" in the second document for instance.
The second way below uses a French part-of-speech tagger integrated with quanteda that allows similar
selection after recognizing and separating the prefixes, and then removing determinants (among other POS).
1. quanteda tokens
toks <- tokens(txt, remove_punct = TRUE)
# remove stopwords
toks <- tokens_remove(toks, stopwords("french"))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "d'une" "réunion"
# [6] "convoquée" "d'urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "s'agit" "d'une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "l'importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "l'indépendantiste"
# [5] "catalan" "a" "désigné" "comme"
# [9] "successeur" "Jordi" "Sanchez" "partisan"
# [13] "de" "l'indépendance" "catalane" "actuellement"
# [17] "en" "prison" "pour" "sédition"
Then, we apply the pattern to match l', d', or l', using a regular expression replacement on the types (the unique tokens):
toks <- tokens_replace(
toks,
types(toks),
stringi::stri_replace_all_regex(types(toks), "[lsd]['\\p{Pf}]", "")
)
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "une" "réunion"
# [6] "convoquée" "urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "agit" "une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "indépendantiste" "catalan"
# [6] "a" "désigné" "comme" "successeur" "Jordi"
# [11] "Sanchez" "partisan" "de" "indépendance" "catalane"
# [16] "actuellement" "En" "prison" "pour" "sédition"
From the resulting toks object you can form a dfm and then proceed to fit the STM.
2. using spacyr
This will involve more sophisticated part-of-speech tagging and then converting the tagged object into quanteda tokens. This requires first that you install Python, spacy, and the French language model. (See https://spacy.io/usage/models.)
library(spacyr)
spacy_initialize(model = "fr", python_executable = "/anaconda/bin/python")
# successfully initialized (spaCy Version: 2.0.1, language model: fr)
toks <- spacy_parse(txt, lemma = FALSE) %>%
as.tokens(include_pos = "pos")
toks
# tokens from 2 documents.
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" ",/PUNCT"
# [4] "lors/ADV" "d’/PUNCT" "une/DET"
# [7] "réunion/NOUN" "convoquée/VERB" "d’/ADP"
# [10] "urgence/NOUN" "à/ADP" "la/DET"
# [13] "Maison/PROPN" "Blanche/PROPN" ",/PUNCT"
# [16] "\n /SPACE" "n’/VERB" "en/PRON"
# [19] "a/AUX" "pas/ADV" "dit/VERB"
# [22] "mot/ADV" "devant/ADP" "la/DET"
# [25] "presse/NOUN" "./PUNCT" "En/ADP"
# [28] "réalité/NOUN" ",/PUNCT" "il/PRON"
# [31] "s’/AUX" "agit/VERB" "d’/ADP"
# [34] "une/DET" "\n /SPACE" "mesure/NOUN"
# [37] "essentiellement/ADV" "commerciale/ADJ" "de/ADP"
# [40] "ce/DET" "pays/NOUN" "qui/PRON"
# [43] "l'/DET" "importe/NOUN" "./PUNCT"
#
# doc2 :
# [1] "Réfugié/VERB" "à/ADP" "Bruxelles/PROPN"
# [4] ",/PUNCT" "l’/PRON" "indépendantiste/ADJ"
# [7] "catalan/VERB" "a/AUX" "désigné/VERB"
# [10] "comme/ADP" "\n /SPACE" "successeur/NOUN"
# [13] "Jordi/PROPN" "Sanchez/PROPN" ",/PUNCT"
# [16] "partisan/VERB" "de/ADP" "l’/DET"
# [19] "indépendance/ADJ" "catalane/ADJ" ",/PUNCT"
# [22] "\n /SPACE" "actuellement/ADV" "en/ADP"
# [25] "prison/NOUN" "pour/ADP" "sédition/NOUN"
# [28] "./PUNCT"
Then we can use the default glob-matching to remove the parts of speech in which we are probably not interested, including the newline:
toks <- tokens_remove(toks, c("*/DET", "*/PUNCT", "\n*", "*/ADP", "*/AUX", "*/PRON"))
toks
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" "lors/ADV" "réunion/NOUN" "convoquée/VERB"
# [6] "urgence/NOUN" "Maison/PROPN" "Blanche/PROPN" "n’/VERB" "pas/ADV"
# [11] "dit/VERB" "mot/ADV" "presse/NOUN" "réalité/NOUN" "agit/VERB"
# [16] "mesure/NOUN" "essentiellement/ADV" "commerciale/ADJ" "pays/NOUN" "importe/NOUN"
#
# doc2 :
# [1] "Réfugié/VERB" "Bruxelles/PROPN" "indépendantiste/ADJ" "catalan/VERB" "désigné/VERB"
# [6] "successeur/NOUN" "Jordi/PROPN" "Sanchez/PROPN" "partisan/VERB" "indépendance/ADJ"
# [11] "catalane/ADJ" "actuellement/ADV" "prison/NOUN" "sédition/NOUN"
Then we can remove the tags, which you probably don't want in your STM - but you could leave them if you prefer.
## remove the tags
toks <- tokens_replace(toks, types(toks),
stringi::stri_replace_all_regex(types(toks), "/[A-Z]+$", ""))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M." "Trump" "lors" "réunion" "convoquée"
# [6] "urgence" "Maison" "Blanche" "n’" "pas"
# [11] "dit" "mot" "presse" "réalité" "agit"
# [16] "mesure" "essentiellement" "commerciale" "pays" "importe"
#
# doc2 :
# [1] "Réfugié" "Bruxelles" "indépendantiste" "catalan" "désigné"
# [6] "successeur" "Jordi" "Sanchez" "partisan" "indépendance"
# [11] "catalane" "actuellement" "prison" "sédition"
From there, you can use the toks object to form your dfm and fit the model.
Here's a scrape from the current page at Le Monde's website. Notice that the apostrophe they use is not the same character as the single-quote here "'":
text <- "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."
It has a little angle and is not actually "straight down" when I view it. You need to copy that character into your gsub command:
sub("l’", "", text)
[#1] "Réfugié à Bruxelles, indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."

With R and XML, can an XPath 1.0 expression eliminate duplicates in the content returned?

When I extract content from the following URL, using XPath 1.0, the cities that are returned contain duplicates, starting with Birmingham. (The complete set of values returned is more than 140, so I have truncated it.) Is there a way with the XPath expression to avoid the duplicates?
require(XML)
doc <- htmlTreeParse("http://www.littler.com/locations", useInternal = TRUE)
xpathSApply(doc, "//div[#class = 'mm-location-usa']//a[position() < 12]", xmlValue, trim = TRUE)
[1] "Birmingham" "Mobile" "Anchorage" "Phoenix" "Fayetteville" "Fresno"
[7] "Irvine" "L.A. - Century City" "L.A. - Downtown" "Sacramento" "San Diego" "Birmingham"
[13] "Mobile" "Anchorage" "Phoenix" "Fayetteville" "Fresno" "Irvine"
[19] "L.A. - Century City" "L.A. - Downtown" "Sacramento" "San Diego"
Is there an XPath expression or work around along the lines of [not-duplicate()]?
Also, various [position() < X] permutations don't produce only the cities and only one instance of each. In fact, it's hard to figure out how positions are counted.
I would appreciate any guidance or finding out that the best I can do is limit the number of duplicates returned.
BTW XPath result with duplicates is not the same problem nor are the questions that pertain to duplicate nodes, e.g., How do I identify duplicate nodes in XPath 1.0 using an XPathNavigator to evaluate?
There is a function for this, it is called distinct-values(), but unfortunately, it is only available in XPath 2.0. In R, you are limited to XPath 1.0.
What you can do is
//div[#class = 'mm-location-usa']//a[position() < 12 and not(normalize-space(.) = normalize-space(following::a))]
What it does, in plain English:
Look for div elements, but only if their class attribute value equals "mm-location-usa". Look for descendant a element of those div elements, but only if the a element's position is less than 12 and if the normalized text content of that a element is not equal to the text content of an a element that follows.
But is is a computationally intensive approach and not the most elegant one. I recommend you take jlhoward's solution.
Can't you just do it this way??
require(XML)
doc <- htmlTreeParse("http://www.littler.com/locations", useInternal = TRUE)
xPath <- "//div[#class = 'mm-location-usa']//a[position() < 12]"
unique(xpathSApply(doc, xPath, xmlValue, trim = TRUE))
# [1] "Birmingham" "Mobile" "Anchorage"
# [4] "Phoenix" "Fayetteville" "Fresno"
# [7] "Irvine" "L.A. - Century City" "L.A. - Downtown"
# [10] "Sacramento" "San Diego"
Or, you can just create an XPath to process the li tags in the first div (since they are duplicate divs):
xpathSApply(doc, "//div[#id='lmblocks-mega-menu---locations'][1]/
div[#class='mm-location-usa']/
ul/
li[#class='mm-list-item']", xmlValue, trim = TRUE)
## [1] "Birmingham" "Mobile" "Anchorage"
## [4] "Phoenix" "Fayetteville" "Fresno"
## [7] "Irvine" "L.A. - Century City" "L.A. - Downtown"
## [10] "Sacramento" "San Diego" "San Francisco"
## [13] "San Jose" "Santa Maria" "Walnut Creek"
## [16] "Denver" "New Haven" "Washington, DC"
## [19] "Miami" "Orlando" "Atlanta"
## [22] "Chicago" "Indianapolis" "Overland Park"
## [25] "Lexington" "Boston" "Detroit"
## [28] "Minneapolis" "Kansas City" "St. Louis"
## [31] "Las Vegas" "Reno" "Newark"
## [34] "Albuquerque" "Long Island" "New York"
## [37] "Rochester" "Charlotte" "Cleveland"
## [40] "Columbus" "Portland" "Philadelphia"
## [43] "Pittsburgh" "San Juan" "Providence"
## [46] "Columbia" "Memphis" "Nashville"
## [49] "Dallas" "Houston" "Tysons Corner"
## [52] "Seattle" "Morgantown" "Milwaukee"
I made an assumption here that you're going after US locations.

Resources