I am trying to rename all of these atrocious column names in a data frame I received from a government agency.
> colnames(thedata)
[1] "Region" "Resource Assessment Site ID"
[3] "Site Name/Facility" "Design Head (feet)"
[5] "Design Flow (cfs)" "Installed Capacity (kW)"
[7] "Annual Production (MWh)" "Plant Factor"
[9] "Total Construction Cost (1,000 $)" "Annual O&M Cost (1,000 $)"
[11] "Cost per Installed Capacity ($/kW)" "Benefit Cost Ratio with Green Incentives"
[13] "IRR with Green Incentives" "Benefit Cost Ratio without Green Incentives"
[15] "IRR without Green Incentives"
The column headers have special non-alphanumeric characters and spaces, so referring to them is impossible so I have to rename them. I would like to replace all non-alphanumeric characters with a period. But I tried:
old.col.names <- colnames(thedata)
new.col.names <- gsub("^a-z0-9", ".", old.col.names)
The ^ is a "not" delineation, so I thought it would replace everything that is not alphanumeric with a period in the old.col.names.
Can anyone help?
Here are three options to consider:
make.names(x)
gsub("[^A-Za-z0-9]", ".", x)
names(janitor::clean_names(setNames(data.frame(matrix(NA, ncol = length(x))), x)))
Here's what each looks like:
make.names(x)
## [1] "Region" "Resource.Assessment.Site.ID"
## [3] "Site.Name.Facility" "Design.Head..feet."
## [5] "Design.Flow..cfs." "Installed.Capacity..kW."
## [7] "Annual.Production..MWh." "Plant.Factor"
## [9] "Total.Construction.Cost..1.000..." "Annual.O.M.Cost..1.000..."
## [11] "Cost.per.Installed.Capacity....kW." "Benefit.Cost.Ratio.with.Green.Incentives"
## [13] "IRR.with.Green.Incentives" "Benefit.Cost.Ratio.without.Green.Incentives"
## [15] "IRR.without.Green.Incentives"
gsub("[^A-Za-z0-9]", ".", x)
## [1] "Region" "Resource.Assessment.Site.ID"
## [3] "Site.Name.Facility" "Design.Head..feet."
## [5] "Design.Flow..cfs." "Installed.Capacity..kW."
## [7] "Annual.Production..MWh." "Plant.Factor"
## [9] "Total.Construction.Cost..1.000..." "Annual.O.M.Cost..1.000..."
## [11] "Cost.per.Installed.Capacity....kW." "Benefit.Cost.Ratio.with.Green.Incentives"
## [13] "IRR.with.Green.Incentives" "Benefit.Cost.Ratio.without.Green.Incentives"
## [15] "IRR.without.Green.Incentives"
library(janitor)
names(clean_names(setNames(data.frame(matrix(NA, ncol = length(x))), x)))
## [1] "region" "resource_assessment_site_id"
## [3] "site_name_facility" "design_head_feet"
## [5] "design_flow_cfs" "installed_capacity_kw"
## [7] "annual_production_mwh" "plant_factor"
## [9] "total_construction_cost_1_000" "annual_o_m_cost_1_000"
## [11] "cost_per_installed_capacity_kw" "benefit_cost_ratio_with_green_incentives"
## [13] "irr_with_green_incentives" "benefit_cost_ratio_without_green_incentives"
## [15] "irr_without_green_incentives"
Sample data:
x <- c("Region", "Resource Assessment Site ID", "Site Name/Facility",
"Design Head (feet)", "Design Flow (cfs)", "Installed Capacity (kW)",
"Annual Production (MWh)", "Plant Factor", "Total Construction Cost (1,000 $)",
"Annual O&M Cost (1,000 $)", "Cost per Installed Capacity ($/kW)",
"Benefit Cost Ratio with Green Incentives", "IRR with Green Incentives",
"Benefit Cost Ratio without Green Incentives", "IRR without Green Incentives")
Related
I'm trying to scrape some data from the following wikipedia table:
Link: https://en.wikipedia.org/wiki/Aire-la-Ville
I am using this code to scrape Area, Elevation, and density using css selectors. I am storing the data in canton_table but only getting the elevation data and not for the other variables.
My code:
# Get labels and data
labels <- current_html %>% html_elements(css = ".infobox-label") %>% html_text()
data <- current_html %>% html_elements(css = ".infobox-data") %>% html_text()
Output for labels and data variables:
> labels
[1] "Country" "Canton" "District"
[4] " • Mayor" " • Total" "Elevation"
[7] " • Total" " • Density" "Time zone"
[10] " • Summer (DST)" "Postal code(s)" "SFOS number"
[13] "Surrounded by" "Website"
>data
[1] "Switzerland"
[2] "Geneva"
[3] "n.a."
[4] "MaireRaymond Gavillet"
[5] "6.50 km2 (2.51 sq mi)"
[6] "428 m (1,404 ft)"
[7] "11,609"
[8] "1,800/km2 (4,600/sq mi)"
[9] "UTC+01:00 (Central European Time)"
[10] "UTC+02:00 (Central European Summer Time)"
[11] "1234,1255"
[12] "6645"
[13] "Bossey (FR-74), Carouge, Chêne-Bougeries, Étrembières (FR-74), Gaillard (FR-74), Geneva (Genève), Plan-les-Ouates, Thônex, Troinex"
[14] "www.veyrier.ch SFSO statistics"
I am able to populate the table with only elevation data and not area and density. Please help. Thanks!
# Clean text and store in data frame
canton_table[canton_table$name == current_name, "area"] <- helper_function(" • Total", labels, data)
canton_table[canton_table$name == current_name, "elevation"] <- helper_function("Elevation", labels, data)
canton_table[canton_table$name == current_name, "density"] <- helper_function(" • Density", labels, data)
My output table:
My output table:
You have to change the labels name in the labels array, from " • Total" to "Total" eccs.
The names like this " • Total" are probably giving references problems.
And then create the table
canton_table[canton_table$name == current_name, "area"] <- helper_function("Total", labels, data)
I would like to scrape the keywords inside the dropdown table of this webpage https://www.aeaweb.org/jel/guide/jel.php
The problem is that the drop-down menu of each item prevents me from scraping the table directly because it only takes the heading and not the inner content of each item.
rvest::read_html("https://www.aeaweb.org/jel/guide/jel.php") %>%
rvest::html_table()
I thought of scraping each line that starts with Keywords: but I do not get how can I do that. Seems like the HTML is not showing the items inside the table.
A RSelenium solution,
#Start the server
library(RSelenium)
driver = rsDriver(
browser = c("firefox"))
remDr <- driver[["client"]]
#Navigate to the url
remDr$navigate("https://www.aeaweb.org/jel/guide/jel.php")
#xpath of the table
remDr$findElement(using = "xpath",'/html/body/main/div/section/div[4]') -> out
#get text from the table
out <- out$getElementText()
out= out[[1]]
Split using stringr package
library(stringr)
str_split(out, "\n", n = Inf, simplify = FALSE)
[[1]]
[1] "A General Economics and Teaching"
[2] "B History of Economic Thought, Methodology, and Heterodox Approaches"
[3] "C Mathematical and Quantitative Methods"
[4] "D Microeconomics"
[5] "E Macroeconomics and Monetary Economics"
[6] "F International Economics"
[7] "G Financial Economics"
[8] "H Public Economics"
[9] "I Health, Education, and Welfare"
[10] "J Labor and Demographic Economics"
[11] "K Law and Economics"
[12] "L Industrial Organization"
[13] "M Business Administration and Business Economics; Marketing; Accounting; Personnel Economics"
[14] "N Economic History"
[15] "O Economic Development, Innovation, Technological Change, and Growth"
[16] "P Economic Systems"
[17] "Q Agricultural and Natural Resource Economics; Environmental and Ecological Economics"
[18] "R Urban, Rural, Regional, Real Estate, and Transportation Economics"
[19] "Y Miscellaneous Categories"
[20] "Z Other Special Topics"
To get the Keywords for History of Economic Thought, Methodology, and Heterodox Approaches
out1 <- remDr$findElement(using = 'xpath', value = '//*[#id="cl_B"]')
out1$clickElement()
out1 <- remDr$findElement(using = 'xpath', value = '/html/body/main/div/section/div[4]/div[2]/div[2]/div/div/div/div[2]')
out1$getElementText()
[[1]]
[1] "Keywords: History of Economic Thought"
I want to perform the Apriori method on the Cuisines column on my dataset.
Cusisine column sample:
[4] Japanese, Sushi
[5] Japanese, Korean
[6] Chinese
[7] Asian, European
[8] Seafood, Filipino, Asian, European
[9] European, Asian, Indian
[10] Filipino
[11] Filipino, Mexican
My code:
install.packages("arules")
library("arules")
itemsets <- apriori(dataSet$Cuisines, parameter=list(support=0.02, minlen=1, maxlen=1, target="frequent itemsets"))
However i keep getting:
no method or default for coercing “factor” to “transactions”
What goes wrong in here?
Is it illogical to use Apriori method on this column in my dataset?
If yes, what type of columns should i use the apriori method on?
You have to convert your data to transaction type:
dats <- strsplit(as.character(dats$Cuisines),',',fixed=T) # split by comma
trans <- as(dats, "transactions")
inspect(trans)
items
[1] { Sushi,Japanese}
[2] { Korean,Japanese}
[3] {Chinese}
[4] { European,Asian}
[5] { Asian, European, Filipino,Seafood}
[6] { Asian, Indian,European}
[7] {Filipino}
[8] { Mexican,Filipino}
Then you can do apriori:
itemsets <- apriori(trans, parameter=list(support=0.3))
inspect(itemsets)
With data:
dats <- read.table(text =" Cuisines
[1] 'Japanese, Sushi'
[2] 'Japanese, Korean'
[3] 'Chinese'
[4] 'Asian, European'
[5] 'Seafood, Filipino, Asian, European'
[6] 'European, Asian, Indian'
[7] 'Filipino'
[8] 'Filipino, Mexican' ", header = T)
This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 4 years ago.
Using R, my data frame capstone3 with column Certificate...HQA has the following levels:
levels(capstone3$Certificate...HQA)
[1] "CUM LAUDE" "DIPLOM"
[3] "DOCTORATE" "GRADUATE DIPLOMA"
[5] "HIGHEST HONS" "HONOURS (DISTINCTION)"
[7] "HONOURS (HIGHEST DISTINCTION)" "HONS"
[9] "HONS I" "HONS II"
[11] "HONS II LOWER" "HONS II UPPER"
[13] "HONS III" "HONS UNCLASSIFIED"
[15] "HONS WITH MERIT" "MAGNA CUM LAUDE"
[17] "MASTER'S DEGREE" "OTHER HONS"
[19] "PASS DEGREE" "PASS WITH CREDIT"
[21] "PASS WITH DISTINCTION" "PASS WITH HIGH MERIT"
[23] "PASS WITH MERIT" "SUMMA CUM LAUDE"
I wrote a code to reduce the number of levels by substituting level [7] with level [9], level [6] with level [12], etc:
capstone3$Certificate...HQA <- as.factor(capstone3$Certificate...HQA)
capstone3$Certificate...HQA <- gsub("HONOURS (HIGHEST DISTINCTION)","HONS I", capstone3$Certificate...HQA)
capstone3$Certificate...HQA <- gsub("HONOURS (DISTINCTION)","HONS II UPPER", capstone3$Certificate...HQA)
capstone3$Certificate...HQA <- gsub("HONS WITH MERIT","HONS II LOWER", capstone3$Certificate...HQA)
But the above gsub code did not replace the names in the column, could someone kindly point out the problem with my code please?
Parentheses () are special characters used in regular expressions to create groups. If you have literal parentheses you need to escape them using \\
gsub("HONOURS \\(HIGHEST DISTINCTION\\)","HONS I", capstone3$Certificate...HQA)
OR as #ManuelBickel: Using fixed = TRUE the pattern is a string will be matched as is.
gsub("HONOURS (HIGHEST DISTINCTION)","HONS I", capstone3$Certificate...HQA, fixed = TRUE)
I am currently running an stm (structural topic model) of a series of articles from the french newspaper Le Monde. The model is working just great, but I have a problem with the pre-processing of the text.
I'm currently using the quanteda package and the tm package for doing things like removing words, removing numbers...etc...
There's only one thing, though, that doesn't seem to work.
As some of you might know, in French, the masculine determinative article -le- contracts in -l'- before vowels. I've tried to remove -l'- (and similar things like -d'-) as words with removeWords
lmt67 <- removeWords(lmt67, c( "l'","d'","qu'il", "n'", "a", "dans"))
but it only works with words that are separate from the rest of text, not with the articles that are attached to a word, such as in -l'arbre- (the tree).
Frustrated, I've tried to give it a simple gsub
lmt67 <- gsub("l'","",lmt67)
but that doesn't seem to be working either.
Now, what's a better way to do this, and possibly through a c(...) vector so that I can give it a series of expressions all together?
Just as context, lmt67 is a "large character" with 30,000 elements/articles, obtained by using the "texts" functions on data imported from txt files.
Thanks to anyone that will want to help me.
I'll outline two ways to do this using quanteda and quanteda-related tools. First, let's define a slightly longer text, with more prefix cases for French. Notice the inclusion of the ’ apostrophe as well as the ASCII 39 simple apostrophe.
txt <- c(doc1 = "M. Trump, lors d’une réunion convoquée d’urgence à la Maison Blanche,
n’en a pas dit mot devant la presse. En réalité, il s’agit d’une
mesure essentiellement commerciale de ce pays qui l'importe.",
doc2 = "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme
successeur Jordi Sanchez, partisan de l’indépendance catalane,
actuellement en prison pour sédition.")
The first method will use pattern matches for the simple ASCII 39 (apostrophe) plus a bunch of
Unicode variants, matched through the category "Pf" for "Punctuation: Final Quote" category.
However, quanteda does its best to normalize the quotes at the tokenization stage - see the
"l'indépendance" in the second document for instance.
The second way below uses a French part-of-speech tagger integrated with quanteda that allows similar
selection after recognizing and separating the prefixes, and then removing determinants (among other POS).
1. quanteda tokens
toks <- tokens(txt, remove_punct = TRUE)
# remove stopwords
toks <- tokens_remove(toks, stopwords("french"))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "d'une" "réunion"
# [6] "convoquée" "d'urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "s'agit" "d'une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "l'importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "l'indépendantiste"
# [5] "catalan" "a" "désigné" "comme"
# [9] "successeur" "Jordi" "Sanchez" "partisan"
# [13] "de" "l'indépendance" "catalane" "actuellement"
# [17] "en" "prison" "pour" "sédition"
Then, we apply the pattern to match l', d', or l', using a regular expression replacement on the types (the unique tokens):
toks <- tokens_replace(
toks,
types(toks),
stringi::stri_replace_all_regex(types(toks), "[lsd]['\\p{Pf}]", "")
)
# tokens from 2 documents.
# doc1 :
# [1] "M" "Trump" "lors" "une" "réunion"
# [6] "convoquée" "urgence" "à" "la" "Maison"
# [11] "Blanche" "n'en" "a" "pas" "dit"
# [16] "mot" "devant" "la" "presse" "En"
# [21] "réalité" "il" "agit" "une" "mesure"
# [26] "essentiellement" "commerciale" "de" "ce" "pays"
# [31] "qui" "importe"
#
# doc2 :
# [1] "Réfugié" "à" "Bruxelles" "indépendantiste" "catalan"
# [6] "a" "désigné" "comme" "successeur" "Jordi"
# [11] "Sanchez" "partisan" "de" "indépendance" "catalane"
# [16] "actuellement" "En" "prison" "pour" "sédition"
From the resulting toks object you can form a dfm and then proceed to fit the STM.
2. using spacyr
This will involve more sophisticated part-of-speech tagging and then converting the tagged object into quanteda tokens. This requires first that you install Python, spacy, and the French language model. (See https://spacy.io/usage/models.)
library(spacyr)
spacy_initialize(model = "fr", python_executable = "/anaconda/bin/python")
# successfully initialized (spaCy Version: 2.0.1, language model: fr)
toks <- spacy_parse(txt, lemma = FALSE) %>%
as.tokens(include_pos = "pos")
toks
# tokens from 2 documents.
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" ",/PUNCT"
# [4] "lors/ADV" "d’/PUNCT" "une/DET"
# [7] "réunion/NOUN" "convoquée/VERB" "d’/ADP"
# [10] "urgence/NOUN" "à/ADP" "la/DET"
# [13] "Maison/PROPN" "Blanche/PROPN" ",/PUNCT"
# [16] "\n /SPACE" "n’/VERB" "en/PRON"
# [19] "a/AUX" "pas/ADV" "dit/VERB"
# [22] "mot/ADV" "devant/ADP" "la/DET"
# [25] "presse/NOUN" "./PUNCT" "En/ADP"
# [28] "réalité/NOUN" ",/PUNCT" "il/PRON"
# [31] "s’/AUX" "agit/VERB" "d’/ADP"
# [34] "une/DET" "\n /SPACE" "mesure/NOUN"
# [37] "essentiellement/ADV" "commerciale/ADJ" "de/ADP"
# [40] "ce/DET" "pays/NOUN" "qui/PRON"
# [43] "l'/DET" "importe/NOUN" "./PUNCT"
#
# doc2 :
# [1] "Réfugié/VERB" "à/ADP" "Bruxelles/PROPN"
# [4] ",/PUNCT" "l’/PRON" "indépendantiste/ADJ"
# [7] "catalan/VERB" "a/AUX" "désigné/VERB"
# [10] "comme/ADP" "\n /SPACE" "successeur/NOUN"
# [13] "Jordi/PROPN" "Sanchez/PROPN" ",/PUNCT"
# [16] "partisan/VERB" "de/ADP" "l’/DET"
# [19] "indépendance/ADJ" "catalane/ADJ" ",/PUNCT"
# [22] "\n /SPACE" "actuellement/ADV" "en/ADP"
# [25] "prison/NOUN" "pour/ADP" "sédition/NOUN"
# [28] "./PUNCT"
Then we can use the default glob-matching to remove the parts of speech in which we are probably not interested, including the newline:
toks <- tokens_remove(toks, c("*/DET", "*/PUNCT", "\n*", "*/ADP", "*/AUX", "*/PRON"))
toks
# doc1 :
# [1] "M./NOUN" "Trump/PROPN" "lors/ADV" "réunion/NOUN" "convoquée/VERB"
# [6] "urgence/NOUN" "Maison/PROPN" "Blanche/PROPN" "n’/VERB" "pas/ADV"
# [11] "dit/VERB" "mot/ADV" "presse/NOUN" "réalité/NOUN" "agit/VERB"
# [16] "mesure/NOUN" "essentiellement/ADV" "commerciale/ADJ" "pays/NOUN" "importe/NOUN"
#
# doc2 :
# [1] "Réfugié/VERB" "Bruxelles/PROPN" "indépendantiste/ADJ" "catalan/VERB" "désigné/VERB"
# [6] "successeur/NOUN" "Jordi/PROPN" "Sanchez/PROPN" "partisan/VERB" "indépendance/ADJ"
# [11] "catalane/ADJ" "actuellement/ADV" "prison/NOUN" "sédition/NOUN"
Then we can remove the tags, which you probably don't want in your STM - but you could leave them if you prefer.
## remove the tags
toks <- tokens_replace(toks, types(toks),
stringi::stri_replace_all_regex(types(toks), "/[A-Z]+$", ""))
toks
# tokens from 2 documents.
# doc1 :
# [1] "M." "Trump" "lors" "réunion" "convoquée"
# [6] "urgence" "Maison" "Blanche" "n’" "pas"
# [11] "dit" "mot" "presse" "réalité" "agit"
# [16] "mesure" "essentiellement" "commerciale" "pays" "importe"
#
# doc2 :
# [1] "Réfugié" "Bruxelles" "indépendantiste" "catalan" "désigné"
# [6] "successeur" "Jordi" "Sanchez" "partisan" "indépendance"
# [11] "catalane" "actuellement" "prison" "sédition"
From there, you can use the toks object to form your dfm and fit the model.
Here's a scrape from the current page at Le Monde's website. Notice that the apostrophe they use is not the same character as the single-quote here "'":
text <- "Réfugié à Bruxelles, l’indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."
It has a little angle and is not actually "straight down" when I view it. You need to copy that character into your gsub command:
sub("l’", "", text)
[#1] "Réfugié à Bruxelles, indépendantiste catalan a désigné comme successeur Jordi Sanchez, partisan de l’indépendance catalane, actuellement en prison pour sédition."