gsub doesn't substitute names in column [duplicate] - r

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 4 years ago.
Using R, my data frame capstone3 with column Certificate...HQA has the following levels:
levels(capstone3$Certificate...HQA)
[1] "CUM LAUDE" "DIPLOM"
[3] "DOCTORATE" "GRADUATE DIPLOMA"
[5] "HIGHEST HONS" "HONOURS (DISTINCTION)"
[7] "HONOURS (HIGHEST DISTINCTION)" "HONS"
[9] "HONS I" "HONS II"
[11] "HONS II LOWER" "HONS II UPPER"
[13] "HONS III" "HONS UNCLASSIFIED"
[15] "HONS WITH MERIT" "MAGNA CUM LAUDE"
[17] "MASTER'S DEGREE" "OTHER HONS"
[19] "PASS DEGREE" "PASS WITH CREDIT"
[21] "PASS WITH DISTINCTION" "PASS WITH HIGH MERIT"
[23] "PASS WITH MERIT" "SUMMA CUM LAUDE"
I wrote a code to reduce the number of levels by substituting level [7] with level [9], level [6] with level [12], etc:
capstone3$Certificate...HQA <- as.factor(capstone3$Certificate...HQA)
capstone3$Certificate...HQA <- gsub("HONOURS (HIGHEST DISTINCTION)","HONS I", capstone3$Certificate...HQA)
capstone3$Certificate...HQA <- gsub("HONOURS (DISTINCTION)","HONS II UPPER", capstone3$Certificate...HQA)
capstone3$Certificate...HQA <- gsub("HONS WITH MERIT","HONS II LOWER", capstone3$Certificate...HQA)
But the above gsub code did not replace the names in the column, could someone kindly point out the problem with my code please?

Parentheses () are special characters used in regular expressions to create groups. If you have literal parentheses you need to escape them using \\
gsub("HONOURS \\(HIGHEST DISTINCTION\\)","HONS I", capstone3$Certificate...HQA)
OR as #ManuelBickel: Using fixed = TRUE the pattern is a string will be matched as is.
gsub("HONOURS (HIGHEST DISTINCTION)","HONS I", capstone3$Certificate...HQA, fixed = TRUE)

Related

Scraping keywords on PHP page

I would like to scrape the keywords inside the dropdown table of this webpage https://www.aeaweb.org/jel/guide/jel.php
The problem is that the drop-down menu of each item prevents me from scraping the table directly because it only takes the heading and not the inner content of each item.
rvest::read_html("https://www.aeaweb.org/jel/guide/jel.php") %>%
rvest::html_table()
I thought of scraping each line that starts with Keywords: but I do not get how can I do that. Seems like the HTML is not showing the items inside the table.
A RSelenium solution,
#Start the server
library(RSelenium)
driver = rsDriver(
browser = c("firefox"))
remDr <- driver[["client"]]
#Navigate to the url
remDr$navigate("https://www.aeaweb.org/jel/guide/jel.php")
#xpath of the table
remDr$findElement(using = "xpath",'/html/body/main/div/section/div[4]') -> out
#get text from the table
out <- out$getElementText()
out= out[[1]]
Split using stringr package
library(stringr)
str_split(out, "\n", n = Inf, simplify = FALSE)
[[1]]
[1] "A General Economics and Teaching"
[2] "B History of Economic Thought, Methodology, and Heterodox Approaches"
[3] "C Mathematical and Quantitative Methods"
[4] "D Microeconomics"
[5] "E Macroeconomics and Monetary Economics"
[6] "F International Economics"
[7] "G Financial Economics"
[8] "H Public Economics"
[9] "I Health, Education, and Welfare"
[10] "J Labor and Demographic Economics"
[11] "K Law and Economics"
[12] "L Industrial Organization"
[13] "M Business Administration and Business Economics; Marketing; Accounting; Personnel Economics"
[14] "N Economic History"
[15] "O Economic Development, Innovation, Technological Change, and Growth"
[16] "P Economic Systems"
[17] "Q Agricultural and Natural Resource Economics; Environmental and Ecological Economics"
[18] "R Urban, Rural, Regional, Real Estate, and Transportation Economics"
[19] "Y Miscellaneous Categories"
[20] "Z Other Special Topics"
To get the Keywords for History of Economic Thought, Methodology, and Heterodox Approaches
out1 <- remDr$findElement(using = 'xpath', value = '//*[#id="cl_B"]')
out1$clickElement()
out1 <- remDr$findElement(using = 'xpath', value = '/html/body/main/div/section/div[4]/div[2]/div[2]/div/div/div/div[2]')
out1$getElementText()
[[1]]
[1] "Keywords: History of Economic Thought"

Levels of a dataframe after filtering

i've been doing an assignment for a self study in R programming. I have a question about what happens with factors in a dataframe once you filter it. I have a dataframe that has the columns (movie)Studio and Genre.
For the assignment i need to filter it. I succeeded in this, but when i check the levels of the newly filtered columns all factors are still present, so not only the filtered ones.
Why is this? Am i doing something wrong?
StudioTarget <- c("Buena Vista Studios","Fox","Paramount Pictures","Sony","Universal","WB")
GenreTarget <- c("action","adventure","animation","comedy","drama")
dftest <- df[df$Studio %in% StudioTarget & df$Genre %in% GenreTarget,]
> levels(dftest$Studio)
[1] "Art House Studios" "Buena Vista Studios" "Colombia Pictures"
[4] "Dimension Films" "Disney" "DreamWorks"
[7] "Fox" "Fox Searchlight Pictures" "Gramercy Pictures"
[10] "IFC" "Lionsgate" "Lionsgate Films"
[13] "Lionsgate/Summit" "MGM" "MiraMax"
[16] "New Line Cinema" "New Market Films" "Orion"
[19] "Pacific Data/DreamWorks" "Paramount Pictures" "Path_ Distribution"
[22] "Relativity Media" "Revolution Studios" "Screen Gems"
[25] "Sony" "Sony Picture Classics" "StudioCanal"
[28] "Summit Entertainment" "TriStar" "UA Entertainment"
[31] "Universal" "USA" "Vestron Pictures"
[34] "WB" "WB/New Line" "Weinstein Company"
You can do droplevels(dftest$Studio) to remove unused levels
No, you're not doing anything wrong. A factor defines a fixed number of levels. These levels remain the same even if one or more of them are not present in the data. You've asked for the levels of your factor, not the values present after filtering.
Consider:
library(tidyverse)
mtcars %>%
mutate(cyl= as.factor(cyl)) %>%
filter(cyl == 4) %>%
distinct(cyl) %>%
pull(cyl)
[1] 4
Levels: 4 6 8
Welcome to SO. Next time, pleasetry to provide a minumum working example. This post will help you construct one.

Using strsplit results in terms with quotation marks in r

I have a large set of data, which I have imported from excel. I wish to get term frequency table for the data set. But, when I use strspplit, it includes quotation marks and other punctuation which gives wrong results.
There is a small error in the way I am using strsplit and need help on the same as I am not able to figure it out myself.
df = read_excel("C:/Users/B M Consulting/Documents/Book2.xlsx", col_types=c("text","numeric"), range=cell_cols("A:B"))
vect <- c(df[1])
vectsplit <- strsplit(tolower(vect), "\s+")
vectlev <- unique(unlist(vectsplit))
vecttermf <- sapply(vectsplit, function(x) table(factor(x, levels=vectlev)))
The output vect is something like this:
[1] "3 inch c clamp" "baby vice" "baby vice bench" "baby vise"
[5] "bench" "bench vice" "bench vice clamp" "bench vise"
[9] "bench voice" "bench wise" "bench wise heavy" "bench wise table"
[13] "box for tools" "c clamp" "c clamp set" "c clamps"
[17] "carpenter tools" "carpenter tools low price" "cast iron pipe" "clamp"
[21] "clamp set" "clamps woodworking" "g clamp" "g clamp set 3 inch"
I need to get each word out. When I use strplit, it includes all the punctuation marks.
Below is a small section of vectsplit that I get. It includes all inverted commas, backslashes and commas which I dont want.
[1] "c(\"3" "inch" "c" "clamp\"," "\"baby" "vice\"," "\"baby" "vice"
[9] "bench\"," "\"baby" "vise\"," "\"bench\"," "\"bench" "vice\"," "\"bench" "vice"
[17] "clamp\"," "\"bench" "vise\"," "\"bench" "voice\"," "\"bench" "wise\"," "\"bench"
[25] "wise" "heavy\"," "\"bench" "wise" "table\"," "\"box" "for" "tools\","
[33] "\"c" "clamp\"," "\"c" "clamp" "set\"," "\"c" "clamps\"," "\"carpenter"
[41] "tools\"," "\"carpenter" "tools" "low" "price\"," "\"cast" "iron" "pipe\","
If you check the class of vect, you'll notice that it's not a character vector, but a list.
vect<-c(df[1])
class(vect)
> "list"
If you define vect as below, the issue disappears:
vect<-df[[1]]
class(vect)
> "character"
If you define vect as such and then use strsplit, it should work just fine. Keep in mind that different kinds of subsetting ([1] vs. [[1]]) will produce different classes of outputs.

Replace all non-alphanumeric with a period

I am trying to rename all of these atrocious column names in a data frame I received from a government agency.
> colnames(thedata)
[1] "Region" "Resource Assessment Site ID"
[3] "Site Name/Facility" "Design Head (feet)"
[5] "Design Flow (cfs)" "Installed Capacity (kW)"
[7] "Annual Production (MWh)" "Plant Factor"
[9] "Total Construction Cost (1,000 $)" "Annual O&M Cost (1,000 $)"
[11] "Cost per Installed Capacity ($/kW)" "Benefit Cost Ratio with Green Incentives"
[13] "IRR with Green Incentives" "Benefit Cost Ratio without Green Incentives"
[15] "IRR without Green Incentives"
The column headers have special non-alphanumeric characters and spaces, so referring to them is impossible so I have to rename them. I would like to replace all non-alphanumeric characters with a period. But I tried:
old.col.names <- colnames(thedata)
new.col.names <- gsub("^a-z0-9", ".", old.col.names)
The ^ is a "not" delineation, so I thought it would replace everything that is not alphanumeric with a period in the old.col.names.
Can anyone help?
Here are three options to consider:
make.names(x)
gsub("[^A-Za-z0-9]", ".", x)
names(janitor::clean_names(setNames(data.frame(matrix(NA, ncol = length(x))), x)))
Here's what each looks like:
make.names(x)
## [1] "Region" "Resource.Assessment.Site.ID"
## [3] "Site.Name.Facility" "Design.Head..feet."
## [5] "Design.Flow..cfs." "Installed.Capacity..kW."
## [7] "Annual.Production..MWh." "Plant.Factor"
## [9] "Total.Construction.Cost..1.000..." "Annual.O.M.Cost..1.000..."
## [11] "Cost.per.Installed.Capacity....kW." "Benefit.Cost.Ratio.with.Green.Incentives"
## [13] "IRR.with.Green.Incentives" "Benefit.Cost.Ratio.without.Green.Incentives"
## [15] "IRR.without.Green.Incentives"
gsub("[^A-Za-z0-9]", ".", x)
## [1] "Region" "Resource.Assessment.Site.ID"
## [3] "Site.Name.Facility" "Design.Head..feet."
## [5] "Design.Flow..cfs." "Installed.Capacity..kW."
## [7] "Annual.Production..MWh." "Plant.Factor"
## [9] "Total.Construction.Cost..1.000..." "Annual.O.M.Cost..1.000..."
## [11] "Cost.per.Installed.Capacity....kW." "Benefit.Cost.Ratio.with.Green.Incentives"
## [13] "IRR.with.Green.Incentives" "Benefit.Cost.Ratio.without.Green.Incentives"
## [15] "IRR.without.Green.Incentives"
library(janitor)
names(clean_names(setNames(data.frame(matrix(NA, ncol = length(x))), x)))
## [1] "region" "resource_assessment_site_id"
## [3] "site_name_facility" "design_head_feet"
## [5] "design_flow_cfs" "installed_capacity_kw"
## [7] "annual_production_mwh" "plant_factor"
## [9] "total_construction_cost_1_000" "annual_o_m_cost_1_000"
## [11] "cost_per_installed_capacity_kw" "benefit_cost_ratio_with_green_incentives"
## [13] "irr_with_green_incentives" "benefit_cost_ratio_without_green_incentives"
## [15] "irr_without_green_incentives"
Sample data:
x <- c("Region", "Resource Assessment Site ID", "Site Name/Facility",
"Design Head (feet)", "Design Flow (cfs)", "Installed Capacity (kW)",
"Annual Production (MWh)", "Plant Factor", "Total Construction Cost (1,000 $)",
"Annual O&M Cost (1,000 $)", "Cost per Installed Capacity ($/kW)",
"Benefit Cost Ratio with Green Incentives", "IRR with Green Incentives",
"Benefit Cost Ratio without Green Incentives", "IRR without Green Incentives")

With R and XML, can an XPath 1.0 expression eliminate duplicates in the content returned?

When I extract content from the following URL, using XPath 1.0, the cities that are returned contain duplicates, starting with Birmingham. (The complete set of values returned is more than 140, so I have truncated it.) Is there a way with the XPath expression to avoid the duplicates?
require(XML)
doc <- htmlTreeParse("http://www.littler.com/locations", useInternal = TRUE)
xpathSApply(doc, "//div[#class = 'mm-location-usa']//a[position() < 12]", xmlValue, trim = TRUE)
[1] "Birmingham" "Mobile" "Anchorage" "Phoenix" "Fayetteville" "Fresno"
[7] "Irvine" "L.A. - Century City" "L.A. - Downtown" "Sacramento" "San Diego" "Birmingham"
[13] "Mobile" "Anchorage" "Phoenix" "Fayetteville" "Fresno" "Irvine"
[19] "L.A. - Century City" "L.A. - Downtown" "Sacramento" "San Diego"
Is there an XPath expression or work around along the lines of [not-duplicate()]?
Also, various [position() < X] permutations don't produce only the cities and only one instance of each. In fact, it's hard to figure out how positions are counted.
I would appreciate any guidance or finding out that the best I can do is limit the number of duplicates returned.
BTW XPath result with duplicates is not the same problem nor are the questions that pertain to duplicate nodes, e.g., How do I identify duplicate nodes in XPath 1.0 using an XPathNavigator to evaluate?
There is a function for this, it is called distinct-values(), but unfortunately, it is only available in XPath 2.0. In R, you are limited to XPath 1.0.
What you can do is
//div[#class = 'mm-location-usa']//a[position() < 12 and not(normalize-space(.) = normalize-space(following::a))]
What it does, in plain English:
Look for div elements, but only if their class attribute value equals "mm-location-usa". Look for descendant a element of those div elements, but only if the a element's position is less than 12 and if the normalized text content of that a element is not equal to the text content of an a element that follows.
But is is a computationally intensive approach and not the most elegant one. I recommend you take jlhoward's solution.
Can't you just do it this way??
require(XML)
doc <- htmlTreeParse("http://www.littler.com/locations", useInternal = TRUE)
xPath <- "//div[#class = 'mm-location-usa']//a[position() < 12]"
unique(xpathSApply(doc, xPath, xmlValue, trim = TRUE))
# [1] "Birmingham" "Mobile" "Anchorage"
# [4] "Phoenix" "Fayetteville" "Fresno"
# [7] "Irvine" "L.A. - Century City" "L.A. - Downtown"
# [10] "Sacramento" "San Diego"
Or, you can just create an XPath to process the li tags in the first div (since they are duplicate divs):
xpathSApply(doc, "//div[#id='lmblocks-mega-menu---locations'][1]/
div[#class='mm-location-usa']/
ul/
li[#class='mm-list-item']", xmlValue, trim = TRUE)
## [1] "Birmingham" "Mobile" "Anchorage"
## [4] "Phoenix" "Fayetteville" "Fresno"
## [7] "Irvine" "L.A. - Century City" "L.A. - Downtown"
## [10] "Sacramento" "San Diego" "San Francisco"
## [13] "San Jose" "Santa Maria" "Walnut Creek"
## [16] "Denver" "New Haven" "Washington, DC"
## [19] "Miami" "Orlando" "Atlanta"
## [22] "Chicago" "Indianapolis" "Overland Park"
## [25] "Lexington" "Boston" "Detroit"
## [28] "Minneapolis" "Kansas City" "St. Louis"
## [31] "Las Vegas" "Reno" "Newark"
## [34] "Albuquerque" "Long Island" "New York"
## [37] "Rochester" "Charlotte" "Cleveland"
## [40] "Columbus" "Portland" "Philadelphia"
## [43] "Pittsburgh" "San Juan" "Providence"
## [46] "Columbia" "Memphis" "Nashville"
## [49] "Dallas" "Houston" "Tysons Corner"
## [52] "Seattle" "Morgantown" "Milwaukee"
I made an assumption here that you're going after US locations.

Resources