Recently, I have started to learn R and trying to explore more by automating the process. Below is the sample data and I'm trying to create a new column by finding and replacing the particular text within the label (colname:Designations).
Since, I'm getting this work with loads of new data I would like to automate using R programming than using excel formulas.
Dataset:
strings<-c("Zonal Manager","Department Manager","Network Manager","Head of Sales","Account Manager","Alliance Manager","Additional Manager","Senior Vice President","General manager","Senior Analyst", "Solution Architect","AGM")
R code i used:
t<-data.frame(strings,stringsAsFactors = FALSE)
colnames(t)[1]<-"Designations"
y<-sub(".*Manager*","Manager",strings,ignore.case = TRUE)
Challenge:
In this all the data got changed as Manager but I needed to replace other designations with the main themes.
I tried with ifelse statement, grep, grepl, str,sub, etc but I didn't get what I'm looking for
I can't use first/second/last words (as'delimit') since the main themes scatters to and fro.. Eg: Chief Information Officer or Commercial Finance Manager or AGM
Excel Work:
I have already coded 300 main themes as...
Manager (for all GM, Asst.Manager,Sales Manager,etc)
Architect (Solution Arch, Sr. Arch, etc)
Director (Senior Director, Director, Asst.Director, etc)
Senior Analyst
Analyst
Head (for head of sales)
What I'm looking for:
I needed to create a new column and should replace the text with the relevant main themes as I did in excel using R.
I'm ok if i can take the main themes that I have already coded in excel to match the themes using R programming (as vlookup in excel).
Expected result:
enter image description here
Thanks in advance for your help!
Yes, exactly the same thing I'm expeccting. Thanks!! But when I tried the same methodology by uploading the new dataset (excel file) and with
df %>%
mutate(theme=gsub(".*(Manager|Lead|Director|Head|Administrator|Executive|Executive|VP|President|Consultant|CFO|CTO|CEO|CMO|CDO|CIO|COO|Cheif Executive Officer|Chief Technological Officer|Chief Digital Officer|Chief Financial Officer|Chief Marketing Officer|Chief Digital Officer|Chief Information Officer,Chief Operations Officer)).*","\\1",Designations,ignore.case = TRUE))
it didn't work. Should I correct somewhere else.?
data:
strings<-c("Zonal Manager","Department Manager","Network Manager","Head of Sales","Account Manager",
"Alliance Manager","Additional Manager","Senior Vice President","General manager","Senior Analyst", "Solution Architect","AGM")
you need to prepare a good look up table: (you complete it and make it perfect.)
lu_table <- data.frame(new = c("Manager", "Architect","Director"), old = c("Manager|GM","Architect|Arch","Director"), stringsAsFactors = F)
Then you can let mapply do the job:
mapply(function(new,old) {ans <- strings; ans[grepl(old,ans)]<-new; strings <<- ans; return(NULL)}, new = lu_table$new, old = lu_table$old)
now look at strings:
> strings
[1] "Manager" "Manager" "Manager" "Head of Sales" "Manager" "Manager"
[7] "Manager" "Senior Vice President" "General manager" "Senior Analyst" "Architect" "Manager"
please note:
This solution uses <<-. So this might not be the best possible solution. But works in this case.
Do you mean something like this?
library(dplyr)
strings <-
c(
"Zonal Manager",
"Department Manager",
"Network Manager",
"Head of Sales",
"Account Manager",
"Alliance Manager",
"Additional Manager",
"Senior Vice President",
"General manager",
"Senior Analyst",
"Solution Architect",
"AGM"
)
df = data.frame(Designations = strings)
df %>%
mutate(
theme = gsub(
".*(manager|head|analyst|architect|agm|director|president).*",
"\\1",
Designations,
ignore.case = TRUE
)
)
#> Designations theme
#> 1 Zonal Manager Manager
#> 2 Department Manager Manager
#> 3 Network Manager Manager
#> 4 Head of Sales Head
#> 5 Account Manager Manager
#> 6 Alliance Manager Manager
#> 7 Additional Manager Manager
#> 8 Senior Vice President President
#> 9 General manager manager
#> 10 Senior Analyst Analyst
#> 11 Solution Architect Architect
#> 12 AGM AGM
Created on 2018-10-04 by the reprex package (v0.2.1)
Related
I'm trying to match people that meet a certain job code, but there's many abbreviations (e.g., "dr." and "dir" are both director. For some reason, my code yields obviously wrong answers (e.g., it retains 'kvp coordinator' in the below example), and I can't figure out what's going on:
library(dplyr)
library(stringr)
test <- tibble(name = c("Corey", "Sibley", "Justin", "Kate", "Ruth", "Phil", "Sara"),
title = c("kvp coordinator", "manager", "director", "snr dr. of marketing", "drawing expert", "dir of finance", "direct to mail expert"))
test %>%
filter(str_detect(title, "chief|vp|president|director|dr\\.|dir\\ |dir\\."))
In the above example, only Justin, Kate, and Phil should be left, but somehow the filter doesn't drop Corey.
In addition to an answer, if you could explain why I'm getting this bizarre result, I'd really appreciate it.
the vp in str_detect pattern matches with kvp, that's why you are getting it in the output.
test %>% filter(str_detect(title, "chief|\\bvp\\b|president|director|dr\\.|dir\\ |dir\\."))
# A tibble: 3 x 2
name title
<chr> <chr>
1 Justin director
2 Kate snr dr. of marketing
3 Phil dir of finance
I am very new to R and webscraping. For practice I am scraping book titles from a website and working out some basic stats using the titles. So far I have managed to scrape the book titles, add them to a table, and find the mean length of the books.
I now want to find the most commonly used word in the book titles, it is probably 'the', but I want to prove this using R. At the moment my program is only looking at the full book title, I need to split the words into their own individual identities so I can count the quantity of different words. However, I am not sure how to do this.
Code:
url <- 'http://books.toscrape.com/index.html'
bookNames <- read_html(allUrls) %>%
html_nodes(xpath='//*[contains(concat( " ", #class, " "), concat( " ", "product_pod", " " ))]//a') %>%
html_text
view(bookNames)
values<-lapply(bookNames,nchar)
mean(unlist(values))
bookNames<-tolower(bookNames)
sort(table(bookNames), decreasing=T)[1:2]
I think splitting every word into a new list would solve my problem, yet I am not sure how to do this.
Thanks in advance.
Above is the table of books I have been able to produce.
You can get all the book titles with :
library(rvest)
url <- 'http://books.toscrape.com/index.html'
url %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title') -> titles
titles
# [1] "A Light in the Attic"
# [2] "Tipping the Velvet"
# [3] "Soumission"
# [4] "Sharp Objects"
# [5] "Sapiens: A Brief History of Humankind"
# [6] "The Requiem Red"
# [7] "The Dirty Little Secrets of Getting Your Dream Job"
#....
To get the most common words in the title then you can split the string on whitespace and use table to count the frequency.
head(sort(table(tolower(unlist(strsplit(titles, '\\s+')))), decreasing = TRUE))
# the a of #1) and for
# 14 3 3 2 2 2
I have a dataframe with a column with some text in it. I want to do three data pre-processing steps:
1) remove words that occur only once
2) remove words with low inverse document frequency (IDF) and 3) remove words that occur most frequently
This is an example of the data:
head(stormfront_data$stormfront_self_content)
Output:
[1] " , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!"
[2] "bonjour warm brother ! forward speaking !"
[3] " check time time forums. frequently moved columbia distinctly numbered. groups gatherings "
[4] " ! site pretty nice. amount news articles. main concern moment islamification."
[5] " , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed."
[6] " white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
Any help would be greatly appreciated, as I am not too familiar with R.
Here's a solution to Q1 in several steps:
Step 1: clean data by removing anything that is not alphanumeric (\\W):
data2 <- trimws(paste0(gsub("\\W+", " ", data), collapse = ""))
Step 2: Make a sorted frequency list of the words:
fw <- as.data.frame(sort(table(strsplit(data2, "\\s{1,}")), decreasing = T))
Step 3: define a pattern to match (namely all the words that occur only once), make sure you wrap them into boundary position markers (\\b) so that only exact matches get matched (e.g., networkbut not networking):
pattern <- paste0("\\b(", paste0(fw$Var1[fw$Freq==1], collapse = "|"), ")\\b")
Step 4: remove matched words:
data3 <- gsub(pattern, "", data2)
Step 5: clean up by removing superfluous spaces:
data4 <- trimws(gsub("\\s{1,}", " ", data3))
Result:
[1] "stormfront introduction white networking site decided networking site white brothers stormfront introduction stormfront stormfront main board forums forums articles stormfront stormfront stormfront local local stormfront stormfront main board groups networking member member site online online stormfront time time forums groups site articles main site decided time site white brothers jay member stormfront jay"
Here is an approach with tidytext
library(tidytext)
library(dplyr)
word_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
count(document, word, sort = TRUE)
total_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
group_by(word) %>%
summarize(total = n())
words <- left_join(word_count,total_count)
words %>%
bind_tf_idf(word, document, n)
# A tibble: 111 x 7
document word n total tf idf tf_idf
<int> <chr> <int> <int> <dbl> <dbl> <dbl>
1 1 stormfront 10 11 0.139 1.10 0.153
2 1 networking 3 3 0.0417 1.79 0.0747
3 1 site 3 6 0.0417 0.693 0.0289
4 1 board 2 2 0.0278 1.79 0.0498
5 1 forums 2 3 0.0278 1.10 0.0305
6 1 introduction 2 2 0.0278 1.79 0.0498
7 1 local 2 2 0.0278 1.79 0.0498
8 1 main 2 3 0.0278 1.10 0.0305
9 1 member 2 3 0.0278 1.10 0.0305
10 1 online 2 2 0.0278 1.79 0.0498
# … with 101 more rows
From here it is trivial to filter with dplyr::filter, but since you don't define any specific criteria other than "only once", I'll leave that to you.
Data
data <- structure(c(" , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!",
"bonjour warm brother ! forward speaking !", " check time time forums. frequently moved columbia distinctly numbered. groups gatherings ",
" ! site pretty nice. amount news articles. main concern moment islamification.",
" , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed.",
" white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
), .Dim = c(6L, 1L))
Base R solution:
# Remove double spacing and punctuation at the start of strings:
# cleaned_str => character vector
cstr <- trimws(gsub("\\s*[[:punct:]]+", "", trimws(gsub('\\s+|^\\s*[[:punct:]]+|"',
' ', df), "both")), "both")
# Calulate the document frequency: document_freq => data.frame
document_freq <- data.frame(table(unlist(sapply(cstr, function(x){
unique(unlist(strsplit(x, "[^a-z]+")))}))))
# Store the inverse document frequency as a vector: idf => double vector:
document_freq$idf <- log(length(cstr)/document_freq$Freq)
# For each record remove terms that occur only once, occur the maximum number
# of times a word occurs in the dataset, or words with a "low" idf:
# pp_records => character vector
pp_records <- do.call("rbind", lapply(cstr, function(x){
# Store the term and corresponding term frequency as a data.frame: tf_dataf => data.frame
tf_dataf <- data.frame(table(as.character(na.omit(gsub("^$", NA_character_,
unlist(strsplit(x, "[^a-z]+")))))),
stringsAsFactors = FALSE)
# Store a vector containing each term's idf: idf => double vector
tf_dataf$idf <- document_freq$idf[match(tf_dataf$Var1, document_freq$Var1)]
# Explicitly return the ppd vector: .GlobalEnv() => character vector
return(
data.frame(
cleaned_record = x,
pp_records =
paste0(unique(unlist(
strsplit(gsub("\\s+", " ",
trimws(
gsub(paste0(tf_dataf$Var1[tf_dataf$Freq == 1 |
tf_dataf$idf < (quantile(tf_dataf$idf, .25) - (1.5 * IQR(tf_dataf$idf))) |
tf_dataf$Freq == max(tf_dataf$Freq)],
collapse = "|"), "", x), "both"
)), "\\s")
)), collapse = " "),
row.names = NULL,
stringsAsFactors = FALSE
)
)
}
))
# Column bind cleaned strings with the original records: ppd_cleaned_df => data.frame
ppd_cleaned_df <- cbind(orig_record = df, pp_records)
# Output to console: ppd_cleaned_df => stdout (console)
ppd_cleaned_df
I have a db with some repeated entries that reports (inconsistently) additional information. I would like to get rid of the information and keep the most simple version for each entry.
db <- data.frame(company=c("ENTRY_X","ENTRY_X COUNTY_1","COUNTY_2 ENTRY_X","ENTRY_Y"))
db_desiderata <- data.frame(company=c(rep("ENTRY_X",3),"ENTRY_Y"))
Entries are possibly lengthy strings (some with spaces). Some examples are: "General Motors Company" and "General Motors".
I manage to isolate all the entries that need to be substituted with their substring (in db$included).
I plan to run it recursively.
Attempted code (all works, I get stuck on how to proceed):
db$included <- lapply(db$company, function(x) c(grep(x,db$company,value=T)))
db$lenght <- lapply(db$included, function(x) length(unlist(x)))
db$included <- ifelse(db$lenght==1,NA,db$included)
The following should work if the data strictly conforms to these patterns:
The desired name must be the first in the sequence of alternative names
The desired name must be the shortest in the sequence of alternative names and can't be followed by a company name which is a shorter subset of the preceding company name.
I'll use a variation of Chuck P's data to illustrate how this works and the problems if the patterns aren't followed.
db <- data.frame(company = c("General Foods","More General Foods","General Foods Cereal Division","General Auto",
"General Motors Company", "General Motors", "European General Motors Company",
"General", "Asia General Toys") )
companies <- Reduce( f = function(y,x) {if(grepl(pattern = y, x=x)) y else x},
x=db$company, accumulate = TRUE)
which gives
companies
[1] General Foods General Foods General Foods General Auto General Motors Company
[6] General Motors General Motors General General
I think I understand your situation a little better after your comment but I still would be very wary of a fully automated solution, One slip or one term that was too general (pun-intended) and you're hosed...
I've taken your early work and done a little renaming. Think of your original length as more a measure of potential. I'd look at the potential column with a human eye and pick and choose the places to replace. I'd approach this with stringr::str_replace_all. If you use the named vector as I've shown below you should be able to handle a wide array of cases with cut and paste. "^.*General Motors.*$" just means if you find it anywhere in the string, front or back. You can work iteratively and just keep adding to the named vector until you have it cleaned.
library(dplyr)
library(stringr)
db <- data.frame(company = c("General Foods","More General Foods","General Foods Cereal Division","General", "General Auto", "General Motors Company", "General Motors", "European General Motors Company"))
db$similar_company <- sapply(db$company, function(x) c(grep(x, db$company, value=T)), simplify = TRUE)
db$potential <- sapply(db$similar_company, function(x) length(unlist(x)), simplify = TRUE)
glimpse(db)
#> Rows: 8
#> Columns: 3
#> $ company <chr> "General Foods", "More General Foods", "General Foods…
#> $ similar_company <named list> [<"General Foods", "More General Foods", "Gene…
#> $ potential <int> 3, 1, 1, 8, 1, 2, 3, 1
db %>% arrange(desc(potential)) %>% select(-similar_company)
#> company potential
#> 1 General 8
#> 2 General Foods 3
#> 3 General Motors 3
#> 4 General Motors Company 2
#> 5 More General Foods 1
#> 6 General Foods Cereal Division 1
#> 7 General Auto 1
#> 8 European General Motors Company 1
db$newcompany <-
str_replace_all(db$company, c("^.*General Foods.*$" = "General Foods",
"^.*General Motors.*$" = "General Motors"),
)
db %>% select(company, newcompany)
#> company newcompany
#> 1 General Foods General Foods
#> 2 More General Foods General Foods
#> 3 General Foods Cereal Division General Foods
#> 4 General General
#> 5 General Auto General Auto
#> 6 General Motors Company General Motors
#> 7 General Motors General Motors
#> 8 European General Motors Company General Motors
Created on 2020-05-08 by the reprex package (v0.3.0)
I am trying to webscrape the following website:
http://www.healthgrades.com/hospital-directory/california-ca-san-mateo/affiliated-physicians-HGSTED418D46050070
I am using R to webscrape the website. Particularly, I am trying to copy all of doctor's names and specialties from this website. However, the main issue that I am dealing with is that the url link does not change when I press the arrow/next button. I can not use any basic techniques to webscrape this page. How can I solve this problem? It would be nice to have all of the data that I am collecting in one data matrix/spreadsheet.
dum <- "http://www.healthgrades.com/hospital-directory/california-ca-san-mateo/affiliated-physicians-HGSTED418D46050070"
library(XML)
ddum <- htmlParse(dum)
noofpages <- xpathSApply(ddum,'//*/span[#class="paginationItem active"]/following-sibling::*[1]',xmlValue)[1]
noofpages <- (as.numeric(gsub(' of ','',noofpages))-1)%/%5+1
doctors <- c(); dspec <- c()
for(i in 1:noofpages){
if(i>1){
ddum <- htmlParse(paste0(dum,"?pagenumber=",i,'#'))
}
doctors <- c(doctors, xpathSApply(ddum,'//*/a[#class="providerSearchResultSelectAction"]',xmlValue))
dspec <- c(dspec, xpathSApply(ddum,'//*/div[#class="listingHeaderLeftColumn"]/p',xmlValue))
}
paste(doctors,dspec,sep=',')
# [1] "Dr. Julia Adamian, MD,Internal Medicine"
# [2] "Dr. Eric R. Adler, MD,Internal Medicine"
# [3] "Dr. Ramzi S. Alami, MD,General Surgery"
# [4] "Dr. Jason L. Anderson, MD,Internal Medicine"
# [5] "Dr. Karl A. Anderson, MD,Urology"
# [6] "Dr. Christine E. Angeles, MD,Geriatric Medicine, Pulmonology"
It looks like they're using the variable
?pagenumber=x
You can probably iterate over x to get your data.
On a side note,
I'm not sure which browser you are using, but Chrome has a handy feature where you can right click on a button and select inspect element.