Text Mining Scraped Data (R) - r

I wrote the code below to look for the word "nationality" in a job postings dataset, where I am essentially trying to see how many employers specify that a given candidate must of a particular visa type or nationality.
I know that in the raw data itself (in excel), there are several cases where the job description where the word "nationality" is mentioned.
nationality_finder = function(string){
nationality = c(" ")
split_string = strsplit(string, split = NULL)
split_string = split_string[[1]]
flag = 0
for(letter in split_string){
if(flag > 0){nationality = append(nationality, letter)}
if(letter == "nationality "){flag = 1}
if(letter == " "){flag = flag-0.5}
}
nationality = paste(nationality, collapse = '')
return(nationality)
}
for(n in 1:length(df2$description)){
df2$nationality[n] <- nationality_finder(df2$description[n])
}
df2%>%
view()
Furthermore, the code is working w/out errors, but it is not producing what I am looking for. I am essentially looking to create another variable where 1 indicates that the word "nationality" is mention, and 0 otherwise. Specifically, I am looking for words such as "citizen" and "nationality" under the job description variable. And the text under each job description is extremely long but here, I just gave a summarized version for brevity.
Text example for a job description in the dataset
Title: Event Planner
Nationality: Saudi National
Location: Riyadh, Saudi Arabia
Salary: Open
Salary depends on the candidates skills, experience, and other attributes.
Another job description:
- Have recently graduated or looking for a career change and be looking for
an entry level role (we will offer full training)
- Priority will be taken for applications by U.S. nationality holders

You can try something like this. I'm assuming you've a data.frame as data, and you want to add a new column.
dats$check <- as.numeric(grepl("nationality",dats$description,ignore.case=TRUE))
dats$check
[1] 1 1 0 1
grepl() is going to detect in the column dats$description the string nationality, ignoring case (ignore.case = TRUE) and as.numeric() is going to convert TRUE FALSE into 1 0.
With fake data:
dats <- structure(list(description = c("Title: Event Planner\n \n Nationality: Saudi National\n \n Location: Riyadh, Saudi Arabia\n \n Salary: Open\n \n Salary depends on the candidates skills, experience, and other attributes.",
"- Have recently graduated or looking for a career change and be looking for\n an entry level role (we will offer full training) \n \n - Priority will be taken for applications by U.S. nationality holders ",
"do not have that word here", "aaaaNationalitybb"), check = c(1,
1, 0, 1)), row.names = c(NA, -4L), class = "data.frame")

Related

How do I turn this code into a working function?

This is my data
[1] "the rooms were clean very comfortable and the staff was amazing they went over and beyond to help make our stay enjoyable i highly recommend this hotel for anyone visiting downtown "
[2] "excellent property and very convenient to activities front desk staff is extremely efficient, pleasant and helpful property is clean and has a fantastic old time charm "
[3] "the rooftop cafeteria of hotel was great. wen i say food was great "
I want to create a function that returns the count of positive sentiments per row. This is my code so far.
x=data$sentences[1:3]
library(tidytext)
library(tm)
bing <- get_sentiments("bing")
positive = bing %>% filter(sentiment %in% "positive")
positive = subset(positive, select = -c(sentiment))
positive = as.vector(positive$word)
positive = paste0(positive," ")
positive_reviews <- function(x) {
data = as.vector(x)
data = Corpus(VectorSource(data))
data = tm_map(data, removePunctuation)
data = as.character(data)
positive_count = sapply(positive, function(x) str_count(data,x))
return(sum(positive_count))
}
print(positive_reviews(x))
I am running into an error that I do not know how to fix.
Error: no function to return from, jumping to top level
How would I write my code to make this work?

Update Purrr loop to input data row by row in R

This question kinda builds on questions I asked here and here, but its finally coming together and I think I know what the problem is, just need help kicking it over the goal line. TL;DR at the bottom.
The overall goal as simply put as possible:
I have a dataframe that is from an API pull of a redcap database. It
has a few columns of information about various studies.
I'd like to go through that dataframe line by line, and push it into a different website called Oncore, through an API.
In the first question linked above (here again), I took a much simpler dataframe... took one column from that dataframe (the number), used it to do an API pull from Oncore where it would download from Oncore, copy one variable it downloaded over to a different spot, and push it back in. It would do this over and over, once per row. Then it would return a simple dataframe of the row number and the api status code returned.
Now I want to get a bit more complicated and instead of just pulling a number from one colum, I want to swap over a bunch of variables from my original dataframe, and upload them.
The idea is for sample studies input into Redcap to be pushed into Oncore.
What I've tried:
I have this dataframe from the redcap api pull:
testprotocols<-structure(list(protocol_no = c("LS-P-Joe's API", "JoeTest3"),
nct_number = c(654321, 543210), library = structure(c(2L,
2L), levels = c("General Research", "Oncology"), class = "factor"),
organizational_unit = structure(c(1L, 1L), levels = c("Lifespan Cancer Institute",
"General Research"), class = "factor"), title = c("Testing to see if basic stuff came through",
"Testing Oncology Projects for API"), department = structure(c(2L,
2L), levels = c("Diagnostic Imaging", "Lifespan Cancer Institute"
), class = "factor"), protocol_type = structure(2:1, levels = c("Basic Science",
"Other"), class = "factor"), protocolid = 1:2), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))
I have used this code to try and push the data into Oncore:
##This chunk gets a random one we're going to change later
base <- "https://website.forteresearchapps.com"
endpoint <- "/website/rest/protocols/"
protocol <- "2501"
## 'results' will get changed later to plug back in
## store
protocolid <- protocolnb <- library_names <- get_codes <- put_codes <- list()
UpdateAccountNumbers <- function(protocol){
call2<-paste(base,endpoint, protocol, sep="")
httpResponse <- GET(call2, add_headers(authorization = token))
results = fromJSON(content(httpResponse, "text"))
results$protocolId<- "8887" ## doesn't seem to matter
results$protocolNo<- testprotocols$protocol_no
results$library<- as.character(testprotocols$library)
results$title<- testprotocols$title
results$nctNo<-testprotocols$nct_number
results$objectives<-"To see if the API works, specifically if you can write over a previous number"
results$shortTitle<- "Short joseph Title"
results$nctNo<-testprotocols$nct_number
results$department <- as.character(testprotocols$department)
results$organizationalUnit<- as.charater(testprotocols$organizational_unit)
results$protocolType<- as.character(testprotocols$protocol_type)
call2 <- paste(base,endpoint, protocol, sep="")
httpResponse_put <- PUT(
call2,
add_headers(authorization = token),
body=results, encode = "json",
verbose()
)
# save stats
protocolid <- append(protocolid, protocol)
protocolnb <- append(protocolnb, testprotocols$PROTOCOL_NO[match(protocol, testprotocols$PROTOCOL_ID)])
library_names <- append(library_names, testprotocols$LIBRARY[match(protocol, testprotocols$PROTOCOL_ID)])
get_codes <- append(get_codes, status_code(httpResponse_get))
put_codes <- append(put_codes, status_code(httpResponse_put))
}
## Oncology will have to change to whatever the df name is, above and below this
purrr::walk(testprotocols$protocol_no, UpdateAccountNumbers)
allresults <- tibble('protocolNo'=unlist(protocol_no),'protocolnb'=unlist(protocolnb),'library_names'=unlist(library_names), 'get_codes'=unlist(get_codes), 'put_codes'=unlist(put_codes) )
When I get to the line:
purrr::walk(testprotocols$protocol_no, UpdateAccountNumbers)
I get this error:
When I do traceback() I get this:
When I step through the loop line by line I realized that in this chunk of code:
call2<-paste(base,endpoint, protocol, sep="")
httpResponse <- GET(call2, add_headers(authorization = token))
results = fromJSON(content(httpResponse, "text"))
results$protocolId<- "8887" ## doesn't seem to matter
results$protocolNo<- testprotocols$protocol_no
results$library<- as.character(testprotocols$library)
results$title<- testprotocols$title
results$nctNo<-testprotocols$nct_number
results$objectives<-"To see if the API works, specifically if you can write over a previous number"
results$shortTitle<- "Short joseph Title"
results$nctNo<-testprotocols$nct_number
results$department <- as.character(testprotocols$department)
results$organizationalUnit<- as.charater(testprotocols$organizational_unit)
results$protocolType<- as.character(testprotocols$protocol_type)
Where I had envisioned it downloading ONE sample study and replacing aspects of it with variables from ONE row of my beginning dataframe, its actually trying to paste everything in the column in there. I.e. results$nctNo is "654321 543210" instead of just "654321" from the first row.
TL;DR version:
I need my purrr loop to take one row at a time instead of my entire column, and I think if I do that, it'll all magically work.
Within UpdateAccountNumbers(), you are referring to entire columns of the testprotocols frame when you do things like results$nctNo<-testprotocols$nct_number ....
Instead, perhaps at the top of the UpdateAccountNumbers() function, you can do something like tp = testprotocols[testprotocols$protocol_no == protocol,], and then when you are trying to assign values to results you can refer to tp instead of testprotocols
Note that your purrr::walk() command is passing just one value of protocol at a time to the UpdateAccountNumbers() function

searching for words in text paragraph and then flagging them in R

I have a text data set, and want to search for various words in it, then flag those when when I find them. Here is sample data:
df <- data.table("id" = c(1:3), "report" = c("Travel opens our eyes to art, history, and culture – but it also introduces us to culinary adventures we may have never imagined otherwise."
, "We quickly observed that no one in Sicily cooks with recipes (just with the heart), so we now do the same."
, "We quickly observed that no one in Sicily cooks with recipes so we now do the same."), "summary" = c("On our first trip to Sicily to discover our family roots,"
, "If you’re not a gardener, an Internet search for where to find zucchini flowers results."
, "add some fresh cream to make the mixture a bit more liquid,"))
So far I have been using SQL to process through this, but it gets challenging when you have a lot list of words to look for.
dfOne <- sqldf("select id
, case when lower(report) like '%opens%' then 1 else 0 end as opens
, case when lower(report) like '%cooks%' then 1 else 0 end as cooks
, case when lower(report) like '%internet%' then 1 else 0 end as internet
, case when lower(report) like '%zucchini%' then 1 else 0 end as zucchini
, case when lower(report) like '%fresh%' then 1 else 0 end as fresh
from df
")
I'm looking for ideas to do this in a more efficient way. Imagine if you have a long list of target terms, this code can get unnecessarily too long.
Thanks,
SM.
1) sqldf
Define the vector of words and then convert it to SQL. Note that case when is not needed since like already produces a 0/1 result. Prefacing sqldf with fn$ enables $like to substitute the R like character string into the SQL statement. Use the verbose=TRUE argument to sqldf to view the SQL statement generated. This is only two lines of code no matter how long words is.
words <- c("opens", "cooks", "internet", "zucchini", "fresh", "test me")
like <- toString(sprintf("\nlower(report) like '%%%s%%' as '%s'", words, words))
fn$sqldf("select id, $like from df", verbose = TRUE)
giving:
id opens cooks internet zucchini fresh test me
1 1 1 0 0 0 0 0
2 2 0 1 0 0 0 0
3 3 0 1 0 0 0 0
2) outer
Using words from above we can use outer as follows. Note that the function (third argument) in outer must be vectorized and we can make grepl vectorized as shown. Omit check.names = FALSE if you don't mind the column names associated with words having spaces or puncutation munged into syntactic R variable names. This produces the same output as (1).
with(df, data.frame(
id,
+t(outer(setNames(words, words), report, Vectorize(grepl))),
check.names = FALSE
))
3) sapply
Using sapply we can get a slightly shorter solution along the same lines as (2). The output is the same as in (1) and (2).
with(df, data.frame(id, +sapply(words, grepl, report), check.names = FALSE))
Here is a tidyverse way. It assumes that you want to search two separate columns.
library(tidyverse)
df <- tibble(id = c(1:3), report = c("Travel opens our eyes to art, history, and culture – but it also introduces us to culinary adventures we may have never imagined otherwise."
, "We quickly observed that no one in Sicily cooks with recipes (just with the heart), so we now do the same."
, "We quickly observed that no one in Sicily cooks with recipes so we now do the same."),
summary = c("On our first trip to Sicily to discover our family roots,"
, "If you’re not a gardener, an Internet search for where to find zucchini flowers results."
, "add some fresh cream to make the mixture a bit more liquid,"))
# Vector of words
vec <- c('eyes','art','gardener','mixture','trip')
df %>%
mutate(reportFlag = case_when(
str_detect(report,paste(vec,collapse = '|')) ~ T,
T ~ F)
) %>%
mutate(summaryFlag = case_when(
str_detect(report,paste(vec,collapse = '|')) ~ T,
T ~ F))

Matching two strings and extracting the matched character in R

Consider I have the below mentioned input character;
text_input <- c("ADOPT", "A", "FAIL", "FAST")
test <- c("TEST", "INPUT", "FAIL", "FAST")
I would like to match both the inputs and extract the words which occurred in common in text_input, I would like to something similar to str_extract.
I do understand that str_extract uses a matching pattern or word to do it, but,my test data consists of around 500,000 words. Any inputs would be really helpful.
Expected Outcome:
"FAIL", "FAST"
EDIT
Just adding one more question here... What happens when the Input is a pure string, like, the one provided below;
text_input <- c("‘Data Scientist’ has been named the sexiest job of the 21st century by Harvard Business Review. The same article tells us that “demand has raced ahead of supply” and that the lack of data scientists “is becoming a serious constraint in some sectors.” A 2011 study by McKinsey Global Institute found that “there will be a shortage of talent necessary for organizations to take advantage of big data” – a shortage to the tune of 140,000 to 190,000 in the United States alone by 2018.")
test <- c("Data Scientist", "McKinsey", "ORGANIZATIONS", "FAST")
Is it possible to perform string match even in this case, as mentioned above.
Note: Changed the input and testing string.
If we need the characters to be extracted
library(stringr)
str_extract(text_input, paste0("[", test, "]+"))
If we are looking for full string match
library(data.table)
fintersect(data.table(col1 = text_input), data.table(col1 = test))
For the easy example you may use intersect() as already was stated in the comments.
text_input1 <- c("ADOPT", "A", "FAIL", "FAST")
test1 <- c("TEST", "INPUT", "FAIL", "FAST")
intersect(text_input1, test1)
# [1] "FAIL" "FAST"
The long example is a little more complicated.
text_input2 <- c("‘Data Scientist’ has been named the sexiest job of the 21st century by Harvard Business Review. The same article tells us that “demand has raced ahead of supply” and that the lack of data scientists “is becoming a serious constraint in some sectors.” A 2011 study by McKinsey Global Institute found that “there will be a shortage of talent necessary for organizations to take advantage of big data” – a shortage to the tune of 140,000 to 190,000 in the United States alone by 2018.")
phrases <- c("Data Scientist", "McKinsey", "ORGANIZATIONS", "FAST")
The test string vector you've defined - I'll call it phrases contains compound terms of two (or probably more) words i.e. containing spaces. Therefore, we need a regular expression rx1 that can handle it. It is not clear if you want case sensitive matches or not, you'd need tolower() both the phrases and the text for the latter. Next we test whether there's a match or not. If so we extend the regex to rx2 so that we can use it well with gsub() replacement functionality. We Vectorize() our function that it can handle vectors of phrases.
matchPhrase <- Vectorize(function(phr, txt, tol=FALSE) {
rx1 <- gsub(" ", "\\\\s", phr) # handle spaces
if (tol) { # optional tolower
rx1 <- tolower(rx1)
txt <- tolower(txt)
}
if (regexpr(rx1, txt) > 0) { # test for matches
rx2 <- paste0(".*(", rx1, ").*")
return(gsub(rx2, "\\1", txt)) # gsub extraction
} else {
return(NA) # we want NA for no matches
}
})
Default without case-sensitivity.
matchPhrase(phrases, text_input2, tol=FALSE)
# Data Scientist McKinsey ORGANIZATIONS FAST
# "Data Scientist" "McKinsey" NA NA
Non-case-sensitive also finds "organizations".
matchPhrase(phrases, text_input2, tol=TRUE)
# Data Scientist McKinsey ORGANIZATIONS FAST
# "data scientist" "mckinsey" "organizations" NA
For a clean output just do:
as.character(na.omit(matchPhrase(phrases, text_input2, tol=TRUE)))
# [1] "data scientist" "mckinsey" "organizations"
Note: Probably you need to adapt the function several times for your specific needs/desired outputs. Actually the quanteda package is quite sophisticated in doing this kind of stuff.
This can also be achieved using the package fuzzyjoin, which contains a way to join df's based on regex.
text_input <- c("ADOPT", "A", "FAIL", "FAST")
regex <- c("TEST", "INPUT", "FAIL", "FAST")
library(fuzzyjoin)
library(dplyr)
df <- tibble( text = text_input )
df.regex <- tibble( regex_name = regex )
# now we can regex match them
df %>%
regex_left_join( df.regex, by = c( text = "regex_name" ) )
# # A tibble: 4 x 2
# text regex_name
# <chr> <chr>
# 1 ADOPT NA
# 2 A NA
# 3 FAIL FAIL
# 4 FAST FAST
#or only regex 'hits'
df %>%
regex_inner_join( df.regex, by = c( text = "regex_name" ) )
# # A tibble: 2 x 2
# text regex_name
# <chr> <chr>
# 1 FAIL FAIL
# 2 FAST FAST

In R, How do you extract multiple matched terms as string and match if TRUE with Regex or Grep?

I'm still a beginner in R. I need help with some code that searches a vector for terms in a list and returns TRUE. If TRUE, return a string of matched terms.
I have it set to tell me if terms match and return the first matched term but I'm not sure how to get the rest of the matched terms.
In the attached code, I have my Desired_Output and the imperfect Final_Output.
#create dataset of 2 columns/vectors. 1st column is "Job Title", 2nd column is "Work Experience"
'Work Experience' <- c("cooked food; cleaned house; made beds", "analyzed data; identified gaps; used sql, python, and r", "used tableau to make dashboards for clients; applied advanced macro excel functions", "financial planning and strategy; consulted with leaders and clients")
'Job Title' <- c("dad", "research analyst", "business intelligence consultant", "finance consultant")
Job_Hist <- data.frame(`Job Title`, `Work Experience`)
#create list of terms to search for in Job_Hist
Term_List <- c("python", " r", "sql", "tableau", "excel")
#use grepl to search the Work Experience vector for terms in CS_Term_List THEN return TRUE or FALSE
Term_TF<- grepl(paste(Term_List, collapse = '|'),Job_Hist$Work.Experience)
#add a new column to our final output dataframe that shows if the job experience matched our terms
Final_Output<-Job_Hist
Final_Output$Term_Test <- Term_TF
#Let's see what what terms caused the TRUE Flag in the Final_Output
m<-regexpr(paste(Term_List, collapse = '|'),
Job_Hist$Work.Experience, perl=TRUE)
T_Match <- regmatches(Job_Hist$Work.Experience,m)
#Compare Final_Output to my Desired_Output and please help me :)
Desired_T_Match <- c("NA", "sql, python, r", "tableau, excel", "NA")
Desired_Output <- data.frame(`Job Title`, `Work Experience`, Term_TF, Desired_T_Match)
#I need 2 things.
#1) a way to tie T_Match back to Final_Output... something like if, TRUE then match
#2) a way to return every term matched in a coma delimited string. Example: research analyst analyzed data... TRUE sql, python
You can use stringr::str_extract_all to get a list of matches from each row:
library(stringr)
library(tidyverse)
Job_Hist$matches <- str_extract_all(Job_Hist$Work.Experience,
paste(Term_List, collapse = '|'), simplify = TRUE)
Work.Experience Term matches.1 matches.2
1 cooked food; cleaned house; made beds FALSE
2 analyzed data; identified gaps; used sql, python, and r TRUE sql python
3 used tableau to make dashboards for clients; applied advanced macro excel functions TRUE tableau excel
4 financial planning and strategy; consulted with leaders and clients FALSE
matches.3
1
2 r
3
4
Edit: if you'd rather have matches in one column as a comma separated string, you can use:
str_extract_all(Job_Hist$Work.Experience, paste(Term_List, collapse = '|')) %>%
sapply(., paste, collapse = ", ")
matches
1
2 sql, python, r
3 tableau, excel
4
Note that if you use the default argument simplify = FALSE in str_extract_all, your column matches will look correct, like the result we get with sapply above. However, if you inspect with str() you'd see each element is actually it's own list, which will cause problems for some types of analysis.

Resources