Replace string in one column by another column - r

I have a dataframe that has two columns and I want to replace some string from the first column by the second column. In the dataframe below, I want the 'em' tag to be replaced by 'a' tag with the url on the ulrs column.
df1 = data.frame(text = c("I like <em'>Rstudio</em> very much",
"<em'> Anaconda</em> is an amazing data science tool"),
urls = c('https://www.rstudio.com/', 'https://anaconda.org/'))
I am looking for a vector like below.
text = c("I like <a href = 'https://www.rstudio.com/'>Rstudio</a> very much",
"<a href = 'https://anaconda.org/'> Anaconda</a> is an amazing data science tool")

An option using gsub and mapply can be as:
mapply(function(x,y)gsub("<em'>.*</em>",x,y),df1$urls, df1$text)
# [1] "I like https://www.rstudio.com/ very much"
# [2] "https://anaconda.org/ is an amazing data science tool"
Data:
df1 = data.frame(text = c("I like <em'>Rstudio</em> very much",
"<em'> Anaconda</em> is an amazing data science tool"),
urls = c('https://www.rstudio.com/', 'https://anaconda.org/'))

Related

Text Mining Scraped Data (R)

I wrote the code below to look for the word "nationality" in a job postings dataset, where I am essentially trying to see how many employers specify that a given candidate must of a particular visa type or nationality.
I know that in the raw data itself (in excel), there are several cases where the job description where the word "nationality" is mentioned.
nationality_finder = function(string){
nationality = c(" ")
split_string = strsplit(string, split = NULL)
split_string = split_string[[1]]
flag = 0
for(letter in split_string){
if(flag > 0){nationality = append(nationality, letter)}
if(letter == "nationality "){flag = 1}
if(letter == " "){flag = flag-0.5}
}
nationality = paste(nationality, collapse = '')
return(nationality)
}
for(n in 1:length(df2$description)){
df2$nationality[n] <- nationality_finder(df2$description[n])
}
df2%>%
view()
Furthermore, the code is working w/out errors, but it is not producing what I am looking for. I am essentially looking to create another variable where 1 indicates that the word "nationality" is mention, and 0 otherwise. Specifically, I am looking for words such as "citizen" and "nationality" under the job description variable. And the text under each job description is extremely long but here, I just gave a summarized version for brevity.
Text example for a job description in the dataset
Title: Event Planner
Nationality: Saudi National
Location: Riyadh, Saudi Arabia
Salary: Open
Salary depends on the candidates skills, experience, and other attributes.
Another job description:
- Have recently graduated or looking for a career change and be looking for
an entry level role (we will offer full training)
- Priority will be taken for applications by U.S. nationality holders
You can try something like this. I'm assuming you've a data.frame as data, and you want to add a new column.
dats$check <- as.numeric(grepl("nationality",dats$description,ignore.case=TRUE))
dats$check
[1] 1 1 0 1
grepl() is going to detect in the column dats$description the string nationality, ignoring case (ignore.case = TRUE) and as.numeric() is going to convert TRUE FALSE into 1 0.
With fake data:
dats <- structure(list(description = c("Title: Event Planner\n \n Nationality: Saudi National\n \n Location: Riyadh, Saudi Arabia\n \n Salary: Open\n \n Salary depends on the candidates skills, experience, and other attributes.",
"- Have recently graduated or looking for a career change and be looking for\n an entry level role (we will offer full training) \n \n - Priority will be taken for applications by U.S. nationality holders ",
"do not have that word here", "aaaaNationalitybb"), check = c(1,
1, 0, 1)), row.names = c(NA, -4L), class = "data.frame")

Stprl from a dataframe to another

How is it possible having a dataframe like this:
df_words <- data.frame(words = c("4 Google", "5Amazon", "4sec"))
replace in the rows of a dataframe like this:
df <- data.frame(id = c(1,2,4), text = "Increase for 4 Google", "There is a slight decrease for 5Amazon", "I will need 4sec more"), stringAsFactors = FALSE)
replace with the specific word the one listed in the df_words like this
"4 Google|5Amazon" -> "stock"
"4sec" -> time
Example of expected output
data.frame(id = c(1,2,4), text = "Increase for stock", "There is a slight decrease for stock", "I will need time more"), stringAsFactors = FALSE)
I recommend the stringi library. Example:
library(stringi)
strings = c("Increase for 4 Google", "There is a slight decrease for 5Amazon", "I will need 4sec more")
patterns = c("4 Google", "5Amazon", "4sec")
replacements = c("stock", "stock", "time")
strings = stri_replace_all_fixed(strings,patterns,replacements)
However, you probably want to handle many stocks and many times, so you might be better off doing something like this:
stocks = c("4 Google", "5Amazon")
strings = stri_replace_all_fixed(strings,stocks,'stock')
strings = stri_replace_all_regex(strings,'\b[0-9]+sec\b',time)
\b[0-9]+sec\b is a regular expression meaning:
word boundary
one or more number characters
"sec"
word boundary
This will include strings such as "2sec" but exclude those such as "1sector"

add column of listed keywords(strings) based on text column

If i have a dataframe with the following column:
df$text <- c("This string is not that long", "This string is a bit longer but still not that long", "This one just helps with the example")
and strings like so:
keywords <- c("not that long", "This string", "example", "helps")
I am trying to add a column to my dataframe with a list of the keywords that exist in the text for each row:
df$keywords:
1 c("This string","not that long")
2 c("This string","not that long")
3 c("helps","example")
Though i'm unsure how to 1) extract the matching words from the text column and 2) how to then list them matching words in each row in the new column
We can extract with str_extract from stringr
library(stringr)
df$keywords <- str_extract_all(df$text, paste(keywords, collapse = "|"))
df
# text keywords
#1 This string is not that long This string, not that long
#2 This string is a bit longer but still not that long This string, not that long
#3 This one just helps with the example helps, example
Or in a chain
library(dplyr)
df %>%
mutate(keywords = str_extract_all(text, paste(keywords, collapse = "|")))
Maybe like this:
df = data.frame(text=c("This string is not that long", "This string is a bit longer but still not that long", "This one just helps with the example"))
keywords <- c("not that long", "This string", "example", "helps")
df$keywords = lapply(df$text, function(x) {keywords[sapply(keywords,grepl,x)]})
Output:
text keywords
1 This string is not that long not that long, This string
2 This string is a bit longer but still not that long not that long, This string
3 This one just helps with the example example, helps
The outer lapply loops over df$text, and the inner lapply checks for each element of keywords if it is in the element of df$text. So a slightly longer but perhaps easier to read equivalent would be:
df$keywords = lapply(df$text, function(x) {keywords[sapply(keywords, function(y){grepl(y,x)})]})
Hope this helps!

How to use a look up table to create a new row in a data frame?

I have two data frames. One contains the data I am trying to clean/modify(df_x) and one is a lookup table(df_y). There is a column df_x$TEXT that contains a string like "Some text - with more" and the lookup table df_y looks like this:
SORT ABB
-------------- ----
Some Text ST
I want to see if a value in df_y$SORT is in the df_x$TEXT for every row of df_x. If there is a match then take the df_y$ABB value at that matched row and add it to a new column in df_x like df_x$TEXT_ABB.
For the information above the algorithm would see that "Some Text" is in "Some text - with more" (ignoring case) so it would add the value "ST" to the column df_x$TEXT_ABB.
I know I can use match and or a combination of sapply and grep to search if it exists but I can not figure how to do this AND grab the abbreviation I would like to map it back to a new column in the original dataframe.
you can try this:
df_x <- data.frame(TEXT=c("Some Text 001", "other text", "Some Text 002"))
df_y <- read.table(header=TRUE, text=
'SORT ABB
"Some Text" ST
"Other Text" OT')
L <- sapply(df_y$SORT, grep, x=df_x$TEXT, ignore.case=TRUE)
df_x$abb <- NA
for (l in 1:length(L)) if (length(L[[l]])!=0) df_x$abb[L[[l]]] <- as.character(df_y$ABB[l])

How can I extract from title from name in a column?

I have a column of names of the form "Hobs, Mr. jack" i.e. lastname, title. firstname. title has 4 types -"Mr.", "Mrs.","Miss.","Master." How can I search for each item in the column & return the title ,which I can store in another column ?
Name <- c("Hobs, Mr. jack","Hobs, Master. John","Hobs, Mrs. Nicole",........)
desired output - a column "title" with values - ("Mr","Master", "Mrs",.....)
I have tried something like this:
f <- function(d) {
if (grep("Mr", d$title)) {
gsub("$Mr$", "Mr", d$title, ignore.case = T)
}
}
no success >.<
Maybe something like this:
library(stringr)
> Name <- c("Hobs, Mr. jack","Hobs, Master. John","Hobs, Mrs. Nicole")
> str_extract(string = Name,pattern = "(Mr|Master|Mrs)\\.")
[1] "Mr." "Master." "Mrs."
A fancier regex might exclude the period up front, or you could remove them in a second step.
Considering dataset name as df and column as Name. New column name would be title.
df$Title <- gsub('(.*, )|(\\..*)', '', df$Name)

Resources