Checking if phrase contains a certain word

Checking if phrase contains a certain word - r

I have a column containing random names. I would like to create a code that would create another column (using mutate function) that would check if the name contains the word "Mr." which would result to the new column generating "Male"

using dplyr and stringr:
library(stringr)
library(dplyr)
df <- data.frame(name = c("Mr. Robinson", "Mrs. robinson", "Gandalf","asdMr.dfa"))
df <- df %>% mutate(male = ifelse(str_detect(df$name, fixed("Mr.")), TRUE, FALSE))
Output:
> df
name male
1 Mr. Robinson TRUE
2 Mrs. robinson FALSE
3 Gandalf FALSE
4 asdMr.dfa TRUE
Be aware that this matches the Phrase "Mr." anywhere in the string, not just the beginning. If you don't want that I'd use regular expressions:
df <- df %>% mutate(male = ifelse(str_detect(name, "^Mr\\."), TRUE, FALSE))
> df
name male
1 Mr. Robinson TRUE
2 Mrs. robinson FALSE
3 Gandalf FALSE
4 asdMr.dfa FALSE
This could also be achieved without the stringr package: (inspired by #akrun)
df <- df %>% mutate(male = ifelse(grepl("^Mr\\.", name), TRUE, FALSE))
EDIT:
#docendo discimus pointed out that the ifelse() isn't necessary since we're creating a logical-column and that's exactly what grepl returns. So:
df <- df %>% mutate(male = grepl("^Mr\\.", name))
Without dplyr:
df <- transform(df, male = grepl("^Mr\\.", name))

Related

How do i access multple element (list) in a dataframe

I have an example dataframe as below.
pr_id
product
name
id_234
onion,bean
chris
id_34d
apple
tom
id_87t
plantain, potato, apple
tex
I want to access the product column and create a new column and assign 1 if apple is in the list and 0 if not.
So i expect a result like this:
pr_id
product
name
result
id_234
onion,bean
chris
0
id_34d
apple
tom
1
id_87t
plantain, potato, apple
tex
1
I thought of something like this:
my_df$result <- ifelse(my_df$product == 'apple', 1,0)
but this only work for rows 1 and 2, but not working for last row having multiple elements.
Please how do i go with this?

With dplyr, dataframe kindly taken from p. Paccioretti
Thanks to AnilGoyal for stringr::str_detect
# construct the dataframe
pr_id = c("id_234", "id_34d", "id_87t")
product = c("onion,bean",
"apple", "plantain, potato, apple")
name = c("chris", "tom","tex")
my_df <- data.frame(pr_id, product, name)
# check with case_when and str_detect if apple is in product
my_df <- my_df %>%
mutate(result = case_when(stringr::str_detect(product, "apple") ~ 1,
TRUE ~ 0)
)

You can use agrepl which searches for approximate matches within a string. If you use ==, you are searching for exact matching.
my_df <-
structure(
list(
pr_id = c("id_234", "id_34d", "id_87t"),
product = c("onion,bean",
"apple", "plantain, potato, apple"),
name = c("chris", "tom",
"tex")
),
class = "data.frame",
row.names = c(NA, -3L)
)
my_df$result <- ifelse(agrepl('apple', my_df$product), 1,0)
Or a tidyverse approach
library(dplyr)
my_df <-
my_df %>%
mutate(result = as.numeric(agrepl('apple', product)))
my_df
#> pr_id product name result
#> 1 id_234 onion,bean chris 0
#> 2 id_34d apple tom 1
#> 3 id_87t plantain, potato, apple tex 1

Using str_count
library(dplyr)
library(stringr)
df %>%
mutate(result = str_count(product, 'apple'))

I would use the str_detect option in stringr (tidyverse option).
my_df <- my_df %>%
mutate(result = ifelse(str_detect(product, "apple"), 1, 0))

Use of a like operator in dplyr

I have a basketball data with a bunch of different types of shots, and I want to reduce the number of different names. For example, I have 'stepback jumpshot' and 'pull up jumpshot'.
I want to add a new variable that does something like:
df %>% mutate(NewVar== case when(Var1 like jumpshot then Jumpshot))
so all my different jumpshots are renamed as Jumpshot.

To elaborate on #r2evans comment, what you are looking for is grepl(). This function can tell you whether a string exists in another string. It will return a TRUE or FALSE. You don't actually need the mutate or the case when, and could do it with Base R:
Var1 <- c("Free Throw", "stepback jumpshot", "pull up jumpshot", "hail mary")
df <- data.frame(Var1)
df$Var2 <- ifelse(grepl("jumpshot", Var1, fixed = TRUE), "Jumpshot", Var1)
df
# Var1 Var2
# 1 Free Throw Free Throw
# 2 stepback jumpshot Jumpshot
# 3 pull up jumpshot Jumpshot
# 4 hail mary hail mary
But if you really want to use dplyr functions, the case statement #r2evans gave will work:
Var1 <- c("Free Throw", "stepback jumpshot", "pull up jumpshot", "hail mary")
df <- data.frame(Var1)
df2 <- df %>%
mutate(Var2 = case_when(grepl("jumpshot", Var1) ~ "Jumpshot",
grepl("block", Var1) ~ "Block",
TRUE ~ Var1))
df2
# Var1 Var2
# 1 Free Throw Free Throw
# 2 stepback jumpshot Jumpshot
# 3 pull up jumpshot Jumpshot
# 4 hail mary hail mary

Don't forget str_detect from stringr...
Var1 <- c("Free Throw", "stepback jumpshot", "pull up jumpshot", "hail mary")
df <- data.frame(Var1,stringsAsFactors = FALSE)
df2 <- df %>%
mutate(Var2 = case_when(str_detect(Var1,"jumpshot") ~ "Jumpshot",
str_detect(Var1,"block") ~ "Block",
TRUE ~ Var1))
It's a little faster than grep (see What's the difference between the str_detect function in stringer and grepl and grep?)

Lookup word in data frame, return column name?

I wish to search a dataframe (really, a categorized word list), and if the word is found, it returns the column name; if it is not found, it simply reproduces the word. The basic idea is below but I can't get it to work as expected:
#data frame to be searched
words <- data.frame(people=c("Mike", "Tom", "Molly", "Susan"),
dogs=c("Rex", "Fido", "King", "Roy"))
#data frame to work with
d <- data.frame(name=c("Roy","Tom", "Pat"))
d %>% mutate(
returned = ifelse(name %in% d, colnames(), name)
)
This returns:
name returned
1 Roy 2
2 Tom 3
3 Pat 1
However, it should return
name returned
1 Roy dog
2 Tom people
3 Pat Pat
I feel like my script is close, but not sure what to do to fix it.
Any help is appreciated!

The numbers in the 'returned' are due to the factor coercion to integer storage mode values. It can be avoided if we create a character class column with stringsAsFactors = FALSE while creating the data.frame or use as.character(name).
d <- data.frame(name=c("Roy","Tom", "Pat"), stringsAsFactors = FALSE)
words <- data.frame(people=c("Mike", "Tom", "Molly", "Susan"),
dogs=c("Rex", "Fido", "King", "Roy"), stringsAsFactors = FALSE)
In addition to the issue with factor, the OP's code in ifelse is not using the keyvalue dataset 'words' i.e. name %in% d refers to calling the value of column 'name' in the data.frame, and the second argument is colnames(), which would have resulted in 'error', but because the first expression is returning FALSE, it will check the 'no' values i.e' 'name'
d %>%
mutate(i1 = name %in% d)
# name i1
#1 Roy FALSE
#2 Tom FALSE
#3 Pat FALSE
Because the 'name' is factor, its values are coerced to integer mode and that is what is showed in the output
We can use pivot_longer to convert to 'long' format and then do a right_join
library(dplyr)
library(tidyr)
words %>%
pivot_longer(everything()) %>%
right_join(d, by = c('value' = 'name')) %>%
mutate(name = ifelse(is.na(name), value, name)) %>%
select(returned = name, name = value)
# returned name
#1 dogs Roy
#2 people Tom
#3 Pat Pat
Or we can use case_when without any reshaping
d %>%
mutate(returned = case_when(name %in% words$people ~ 'people',
name %in% words$dogs ~ 'dogs',
TRUE ~ as.character(name)))
# name returned
#1 Roy dogs
#2 Tom people
#3 Pat Pat
Or using only base R
d$returned <- with(stack(words), as.character(ind[match(d$name, values)]))
d$returned[is.na(d$returned)] <- d$name[is.na(d$returned)]
d
# name returned
#1 Roy dogs
#2 Tom people
#3 Pat Pat

We can get words data in long format and then do a left_join. For the returned values that do not match we can replace with name value.
library(dplyr)
d %>%
left_join(tidyr::pivot_longer(words, cols = names(words), names_to = 'returned'),
by = c('name' = 'value')) %>%
mutate(returned = coalesce(returned, name))
# name returned
#1 Roy dogs
#2 Tom people
#3 Pat Pat

R: Collapse "wide" data to a single concatenated column based on binary "Yes/No"

How can I collapse data in a wide format (see example below), into a concatenated column showing only the TRUE values? I want to end up with a data table in the format Employee Name | "string of applicable column headers" as illustrated in demoOUT.
library(data.table)
demoIN <- data.table(
Name=c("Mike Jones","Bobby Fisher"),
A=c(1,0),
B=c(1,1),
C=c(0,0),
D=c(1,1))
Name A B C D
1: Mike Jones 1 1 0 1
2: Bobby Fisher 0 1 0 1
demoOUT <- data.table(
Name=c("Mike Jones","Bobby Fisher"),
Cases =c("A,B,D","B,D"))
Name Cases
1: Mike Jones A,B,D
2: Bobby Fisher B,D

A solution uses functions from dplyr and tidyr. demoIN2 is the final output.
library(dplyr)
library(tidyr)
demoIN2 <- demoIN %>%
gather(Cases, Value, -Name) %>%
filter(Value == 1) %>%
group_by(Name) %>%
summarise(Cases = paste(Cases, collapse = ","))

Here is a base R solution if you were interested.
demoIN$Cases <- apply(demoIN[, -c("Name")], 1, function(x) paste(na.omit(ifelse(x == 1, names(x), NA)), collapse = ","))
demoIN <- demoIN[,c("Name","Cases")]

Here is an option using data.table (as the initial object is data.table
library(data.table)
melt(demoIN, id.var = 'Name')[value==1, .(Cases = paste(variable, collapse=',')), Name]
# Name Cases
#1: Mike Jones A,B,D
#2: Bobby Fisher B,D

Change select columns from character to integers

I have a dataframe as such:
sample = data.frame(
beer_brewerId = c("8481", "8481", "8481"),
rev_app = c("4/5","1/5", "2/5"),
beer_name = c("John Harvards Simcoe IPA", "John Harvards Simcoe IPA", "John Harvards American Brown Ale"),
review_taste =c("6/10", "7/10", "6/10"), stringsAsFactors = FALSE
)
str(sample)
I would like to convert only columns 2 and 4 from a character vector into an integer for analysis purposes. Normally, this would not be so difficult if all of the character columns I have I want to convert to numeric with the following code, but this does not work as I want to keep column 3 as a chr type:
sample %>%
select(2,4) %>%
mutate_if(is.character, as.numeric)
You can easily accomplish this with base r as:
#base approach
cols <- c("2","4")
data[cols] <- lapply(data[cols], as.numeric)
Is there an easy way to do this using dplyr, and even within a pipe sequence? If you were to select only certain columns using select(), it would not allow you to save the results back into the dataframe
Something like this would work, but as my dataset has 15+ columns, this seems like its very cumbersome code:
cleandf <- sample %>%
#Use transform or mutate to convert each column manually
transform(rev_app = as.integer(rev_app)) %>%
transform(review_taste = as.integer(review_taste))
Is mutate_at, or mutate_each meant to perform this task? Any help would be appreciated. Thanks.
#Maybe something like this:
cols <- c("2","4")
data %>%
mutate_each(is.character[cols], as.numeric)

Easiest way to accomplish this is through using mutate_at with the specified column indexes:
sample <- sample %>%
#Do normal mutations on the data
mutate(rev_app = str_replace_all(rev_app, "/5", "")) %>%
mutate(review_taste = str_replace_all(review_taste, "/10", "")) %>%
#Now add this one-liner onto your chain
mutate_at(c(2,4), as.numeric) %>%
glimpse(., n=5)

You can achieve this with the mutate_at function.
sample = data.frame(beer_brewerId = c("8481", "8481", "8481"),
rev_app = c("4/5","1/5", "2/5"),
beer_name = c("John Harvards Simcoe IPA", "John Harvards Simcoe IPA", "John Harvards American Brown Ale"),
review_taste =c("6/10", "7/10", "6/10"), stringsAsFactors = FALSE)
# get rid of "/"
clean <- function(foo) {
sapply(foo, function(x) eval(parse(text = x)))
}
# you can replace c(2,4) by whatever columns you need
clean_sample <- sample %>%
mutate_at(c(2,4), clean)
Columns 2 and 4 are now numeric.