I am doing data cleaning with dplyr.
One of the things I want to do is to capitalize values in certain columns.
data$surname
john
Mary
John
mary
...
I suppose I have to use the mutate function of dplyr with something like this
titleCase <- function(x) {
+ s <- strsplit(as.character(x), " ")[[1]]
+ paste(toupper(substring(s, 1, 1)), substring(s, 2),
+ sep = "", collapse = " ")
+ }
But how to combine both? I get all kinds of errors or truncated data frames
Thanks
We can use sub
sub("(.)", "\\U\\1", data$surname, perl=TRUE)
#[1] "John" "Mary" "John" "Mary"
Implementing in the dplyr workflow
library(dplyr)
data %>%
mutate(surname = sub("(.)", "\\U\\1", surname, perl=TRUE))
If we need to do this on multiple columns
data %>%
mutate_each(funs(sub("(.)", "\\U\\1", ., perl=TRUE)))
Just to check
res <- data1 %>%
mutate(surname = sub("(.)", "\\U\\1", surname, perl=TRUE))
sum(grepl("[A-Z]", substr(res$surname, 1,1)))
#[1] 500000
data
data <- data.frame(surname=c("john", "Mary", "John", "mary"),
firstname = c("abe", "Jacob", "george", "jen"), stringsAsFactors=FALSE)
data1 <- data.frame(surname = sample(c("john", "Mary", "John", "mary"),
500000, replace=TRUE), stringsAsFactors=FALSE)
A little late to the party but you can use stringr package
library(stringr)
library(dplyr)
example1 <- tibble(names = c("john" ,"Mary", "John", "mary"))
example1 %>%
mutate(names = str_to_title(names))
## names
## <chr>
## 1 John
## 2 Mary
## 3 John
## 4 Mary
This will still work if you want all terms capitalized
example2 <- tibble(names = c("john james" ,"Mary carey", "John Jack", "mary Harry"))
example2 %>%
mutate(names = str_to_title(names))
## names
## <chr>
## 1 John James
## 2 Mary Carey
## 3 John Jack
## 4 Mary Harry
If you only want the first term capitalized, str_to_sentence() will work
example2 %>%
mutate(names = str_to_sentence(names))
## names
## <chr>
## 1 John james
## 2 Mary carey
## 3 John jack
## 4 Mary harry
There is a dedicated function for this that you can try:
R.utils::capitalize(data$surname)
If this needs to be implemented into a dplyr procedure, one could try the following:
library(dplyr)
library(R.utils)
data %>% mutate(surname = capitalize(surname))
Related
I am trying to perform the following search on a database of text.
Here is the sample database, df
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6),
name = c("john doe", "carol jones", "jimmy smith",
"jenny ruiz", "joey jones", "tim brown"),
place = c("reno nevada", "poland maine", "warsaw poland",
"trenton new jersey", "brooklyn new york", "atlanta georgia")
)
I have a vector of strings which contains terms I am trying to find.
new_search <- c("poland", "jones")
I pass the vector to str_detect to find ANY of the strings in new_search in ANY of the columns in df and then return rows which match...
df %>%
filter_all(any_vars(str_detect(., paste(new_search, collapse = "|"))))
Question... how can I extract the results of str_detect into a new column?
For each row which is returned... I would like to generate a list of the terms which were successfully matched and put them in a list or character vector (matched_terms)...something like this...
id name place matched_terms
1 2 carol jones poland maine c("jones", "poland")
2 3 jimmy smith warsaw poland c("poland")
3 5 joey jones brooklyn new york c("jones")
This is my naive solution:
new_search <- c("poland", "jones") %>% paste(collapse = "|")
df %>%
mutate(new_var = str_extract_all(paste(name, place), new_search))
You can extract all the patterns in multiple columns using str_extract_all, combine them into one column with unite. unite combines the column into one string hence the empty values are turned into "character(0)" which we remove using str_remove_all and keep only those rows that have any matched term.
library(tidyverse)
pat <- str_c(new_search, collapse = "|")
df %>%
mutate(across(-id, ~str_extract_all(., pat), .names = '{col}_new')) %>%
unite(matched_terms, ends_with('new'), sep = ',') %>%
mutate(matched_terms = str_remove_all(matched_terms,
'character\\(0\\),?|,character\\(0\\)')) %>%
filter(matched_terms != '')
# id name place matched_terms
#1 2 carol jones poland maine jones,poland
#2 3 jimmy smith warsaw poland poland
#3 5 joey jones brooklyn new york jones
Consider this data frame, containing multiple entries for a person named Steve/Stephan Jones and a person named Steve/Steven Smith (as well as Jane Jones and Matt/Matthew Smith)
df <- data.frame(First = c("Steve", "Stephan", "Steve", "Jane", "Steve", "Steven", "Matt"),
Last = c(rep("Jones", 4), rep("Smith", 3)))
What I'd like is to match values of First to the appropriate value of Name in this data frame.
nicknames <- data.frame(Name = c("Stephan", "Steven", "Stephen", "Matthew"),
N1 = c(rep("Steve", 3), "Matt"))
To yield this target
target <- data.frame(First = c("Stephan", "Stephan", "Stephan", "Jane", "Steven", "Steven", "Matthew"),
Last = c(rep("Jones", 4), rep("Smith", 3)))
The issue is that there are multiple values of Name corresponding to a N1 (or First) value of "Steve", so I need to check within each group based of df$Last to see which version of Steven/Stephan/Stephen is correct.
Using something like this
library(dplyr)
library(stringr)
df %>%
group_by(Last) %>%
mutate(First = First[which.max(str_length(First))])
won't work because the value of "Jane" in row 4 will be converted to "Stephan"
I'm not sure, if this solves your problem and is consistent to your desired output:
library(dplyr)
df %>%
mutate(id = row_number()) %>%
left_join(nicknames, by=c("First" = "N1")) %>%
mutate(real_name = coalesce(Name, First)) %>%
group_by(Last, real_name) %>%
mutate(id = n()) %>%
group_by(Last, First) %>%
filter(id==max(id)) %>%
select(-Name, -id)
returns
# A tibble: 7 x 3
# Groups: Last, First [6]
First Last real_name
<chr> <chr> <chr>
1 Steve Jones Stephan
2 Stephan Jones Stephan
3 Steve Jones Stephan
4 Jane Jones Jane
5 Steve Smith Steven
6 Steven Smith Steven
7 Matt Smith Matthew
I have an example dataframe as below.
pr_id
product
name
id_234
onion,bean
chris
id_34d
apple
tom
id_87t
plantain, potato, apple
tex
I want to access the product column and create a new column and assign 1 if apple is in the list and 0 if not.
So i expect a result like this:
pr_id
product
name
result
id_234
onion,bean
chris
0
id_34d
apple
tom
1
id_87t
plantain, potato, apple
tex
1
I thought of something like this:
my_df$result <- ifelse(my_df$product == 'apple', 1,0)
but this only work for rows 1 and 2, but not working for last row having multiple elements.
Please how do i go with this?
With dplyr, dataframe kindly taken from p. Paccioretti
Thanks to AnilGoyal for stringr::str_detect
# construct the dataframe
pr_id = c("id_234", "id_34d", "id_87t")
product = c("onion,bean",
"apple", "plantain, potato, apple")
name = c("chris", "tom","tex")
my_df <- data.frame(pr_id, product, name)
# check with case_when and str_detect if apple is in product
my_df <- my_df %>%
mutate(result = case_when(stringr::str_detect(product, "apple") ~ 1,
TRUE ~ 0)
)
You can use agrepl which searches for approximate matches within a string. If you use ==, you are searching for exact matching.
my_df <-
structure(
list(
pr_id = c("id_234", "id_34d", "id_87t"),
product = c("onion,bean",
"apple", "plantain, potato, apple"),
name = c("chris", "tom",
"tex")
),
class = "data.frame",
row.names = c(NA, -3L)
)
my_df$result <- ifelse(agrepl('apple', my_df$product), 1,0)
Or a tidyverse approach
library(dplyr)
my_df <-
my_df %>%
mutate(result = as.numeric(agrepl('apple', product)))
my_df
#> pr_id product name result
#> 1 id_234 onion,bean chris 0
#> 2 id_34d apple tom 1
#> 3 id_87t plantain, potato, apple tex 1
Using str_count
library(dplyr)
library(stringr)
df %>%
mutate(result = str_count(product, 'apple'))
I would use the str_detect option in stringr (tidyverse option).
my_df <- my_df %>%
mutate(result = ifelse(str_detect(product, "apple"), 1, 0))
I want to iterate over column names of the data frame, then using dplyr, separate fields using a delimiter(->) found among the row fields. This is how the dataset looks like :
dput(df)
structure(list(v1 = c("Silva->Mark", "Brandon->Livo", "Mango->Apple"),
v2 = c("Austin", "NA ", "Orange"),
v3 = c("James -> Jacy","NA->Jane", "apple -> Orange")),
class = "data.frame", row.names = c(NA, -3L))
Now I wrote a code that filters out column names with delimiter(->) on rows which are column v1 and column v3. Here is the code:
rows_true <- apply(df,2,function(x) any(sapply(x,function(y)grepl("->",y))))
ss<-df[,rows_true]
Then I tried to loop through those column names so that I can separate using the delimiter using this code but it ain't working
cols<- names(df)
if (names %in% df){
splitcols <- ss %>%
tidyr::separate(cols, into = c(paste0(names,+ "old"), "paste0(names,+ "New")"), sep = "->")
}
The reason I am using paste0 is because I do want the columns split into two using the delimiter then the newly formed columns should be named using the original name plus suffix Old for the first one and New for second split column
End result after looping through column names and recursively separating them should look like this
dput(df)
structure(list(v1_Old = c("Silva", "Brandon", "Mango"),
v1_New = c("Mark", "Livo", "Apple"),
v3_Old = c("James","NA", "apple"),
v3_New = c("Jacy","Jane", "Orange")),
class = "data.frame", row.names = c(NA, -3L))
For the sake of completeness, here is also a solution which uses data.table().
There are some differences to the other answers posted so far:
It is not required to identify the columns to be split beforehand. Instead, columns without "->" are dropped from the result on the fly.
The regular expression which is used for splitting includes surrounding white space (if any)
" *-> *". This avoids to call trimws() on the resulting pieces afterwards or to remove white space beforehand.
.
library(data.table)
library(magrittr) # piping used to improve readability
setDT(df)
lapply(names(df), function(x) {
mDT <- df[, tstrsplit(get(x), " *-> *")]
if (ncol(mDT) == 2L) setnames(mDT, paste0(x, c("_Old", "_New")))
}) %>% as.data.table()
v1_Old v1_New v3_Old v3_New
1: Silva Mark James Jacy
2: Brandon Livo NA Jane
3: Mango Apple apple Orange
One possibility involving dplyr and tidyr could be:
df %>%
select(v1, v3) %>%
rowid_to_column() %>%
gather(var, val, -rowid) %>%
separate_rows(val, sep = "->", convert = TRUE) %>%
group_by(rowid) %>%
mutate(val = trimws(val),
var = make.unique(var)) %>%
ungroup() %>%
spread(var, val) %>%
select(-rowid)
v1 v1.1 v3 v3.1
<chr> <chr> <chr> <chr>
1 Silva Mark James Jacy
2 Brandon Livo <NA> Jane
3 Mango Apple apple Orange
Or to further match the expected output:
df %>%
select(v1, v3) %>%
rowid_to_column() %>%
gather(var, val, -rowid) %>%
separate_rows(val, sep = "->", convert = TRUE) %>%
group_by(rowid, var) %>%
mutate(val = trimws(val),
var2 = if_else(row_number() == 2, paste0(var, "_old"), paste0(var, "_new"))) %>%
ungroup() %>%
select(-var) %>%
spread(var2, val) %>%
select(-rowid)
v1_new v1_old v3_new v3_old
<chr> <chr> <chr> <chr>
1 Silva Mark James Jacy
2 Brandon Livo <NA> Jane
3 Mango Apple apple Orange
A different approach with dplyr, purr, and stringr is the following.
library(dplyr)
library(purrr)
library(stringr)
# Detect the columns with at least on "->"
my_df_cols <- map_lgl(my_df, ~ any(str_detect(., "->")))
my_df %>%
# Select only the columns with at least "->"
select(which(my_df_cols)) %>%
# Mutate these columns and only keep the mutated columns with new names
transmute_all(list(old = ~ str_split(., "->", simplify = TRUE)[, 1],
new = ~ str_split(., "->", simplify = TRUE)[, 2]))
# v1_old v3_old v1_new v3_new
# 1 Silva James Mark Jacy
# 2 Brandon NA Livo Jane
# 3 Mango apple Apple Orange
We can also use cSplit from splitstackshape
#Detect columns with "->"
cols <- names(df)[colSums(sapply(df, grepl, pattern = "->")) > 1]
#Remove unwanted whitespaces before and after "->"
df[cols] <- lapply(df[cols], function(x) gsub("\\s+", "", x))
#Split into new columns specifying sep as "->"
splitstackshape::cSplit(df[cols], cols, sep = "->")
# v1_1 v1_2 v3_1 v3_2
#1: Silva Mark James Jacy
#2: Brandon Livo <NA> Jane
#3: Mango Apple apple Orange
I have a column containing random names. I would like to create a code that would create another column (using mutate function) that would check if the name contains the word "Mr." which would result to the new column generating "Male"
using dplyr and stringr:
library(stringr)
library(dplyr)
df <- data.frame(name = c("Mr. Robinson", "Mrs. robinson", "Gandalf","asdMr.dfa"))
df <- df %>% mutate(male = ifelse(str_detect(df$name, fixed("Mr.")), TRUE, FALSE))
Output:
> df
name male
1 Mr. Robinson TRUE
2 Mrs. robinson FALSE
3 Gandalf FALSE
4 asdMr.dfa TRUE
Be aware that this matches the Phrase "Mr." anywhere in the string, not just the beginning. If you don't want that I'd use regular expressions:
df <- df %>% mutate(male = ifelse(str_detect(name, "^Mr\\."), TRUE, FALSE))
> df
name male
1 Mr. Robinson TRUE
2 Mrs. robinson FALSE
3 Gandalf FALSE
4 asdMr.dfa FALSE
This could also be achieved without the stringr package: (inspired by #akrun)
df <- df %>% mutate(male = ifelse(grepl("^Mr\\.", name), TRUE, FALSE))
EDIT:
#docendo discimus pointed out that the ifelse() isn't necessary since we're creating a logical-column and that's exactly what grepl returns. So:
df <- df %>% mutate(male = grepl("^Mr\\.", name))
Without dplyr:
df <- transform(df, male = grepl("^Mr\\.", name))