I'm relatively new to R, taken a few classes, etc. However, I'm trying to do something that has really stumped me and I just can't seem to find an answer to despite it being a task which seems trivial and likely has been done countless times. So I have a dataset and in that dataset is a field "Name" that contains, go figure, names. However, that field includes the persons title in their name, e.g. Mr., Mrs., Miss, etc. What I'm looking to do is create a new column "Title" (I've made this column already) and have that column contain the titles in a numeric form such as Mr. = 1, Mrs. = 2, Miss = 3, etc. The solutions I've found propose new subsets of data but I don't really want that, I want to add a new column in the current dataset with this information. I realize this probably sounds like a trivial task to those experienced with R but it's driving me bonkers. Thank you for any help that can be provided.
Expected output:
Name Title
Jones, Mr. Frank 1
Jennings, Mrs. Joan 2
Hinker, Miss. Lisa 3
Brant, Mrs. Jane 2
Allin, Mr. Hank 1
Minks, Mr. Jeff 1
Naps, Mr. Tim 1
We can use gsub to extract the Mr/Mrs/Miss substring from the 'Name' column, convert the factor by specifying the levels as the unique elements in the vector, and finally convert to numeric class.
Using gsub, we match a particular pattern from the beginning of the string ^, i.e. match all characters that are not , ([^,]+) followed by a , and zero or more white space (\\s*) or (|) match the character . (\\.) followed by characters that are not . ([^.]+) upto the end of the string ($) and replace that with '' (2nd argument after the ,).
v1 <- gsub('^[^,]+,\\s*|\\.[^.]+$', '', df1$Name)
df1$Title <- as.numeric(factor(v1, levels=unique(v1)))
NOTE: We can also specify the order in the levels, i.e. factor(v1, levels= c('Mr', 'Mrs', 'Miss')). In the example provided, the unique gives the correct order as in the expected output.
Or we can match the vector with the unique elements in that vector.
df1$Title <- match(v1, unique(v1))
df1
# Name Title
#1 Jones, Mr. Frank 1
#2 Jennings, Mrs. Joan 2
#3 Hinker, Miss. Lisa 3
#4 Brant, Mrs. Jane 2
#5 Allin, Mr. Hank 1
#6 Minks, Mr. Jeff 1
#7 Naps, Mr. Tim 1
data
df1 <- structure(list(Name = c("Jones, Mr. Frank", "Jennings, Mrs. Joan",
"Hinker, Miss. Lisa", "Brant, Mrs. Jane", "Allin, Mr. Hank",
"Minks, Mr. Jeff", "Naps, Mr. Tim")), .Names = "Name", row.names = c(NA,
-7L), class = "data.frame")
Related
I have a dataset similar to the following (but larger):
dataset <- data.frame(First = c("John","John","Andy","John"), Last = c("Lewis","Brown","Alphie","Johnson"))
I would like to create a new column that contains each unique last name cooresponding to the given first name. Thus, each observation of "John" would have c("Lewis", "Brown", "Johnson") in the third column.
I'm a bit perplexed because my attempts at vectorization seem impossible given I can't reference the particular observation I'm looking at. Specifically, what I want to write is:
dataset$allLastNames <- unique(data$Last[data$First == "the current index???"])
I think this can work in a loop (since I reference the observation with 'i'), but it is taking too long given the size of my data:
for(i in 1:nrow(dataset)){
dataset$allLastNames[i] <- unique(dataset$Last[dataset$First == dataset$First[i]])
}
Any suggestions for how I could make this work (using Base R)?
Thanks!
You can use dplyr library with a few lines. First, you can group by first names and list all unique last names occurences.
library(dplyr)
list_names = dataset %>%
group_by(First) %>%
summarise(allLastNames = list(unique(Last)))
Then, add the summary table to your dataset matching the First names:
dataset %>% left_join(list_names,by='First')
First Last allLastNames
1 John Lewis Lewis, Brown, Johnson
2 John Brown Lewis, Brown, Johnson
3 Andy Alphie Alphie
4 John Johnson Lewis, Brown, Johnson
Also, I think R is a good language to avoid using for-loops. You have several methods to work with dataset and arrays avoiding them.
Base R option:
allLastNames <- aggregate(.~First, dataset, paste, collapse = ",")
dataset <- merge(dataset, allLastNames, by = "First")
names(dataset) <- c("First", "Last", "allLastNames")
Output:
First Last allLastNames
1 Andy Alphie Alphie
2 John Lewis Lewis,Brown,Johnson
3 John Brown Lewis,Brown,Johnson
4 John Johnson Lewis,Brown,Johnson
library(dplyr)
library(stringr)
dataset %>%
group_by(First) %>%
mutate(Lastnames = str_flatten(Last, ', '))
# Groups: First [2]
First Last Lastnames
<chr> <chr> <chr>
1 John Lewis Lewis, Brown, Johnson
2 John Brown Lewis, Brown, Johnson
3 Andy Alphie Alphie
4 John Johnson Lewis, Brown, Johnson
I'm looking to join two dataframes based on a condition, in this case, that one string is inside another. Say I have two dataframes,
df1 <- data.frame(fullnames=c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20))
fullnames ages
1 Jane Doe 30
2 Mr. John Smith 51
3 Nate Cox, Esq. 45
4 Bill Lee III 38
5 Ms. Kate Smith 20
df2 <- data.frame(lastnames=c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages=c(30, 45, 20, 28, 51, 38),
homestate=c("NJ", "CT", "MA", "RI", "MA", "NY"))
lastnames ages homestate
1 Doe 30 NJ
2 Cox 45 CT
3 Smith 20 MA
4 Jung 28 RI
5 Smith 51 MA
6 Lee 38 NY
I want to do a left join on these two dataframes on ages and the row in which df2$lastnames is contained within df1$fullnames. I thought fuzzy_join might do it, but I don't think it liked my grepl:
joined_dfs <- fuzzy_join(df1, df2, by = c("ages", "fullnames"="lastnames"),
+ match_fun = c("=", "grepl()"),
+ mode="left")
Error in which(m) : argument to 'which' is not logical
Desired result: a dataframe identical to the first but with a "homestate" column appended. Any ideas?
TLDR
You just need to fix match_fun:
# ...
match_fun = list(`==`, stringr::str_detect),
# ...
Background
You had the right idea, but you went wrong in your interpretation of the match_fun parameter in fuzzyjoin::fuzzy_join(). Per the documentation, match_fun should be a
Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by (if a named list, it uses the names in x). If only one function is given it is used on all column pairs.
Solution
A simple correction will do the trick, with further formatting by dplyr. For conceptual clarity, I've typographically aligned the by columns with the functions used to match them:
library(dplyr)
# ...
# Existing code
# ...
joined_dfs <- fuzzy_join(
df1, df2,
by = c("ages", "fullnames" = "lastnames"),
# |----| |-----------------------|
match_fun = list(`==` , stringr::str_detect ),
# |--| |-----------------|
# Match by equality ^ ^ Match by detection of `lastnames` in `fullnames`
mode = "left"
) %>%
# Format resulting dataset as you requested.
select(fullnames, ages = ages.x, homestate)
Result
Given your sample data reproduced here
df1 <- data.frame(
fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20)
)
df2 <- data.frame(
lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages = c(30, 45, 20, 28, 51, 38),
homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)
this solution should produce the following data.frame for joined_dfs, formatted as requested:
fullnames ages homestate
1 Jane Doe 30 NJ
2 Mr. John Smith 51 MA
3 Nate Cox, Esq. 45 CT
4 Bill Lee III 38 NY
5 Ms. Kate Smith 20 MA
Note
Because each ages is coincidentally a unique key, the following join on only *names
fuzzy_join(
df1, df2,
by = c("fullnames" = "lastnames"),
match_fun = stringr::str_detect,
mode = "left"
)
will better illustrate the behavior of matching on substrings:
fullnames ages.x lastnames ages.y homestate
1 Jane Doe 30 Doe 30 NJ
2 Mr. John Smith 51 Smith 20 MA
3 Mr. John Smith 51 Smith 51 MA
4 Nate Cox, Esq. 45 Cox 45 CT
5 Bill Lee III 38 Lee 38 NY
6 Ms. Kate Smith 20 Smith 20 MA
7 Ms. Kate Smith 20 Smith 51 MA
Where You Went Wrong
Error in Type
The value passed to match_fun should be either (the symbol for) a function
fuzzyjoin::fuzzy_join(
# ...
match_fun = grepl
# ...
)
or a list of such (symbols for) functions:
fuzzyjoin::fuzzy_join(
# ...
match_fun = list(`=`, grepl)
# ...
)
Instead of providing a list of symbols
match_fun = list(=, grepl)
you incorrectly provided a vector of character strings:
match_fun = c("=", "grepl()")
Error in Syntax
The user should name the functions
`=`
grepl
yet you incorrectly attempted to call them:
=
grepl()
Naming them will pass the functions themselves to match_fun, as intended, whereas calling them will pass their return values*. In R, an operator like = is named using backticks: `=`.
* Assuming the calls didn't fail with errors. Here, they would fail.
Inappropriate Functions
To compare two values for equality, here the character vectors df1$fullnames and df2$lastnames, you should use the relational operator ==; yet you incorrectly supplied the assignment operator =.
Furthermore grepl() is not vectorized in quite the way match_fun desires. While its second argument (x) is indeed a vector
a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.
its first argument (pattern) is (treated as) a single character string:
character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr, gregexpr and regexec.
Thus, grepl() is not a
Vectorized function given two columns...
but rather a function given one string (scalar) and one column (vector) of strings.
The answer to your prayers is not grepl() but rather something like stringr::str_detect(), which is
Vectorised over string and pattern. Equivalent to grepl(pattern, x).
and which wraps stringi::stri_detect().
Note
Since you're simply trying to detect whether a literal string in df1$fullnames contains a literal string in df2$lastnames, you don't want to accidentally treat the strings in df2$lastnames as regular expression patterns. Now your df2$lastnames column is statistically unlikely to contain names with special regex characters; with the lone exception of -, which is interpreted literally outside of [], which are very unlikely to be found in a name.
If you're still worried about accidental regex, you might want to consider alternative search methods with stringi::stri_detect_fixed() or stringi::stri_detect_coll(). These perform literal matching, respectively by either byte or "canonical equivalence"; the latter adjusts for locale and special characters, in keeping with natural language processing.
This seems to work given your two dataframes:
Edited as per comment by #Greg:
The code is adpated to the data as posted; if in your actual data, there are more variants expecially to last names, such as not only III but also IV, feel free to adapt the code accordingly:
library(dplyr)
df1 %>%
mutate(
# create new column that gets rid of strings after last name:
lastnames = sub("\\sI{1,3}$|,.+$", "", fullnames),
# grab last names:
lastnames = sub(".*?(\\w+)$", "\\1", lastnames)) %>%
# join the two dataframes:
left_join(., df2, by = c("lastnames", "ages"))
fullnames ages lastnames homestate
1 Jane Doe 30 Doe NJ
2 Mr. John Smith 51 Smith MA
3 Nate Cox, Esq. 45 Cox CT
4 Bill Lee III 38 Lee NY
5 Ms. Kate Smith 20 Smith MA
If you want lastnamesremoved just append this after %>%:
select(-lastnames)
EDIT #2:
If you don't trust the above solution given massive variation in how last names are actually noted, then of course fuzzy_join is an option too. BUT, the current fuzzy_join solution is not enough; it needs to be amended by one critical data transformation. This is because str_detect detects whether a string is contained within another string. That is, it will return TRUE if it compares, for example, Smith to Smithsonian or to Hammer-Smith - each time the string Smith is indeed contained in the longer names. If, as will likely be the case in a large dataset, Smith and Smithsonian happen to have the same ages the mismatch will be perfect: fuzzy_join will incorrectly join the two. The same problem arises when you have, e.g., Smith and Smith-Klein of the same age: there too fuzzy_join will join them.
The first set of problematic cases can be resolved by including word boundary achors \\b in df2. These assert that, for example, Smith must be bounded by word boundaries to either side, which is not the case with Smithsonian, which does have an invisible boundary to the left of Smithsonian but the right-hand anchor is after its last letter n. The second set of problematic cases can be addressed by including a negative lookahead after \\b, namely \\b(?!-), which asserts that after the word boundary there must not be a hyphen.
The solution is easily implemented with mutate and paste0 like so:
fuzzy_join(
df1, df2 %>%
mutate(lastnames = paste0("\\b", lastnames, "\\b(?!-)")),
by = c("ages", "fullnames" = "lastnames"),
match_fun = list(`==`, str_detect),
mode = "left"
) %>%
select(fullnames, ages = ages.x, homestate)
I have multi-party conversations in strings like this:
convers <- "Peter: Hiya Mary: Hi. How w'z your weekend. Peter: a::hh still got a headache. An' you (.) party a lot? Mary: nuh, you know my kid's sick 'n stuff Peter: yeah i know that's=erm al hamshi: hey guys how's it goin'? Peter: Great! Mary: where've you BEn last week al hamshi: ah y' know, camping with my girl friend."
I also have a vector with the speakers' names:
speakers <- c("Peter", "Mary", "al hamshi")
I'd like to create a dataframe with the utterances by each individual speaker in a separate column. I can only do this task in a piecemeal fashion, by addressing each speaker specifically using the indices in speakers, and then combine the separate results in a list but what I'd really like to have is a dataframe with separate columns for each speaker:
Peter <- str_extract_all(convers, paste0("(?<=", speakers[1],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
Mary <- str_extract_all(convers, paste0("(?<=", speakers[2],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
al_hamshi <- str_extract_all(convers, paste0("(?<=", speakers[3],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
df <- list(
Peter = Peter, Mary = Mary , al_hamshi = al_hamshi
)
df
$Peter
$Peter[[1]]
[1] "Hiya" "a::hh still got a headache. An' you (.) party a lot?"
[3] "yeah i know that's=erm" "Great!"
$Mary
$Mary[[1]]
[1] "Hi. How w'z your weekend." "nuh, you know my kid's sick 'n stuff" "where've you BEn last week"
$al_hamshi
$al_hamshi[[1]]
[1] "hey guys how's it goin'?" "ah y' know, camping with my girl friend."
How can I extract the same-speaker utterances not one by one but in one go and how can the results be assigned not to a list but a dataframe?
With a bit of pre-processing, and assuming the names exactly match the speakers in the conversation text, you can do:
# Pattern to use to insert new lines in string
pattern <- paste0("(", paste0(speakers, ":", collapse = "|"), ")")
# Split string by newlines
split_conv <- strsplit(gsub(pattern, "\n\\1", convers), "\n")[[1]][-1]
# Capture speaker and text into data frame
dat <- strcapture("(.*?):(.*)", split_conv, data.frame(speaker = character(), text = character()))
Which gives:
speaker text
1 Peter Hiya
2 Mary Hi. How w'z your weekend.
3 Peter a::hh still got a headache. An' you (.) party a lot?
4 Mary nuh, you know my kid's sick 'n stuff
5 Peter yeah i know that's=erm
6 al hamshi hey guys how's it goin'?
7 Peter Great!
8 Mary where've you BEn last week
9 al hamshi ah y' know, camping with my girl friend.
To get each speaker into their own column:
# Count lines by speaker
dat$cnt <- with(dat, ave(speaker, speaker, FUN = seq_along))
# Reshape and rename
dat <- reshape(dat, idvar = "cnt", timevar = "speaker", direction = "wide")
names(dat) <- sub("text\\.", "", names(dat))
cnt Peter Mary al hamshi
1 1 Hiya Hi. How w'z your weekend. hey guys how's it goin'?
3 2 a::hh still got a headache. An' you (.) party a lot? nuh, you know my kid's sick 'n stuff ah y' know, camping with my girl friend.
5 3 yeah i know that's=erm where've you BEn last week <NA>
7 4 Great! <NA> <NA>
If new lines already exist in your text, choose another character that doesn't exist to do use to split the string.
You can add :\\s to each speakers, as you are also doing, then make a gregexpr finding the position where a speaker starts. Extract this using regmatches and remove the previously added :\\s to get the speaker. Make again a regmatches but with invert giving the sentences. With spilt the sentences are grouped to the speaker. To bring this to the desired data.frame you have to add NA to have the same length for all speakes, done her with [ inside lapply:
x <- gregexpr(paste0(speakers, ":\\s", collapse="|"), convers)
y <- sub(":\\s$", "", regmatches(convers, x)[[1]])
z <- trimws(regmatches(convers, x, TRUE)[[1]][-1])
tt <- split(z, y)
do.call(data.frame, lapply(tt, "[", seq_len(max(lengths(tt)))))
# al.hamshi Mary Peter
#1 hey guys how's it goin'? Hi. How w'z your weekend. Hiya
#2 ah y' know, camping with my girl friend. nuh, you know my kid's sick 'n stuff a::hh still got a headache. An' you (.) party a lot?
#3 <NA> where've you BEn last week yeah i know that's=erm
#4 <NA> <NA> Great!
I have the following datasets
df <- data.frame(id = c(1,2,3), names = c( "Adam Jones, John David, Maddy Kones",
"Adam Smith, Maddy Kones, John David", "Maddy Kones, John Peterson, Adam Smith"))
that is
and I wish to see the rows that "John" is immediately after "Adam"
So my output will be
id names
1 Adam Jones, John David, Maddy Kones
I do not know how to use a regular expression for that. I tried this so far
output <- df [grep("Adam" [^,]* "John", df$names),]
One base R approach here is to use grepl with an approriate pattern:
Adam\b[^,]*,\\s*John.*
This says to match Adam followed by a word boundary and anything up until the first comma, immediately followed by John as the next term. We don't have any ugly edge cases, because if John has to follow Adam, this implies that there will always be a comma separating the two names.
Code:
df[grepl("Adam\\b[^,]*,\\s*John.*", df$names), ]
Demo
Update
The original solution does not give the expected answer when there is an absence of "Adam" or "John". For example, for this dataframe
df
# id names
#1 1 Adam Jones, John David, Maddy Kones
#2 2 Adam Smith, Maddy Kones, John David
#3 3 Maddy Kones, John Peterson, Adam Smith
#4 4 Adam Smith, Ronak Shah
Using the original solution we would get output as
# id names
#1 1 Adam Jones, John David, Maddy Kones
#NA NA <NA>
To correct the issue we add an additional check of isTRUE function which would ignore such NA's and would give us only TRUE elements
df[sapply(strsplit(df$names, ","), function(x)
isTRUE(grep("John", x) - grep("Adam", x) == 1)), ]
# id names
#1 1 Adam Jones, John David, Maddy Kones
Original Answer
Another option is by splitting all the names on , and using grep to check the position at which "John" and "Adam" occurs and select only if the difference between them is 1 (as "John" follows "Adam").
df[sapply(strsplit(df$names, ","), function(x)
grep("John", x) - grep("Adam", x)) == 1, ]
#id names
#1 1 Adam Jones, John David, Maddy Kones
I have a two datasets. In one dataset fist name, second name and last name are written in different variables.
for instance:
ID firstname second name last name
12 john arnold doe
14 jerry k wildlife
In the second one they are written in one variable:
ID name
12 john arnold doe
14 jerry k wildlife
Now I want to be able to find these people from dataset two (full names) in dataset one (seperate names).
A couple of problems that i have are:
not all names occur in both dataset,
not all names have a middle initial,
not all names have IDs so I cannot search on that alone either.
so the question is, could someone suggest a command to split the names in first/second/last name? secondly would someone know how to search for these names with a simple command, something like:
df<-df.old[grepl("firstname", df.old$firstname, ignore.cases=T) & grepl("secondname", df.old$secondname,ignore.cases=T) & grepl("lastname", df.old$lastname, ignore.cases=T),]
any suggestions?
Dirk
You can use separate from tidyr package.
separate(df2, name, into=c("firstname", "secondname", "last name"), " ")
# ID firstname secondname last name
#1 12 john arnold doe
#2 14 jerry k wildlife
For missing middlenames, if lastname can be classified as middle name,
df2 <- data.frame(ID=c(12, 14), name=c("john arnold doe", "jerry wildlife"))
library(splitstackshape)
cSplit(df2, 2, sep = " ")# this reads "split 2nd column by white space"
# ID name_1 name_2 name_3
#1: 12 john arnold doe
#2: 14 jerry wildlife NA
name_1 corresponds to first name, name_2 to middle name
Try this:
*Sample Data *
df2 <- data.frame(ID=c(12, 14), name=c("john arnold doe", "jerry k wildlife"))
Split names by space
df2 <- cbind(df2$ID, data.frame(do.call(rbind, strsplit(as.character(df2$name), " "))))
names(df2) <- c("ID", "firstname", "second name", "last name")
df2
Join the two data frames by first name and last or id.