Unlist dataframes but also keep the original - r

I have the following data which i wish to unlist to make a new dataframe, probably easier here if i show show what im looking for; so i currently have names and codes like this;
name code
joe blogs/john williams 100000/100001
what i want:
name code
joe blogs 1000000
john williams 1000001
joe blogs/john williams 100000/100001
so im unlisting the original but also keeping it whilst making a new df

Something like this may work for you
rbind(data.frame(sapply(df, strsplit, "/")), df)
name code
1 joe blogs 100000
2 john williams 100001
3 joe blogs/john williams 100000/100001
Data
df <- structure(list(name = "joe blogs/john williams", code = "100000/100001"), class = "data.frame", row.names = c(NA,
-1L))

You can use seperate_rows() for that:
library(dplyr)
library(tidyr)
df <- data.frame(name = "joe blogs/john williams",
code = "100000/100001")
df |>
separate_rows(everything(), sep = "/") |>
bind_rows(df)
# A tibble: 3 × 2
name code
<chr> <chr>
1 joe blogs 100000
2 john williams 100001
3 joe blogs/john williams 100000/100001

Using reframe
library(dplyr)
df %>%
reframe(across(everything(), ~ c(unlist(strsplit(.x, "/")), .x)))
-output
name code
1 joe blogs 100000
2 john williams 100001
3 joe blogs/john williams 100000/100001

Related

Joining Dataframes in R, Matching Patterns in Strings

Two big real life tables to join up, but here's a little reprex:
I've got a table of small strings and I want to left join on a second table, with the join being based on whether or not these small strings can be found inside the bigger strings on the second table.
df_1 <- data.frame(index = 1:5,
keyword = c("john", "ella", "mil", "nin", "billi"))
df_2 <- data.frame(index_2 = 1001:1008,
name = c("John Coltrane", "Ella Fitzgerald", "Miles Davis", "Billie Holliday",
"Nina Simone", "Bob Smith", "John Brown", "Tony Montana"))
df_results_i_want <- data.frame(index = c(1, 1:5),
keyword = c("john", "john", "ella", "mil", "nin", "billi"),
index_2 = c(1001, 1007, 1002, 1003, 1005, 1004),
name = c("John Coltrane", "John Brown", "Ella Fitzgerald",
"Miles Davis", "Nina Simone", "Billie Holliday"))
Seems like a str_detect() call and a left_join() call might be part of the solution - ie I'm hoping for something like:
library(tidyverse)
df_results <- df_1 |> left_join(df_2, join_by(blah blah str_detect() blah blah))
I'm using dplyr 1.1 so I can use join_by(), but I'm not sure of the correct way to get what I need - can anyone help please?
I suppose I could do a simple cross join using tidyr::crossing() and then do the str_detect() stuff afterwards (and filter out things that don't match)
df_results <- df_1 |>
crossing(df_2) |>
mutate(match = str_detect(name, fixed(keyword, ignore_case = TRUE))) |>
filter(match) |>
select(-match)
but in my real life example, the cross join would produce an absolutely enormous table that would overwhelm my PC.
Thank you.
You can try fuzzy_join::regex_join():
library(fuzzyjoin)
regex_join(df_2, df_1, by=c("name"="keyword"), ignore_case=T)
Output:
index.x name index.y keyword
1 1001 John Coltrane 1 john
2 1002 Ella Fitzgerald 2 ella
3 1003 Miles Davis 3 mil
4 1004 Billie Holliday 5 billi
5 1005 Nina Simone 4 nin
6 1007 John Brown 1 john
join_by does not support inexact join (but unequal), but you can use fuzzyjoin:
library(dplyr)
library(fuzzyjoin)
df_2 %>%
mutate(name = tolower(name)) %>%
fuzzy_left_join(df_1, ., by = c(keyword = "name"),
match_fun = \(x, y) str_detect(y, x))
index keyword index_2 name
1 1 john 1001 john coltrane
2 1 john 1007 john brown
3 2 ella 1002 ella fitzgerald
4 3 mil 1003 miles davis
5 4 nin 1005 nina simone
6 5 billi 1004 billie holliday
We can use SQL to do that.
library(sqldf)
sqldf("select * from [df_1] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
giving:
index keyword index_2 name
1 1 john 1001 John Coltrane
2 1 john 1007 John Brown
3 2 ella 1002 Ella Fitzgerald
4 3 mil 1003 Miles Davis
5 4 nin 1005 Nina Simone
6 5 billi 1004 Billie Holliday
It can be placed in a pipeline like this:
library(magrittr)
library(sqldf)
df_1 %>%
{ sqldf("select * from [.] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
}

Expand data.table so one row per pattern match of each ID

I have a lot of text data in a data.table. I have several text patterns that I'm interested in. I have managed to subset the table so it shows text that matches at least two of the patterns (relevant question here).
I now want to be able to have one row per match, with an additional column that identifies the match - so rows where there are multiple matches will be duplicates apart from that column.
It feels like this shouldn't be too hard but I'm struggling! My vague thoughts are around maybe counting the number of pattern matches, then duplicating the rows that many times...but then I'm not entirely sure how to get the label for each different pattern...(and also not sure that is very efficient anyway).
Thanks for your help!
Example data
library(data.table)
library(stringr)
text_table <- data.table(ID = (1:5),
text = c("lucy, sarah and paul live on the same street",
"lucy has only moved here recently",
"lucy and sarah are cousins",
"john is also new to the area",
"paul and john have known each other a long time"))
text_patterns <- as.character(c("lucy", "sarah", "paul|john"))
# Filtering the table to just the IDs with at least two pattern matches
text_table_multiples <- text_table[, Reduce(`+`, lapply(text_patterns,
function(x) str_detect(text, x))) >1]
Ideal output
required_table <- data.table(ID = c(1, 1, 1, 2, 3, 3, 4, 5),
text = c("lucy, sarah and paul live on the same street",
"lucy, sarah and paul live on the same street",
"lucy, sarah and paul live on the same street",
"lucy has only moved here recently",
"lucy and sarah are cousins",
"lucy and sarah are cousins",
"john is also new to the area",
"paul and john have known each other a long time"),
person = c("lucy", "sarah", "paul or john", "lucy", "lucy", "sarah", "paul or john", "paul or john"))
A way to do that is to create a variable for each indicator and melt:
library(stringi)
text_table[, lucy := stri_detect_regex(text, 'lucy')][ ,
sarah := stri_detect_regex(text, 'sarah')
][ ,`paul or john` := stri_detect_regex(text, 'paul|john')
]
melt(text_table, id.vars = c("ID", "text"))[value == T][, -"value"]
## ID text variable
## 1: 1 lucy, sarah and paul live on the same street lucy
## 2: 2 lucy has only moved here recently lucy
## 3: 3 lucy and sarah are cousins lucy
## 4: 1 lucy, sarah and paul live on the same street sarah
## 5: 3 lucy and sarah are cousins sarah
## 6: 1 lucy, sarah and paul live on the same street paul or john
## 7: 4 john is also new to the area paul or john
## 8: 5 paul and john have known each other a long time paul or john
A tidy way of doing the same procedure is:
library(tidyverse)
text_table %>%
mutate(lucy = stri_detect_regex(text, 'lucy')) %>%
mutate(sarah = stri_detect_regex(text, 'sarah')) %>%
mutate(`paul or john` = stri_detect_regex(text, 'paul|john')) %>%
gather(value = value, key = person, - c(ID, text)) %>%
filter(value) %>%
select(-value)
DISCLAIMER: this is not an idiomatic data.table solution
I would build a helper function like the following, that take a single row and an input and returns a new dt with Nrows:
library(data.table)
library(tidyverse)
new_rows <- function(dtRow, patterns = text_patterns){
res <- map(text_patterns, function(word) {
textField <- grep(x = dtRow[1, text], pattern = word, value = TRUE) %>%
ifelse(is.character(.), ., NA)
personField <- str_extract(string = dtRow[1, text], pattern = word) %>%
ifelse( . == "paul" | . == "john", "paul or john", .)
idField <- ifelse(is.na(textField), NA, dtRow[1, ID])
data.table(ID = idField, text = textField, person = personField)
}) %>%
rbindlist()
res[!is.na(text), ]
}
And I will execute it:
split(text_table, f = text_table[['ID']]) %>%
map_df(function(r) new_rows(dtRow = r))
The answer is:
ID text person
1: 1 lucy, sarah and paul live on the same street lucy
2: 1 lucy, sarah and paul live on the same street sarah
3: 1 lucy, sarah and paul live on the same street paul or john
4: 2 lucy has only moved here recently lucy
5: 3 lucy and sarah are cousins lucy
6: 3 lucy and sarah are cousins sarah
7: 4 john is also new to the area paul or john
8: 5 paul and john have known each other a long time paul or john
which looks like your required_table (duplicated IDs included)
ID text person
1: 1 lucy, sarah and paul live on the same street lucy
2: 1 lucy, sarah and paul live on the same street sarah
3: 1 lucy, sarah and paul live on the same street paul or john
4: 2 lucy has only moved here recently lucy
5: 3 lucy and sarah are cousins lucy
6: 3 lucy and sarah are cousins sarah
7: 4 john is also new to the area paul or john
8: 5 paul and john have known each other a long time paul or john

Summing Rows Next to a Name in R

I'm working on a banking project where I'm trying to find a yearly sum of money spent, while the dataset has these listed as monthly transactions.
Month Name Money Spent
2 John Smith 10
3 John Smith 25
4 John Smith 20
2 Joe Nais 10
3 Joe Nais 25
4 Joe Nais 20
Right now, this is the code I have:
OTData <- OTData %>%
mutate(
OTData,
Full Year = [CODE NEEDED TO SUM UP]
)
Thanks!
As #Pawel said, there's no question here. I assume you want:
df <- data.frame(Month = c(2,3,4,2,3,4),
Name = c("John Smith", "John Smith", "John Smith",
"Joe Nais", "Joe Nais", "Joe Nais"),
Money_Spent = c(10,25,20,10,25,20))
df %>%
group_by(Name) %>%
summarize(Full_year = sum(Money_Spent))
Name Full_year
<fct> <dbl>
1 Joe Nais 55
2 John Smith 55
NOTE: You're going to run into trouble if you include spaces in your variable names. You really should replace them with ., _, or camelCase as in the above example.

separate different combinations of names to first and last using dplyr, tidyr, and regex

Sample data frame:
name <- c("Smith John Michael","Smith, John Michael","Smith John, Michael","Smith-John Michael","Smith-John, Michael")
df <- data.frame(name)
df
name
1 Smith John Michael
2 Smith, John Michael
3 Smith John, Michael
4 Smith-John Michael
5 Smith-John, Michael
I need to achieve the following desired output:
name first.name last.name
1 Smith John Michael John Smith
2 Smith, John Michael John Smith
3 Smith John, Michael Michael Smith John
4 Smith-John Michael Michael Smith-John
5 Smith-John, Michael Michael Smith-John
The rules are: if there is a comma in the string, then anything before is the last name. the first word following the comma is first name. If no comma in string, first word is last name, second word is last name. hyphenated words are one word. I would rather acheive this with dplyr and regex but I'll take any solution. Thanks for the help
You can achieve your desired result using strsplit switching between splitting by "," or " " based on whether there is a comma or not in name. Here, we define two functions to make the presentation clearer. You can just as well inline the code within the functions.
get.last.name <- function(name) {
lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,1)
}
The result of strsplit is a list. The lapply(...,'[[',1) loops through this list and extracts the first element from each list element, which is the last name.
get.first.name <- function(name) {
d <- lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,2)
lapply(strsplit(gsub("^ ","",d), " "),`[[`,1)
}
This function is similar except we extract the second element from each list element returned by strsplit, which contains the first name. We then remove any starting spaces using gsub, and we split again with " " to extract the first element from each list element returned by that strsplit as the first name.
Putting it all together with dplyr:
library(dplyr)
res <- df %>% mutate(first.name=get.first.name(name),
last.name=get.last.name(name))
The result is as expected:
print(res)
## name first.name last.name
## 1 Smith John Michael John Smith
## 2 Smith, John Michael John Smith
## 3 Smith John, Michael Michael Smith John
## 4 Smith-John Michael Michael Smith-John
## 5 Smith-John, Michael Michael Smith-John
Data:
df <- structure(list(name = c("Smith John Michael", "Smith, John Michael",
"Smith John, Michael", "Smith-John Michael", "Smith-John, Michael"
)), .Names = "name", row.names = c(NA, -5L), class = "data.frame")
## name
##1 Smith John Michael
##2 Smith, John Michael
##3 Smith John, Michael
##4 Smith-John Michael
##5 Smith-John, Michael
I am not sure if this is any better than aichao's answer but I gave it a shot anyway. I gives the right output.
df1 <- df %>%
filter(grepl(",",name)) %>%
separate(name, c("last.name","first.middle.name"), sep = "\\,", remove=F) %>%
mutate(first.middle.name = trimws(first.middle.name)) %>%
separate(first.middle.name, c("first.name","middle.name"), sep="\\ ",remove=T) %>%
select(-middle.name)
df2 <- df %>%
filter(!grepl(",",name)) %>%
separate(name, c("last.name","first.name"), sep = "\\ ", remove=F)
df<-rbind(df1,df2)

Merge data frames with partial id

Say I have these two data frames:
> df1 <- data.frame(name = c('John Doe',
'Jane F. Doe',
'Mark Smith Simpson',
'Sam Lee'))
> df1
name
1 John Doe
2 Jane F. Doe
3 Mark Smith Simpson
4 Sam Lee
> df2 <- data.frame(family = c('Doe', 'Smith'), size = c(2, 6))
> df2
family size
1 Doe 2
2 Smith 6
I want to merge both data frames in order to get this:
name family size
1 John Doe Doe 2
2 Jane F. Doe Doe 2
3 Mark Smith Simpson Smith 6
4 Sam Lee <NA> NA
But I can't wrap my head around a way to do this apart from the following very convoluted solution, which is becoming very messy with my real data, which has over 100 "family names":
> df3 <- within(df1, {
family <- ifelse(test = grepl('Doe', name),
yes = 'Doe',
no = ifelse(test = grepl('Smith', name),
yes = 'Smith',
no = NA))
})
> merge(df3, df2, all.x = TRUE)
family name size
1 Doe John Doe 2
2 Doe Jane F. Doe 2
3 Smith Mark Smith Simpson 6
4 <NA> Sam Lee NA
I've tried taking a look into pmatch as well as the solutions provided at R partial match in data frame, but still haven't found what I'm looking for.
Rather than attempting to use regular expressions and partial matches, you could split the names up into a lookup-table format, where each component of a person's name is kept in a row, and matched to their full name:
df1 <- data.frame(name = c('John Doe',
'Jane F. Doe',
'Mark Smith Simpson',
'Sam Lee'),
stringsAsFactors = FALSE)
df2 <- data.frame(family = c('Doe', 'Smith'), size = c(2, 6),
stringsAsFactors = FALSE)
library(tidyr)
library(dplyr)
str_df <- function(x) {
ss <- strsplit(unlist(x)," ")
data.frame(family = unlist(ss),stringsAsFactors = FALSE)
}
splitnames <- df1 %>%
group_by(name) %>%
do(str_df(.))
splitnames
name family
1 Jane F. Doe Jane
2 Jane F. Doe F.
3 Jane F. Doe Doe
4 John Doe John
5 John Doe Doe
6 Mark Smith Simpson Mark
7 Mark Smith Simpson Smith
8 Mark Smith Simpson Simpson
9 Sam Lee Sam
10 Sam Lee Lee
Now you can just merge or join this with df2 to get your answer:
left_join(df2,splitnames)
Joining by: "family"
family size name
1 Doe 2 Jane F. Doe
2 Doe 2 John Doe
3 Smith 6 Mark Smith Simpson
Potential problem: if one person's first name is the same as somebody else's last name, you'll get some incorrect matches!
Here is one strategy, you could use lapply with grep match over all the family names. This will find them at any position. First let me define a helper function
transindex<-function(start=1) {
function(x) {
start<<-start+1
ifelse(x, start-1, NA)
}
}
and I will also be using the function coalesce.R to make things a bit simpler. Here the code i'd run to match up df2 to df1
idx<-do.call(coalesce, lapply(lapply(as.character(df2$family),
function(x) grepl(paste0("\\b", x, "\\b"), as.character(df1$name))),
transindex()))
Starting on the inside and working out, i loop over all the family names in df2 and grep for those values (adding "\b" to the pattern so i match entire words). grepl will return a logical vector (TRUE/FALSE). I then apply the above helper function transindex() to change those vector to be either the index of the row in df2 that matched, or NA. Since it's possible that a row may match more than one family, I simply choose the first using the coalesce helper function.
Not that I can match up the rows in df1 to df2, I can bring them together with
cbind(df1, size=df2[idx,])
name family size
# 1 John Doe Doe 2
# 1.1 Jane F. Doe Doe 2
# 2 Mark Smith Simpson Smith 6
# NA Sam Lee <NA> NA
Another apporoach that looks valid, at least with the sample data:
df1name = as.character(df1$name)
df1name
#[1] "John Doe" "Jane F. Doe" "Mark Smith Simpson" "Sam Lee"
regmatches(df1name, regexpr(paste(df2$family, collapse = "|"), df1name), invert = T) <- ""
df1name
#[1] "Doe" "Doe" "Smith" ""
cbind(df1, df2[match(df1name, df2$family), ])
# name family size
#1 John Doe Doe 2
#1.1 Jane F. Doe Doe 2
#2 Mark Smith Simpson Smith 6
#NA Sam Lee <NA> NA

Resources