Summing Rows Next to a Name in R - r

I'm working on a banking project where I'm trying to find a yearly sum of money spent, while the dataset has these listed as monthly transactions.
Month Name Money Spent
2 John Smith 10
3 John Smith 25
4 John Smith 20
2 Joe Nais 10
3 Joe Nais 25
4 Joe Nais 20
Right now, this is the code I have:
OTData <- OTData %>%
mutate(
OTData,
Full Year = [CODE NEEDED TO SUM UP]
)
Thanks!

As #Pawel said, there's no question here. I assume you want:
df <- data.frame(Month = c(2,3,4,2,3,4),
Name = c("John Smith", "John Smith", "John Smith",
"Joe Nais", "Joe Nais", "Joe Nais"),
Money_Spent = c(10,25,20,10,25,20))
df %>%
group_by(Name) %>%
summarize(Full_year = sum(Money_Spent))
Name Full_year
<fct> <dbl>
1 Joe Nais 55
2 John Smith 55
NOTE: You're going to run into trouble if you include spaces in your variable names. You really should replace them with ., _, or camelCase as in the above example.

Related

Unlist dataframes but also keep the original

I have the following data which i wish to unlist to make a new dataframe, probably easier here if i show show what im looking for; so i currently have names and codes like this;
name code
joe blogs/john williams 100000/100001
what i want:
name code
joe blogs 1000000
john williams 1000001
joe blogs/john williams 100000/100001
so im unlisting the original but also keeping it whilst making a new df
Something like this may work for you
rbind(data.frame(sapply(df, strsplit, "/")), df)
name code
1 joe blogs 100000
2 john williams 100001
3 joe blogs/john williams 100000/100001
Data
df <- structure(list(name = "joe blogs/john williams", code = "100000/100001"), class = "data.frame", row.names = c(NA,
-1L))
You can use seperate_rows() for that:
library(dplyr)
library(tidyr)
df <- data.frame(name = "joe blogs/john williams",
code = "100000/100001")
df |>
separate_rows(everything(), sep = "/") |>
bind_rows(df)
# A tibble: 3 × 2
name code
<chr> <chr>
1 joe blogs 100000
2 john williams 100001
3 joe blogs/john williams 100000/100001
Using reframe
library(dplyr)
df %>%
reframe(across(everything(), ~ c(unlist(strsplit(.x, "/")), .x)))
-output
name code
1 joe blogs 100000
2 john williams 100001
3 joe blogs/john williams 100000/100001

R function for recognizing name variants?

I'm working with a big dataset of names and need to be able to group by the individual. It's possible that in the dataset there are names that appear different but are the same person, such as John Doe or John A. Doe, or Michael Smith and Mike Smith. Is there a way for R to find instances like these and recognize them as the same person?
df <- data.frame(
name = c("John Doe", "John A. Doe", "Jane Smith", "Jane Anderson", "Jane Anderson Lowell",
"Jane B. Smith", "John Doe", "Jane Smith", "Michael Smith",
"Mike Smith", "A.K. Ross", "Ana Kristina Ross"),
rating = c(1,2,1,1,2,3,1,4,2,1,3,2)
)
Here, there are multiple repeated individuals, whether the variant be a middle initial, a shortened name, a lengthened name, or someone whose last name changed. I've been trying to find a function that could give a similarity percentage of characters in name matches, and from there I could manually examine cases of high percentage to evaluate if they are indeed the same person. My end goal is to find the average rating by person, where I would need to sort by the individual.
The actual solution depends on your data and the variations that it can have. How can you identify one unique individual? Maybe extracting first and last name from each name would help?
library(dplyr)
library(stringr)
df %>%
group_by(firstname = word(name, 1),
lastname = word(name, -1)) %>%
summarise(rating = mean(rating)) %>%
ungroup
# firstname lastname rating
# <chr> <chr> <dbl>
#1 Jane Anderson 1
#2 Jane Smith 1.33
#3 John Doe 2
We could do this with extract from tidyr
library(dplyr)
library(tidyr)
df %>%
tidyr::extract(name, into = c('firstname', 'lastname'),
"^(\\w+).*\\s(\\w+)$") %>%
group_by(firstname, lastname) %>%
summarise(rating = mean(rating, na.rm = TRUE), .groups = 'drop')
# A tibble: 3 x 3
firstname lastname rating
<chr> <chr> <dbl>
1 Jane Anderson 1
2 Jane Smith 1.33
3 John Doe 2

Replace multiple strings/values based on separate list

I have a data frame that looks similar to this:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 John Smith GROUP1 2015 1 John Smith 5 Adam Smith 12 Mike Smith 20 Sam Smith 7 Luke Smith 3 George Smith
Each row repeats for new logs, but the values in X.1 : Y.3 change often.
The ID's and the ID's present in X.1 : Y.3 have a numeric value and then the name ID, i.e., "1 John Smith" or "20 Sam Smith" will be the string.
I have an issue where in certain instances, the ID will remain as "1 John Smith" but in X.1 : Y.3 the number may change preceding "John Smith", so for example it might be "14 John Smith". The names will always be correct, it's just the number that sometimes gets mixed up.
I have a list of 200+ ID's that are impacted by this mismatch - what is the most efficient way to replace the values in X.1 : Y.3 so that they match the correct ID in column ID?
I won't know which column "14 John Smith" shows up in, it could be X.1, or Y.2, or Y.3 depending on the row.
I can use a replace function in a dplyr line of code, or gsub for each 200+ ID's and for each column effected, but it seems very inefficient. Is there a quicker way than repeated something like the below x times?
df%>%mutate(X.1=replace(X.1, grepl('John Smith', X.1), "1 John Smith"))%>%as.data.frame()
Sometimes it helps to temporarily reshape the data. That way we can operate on all the X and Y values without iterating over them.
library(stringr)
library(tidyr)
## some data to work with
exd <- read.csv(text = "EVENT,ID,GROUP,YEAR,X.1,X.2,X.3,Y.1,Y.2,Y.3
1,1 John Smith,GROUP1,2015,19 John Smith,11 Adam Smith,9 Sam Smith,5 George Smith,13 Mike Smith,12 Luke Smith
2,2 John Smith,GROUP1,2015,1 George Smith,9 Luke Smith,19 Adam Smith,7 Sam Smith,17 Mike Smith,11 John Smith
3,3 John Smith,GROUP1,2015,5 George Smith,18 John Smith,12 Sam Smith,6 Luke Smith,2 Mike Smith,4 Adam Smith",
stringsAsFactors = FALSE)
## re-arrange to put X and Y columns into a single column
exd <- gather(exd, key = "var", value = "value", X.1, X.2, X.3, Y.1, Y.2, Y.3)
## find the X and Y values that contain the ID name
matches <- str_detect(exd$value, str_replace_all(exd$ID, "^\\d+ *", ""))
## replace X and Y values with the matching ID
exd[matches, "value"] <- exd$ID[matches]
## put it back in the original shape
exd <- spread(exd, key = "var", value = value)
exd
## EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
## 1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
## 2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
## 3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Not sure if you're set on dplyr and piping, but I think this is a plyr solution that does what you need. Given this example dataset:
> df
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 19 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 11 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 18 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
This adply function goes row by row and replaces any matching X:Y column values with the one from the ID column:
library(plyr)
adply(df, .margins = 1, function(x) {
idcol <- as.character(x$ID)
searchname <- trimws(gsub('[[:digit:]]+', "", idcol))
sapply(x[5:10], function(y) {
ifelse(grepl(searchname, y), idcol, as.character(y))
})
})
Output:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Data:
names <- c("EVENT","ID",'GROUP','YEAR', paste(rep(c("X.", "Y."), each = 3), 1:3, sep = ""))
first <- c("John", "Sam", "Adam", "Mike", "Luke", "George")
set.seed(2017)
randvals <- t(sapply(1:3, function(x) paste(sample(1:20, size = 6),
paste(sample(first, replace = FALSE, size = 6), "Smith"))))
df <- cbind(data.frame(1:3, paste(1:3, "John Smith"), "GROUP1", 2015), randvals)
names(df) <- names
I think that the most efficient way to accomplish this is by building a loop. The reason is that you will have to repeat the function to replace the names for every name in your ID list. With a loop, you can automate this.
I will make some assumptions first:
The ID list can be read as a character vector
You don't have any typos in the ID list or in your data.frame, including
different lowercase and uppercase letters in the names.
Your ID list does not contain the numbers. In case that it does contain numbers, you have to use gsub to erase them.
The example can work with a data.frame (DF) with the same structure that
you put in your question.
>
ID <- c("John Smith", "Adam Smith", "George Smith")
for(i in 1:length(ID)) {
DF[, 5:10][grep(ID[i], DF[, 5:10])] <- ID[i]
}
With each round this loop will:
Identify the positions in the columns X.1:Y.3 (columns 5 to 10 in your question) where the name "i" appears.
Then, it will change all those values to the one in the "i" position of the ID vector.
So, the first iteration will do: 1) Search for every position where the name "John Smith" appears in the data frame. 2) Replace all those "# John Smith" with "John Smith".
Note: If you simply want to delete the numbers, you can use gsub to replace them. Take into account that you probably want to erase the first space between the number and the name too. One way to do this is using gsub and a regular expression:
DF[, 5:10] <- gsub("[0-9]+ ", "", DF[, 5:10])

separate different combinations of names to first and last using dplyr, tidyr, and regex

Sample data frame:
name <- c("Smith John Michael","Smith, John Michael","Smith John, Michael","Smith-John Michael","Smith-John, Michael")
df <- data.frame(name)
df
name
1 Smith John Michael
2 Smith, John Michael
3 Smith John, Michael
4 Smith-John Michael
5 Smith-John, Michael
I need to achieve the following desired output:
name first.name last.name
1 Smith John Michael John Smith
2 Smith, John Michael John Smith
3 Smith John, Michael Michael Smith John
4 Smith-John Michael Michael Smith-John
5 Smith-John, Michael Michael Smith-John
The rules are: if there is a comma in the string, then anything before is the last name. the first word following the comma is first name. If no comma in string, first word is last name, second word is last name. hyphenated words are one word. I would rather acheive this with dplyr and regex but I'll take any solution. Thanks for the help
You can achieve your desired result using strsplit switching between splitting by "," or " " based on whether there is a comma or not in name. Here, we define two functions to make the presentation clearer. You can just as well inline the code within the functions.
get.last.name <- function(name) {
lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,1)
}
The result of strsplit is a list. The lapply(...,'[[',1) loops through this list and extracts the first element from each list element, which is the last name.
get.first.name <- function(name) {
d <- lapply(ifelse(grepl(",",name),strsplit(name,","),strsplit(name," ")),`[[`,2)
lapply(strsplit(gsub("^ ","",d), " "),`[[`,1)
}
This function is similar except we extract the second element from each list element returned by strsplit, which contains the first name. We then remove any starting spaces using gsub, and we split again with " " to extract the first element from each list element returned by that strsplit as the first name.
Putting it all together with dplyr:
library(dplyr)
res <- df %>% mutate(first.name=get.first.name(name),
last.name=get.last.name(name))
The result is as expected:
print(res)
## name first.name last.name
## 1 Smith John Michael John Smith
## 2 Smith, John Michael John Smith
## 3 Smith John, Michael Michael Smith John
## 4 Smith-John Michael Michael Smith-John
## 5 Smith-John, Michael Michael Smith-John
Data:
df <- structure(list(name = c("Smith John Michael", "Smith, John Michael",
"Smith John, Michael", "Smith-John Michael", "Smith-John, Michael"
)), .Names = "name", row.names = c(NA, -5L), class = "data.frame")
## name
##1 Smith John Michael
##2 Smith, John Michael
##3 Smith John, Michael
##4 Smith-John Michael
##5 Smith-John, Michael
I am not sure if this is any better than aichao's answer but I gave it a shot anyway. I gives the right output.
df1 <- df %>%
filter(grepl(",",name)) %>%
separate(name, c("last.name","first.middle.name"), sep = "\\,", remove=F) %>%
mutate(first.middle.name = trimws(first.middle.name)) %>%
separate(first.middle.name, c("first.name","middle.name"), sep="\\ ",remove=T) %>%
select(-middle.name)
df2 <- df %>%
filter(!grepl(",",name)) %>%
separate(name, c("last.name","first.name"), sep = "\\ ", remove=F)
df<-rbind(df1,df2)

Merge data frames with partial id

Say I have these two data frames:
> df1 <- data.frame(name = c('John Doe',
'Jane F. Doe',
'Mark Smith Simpson',
'Sam Lee'))
> df1
name
1 John Doe
2 Jane F. Doe
3 Mark Smith Simpson
4 Sam Lee
> df2 <- data.frame(family = c('Doe', 'Smith'), size = c(2, 6))
> df2
family size
1 Doe 2
2 Smith 6
I want to merge both data frames in order to get this:
name family size
1 John Doe Doe 2
2 Jane F. Doe Doe 2
3 Mark Smith Simpson Smith 6
4 Sam Lee <NA> NA
But I can't wrap my head around a way to do this apart from the following very convoluted solution, which is becoming very messy with my real data, which has over 100 "family names":
> df3 <- within(df1, {
family <- ifelse(test = grepl('Doe', name),
yes = 'Doe',
no = ifelse(test = grepl('Smith', name),
yes = 'Smith',
no = NA))
})
> merge(df3, df2, all.x = TRUE)
family name size
1 Doe John Doe 2
2 Doe Jane F. Doe 2
3 Smith Mark Smith Simpson 6
4 <NA> Sam Lee NA
I've tried taking a look into pmatch as well as the solutions provided at R partial match in data frame, but still haven't found what I'm looking for.
Rather than attempting to use regular expressions and partial matches, you could split the names up into a lookup-table format, where each component of a person's name is kept in a row, and matched to their full name:
df1 <- data.frame(name = c('John Doe',
'Jane F. Doe',
'Mark Smith Simpson',
'Sam Lee'),
stringsAsFactors = FALSE)
df2 <- data.frame(family = c('Doe', 'Smith'), size = c(2, 6),
stringsAsFactors = FALSE)
library(tidyr)
library(dplyr)
str_df <- function(x) {
ss <- strsplit(unlist(x)," ")
data.frame(family = unlist(ss),stringsAsFactors = FALSE)
}
splitnames <- df1 %>%
group_by(name) %>%
do(str_df(.))
splitnames
name family
1 Jane F. Doe Jane
2 Jane F. Doe F.
3 Jane F. Doe Doe
4 John Doe John
5 John Doe Doe
6 Mark Smith Simpson Mark
7 Mark Smith Simpson Smith
8 Mark Smith Simpson Simpson
9 Sam Lee Sam
10 Sam Lee Lee
Now you can just merge or join this with df2 to get your answer:
left_join(df2,splitnames)
Joining by: "family"
family size name
1 Doe 2 Jane F. Doe
2 Doe 2 John Doe
3 Smith 6 Mark Smith Simpson
Potential problem: if one person's first name is the same as somebody else's last name, you'll get some incorrect matches!
Here is one strategy, you could use lapply with grep match over all the family names. This will find them at any position. First let me define a helper function
transindex<-function(start=1) {
function(x) {
start<<-start+1
ifelse(x, start-1, NA)
}
}
and I will also be using the function coalesce.R to make things a bit simpler. Here the code i'd run to match up df2 to df1
idx<-do.call(coalesce, lapply(lapply(as.character(df2$family),
function(x) grepl(paste0("\\b", x, "\\b"), as.character(df1$name))),
transindex()))
Starting on the inside and working out, i loop over all the family names in df2 and grep for those values (adding "\b" to the pattern so i match entire words). grepl will return a logical vector (TRUE/FALSE). I then apply the above helper function transindex() to change those vector to be either the index of the row in df2 that matched, or NA. Since it's possible that a row may match more than one family, I simply choose the first using the coalesce helper function.
Not that I can match up the rows in df1 to df2, I can bring them together with
cbind(df1, size=df2[idx,])
name family size
# 1 John Doe Doe 2
# 1.1 Jane F. Doe Doe 2
# 2 Mark Smith Simpson Smith 6
# NA Sam Lee <NA> NA
Another apporoach that looks valid, at least with the sample data:
df1name = as.character(df1$name)
df1name
#[1] "John Doe" "Jane F. Doe" "Mark Smith Simpson" "Sam Lee"
regmatches(df1name, regexpr(paste(df2$family, collapse = "|"), df1name), invert = T) <- ""
df1name
#[1] "Doe" "Doe" "Smith" ""
cbind(df1, df2[match(df1name, df2$family), ])
# name family size
#1 John Doe Doe 2
#1.1 Jane F. Doe Doe 2
#2 Mark Smith Simpson Smith 6
#NA Sam Lee <NA> NA

Resources