Update R dataframe column based on conditions - r

I am trying to update a dataframe based on a certain condition. Here is the sample dataframe.
fname mname lname
1 RONALD D VALE
2 RONALD VALE
3 JACK A SMITH
4 JACK B SMITH
5 JACK SMITH
I would like to update the middle names column if the first and last names match. In this example, I would expect the following output.
fname mname lname
1 RONALD D VALE
2 RONALD D VALE
3 JACK A SMITH
4 JACK B SMITH
5 JACK SMITH
I also do not want to update the table if there are two different middle initials. There are some missing values in the data. So the main aim is to identify and merge multiple entries which are possibly similar. At the same time, we do not want to introduce erroneous data into the table.

A tidyverse solution:
df %>%
group_by(fname, lname) %>%
mutate(mname_count = n_distinct(mname, na.rm = TRUE)) %>%
mutate(mname = ifelse(mname_count == 1, unique(na.omit(mname)), mname)) %>%
select(-mname_count)

An ugly base R solution (assuming you changed your "" to NA):
unic<-unique(lolz[,c("fname","lname")])
for (i in 1:nrow(unic)){
lelz<-lolz[lolz[,"fname"]==unic[i,1] & lolz[,"lname"]==unic[i,2],]$mnam
if (sum(!is.na(lelz))==1){
lelz[is.na(lelz)] <- "D"
lolz[lolz[,"fname"]==unic[i,1] & lolz[,"lname"]==unic[i,2],][,2]<-lelz
}
}

We can use data.table
library(data.table)
setDT(df1)[, mname := if(uniqueN(mname[nzchar(mname)])==1)
mname[nzchar(mname)] else mname, .(fname, lname)]
df1
# fname mname lname
#1: RONALD D VALE
#2: RONALD D VALE
#3: JACK A SMITH
#4: JACK B SMITH
#5: JACK SMITH
data
df1 <- structure(list(fname = c("RONALD", "RONALD", "JACK", "JACK",
"JACK"), mname = c("D", "", "A", "B", ""), lname = c("VALE",
"VALE", "SMITH", "SMITH", "SMITH")), .Names = c("fname", "mname",
"lname"), class = "data.frame", row.names = c("1", "2", "3",
"4", "5"))

Related

Compare two dataframes based on first and last name separeted by row

I have two dataframes organised like this.
df1 <- data.frame(lastname = c("Miller", "Smith", "Grey"),
firstname = c("John", "Jane", "Hans")
)
df2 <- data.frame(lastname =c("Smith", "Grey"),
firstname = c("Jane", "Hans")
)
df2 is not necessarily a subset of df1. Duplicated entries are also possible.
My goal is to keep a copy of df1 in which all entries occur represented in both dfs. Alternatively, I would like to end up with a subset of df1 with a new variable, indicating that the name is also element of df2.
Can someone suggest a way to do this? A {dyplr}-attempt is totally fine.
Desired output for the the paticular simple case:
res <- data.frame(lastname = c("Smith", "Grey"),
firstname = c("Jane", "Hans")
)
Including the "alternatively" part of the question this is an approach with left_join. Adding a grouping variable grp to distinguish the 2 sets.
library(dplyr)
left_join(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffix=c("_A", "_B"))
lastname firstname grp_A grp_B
1 Miller John A <NA>
2 Smith Jane A B
3 Grey Hans A B
or with base R merge
merge(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffixes=c("_A", "_B"), all=T)
lastname firstname grp_A grp_B
1 Grey Hans A B
2 Miller John A <NA>
3 Smith Jane A B
To remove NA and compact the grps
na.omit(left_join(cbind(df1, grp = "A"), cbind(df2, grp = "B"),
c("lastname", "firstname"), suffix=c("_A", "_B"))) %>%
summarize(lastname, firstname,
grp = list(across(starts_with("grp"), ~ unique(.x))))
lastname firstname grp
1 Smith Jane A, B
2 Grey Hans A, B
The other part is simply
merge(df1, df2)
lastname firstname
1 Grey Hans
2 Smith Jane

sort two column values ​by date in R

I have a dataframe where I have two columns with names
df1 <- structure(list(Col1 = c("Luis", "Pedro", "John", "Ingrid"),
Col2 = c("Raul", "Maria", "Chris", "Lia")),
class = "data.frame", row.names = c(NA, -4L))
and I have another one with dates corresponding to each name:
df2 <- structure(list(Name = c("Luis", "Pedro", "John", "Ingrid","Raul", "Maria", "Chris", "Lia"),
Date = c("10/05/22","04/05/22", "03/05/22", "07/05/22","01/05/22","06/05/22", "05/05/22","02/05/22")),
class = "data.frame", row.names = c(NA, -8L))
it looks like this:
df1:
Col1 Col2
1 Luis Raul
2 Pedro Maria
3 John Chris
4 Ingrid Lia
df2:
Name Date
1 Luis 10/05/22
2 Pedro 04/05/22
3 John 03/05/22
4 Ingrid 07/05/22
5 Raul 01/05/22
6 Maria 06/05/22
7 Chris 05/05/22
8 Lia 02/05/22
what I want is that in each row, the name with the date that goes first appears in the first column, and in the second the name that has the later date, I put an example of the result that I expect:
Col1 Col2
1 Raul Luis
2 Pedro Maria
3 John Chris
4 Lia Ingrid
We convert the 'Date' to Date class and do the ordering
df2$Date <- as.Date(df2$Date, "%d/%m/%y")
df2new <- df2[order(df2$Date),]
df1[] <- t(apply(df1, 1, function(x) x[order(match(x, df2new$Name))]))
-output
> df1
Col1 Col2
1 Raul Luis
2 Pedro Maria
3 John Chris
4 Lia Ingrid

Modify multiple columns at same time in R

I don't know how to say it clearly, that is maybe why i did not find the answer, but i want to edit the values of two different columns at the same time, while they are the identifying columns.
For example this is the data :
> data = data.frame(name1 = c("John","Jake","John","Paul"),
name2 = c("Paul", "Paul","John","John"),
value1 = c(0,0,1,0),
value2 = c(1,0,1,0))
> data
name1 name2 value1 value2
1 John Paul 0 1
2 Jake Paul 0 0
3 John John 1 1
4 Paul John 0 0
I would like to edit the values of the first row so the first row become Jake & John instead of John & Paul, and so i would like to combine these two lines of code for doing it at the same time :
data$name1[(data$name1 == "John" & data$name2 == "Paul")] <- "Jake"
data$name2[(data$name1 == "John" & data$name2 == "Paul")] <- "John"
Should be a simple trick but i dont have it !
Also, i should do that on larger datasets each modification can appear on multiple lines, and i cant know on which rows will be the modification
How about this ?
data[data$col1 == "A" & data$col2 == "B", ] <- list("B", "D")
data
# col1 col2
#1 B D
#2 A C
#3 B A
#4 B B
library(tidyverse)
data %>%
mutate(
name1=
case_when(
name1=="John" & name2=="Paul" ~ "Jake",
TRUE ~ name1
),
name2=
case_when(
name1=="John" & name2=="Paul" ~ "John",
TRUE ~ name2))

Cumulative count of names from two separate columns

I have a data set in chronological order which I have imported to R using:
mydata <- read.csv(file="test.csv",stringsAsFactors=FALSE)
Two of the columns in the data set are 'winner' and loser'. Each row in the data is a tennis match.
What I am looking to do is to add two columns which give me a cumulative count of the total matches the player in the 'winner' column has played up to and including the match on that row. And the same count for the 'loser' in that row.
So for example it would look like this:
winner loser winner_matches loser_matches
tom andy 1 1
andy greg 2 1
greg tom 2 2
I hope that makes sense.
I have tried using the following code but can't get it to work across both columns:
ave(mydata$winner_name==mydata$winner_name, mydata$winner_name, FUN=cumsum)
So the data below is the first 10 rows of around 20,000.
1) base Define a function which counts matches up to the ith row for the indicated player and then apply it for the winner and loser matches separately. No packages are used:
count_matches <- function(i, player) {
with(DF[1:i, ], sum(winner == player | loser == player))
}
n <- nrow(DF)
transform(DF, winner_matches = mapply(count_matches, 1:n, winner),
loser_matches = mapply(count_matches, 1:n, loser))
giving:
winner loser winner_matches loser_matches
1 tom andy 1 1
2 andy greg 2 1
3 greg tom 2 2
2) sqldf A different solution can be obtained using sqldf upon realizing that this problem can be solved with a self-join on a complex condition like this:
library(sqldf)
sqldf("select a.winner,
a.loser,
sum(a.winner = b.winner or a.winner = b.loser) winner_matches,
sum(a.loser = b.winner or a.loser = b.loser) loser_matches
from DF a join DF b on a.rowid >= b.rowid
group by a.rowid")
giving:
winner loser winner_matches loser_matches
1 tom andy 1 1
2 andy greg 2 1
3 greg tom 2 2
Note: The input used, in reproducible form, is:
Lines <- "winner loser
tom andy
andy greg
greg tom"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
We can get number of times that each player won or lost by data.table package:
library(data.table)
setDT(dat)[, winner_matches_won := seq_len(.N), by=(winner)]
setDT(dat)[, loser_matches_lost := seq_len(.N), by=(loser)]
dat
# winner loser winner_matches_won loser_matches_lost
# 1: tom andy 1 1
# 2: andy greg 1 1
# 3: greg tom 1 1
# 4: greg tom 2 2
# 5: tom greg 2 2
Data:
dat <- structure(list(winner = structure(c(3L, 1L, 2L, 2L, 3L), .Label = c("andy",
"greg", "tom"), class = "factor"), loser = structure(c(1L, 2L,
3L, 3L, 2L), .Label = c("andy", "greg", "tom"), class = "factor")), .Names = c("winner",
"loser"), class = "data.frame", row.names = c(NA, -5L))
You're really close to getting ave to work. The cumsum function doesn't know how to handle text so I created a dummy column that's equal to 1 for each row. That gives cumsum something to count.
Here's a sample dataframe.
mydata <-
data.frame(
winner = c("tom", "andy", "greg", "tom", "gary"),
loser = c("andy", "greg", "tom", "gary", "tom"),
stringsAsFactors = FALSE
)
And here's the code to add the two new columns.
library(tidyverse)
mydata <- mutate(mydata, one = 1) # Add dummy column
# Use ave() to calculate both the wins and losses
mydata$winner_matches <- ave(x = mydata$one, mydata$winner, FUN = cumsum)
mydata$loser_matches <- ave(x = mydata$one, mydata$loser, FUN = cumsum)
mydata <- select(mydata, -one) # Remove dummy column

splitting a column delimiter R

I have a dataframe as below. I want to split the last column into 2. Splitting needs to be done based upon the only first : and rest of the columns dont matter.
In the new dataframe, there will be 4 columns. 3 rd column will be (a,b,d) while 4th column will be (1,2:3,3:4:4)
any suggestions? 4th line of my code doesnt work :(. I am okay with completely new solution or corrections to the line 4
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(3, 2, 1)
df <- data.frame(employee, salary, originalColumn = c("a :1", "b :2:3", "d: 3:4:4"))
as.data.frame(do.call(rbind, strsplit(df,":")))
--------------------update1
Below solutions work well. But i need a modified solution as I just realized that some of the cells in column 3 wont have ":". In such case i want text in that cell to appear in only 1st column after splitting that column
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(3, 2, 1)
df <- data.frame(employee, salary, originalColumn = c("a :1", "b", "d: 3:4:4"))
You could use cSplit. On your updated data frame,
library(splitstackshape)
cSplit(df, "originalColumn", sep = ":{1}")
# employee salary originalColumn_1 originalColumn_2
# 1: John Doe 3 a 1
# 2: Peter Gynn 2 b NA
# 3: Jolie Hope 1 d 3:4:4
And on your original data frame,
df1 <- data.frame(employee, salary,
originalColumn = c("a :1", "b :2:3", "d: 3:4:4"))
cSplit(df1, "originalColumn", sep = ":{1}")
# employee salary originalColumn_1 originalColumn_2
# 1: John Doe 3 a 1
# 2: Peter Gynn 2 b 2:3
# 3: Jolie Hope 1 d 3:4:4
Note: I'm using splitstackshape version 1.4.2. I believe the sep argument has been changed from version 1.4.0
You could use extract from tidyr to split the originalColumn in to two columns. In the below code, I am creating 3 columns and removing one of the unwanted columns from the result.
library(tidyr)
pat <- "([^ :])( ?:|: ?|)(.*)"
extract(df, originalColumn, c("Col1", "ColN", "Col2"), pat)[,-4]
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b 2:3
#3 Jolie Hope 1 d 3:4:4
Using the updated df, (for better identification - df1)
extract(df1, originalColumn, c("Col1", "ColN", "Col2"), pat)[,-4]
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b
#3 Jolie Hope 1 d 3:4:4
Or without creating a new column in df
extract(df, originalColumn, c("Col1", "Col2"), "(.)[ :](.*)") %>%
mutate(Col2= gsub("^\\:", "", Col2))
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b 2:3
#3 Jolie Hope 1 d 3:4:4
Based on the pattern in df, the below code also works. Here, the regex used to extract the first column is (.). A dot is a single element at the beginning of the string inside the parentheses will be extracted for the Col1. Then .{2} two elements following the first are discarded and the rest within the parentheses (.*) forms the Col2.
extract(df, originalColumn, c("Col1", "Col2"), "(.).{2}(.*)")
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b 2:3
#3 Jolie Hope 1 d 3:4:4
or using strsplit
as.data.frame(do.call(rbind, strsplit(as.character(df$originalColumn), " :|: ")))
# V1 V2
#1 a 1
#2 b 2:3
#3 d 3:4:4
For df1, here is a solution using strsplit
lst <- strsplit(as.character(df1$originalColumn), " :|: ")
as.data.frame(do.call(rbind,lapply(lst,
`length<-`, max(sapply(lst, length)))) )
# V1 V2
#1 a 1
#2 b <NA>
#3 d 3:4:4
You were close, here's a solution:
library(stringr)
df[, c('Col1','Col2')] <- do.call(rbind, str_split_fixed(df$originalColumn,":",n=2))
df$originalColumn <- NULL
employee salary Col1 Col2
1 John Doe 3 a 1
2 Peter Gynn 2 b 2:3
3 Jolie Hope 1 d 3:4:4
Notes:
stringr::str_split() is better than base::strsplit() because you don't have to do as.character(), also it has the n=2 argument you want to limit to only split on the first ':'

Resources