Joining rows with columns to create vertical table - r

I am trying to figure out how to join 2 data frames to create a vertical table of the data. Here is some sample data:
people <- data.frame(person = c("John","David","Peter"), company = c("A", "B", "C"))
grades <- data.frame(person1=c(10, 40, 50, 60), person2=c(60,70,80, 100), person3=c(33,44,55, 75))
NOTE: The order of the columns in grades is the same as the order of the person column in the people data frame.
I would like to get a data frame like the following but can't think of how to get there. Would prefer a solution using base R (am using an older version of R so some packages don't work for me):
person | company | grade
-------------------------
John | A | 10
John | A | 40
John | A | 50
John | A | 60
David | B | 60
David | B | 70
David | B | 80
David | B | 100
Peter | C | 33
Peter | C | 44
Peter | C | 55
Peter | C | 75

We change the column names of 'grades' with 'person' column from 'people', gather into 'long' format and then do a left_join
library(tidyverse)
setNames(grades, people$person) %>%
gather(person, grade) %>%
left_join(people)
# person grade company
#1 John 10 A
#2 John 40 A
#3 John 50 A
#4 John 60 A
#5 David 60 B
#6 David 70 B
#7 David 80 B
#8 David 100 B
#9 Peter 33 C
#10 Peter 44 C
#11 Peter 55 C
#12 Peter 75 C
Or using base R with merge
merge(stack(setNames(grades, people$person)),
people, all.x = TRUE, by.x = 'ind', by.y = 'person')

A base R option using cbind would be
idx <- rep(seq_along(people$person), each = dim(grades)[1])
cbind(people[idx,], stack(unlist(grades))["values"])
Result
# person company values
#1 John A 10
#1.1 John A 40
#1.2 John A 50
#1.3 John A 60
#2 David B 60
#2.1 David B 70
#2.2 David B 80
#2.3 David B 100
#3 Peter C 33
#3.1 Peter C 44
#3.2 Peter C 55
#3.3 Peter C 75
Use unlist and stack on grades to get
stack(unlist(grades))
values ind
1 10 john_grades1
2 40 john_grades2
3 50 john_grades3
4 60 john_grades4
5 60 david1
6 70 david2
7 80 david3
8 100 david4
9 33 pj1
10 44 pj2
11 55 pj3
12 75 pj4
Since "The order of the columns in grades is the same as the order of the person column in the people data frame." we can use cbind next, after we expanded people to have the correct number of rows.
(idx <- rep(seq_along(people$person), each = dim(grades)[1]))
# [1] 1 1 1 1 2 2 2 2 3 3 3 3
Another option, probably a little faster would be
cbind(people[idx,], data.frame(grade = unlist(grades, use.names = FALSE)))

Related

Remove inconsistent duplicate entries from data frame with Base R

I want to remove duplicate entries from a data frame that are inconsistent, the following gives a simplified example:
df <- data.frame(name = c("Andy", "Bert", "Cindy", "Cindy", "David", "Edgar", "Edgar", "Frank", "George", "George", "George", "Herbert", "Iris", "Iris", "Iris"), amount = c(100, 50, 30, 30, 200, 65, 55, 90, 120, 120, 120, 300, 15, 25, 25))
df
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 Cindy 30
## 5 David 200
## 6 Edgar 65
## 7 Edgar 55
## 8 Frank 90
## 9 George 120
## 10 George 120
## 11 George 120
## 12 Herbert 300
## 13 Iris 15
## 14 Iris 25
## 15 Iris 25
Version A)
Edgar and Iris are the same person yet the given amounts are inconsistent so I want to remove the entries:
#remove inconsistent duplicate entries
df2
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 Cindy 30
## 5 David 200
## 6 Frank 90
## 7 George 120
## 8 George 120
## 9 George 120
## 10 Herbert 300
Version B)
Another possibility would be to keep only one instance of the consistent entries:
#keep only one instance of consistent entries
df3
## name amount
## 1 Andy 100
## 2 Bert 50
## 3 Cindy 30
## 4 David 200
## 5 Frank 90
## 6 George 120
## 7 Herbert 300
I am interested in (elegant?) ways to solve both versions in Base R. Efficiency should not be a problem because the dataset I have is not that huge.
A base solution that solves both at once. This has the side effect of requiring row name changes.
A Remove "inconsistent" values
new_df<-do.call("rbind",
Filter(function(x) all(x$amount == x$amount[1]),
split(df,df$name)))
name amount
Andy Andy 100
Bert Bert 50
Cindy.3 Cindy 30
Cindy.4 Cindy 30
David David 200
Frank Frank 90
George.9 George 120
George.10 George 120
George.11 George 120
Herbert Herbert 300
The above needs further cleaning of row names (an unwanted side effect perhaps but we deal with that below)
B Remove duplicates
new_df<-new_df[!duplicated(new_df$name),]
row.names(new_df) <- 1:nrow(new_df)
Combined result
new_df
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
The question specifically requests for a base solution. If for whatever reason someone from the future wants to use dplyr, I will leave this solution here.
Using dplyr, we can check if all values are equal to the first value of amount. If not, make them NA and delete them. Proceed with removing duplicates for what remains.
A Remove inconsistent ones
library(dplyr)
(df %>%
group_by(name) %>%
mutate(name = ifelse(!all(amount==first(amount)), NA, name)) %>%
na.omit() -> new_df)
A tibble: 10 x 2
# Groups: name [7]
name amount
<chr> <dbl>
1 Andy 100
2 Bert 50
3 Cindy 30
4 Cindy 30
5 David 200
6 Frank 90
7 George 120
8 George 120
9 George 120
10 Herbert 300
Remove duplicates
new_df %>%
filter(!duplicated(name)) %>%
ungroup()
# A tibble: 7 x 2
name amount
<chr> <dbl>
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
A) First aggregate to apply the conditions, then filter the data and finally stack the result.
t <- aggregate( amount ~ name, df, function(x) c(unique(x),length(x)) )
t_m <- t[!sapply( t$amount, function(x) (length(x)>2) ),]
setNames( stack( setNames(lapply( t_m$amount, function(x)
rep(x[1],x[2]) ), t_m$name) )[,c("ind", "values")], colnames(df) )
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 Cindy 30
5 David 200
6 Frank 90
7 George 120
8 George 120
9 George 120
10 Herbert 300
B) Is a bit more straightforward. Just aggregate and filter.
t <- aggregate( amount ~ name, df, unique )
t[lengths(t$amount) == 1,]
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
6 Frank 90
7 George 120
8 Herbert 300
You can use duplicate, but you need to remove all duplicate rows. (your option B).
The result can be used to filter the data frame for all rows.
df <- data.frame(name = c("Andy", "Bert", "Cindy", "Cindy", "David", "Edgar", "Edgar", "Frank", "George", "George", "George", "Herbert", "Iris", "Iris", "Iris"), amount = c(100, 50, 30, 30, 200, 65, 55, 90, 120, 120, 120, 300, 15, 25, 25))
df_unq <- unique(df)
df3 <- df_unq[!(duplicated(df_unq$name)|duplicated(df_unq$name, fromLast = TRUE)), ]
df3
#> name amount
#> 1 Andy 100
#> 2 Bert 50
#> 3 Cindy 30
#> 5 David 200
#> 8 Frank 90
#> 9 George 120
#> 12 Herbert 300
df[df$name %in% df3$name, ]
#> name amount
#> 1 Andy 100
#> 2 Bert 50
#> 3 Cindy 30
#> 4 Cindy 30
#> 5 David 200
#> 8 Frank 90
#> 9 George 120
#> 10 George 120
#> 11 George 120
#> 12 Herbert 300
Created on 2021-12-12 by the reprex package (v2.0.1)
For the first requirement, where you need to get rid of duplicate entries, there's an in-built function in R called duplicated.
Here's the code:
df[!duplicated(df), ]
df[!duplicated(df$name),]
The output looks like this:
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
5 David 200
6 Edgar 65
8 Frank 90
9 George 120
12 Herbert 300
13 Iris 15
And for the second requirement, you'll need to do something like this:
df <- unique(df)
df <- split(df, df$name)
df <- df[sapply(df, nrow) == 1]
df <- do.call(rbind, df)
rownames(df) <- 1:nrow(df)
The output looks like this:
name amount
1 Andy 100
2 Bert 50
3 Cindy 30
4 David 200
5 Frank 90
6 George 120
7 Herbert 300
Both versions are using Base-R. You can do the same using dplyr package in R.
Problem B is a sub-problem of problem A. To solve A we can use var() to find inconsistent values, utilizing the behavior of Filter() which always takes NAs as FALSE. To solve B we just need to get rid of duplicated rows in A applying unique().
Case A
with(df, df[!name %in% names(Filter(var, split(amount, name))), ])
# name amount
# 1 Andy 100
# 2 Bert 50
# 3 Cindy 30
# 4 Cindy 30
# 5 David 200
# 8 Frank 90
# 9 George 120
# 10 George 120
# 11 George 120
# 12 Herbert 300
Case B
with(df, df[!name %in% names(Filter(var, split(amount, name))), ]) |>
unique()
# name amount
# 1 Andy 100
# 2 Bert 50
# 3 Cindy 30
# 5 David 200
# 8 Frank 90
# 9 George 120
# 12 Herbert 300

Match and replace value using 2 Data Frames (R)

2 dfs, need to match "Name" with info$Name and replace corresponding values in details$Salary , df - details should retain all values and there should be no NAs(if match found replace the value if not found leave as it is)
details<- data.frame(Name = c("Aks","Bob","Caty","David","Enya","Fredrick","Gaby","Hema","Isac","Jaby","Katy"),
Age = c(12,22,33,43,24,67,41,19,25,24,32),
Gender = c("f","m","m","f","m","f","m","f","m","m","m"),
Salary = c(1500,2000,3.6,8500,1.2,1400,2300,2.5,5.2,2000,1265))
info <- data.frame(Name = c("caty","Enya","Dadi","Enta","Billu","Viku","situ","Hema","Ignu","Isac"),
income = c(2500,5600,3200,1522,2421,3121,4122,5211,1000,3500))
Expected Result :
Name Age Gender Salary
Aks 12 f 1500
Bob 22 m 2000
Caty 33 m 2500
David 43 f 8500
Enya 24 m 5600
Fredrick 67 f 1400
Gaby 41 m 2300
Hema 19 f 5211
Isac 25 m 3500
Jaby 24 m 2000
Katy 32 m 1265
None of the following is giving expected result
dplyr::left_join(details,info,by = "Name")
dplyr::right_join(details,info,by = "Name")
dplyr::inner_join(details,info, by ="Name") # for other matching and replace this works fine but not here
dplyr:: full_join(details,info,by ="Name")
All the results are giving NA's , tried using match function also but it is not giving desired result, any help would be highly appreciated
You have Name in both the dataframe in different cases, we need to first bring them in the same case then do a left_join with them and use coalesce to select the first non-NA value between income and salary.
library(dplyr)
details %>% mutate(Name = stringr::str_to_title(Name)) %>%
left_join(info %>% mutate(Name = stringr::str_to_title(Name)), by = "Name") %>%
mutate(Salary = coalesce(income, Salary)) %>%
select(names(details))
# Name Age Gender Salary
#1 Aks 12 f 1500
#2 Bob 22 m 2000
#3 Caty 33 m 2500
#4 David 43 f 8500
#5 Enya 24 m 5600
#6 Fredrick 67 f 1400
#7 Gaby 41 m 2300
#8 Hema 19 f 5211
#9 Isac 25 m 3500
#10 Jaby 24 m 2000
#11 Katy 32 m 1265
A base R solution:
matches <- match(tolower(details$Name), tolower(info$Name))
match <- !is.na(matches)
details$Salary[match] <- info$income[matches[match]]
#Result
Name Age Gender Salary
1 Aks 12 f 1500
2 Bob 22 m 2000
3 Caty 33 m 2500
4 David 43 f 8500
5 Enya 24 m 5600
6 Fredrick 67 f 1400
7 Gaby 41 m 2300
8 Hema 19 f 5211
9 Isac 25 m 3500
10 Jaby 24 m 2000
11 Katy 32 m 1265

Best method to Merge two Datasets (Maybe if function?)

I have two data sets I am working with. Datasets TestA and Test B (Below is how to make them in R)
Instructor <- c('Mr.A','Mr.A','Mr.B', 'Mr.C', 'Mr.D')
Class <- c('French','French','English', 'Math', 'Geometry')
Section <- c('1','2','3','5','5')
Time <- c('9:00-10:00','10:00-11:00','9:00-10:00','9:00-10:00','10:00-11:00')
Date <- c('MWF','MWF','TR','TR','MWF')
Enrollment <- c('30','40','24','29','40')
TestA <- data.frame(Instructor,Class,Section,Time,Date,Enrollment)
rm(Instructor,Class,Section,Time,Date,Enrollment)
Student <- c("Frances","Cass","Fern","Pat","Peter","Kory","Cole")
ID <- c('123','121','101','151','456','789','314')
Instructor <- c('','','','','','','')
Time <- c('','','','','','','')
Date <- c('','','','','','','')
Enrollment <- c('','','','','','','')
Class <- c('French','French','French','French','English', 'Math', 'Geometry')
Section <- c('1','1','2','2','3','5','5')
TestB <- data.frame(Student, ID, Instructor, Class, Section, Time, Date, Enrollment)
rm(Instructor,Class,Section,Time,Date,Enrollment,ID,Student)
I would like to merge both datasets (If possible, without using merge() ) So that All the columns of Test A are filled with the information provided by TestB and it should be added depending on the Class and Section.
I tried using merge(TestA, TestB, by=c('Class','Section'), all.x=TRUE) but it adds observations to the original TestA. This is just a test but in the datasets I am using there are hundreds of observations. It worked when I did it with these smaller frames but something is happening to the bigger set. That's why I'd like to know if there is a merge alternative.
Any ideas on how to do this?
The output should look like this
Class Section Instructor Time Date Enrollment Student ID
English 3 Mr.B 9:00-10:00 TR 24 Peter 456
French 1 Mr.A 9:00-10:00 MWF 30 Frances 123
French 1 Mr.A 9:00-10:00 MWF 30 Cass 121
French 2 Mr.A 10:00-11:00 MWF 40 Fern 101
French 2 Mr.A 10:00-11:00 MWF 40 Pat 151
Geometry 5 Mr.D 10:00-11:00 MWF 40 Cole 314
Math 5 Mr.C 9:00-10:00 TR 29 Kory 789
I was once a big fan of merge() until I learned about dplyr's join functions.
Try this instead:
library(dplyr)
TestA %>%
left_join(TestB, by = c("Class", "Section")) %>% #Here, you're joining by just the "Class" and "Section" columns of TestA and TestB
select(Class,
Section,
Instructor = Instructor.x,
Time = Time.x,
Date = Date.x,
Enrollment = Enrollment.x,
Student,
ID) %>%
arrange(Class, Section) #Added to match your output.
The select statement is keeping only those columns that are specifically named and, in some cases, renaming them.
Output:
Class Section Instructor Time Date Enrollment Student ID
1 English 3 Mr.B 9:00-10:00 TR 24 Peter 456
2 French 1 Mr.A 9:00-10:00 MWF 30 Frances 123
3 French 1 Mr.A 9:00-10:00 MWF 30 Cass 121
4 French 2 Mr.A 10:00-11:00 MWF 40 Fern 101
5 French 2 Mr.A 10:00-11:00 MWF 40 Pat 151
6 Geometry 5 Mr.D 10:00-11:00 MWF 40 Cole 314
7 Math 5 Mr.C 9:00-10:00 TR 29 Kory 789
The key is to drop the empty but duplicate columns from TestB before merging / joining as shown by SymbolixAU.
Here is an implementation in data.table syntax:
library(data.table)
setDT(TestB)[, .(Student, ID, Class, Section)][setDT(TestA), on = .(Class, Section)]
Student ID Class Section Instructor Time Date Enrollment
1: Frances 123 French 1 Mr.A 9:00-10:00 MWF 30
2: Cass 121 French 1 Mr.A 9:00-10:00 MWF 30
3: Fern 101 French 2 Mr.A 10:00-11:00 MWF 40
4: Pat 151 French 2 Mr.A 10:00-11:00 MWF 40
5: Peter 456 English 3 Mr.B 9:00-10:00 TR 24
6: Kory 789 Math 5 Mr.C 9:00-10:00 TR 29
7: Cole 314 Geometry 5 Mr.D 10:00-11:00 MWF 40

R - Merging two data files based on partial matching of inconsistent full name formats

I'm looking for a way to merge two data files based on partial matching of participants' full names that are sometimes entered in different formats and sometimes misspelled. I know there are some different function options for partial matches (eg agrep and pmatch) and for merging data files but I need help with a) combining the two; b) doing partial matching that can ignore middle names; c) in the merged data file store both original name formats and d) retain unique values even if they don't have a match.
For example, I have the following two data files:
File name: Employee Data
Full Name Date Started Orders
ANGELA MUIR 6/15/14 25
EILEEN COWIE 6/15/14 44
LAURA CUMMING 10/6/14 43
ELENA POPA 1/21/15 37
KAREN MACEWAN 3/15/99 39
File name: Assessment data
Candidate Leading Factor SI-D SI-I
Angie muir I -3 12
Caroline Burn S -5 -3
Eileen Mary Cowie S -5 5
Elena Pope C -4 7
Henry LeFeuvre C -5 -1
Jennifer Ford S -3 -2
Karen McEwan I -4 10
Laura Cumming S 0 6
Mandip Johal C -2 2
Mubarak Hussain D 6 -1
I want to merge them based on names (Full Name in df1 and Candidate in df2) ignoring middle name (eg Eilen Cowie = Eileen Mary Cowie), extra spaces (Laura Cumming = Laura Cumming); misspells (e.g. Elena Popa = Elena Pope) etc.
The ideal output would look like this:
Name Full Name Candidate Date Started Orders Leading Factor SI-D SI-I
ANGELA MUIR ANGELA MUIR Angie muir 6/15/14 25 I -3 12
Caroline Burn N/A Caroline Burn N/A N/A S -5 -3
EILEEN COWIE EILEEN COWIE Eileen Mary Cowie 6/15/14 44 S -5 5
ELENA POPA ELENA POPA Elena Pope 1/21/15 37 C -4 7
Henry LeFeuvre N/A Henry LeFeuvre N/A N/A C -5 -1
Jennifer Ford N/A Jennifer Ford N/A N/A S -3 -2
KAREN MACEWAN KAREN MACEWAN Karen McEwan 3/15/99 39 I -4 10
LAURA CUMMING LAURA CUMMING Laura Cumming 10/6/14 43 S 0 6
Mandip Johal N/A Mandip Johal N/A N/A C -2 2
Mubarak Hussain N/A Mubarak Hussain N/A N/A D 6 -1
Any suggestions would be greatly appreciated!
Here's a process that may help. You will have to inspect the results and make adjustments as needed.
df1
# v1 v2
#1 ANGELA MUIR 6/15/14
#2 EILEEN COWIE 6/15/14
#3 AnGela Smith 5/3/14
df2
# u1 u2
#1 Eileen Mary Cowie I-3
#2 Angie muir -5 5
index <- sapply(df1$v1, function(x) {
agrep(x, df2$u1, ignore.case=TRUE, max.distance = .5)
}
)
index <- unlist(index)
df2$u1[index] <- names(index)
merge(df1, df2, by.x='v1', by.y='u1')
# v1 v2 u2
#1 ANGELA MUIR 6/15/14 -5 5
#2 EILEEN COWIE 6/15/14 I-3
I had to adjust the argument max.distance in the index function. It may not work for your data, but adjust and test if it works. If this doesn't help, there is a package called stringdist that may have a more robust matching function in amatch.
Data
v1 <- c('ANGELA MUIR', 'EILEEN COWIE', 'AnGela Smith')
v2 <- c('6/15/14', '6/15/14', '5/3/14')
u1 <- c('Eileen Mary Cowie', 'Angie muir')
u2 <- c('I-3', '-5 5')
df1 <- data.frame(v1, v2, stringsAsFactors=F)
df2 <- data.frame(u1, u2, stringsAsFactors = F)

Deduplicate dataframe based on criteria in R?

I've got this dataframe:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18
As you can see the value "Jane" appears 3 times. What I want to do is to deduplicate the list based on the variable "Name" but because the rest of the columns are important to me, I want to keep the rows that have the most information in them. For example if I was to deduplicate the above file in excel, it would keep the first value of "Jane" and delete all the other ones. But the first value of "Jane" (row no3) has got missing information in the other columns.
So in other words I want to deduplicate the list by "Name" but add a criteria to keep the rows that have any other value different from "0" in the column "Age". This way the result I would get would be this:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane US F 30
4 Kate GB F 18
I have tried this
file3 <- file1[!duplicated(file1$Name),]
But like excel it keeps the value of "Jane" that has no usable information in the other columns.
How do I sort the rows based on column "Age" in a Z-A order so that anything that has "0" will be on the bottom and will be removed when I deduplicate the list?
Cheers
David
Try this trick
ind <- with(DF,
Country !=0 &
Gender %in% c('F', 'M') &
Age !=0)
DF[ind, ]
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
4 Jane US F 30
6 Kate GB F 18
So far it works well and produces your desired output
EDIT
library(doBy)
orderBy(~ -Age+Name, DF) # Sort decreasingly by Age and Name
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Or simply using Base functions:
DF[order(DF$Age, DF$Name, decreasing = TRUE), ]
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Now you can select by indexing the correct rows meeting your conditions, I really think the first part is better than these two lasts.
If all duplicated rows have the value zero in column Age, it will work with subset:
# the data
file1 <- read.table(text="Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18", header = TRUE, stringsAsFactors = FALSE)
# create a subset of the data
subset(file1, Age > 0)
# Name Country Gender Age
# 1 John GB M 25
# 2 Mark US M 35
# 4 Jane US F 30
# 6 Kate GB F 18

Resources