Merge not merging data frames correctly - r

I have three data frames that need to be merged. There are a few small differences between the competitor names in each data frame. For instance, one name might not have a space between their middle and last name, while the other data frame correctly displays the persons name (Example: Sarah JaneDoe vs. Sarah Jane Doe). So, I used the two methods below.
The first method involves using fuzzy matching to merge the first two data frames, but when I run the code, it just keeps running.
The second attempt, I created a function to keep only one space between a capital letter and the first lower case letter that comes before it, and then merged all three data frames at together.
When I open the data set, the competitors who have NA's for their rank and team have their names spelt correctly in all three data sets. I'm not sure where the issue lies.
A few notes:
The 'comp01_n' column originally from the temp1 data frame is the same as the 'rank_1' column from the stats data frame. I kept them both to verify the data frames merged correctly at the end
I deleted rows in the 'fight' column with NA's because that was data for competitors not in the temp1 data frame. My actual data set is much larger and more complex.
Can you spot where I made an error and how to fix it?
library(fuzzyjoin)
library(tidyverse)
temp1 = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/temp1.csv')
stats=read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/stats.csv')
winners = read.csv('https://raw.githubusercontent.com/bandcar/bjj/main/winners.csv')
#============================================
# Attempt 1
#============================================
#perform fuzzy matching full join
star = stringdist_join(temp1, stats,
by='Name', #match based on Name
mode='full', #use full join
method = "jw", #use jw distance metric
max_dist=99,
distance_col='dist') %>%
group_by(Name.x)
#============================================
# Attempt 2
#============================================
# Function to keep only one space between a capital letter and the first lower case letter that comes before it
format_name <- function(x) {
gsub("([a-z])([A-Z])", "\\1 \\2", x)
}
# Apply the function to the Name column
temp1$Name <- sapply(temp1$Name, format_name)
# create a list of all three data frames
df_list <- list(temp1, stats, winners)
# create a function to remove duplicate columns in list
merge_dfs <- function(df_list) {
# Initialize the first data frame as the merged data frame
merged_df <- df_list[[1]]
# Loop through the rest of the data frames in the list
for (i in 2:length(df_list)) {
current_df <- df_list[[i]]
merged_df <- merge(merged_df, current_df, by=intersect(colnames(merged_df), colnames(current_df)), all=TRUE)
}
return(merged_df)
}
# apply function
t = merge_dfs(df_list)
# delete rows with NA in 'fight' column
t <- t[complete.cases(t[ , 'fight']), ]
# add suffix to indicate it's data for competitor 1
colnames(t)[c(5,16:32)]<-paste(colnames(t[,c(5,16:32)]),"1",sep="_")
# verify rank and comp01_n have the same values
result = ifelse(t$comp01_n == t$rank_1, 1, 0)
sum(result == 1, na.rm = TRUE)
sum(result == 0, na.rm = TRUE)
sum(is.na(result))

Related

Compare the column names of multiple CSV files in R

I'm combining 12 CSV files into one dataframe in R. Before doing this I want to ensure all the column names are an exact match with each other. I've made a dataframe where each column is the column names of the 12 CSV files.
jul21_cols <- data.frame(colnames(jul21))
aug21_cols <- data.frame(colnames(aug21))
sep21_cols <- data.frame(colnames(sep21))
oct21_cols <- data.frame(colnames(oct21))
nov21_cols <- data.frame(colnames(nov21))
dec21_cols <- data.frame(colnames(dec21))
jan22_cols <- data.frame(colnames(jan22))
feb22_cols <- data.frame(colnames(feb22))
mar22_cols <- data.frame(colnames(mar22))
apr22_cols <- data.frame(colnames(apr22))
may22_cols <- data.frame(colnames(may22))
jun22_cols <- data.frame(colnames(jun22))
col_df <- cbind(jul21_cols,aug21_cols,sep21_cols,oct21_cols,nov21_cols,dec21_cols,
jan22_cols,feb22_cols,mar22_cols,apr22_cols,may22_cols,jun22_cols)
I've tried using the identical function to compare 2 columns at a time.
identical(col_df[['jul21']], col_df[['aug21']])
identical(col_df[['aug21']], col_df[['sep21']])
identical(col_df[['sep21']], col_df[['oct21']])
identical(col_df[['oct21']], col_df[['nov21']])
identical(col_df[['nov21']], col_df[['dec21']])
identical(col_df[['dec21']], col_df[['jan22']])
identical(col_df[['jan22']], col_df[['feb22']])
identical(col_df[['feb22']], col_df[['mar22']])
identical(col_df[['mar22']], col_df[['apr22']])
identical(col_df[['apr22']], col_df[['may22']])
identical(col_df[['may22']], col_df[['jun22']])`
All of the identical lines return the value of TRUE
I'm just trying to verify that this code is telling me all my column names are identical in each CSV files before I move on. I'd also like to know if there is a more efficient way to solve this problem.
First, identical() will only return TRUE if the two dataframes have all the same column names in the same order. If you don’t care about order, just that all the same names are in both dataframes, you can sort() the names before comparing as shown below.
Second, you can often use the base::lapply() or purrr::map() families of functions for operations requiring iteration.
For your case, let’s put your dataframes in a list (which they probably should be to begin with), then use sapply() to compare the column names of the first df in the list to the column names of all other dfs.
jul21 <- data.frame(x = 1, y = 2)
aug21 <- data.frame(x = 3, y = 4)
sep21 <- data.frame(y = 6, x = 5)
dfs <- list(jul21,aug21,sep21)
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# TRUE
And as another test case, we’ll add a df with a non-matching column.
oct22 <- data.frame(x = 1, y = 2, z = 3)
dfs[[4]] <- oct22
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# FALSE
We assume that what is needed is to determine if the column names are the same and in same order and if not to determine which differ.
First get a character vector, Names, containing the names of the data frames and from that make a named list L containing the data frames themselves.
From those names assemble a list L of the data frames and then get a character vector nms whose elements are strings of column names, one for each data frame.
Finally group the names of the data frames using tapply and nms as the groupings so we can see which data frames contain which columns. In the example below aug21 and jul21 have one set of columns, i.e. Time and demand, and sep21 has a different set, i.e. Time and DEMAND. If there were only one row then all data frames have the same column names in the same order.
Names <- c("jul21", "aug21", "sep21") # using example in Note
L <- mget(Names)[Names]
nms <- sapply(names(L), function(x) toString(names(L[[x]])))
tab <- stack(tapply(names(nms), nms, toString))
names(tab) <- c("data.frames", "column.names")
nrow(tab)
## [1] 2
tab
## data.frames column.names
## 1 jul21, aug21 Time, demand
## 2 sep21 Time, DEMAND
graph
Another approach which could be used alternately or in conjuction with the one above is to create a graph such that each vertex is a data frame and each edge means that the two vertices on either end of the edge have the same column names in the same order. Each connected component represents distinct column names or orders. From the example below we see that jul21 and aug21 form one connected component and sep21 forms a second connected component.
To investigate how data frame column names differ note that setdiff(names(jul21), names(sep21)) will show names that are in jul21 but not in sep21 and the reverse can be used for the other direction. If the setdiff in both directions are zero length vectors and names vectors are not the same then they differ by order.
library(igraph)
set.seed(123)
isSame <- function(x, y) +identical(names(x), names(y))
A <- outer(L, L, Vectorize(isSame))
diag(A) <- 0
g <- graph_from_adjacency_matrix(A, "undirected")
plot(g, vertex.color = "white", vertex.size = 30)
Note
Test data. BOD comes with R.
jul21 <- aug21 <- sep21 <- BOD
names(sep21) <- c("Time", "DEMAND")

Sweep a for loop output object for duplicates at/near the same time as the output is created

I have a large data frame(df2) consisting of various species with a lot of different types of data associated with each (~4000 rows/species). I have a for loop that already sweeps through and subsets df2 into individual smaller data frames by species by indexing df2 against a discrete table of the species(df1). The code for that is:
for(i in 1:length(df1$x1))
{
assign(paste0("", df1$x1[i]),
df2[df2$x1 == df1$x1[i],])
}
What I want to do is add a component to/modify the existing for loop above that will sweep each of these smaller output data frames (df*) for duplicate values via a specific column. (If that makes sense...)
Here is an example data frame to work on
Date <- (sample(seq.Date(from=(as.Date("2000/01/01")), to=(as.Date("2001/01/01")), by="day", format="%m/%d/%Y"), 25, replace = TRUE))
Alphabet <- (sample(LETTERS[1:26],25, replace = FALSE))
Species <- sample(paste0("Species_", c(1:10)), 25, replace = TRUE)
Numbers <- (sample(1:50,25, replace = TRUE))
df2 <- data.frame("x1" = (Species),
"x2" = (Alphabet),
"x3" = (Numbers),
"x4" = (Date))
Here is the example species table needed for the for loop as well:
df1 <- data.frame(unique(Species))
names(df1)[names(df1) == "unique.Species."] <- "x1"
Thank you for the help, and sorry about any redundant or inefficient code (still pretty new to R)
**Addendums:
For the purposes of this example, lets say we are looking for duplicates in the "Date" column of the sample data frame above.
Effectively, the desired output would be one data frame which is all the data associated with each species (which the for loop already does) and a sister data frame that is a vector object of any duplicate values found in that output data frame, and should no values be found, the vector would have "NULL" as the values or "0". Ex: for species 1, there would be an output for all data associated with it from the original data, and say there are 3 duplicate dates in species 1, a vector would be made titled "species_1_duplicates" and would contain those 3 duplicate values.`
#Example desired output code`
Species_1 #The main/original output from the for loop
Species_1_duplicates <- vector(NULL) #New output if no duplicates were found
Species_1_duplicates <- vector("01/01/2000", "02/13/2000", "08/08/2000",...) # if these three dates have duplicates in "Species_1"

R: Match values in two data frames like vlookup but for multiple criteria without Key [large data]

I have two large data frames (500k rows) from two separate sources without a key. Instead of being able to merge using a key, I want to merge the two data frames by matching other columns. Such as age and amount. It is not a perfect match between the two data frames so some values will not match, and I will later simply remove these ones.
The data could look something like this.
So, in the example above I want to be able to create a table matching Key 1 and Key 2. In the picture above we see that XXX1 and YYY3 is a match. So from here I would like to create a data frame like:
[Key 1] [Key 2]
XXX1 YYY3
XXX2 N/A
XXX3 N/A
I know how to do this in Excel but due to the large amount of data, it simply crashes. I want to focus on R but for what it is worth, this is how I built it in Excel (where the idea is that we first do a VLOOKUP, and then uses INDEX as a VLOOKUP for getting the second match if the first one does not match both criteria):
=IF(P2=0;IFNA(VLOOKUP(L2;B:C;2;FALSE);VLOOKUP(L2;G:H;2;FALSE));IF(O2=Q2;INDEX($A$2:$A$378300;SMALL(IF($L2=$B$2:$B378300;ROW($B$2:$B$378300)-ROW($B$2)+1);2));0))
And this is the approach made in R:
for (i in 1:nrow(df)) {
for (j in 1:nrow(df)) {
if (df_1$pc_age[i] == df_2$pp_age[j] && (df_1$amount[i] %in% c(df_2$amount1[j], df_2$amount2[j], df_2$amount3[j]))) {
df_1$Key1[i] = df_2$Key2[j]
} else (df_1$Key1[i] = N/A)
}}
The problem is that this takes way, way to long. Is there a more effective way to map this data as good as possible?
Thanks!
Create dummy columns in both the data frames such as(I can show you for df1) :
for(i in 1:nrow(df1)){
df1$key1 <- paste0("X_",i)
}
Similarly for df2 from Y1....Yn and then join both data frames using "merge" on columns age and amount.
Concatenate Key1 and key2 in a new column in the merged data frame. You will directly get your desired data frame.
could the following code work for you?
# create random data
set.seed(123)
df1 <- data.frame(
key_1=as.factor(paste("xxx",1:100,sep="_")),
age = sample(1:100,100,replace=TRUE),
amount = sample(1:200,100))
df2 <- data.frame(
key_1=paste("yyy",1:500,sep="_"),
age = sample(1:100,500,replace=TRUE),
amount_1 = sample(1:200,500,replace=TRUE),
amount_2 = sample(1:200,500,replace=TRUE),
amount_3 = sample(1:200,500,replace=TRUE))
# ensure at least three fit rows
df2[10,2:3] <- df1[1,2:3]
df2[20,c(2,4)] <- df1[2,2:3]
df2[30,c(2,5)] <- df1[3,2:3]
# define comparrison with df2
comp2df2 <- function(x){
ageComp <- df2$age == as.numeric(x[2])
if(!any(ageComp)){
return(NaN)
}
amountComp <- apply(df2,1,function(a) as.numeric(x[3]) %in% as.numeric(a[3:5]))
if(!any(amountComp)){
return(NaN)
}
matchIdx <- ageComp & amountComp
if(sum(matchIdx) > 1){
warning("multible match detected first match is taken\n")
}
return(which(matchIdx)[1])
}
# run match
matchIdx <- apply(df1,1,comp2df2)
# merge
df_new <- cbind(df1[!is.na(matchIdx),],df2[matchIdx[!is.na(matchIdx)],])
didn't had time to test it on really big data, but this should be faster than your two for loops I guess....
To further speed up things you could delete the
if(sum(matchIdx) > 1){
warning("multible match detected first match is taken\n")
}
lines if you are not worried about a line matches several others.

Recording comma separated entries in R

I've got a data frame (df2) with two variables, Mood and PartOfTown, where Mood is a multi-select (ie any combination of options's allowed) question rating a person's happiness, and PartOfTown describes the geographical location.
The problem is that the centres code moods differently, with the centre in the northern part of town using NorthCode and the centre in the southern part using SouthCode (df1).
I'd like all the entries in the data set (df2) to be recoded to SouthCode, so that I end up with a data set like df3. I'd like a general solution, because there might be new entries with new combinations currently not featuring in the data set. Any thoughts on it would be much appreciated.
Centre codes and definitions for moods:
df1 <- data.frame(NorthCode=c(4,5,6,7,99),NorthDef=c("happy","sad","tired","energetic","other"),SouthCode=c(7,8,9,5,99),SouthDef=c("happy","sad","tired","energetic","other"))
Starting point:
df2 <- data.frame(Mood=c("4","5","6","7","4,5","5,6,99","99","7","8","9","5","7,8","8,5,99","99"),Region=c("north","north","north","north","north","north","north","south","south","south","south","south","south","south"))
Desired outcome:
df3 <- data.frame(Mood=c("7","8","9","5","7,8","8,9,99","99","7","8","9","5","7,8","8,5,99","99"),PartofTown=c("north","north","north","north","north","north","north","south","south","south","south","south","south","south"))
Current attempt: tried to start of by splitting the entries but couldn't get it to work.
unlist(strsplit(df2$Mood, ","))
You were on the right path with strsplit, but you need to add stringsAsFactors = F to as.data.frame() to make sure that Mood is a character vector, not a factor.
After that you can keep the separated elements as a list and match the old codes with the new ones with lapply().
df1 <-
data.frame(NorthCode=c(4,5,6,7,99),
NorthDef=c("happy","sad","tired","energetic","other"),
SouthCode=c(7,8,9,5,99),
SouthDef=c("happy","sad","tired","energetic","other"),
stringsAsFactors = F)
df2 <-
data.frame(Mood=c("4","5","6","7","4,5","5,6,99","99","7","8","9","5","7,8","8,5,99","99"),
Region=c("north","north","north","north","north","north","north","south","south","south","south" ,"south","south","south"),
stringsAsFactors = F)
df3 <-
data.frame(Mood=c("7","8","9","5","7,8","8,9,99","99","7","8","9","5","7,8","8,5,99","99"),
PartofTown=c("north","north","north","north","north","north","north","south","south","south","south" ,"south","south","south"),
stringsAsFactors = F)
# Split the Moods into separate values
splitCodes <- strsplit(df2$Mood, ",")
# Add the Region as the name of each element in the new list
names(splitCodes) <- df2$Region
# Recode the values by matching the north values to the south values
recoded <-
lapply(
seq_along(splitCodes),
function(x){
ifelse(rep(names(splitCodes[x]) == "north", length(splitCodes[[x]])),
df1$SouthCode[match(splitCodes[[x]], df1$NorthCode)],
splitCodes[[x]])
}
)
# Add the recoded values back to df2
df2$recoded <-
sapply(recoded,
paste,
collapse = ",")
# Check if the recoded values match your desired values
identical(df2$recoded, df3$Mood)

Splitting large data frame by column into smaller data frames (not lists) using loops

I have many large data frames. Using of the smaller ones for example:
dim(ch29)
476 4283
I need to split it into smaller pieces (i.e. subset into 241 columns at the most). My problems come afterwards when I want to analyze these smaller subsets.
I do not know how to subset the large date-frame into smaller data-frames and not simply a list.
I also want to do all of this in a loop and give the newly created smaller data frames unique names in the loop.
chunk=241
df<-ch29
n<-ceiling(ncol(df)/chunk)
for (i in 1:n) {
xname <- paste("ch29", i, sep="_")
cat("_", xname)
assign(xname, split(df, rep(1:n, each=chunk, length.out=ncol(df))))
}
I'm not exactly sure what you're trying to do or how you want to choose the columns that go in each data frame, but here's an example of one option:
# Fake data
set.seed(100)
ch29 = as.data.frame(replicate(4283, rnorm(476)))
# Number of columns we want in each split data frame
ncols = floor(ncol(ch29)/20)
# Start column for each split data frame
start = seq(1,ncol(ch29),ncols)
# Split ch29 into a bunch of separate data frames
df.list = lapply(setNames(start, paste0("ch29_", start, "_", start+ncols-1)),
function(i) ch29[ , i:min(i+ncols-1,ncol(ch29))])
You now have a list, df.list, where each list element is a data frame with ncols columns from ch29, except for the last element of the list, which will have between 1 and ncols columns. Also, the name of each list element is the name of the parent data frame (ch29) and the column range from which the subset data frame is drawn.
Try
for (i in 1:3) { # i = 1
xname = paste("ch29", i, sep = "_")
col.min = (i - 1) * chunk + 1
col.max = min(i * chunk, ncol(df))
assign(xname, df[,col.min:col.max])
}
In other words, use the notation df[,a:b], where a < b, to get the subset of the dataframe df consisting only of columns a to b.

Resources