merge data with partial match in r - r

I have two datasets
datf1 <- data.frame (name = c("regular", "kklmin", "notSo", "Jijoh",
"Kish", "Lissp", "Kcn", "CCCa"),
number1 = c(1, 8, 9, 2, 18, 25, 33, 8))
#-----------
name number1
1 regular 1
2 kklmin 8
3 notSo 9
4 Jijoh 2
5 Kish 18
6 Lissp 25
7 Kcn 33
8 CCCa 8
datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", "LiSsp",
"KcN", "CaPN"),
number2 = c(2, 8, 12, 13, 20, 18, 13))
#-------------
name number2
1 reGulr 2
2 ntSo 8
3 Jijoh 12
4 sean 13
5 LiSsp 20
6 KcN 18
7 CaPN 13
I want to merge them by name column, however with partial match is allowed (to avoid hampering merging spelling errors in large data set and even to detect such spelling errors) and for example
(1) If consecutive four letters (all if the number of letters are less than 4) at any position - match that is fine
ABBCD = BBCDK = aBBCD = ramABBBCD = ABB
(2) Case sensitivity is off in the match e.g ABBCD = aBbCd
(3) The new dataset will have both names (names from datf1 and datf2) preserved. So that letter we can detect if the match is perfect (may a separate column with how many letter do match)
Is such merge possible ?
Edits:
datf1 <- data.frame (name = c("xxregular", "kklmin", "notSo", "Jijoh",
"Kish", "Lissp", "Kcn", "CCCa"),
number1 = c(1, 8, 9, 2, 18, 25, 33, 8))
datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean",
"LiSsp", "KcN", "CaPN"),
number2 = c(2, 8, 12, 13, 20, 18, 13))
uglyMerge(datf1, datf2)
name1 name2 number1 number2 matches
1 xxregular <NA> 1 NA 0
2 kklmin <NA> 8 NA 0
3 notSo <NA> 9 NA 0
4 Jijoh Jijoh 2 12 5
5 Kish <NA> 18 NA 0
6 Lissp LiSsp 25 20 5
7 Kcn KcN 33 18 3
8 CCCa <NA> 8 NA 0
9 <NA> reGulr NA 2 0
10 <NA> ntSo NA 8 0
11 <NA> sean NA 13 0
12 <NA> CaPN NA 13 0

Maybe there is a simple solution but I can't find any.
IMHO you have to implement this kind of merging for your own.
Please find an ugly example below (there is a lot of space for improvements):
uglyMerge <- function(df1, df2) {
## lower all strings to allow case-insensitive comparison
lowerNames1 <- tolower(df1[, 1]);
lowerNames2 <- tolower(df2[, 1]);
## split strings into single characters
names1 <- strsplit(lowerNames1, "");
names2 <- strsplit(lowerNames2, "");
## create the final dataframe
mergedDf <- data.frame(name1=as.character(df1[,1]), name2=NA,
number1=df1[,2], number2=NA, matches=0,
stringsAsFactors=FALSE);
## store names of dataframe2 (to remember which strings have no match)
toMerge <- df2[, 1];
for (i in seq(along=names1)) {
for (j in seq(along=names2)) {
## set minimal match to 4 or to string length
minMatch <- min(4, length(names2[[j]]));
## find single matches
matches <- names1[[i]] %in% names2[[j]];
## look for consecutive matches
r <- rle(matches);
## any matches found?
if (any(r$values)) {
## find max consecutive match
possibleMatch <- r$value == TRUE;
maxPos <- which(which.max(r$length[possibleMatch]) & possibleMatch)[1];
## store max conscutive match length
maxMatch <- r$length[maxPos];
## to remove FALSE-POSITIVES (e.g. CCC and kcn) find
## largest substring
start <- sum(r$length[0:(maxPos-1)]) + 1;
stop <- start + r$length[maxPos] - 1;
maxSubStr <- substr(lowerNames1[i], start, stop);
## all matching criteria fulfilled
isConsecutiveMatch <- maxMatch >= minMatch &&
grepl(pattern=maxSubStr, x=lowerNames2[j], fixed=TRUE) &&
nchar(maxSubStr) > 0;
if (isConsecutiveMatch) {
## merging
mergedDf[i, "matches"] <- maxMatch
mergedDf[i, "name2"] <- as.character(df2[j, 1]);
mergedDf[i, "number2"] <- df2[j, 2];
## don't append this row to mergedDf because already merged
toMerge[j] <- NA;
## stop inner for loop here to avoid possible second match
break;
}
}
}
}
## append not matched rows to mergedDf
toMerge <- which(df2[, 1] == toMerge);
df2 <- data.frame(name1=NA, name2=as.character(df2[toMerge, 1]),
number1=NA, number2=df2[toMerge, 2], matches=0,
stringsAsFactors=FALSE);
mergedDf <- rbind(mergedDf, df2);
return (mergedDf);
}
Output:
> uglyMerge(datf1, datf2)
name1 name2 number1 number2 matches
1 xxregular reGulr 1 2 5
2 kklmin <NA> 8 NA 0
3 notSo <NA> 9 NA 0
4 Jijoh Jijoh 2 12 5
5 Kish <NA> 18 NA 0
6 Lissp LiSsp 25 20 5
7 Kcn KcN 33 18 3
8 CCCa <NA> 8 NA 0
9 <NA> ntSo NA 8 0
10 <NA> sean NA 13 0
11 <NA> CaPN NA 13 0

agrep will get you started.
something like:
lapply(tolower(datf1$name), function(x) agrep(x, tolower(datf2$name)))
then you can adjust the max.distance parameter until you get the appropriate amount of matching. then merge however you like.

Related

Add empty rows at specific positions of dataframe

I want to add empty rows at specific positions of a dataframe. Let's say we have this dataframe:
df <- data.frame(var1 = c(1,2,3,4,5,6,7,8,9),
var2 = c(9,8,7,6,5,4,3,2,1))
In which I want to add an empty row after rows 1, 3 and 5 (I know that this is not best practice in most cases, ultimately I want to create a table using flextable here). These row numbers are saved in a vector:
rows <- c(1,3,5)
Now I want to use a for loop that loops through the rows vector to add an empty row after each row using add_row():
for (i in rows) {
df <- add_row(df, .after = i)
}
The problem is, that while the first iteration works flawlessly, the other empty rows get misplaced, since the dataframe gets obviously longer. To fix this I tried adding 1 to the vector after each iteration:
for (i in rows) {
df <- add_row(df, .after = i)
rows <- rows+1
}
Which does not work. I assume the rows vector does only get evaluated once. Anyone got any ideas?
Do it all at once, no need for looping. Make a sequence of row numbers, add the new rows in, sort, then replace the duplicated row numbers with NA:
s <- sort(c(seq_len(nrow(df)), rows))
out <- df[s,]
out[duplicated(s),] <- NA
# var1 var2
#1 1 9
#1.1 NA NA
#2 2 8
#3 3 7
#3.1 NA NA
#4 4 6
#5 5 5
#5.1 NA NA
#6 6 4
#7 7 3
#8 8 2
#9 9 1
This will be much more efficient than looping or loop-like code, for even moderately sized data:
df <- df[rep(1:9,1e4),]
rows <- seq(1,9e4,100)
system.time({
s <- sort(c(seq_len(nrow(df)), rows))
out <- df[s,]
out[duplicated(s),] <- NA
})
# user system elapsed
# 0.01 0.00 0.02
df <- df[rep(1:9,1e4),]
rows <- seq(1,9e4,100)
system.time({
Reduce(function(x, y) tibble::add_row(x, .after = y), rev(rows), init = df)
})
# user system elapsed
# 26.03 0.00 26.03
df <- df[rep(1:9,1e4),]
rows <- seq(1,9e4,100)
system.time({
for (i in rev(rows)) {
df <- tibble::add_row(df, .after = i)
}
})
# user system elapsed
# 25.05 0.00 25.04
You could achieve your result by looping in the reverse direction:
df <- data.frame(
var1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
var2 = c(9, 8, 7, 6, 5, 4, 3, 2, 1)
)
rows <- c(1, 3, 5)
for (i in rev(rows)) {
df <- tibble::add_row(df, .after = i)
}
df
#> var1 var2
#> 1 1 9
#> 2 NA NA
#> 3 2 8
#> 4 3 7
#> 5 NA NA
#> 6 4 6
#> 7 5 5
#> 8 NA NA
#> 9 6 4
#> 10 7 3
#> 11 8 2
#> 12 9 1

How to get the sum of rows using a vector and the make the result in a column

I have a dataframe and i want to calculate the sum of variables present in a vector in every row and make the sum in other variable after i want the name of new variable created to be from the name of the variable in vector
for example
data
Name A_12 B_12 C_12 D_12 E_12
r1 1 5 12 21 15
r2 2 4 7 10 9
r3 5 15 16 9 6
r4 7 8 0 7 18
let's say i have two vectors
vector_1 <- c("A_12","B_12","C_12")
vector_2 <- c("B_12","C_12","D_12","E_12")
The result i want is :
New_data >
Name A_12 B_12 C_12 ABC_12 D_12 E_12 BCDE_12
r1 1 5 12 18 21 15 54
r2 2 4 7 13 10 9 32
r3 5 15 16 36 9 6 45
r4 7 8 0 15 7 18 40
I created for loop to get the sum of the rows in a vector but i didn't get the correct result
Please tell me ig you need any more informations or clarifications
Thank you
You can use rowSums and simple column-subsetting:
dat$ABC_12 <- rowSums(dat[,vector_1])
dat$BCDE_12 <- rowSums(dat[,vector_2])
dat
# Name A_12 B_12 C_12 D_12 E_12 ABC_12 BCDE_12
# 1 r1 1 5 12 21 15 18 53
# 2 r2 2 4 7 10 9 13 30
# 3 r3 5 15 16 9 6 36 46
# 4 r4 7 8 0 7 18 15 33
Note that if your frames inherit from data.table, then you'll need to use either subset(dat, select=vector_1) or dat[,..vector_1] instead of simply dat[,vector_1]; if you aren't already using data.table, then you can safely ignore this paragraph.
Like this (using dplyr/tidyverse)
df %>%
rowwise() %>%
mutate(
ABC_12 = sum(c_across(vector_1)),
BCDE_12 = sum(c_across(vector_2))
)
Though I'm not sure the sums are correct in your example
-=-=-=EDIT-=-=-=-
Here's a function to help with the naming.
ex_fun <- function(vec, n_len){
paste0(paste(substr(vec,1,n_len), collapse = ""), substr(vec[1],n_len+1,nchar(vec[1])))
}
Which can then be implemented like so.
df %>%
rowwise() %>%
mutate(
!!ex_fun(vector_1, 1) := sum(c_across(vector_1)),
!!ex_fun(vector_2, 1) := sum(c_across(vector_2)),
)
-=-= Extra note -=--=
If you list your vectors up you could then combine this with r2evans answer and stick into a loop if you prefer.
vectors = list(vector_1, vector_2)
for (v in vectors){
df[ex_fun(v, 1)] <- rowSums(df[,v])
}
I believe this might work, so long as only the starting digits are different:
library("tidyverse")
#Input dataframe.
data <- data.frame(Name =c("r1", "r2", "r3", "r4"), A_12 = c(1, 2, 5, 7), B_12 = c(5, 4, 15, 8),
C_12 = c(12, 7, 16, 0), D_12 = c(21, 10, 9, 7), E_12 = c(15, 9, 6, 18))
#add all vectors to the "vectors" list. I have added vector_1 and vector_2, but
#there can be as many vectors as needed, they just need to be put in the list.
vector_1 <- c("A_12","B_12","C_12")
vector_2 <- c("B_12","C_12","D_12","E_12")
vector_list<-list(vector_1, vector_2)
vector_sum <- function(data, vector_list){
output <- data |>
dplyr::select(1, all_of(vector_list[[1]]))
for (i in vector_list) {
name1 <- substring(as.character(i), 1,1) |> paste(collapse = '')
name2 <- substring(as.character(i[1]), 2)
input_temp <- dplyr::select(data, all_of(i))
input_temp <- mutate(input_temp, temp=rowSums(input_temp))
names(input_temp)[names(input_temp) == "temp"] <- paste(name1, name2)
output = cbind(output, input_temp)
}
output[, !duplicated(colnames(output))]
}
vector_sum(data, vector_list)

Returning values after last NA in a vector

Returning values after last NA in a vector
I can remove all NA values from a vector
v1 <- c(1,2,3,NA,5,6,NA,7,8,9,10,11,12)
v2 <- na.omit(v1)
v2
but how do I return a vector with values only after the last NA
c( 7,8,9,10,11,12)
Thank you for your help.
You could detect the last NA with which and add 1 to get the index past the last NA and index until the length(v1):
v1[(max(which(is.na(v1)))+1):length(v1)]
[1] 7 8 9 10 11 12
Here’s an alternative solution that does not use indices and only vectorised operations:
after_last_na = as.logical(rev(cumprod(rev(! is.na(v1)))))
v1[after_last_na]
The idea is to use cumprod to fill the non-NA fields from the last to the end. It’s not a terribly useful solution in its own right (I urge you to use the more obvious, index range based solution from other answers) but it shows some interesting techniques.
You could detect the last NA with which
v1[(tail(which(is.na(v1)), 1) + 1):length(v1)]
# [1] 7 8 9 10 11 12
However, the most general - as #MrFlick pointed out - seems to be this:
tail(v1, -tail(which(is.na(v1)), 1))
# [1] 7 8 9 10 11 12
which also handles the following case correctly:
v1[13] <- NA
tail(v1, -tail(which(is.na(v1)), 1))
# numeric(0)
To get the null NA case, too,
v1 <- 1:13
we can do
if (any(is.na(v1))) tail(v1, -tail(which(is.na(v1)), 1)) else v1
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13
Data
v1 <- c(1, 2, 3, NA, 5, 6, NA, 7, 8, 9, 10, 11, 12)
v1 <- c(1,2,3,NA,5,6,NA,7,8,9,10,11,12)
v1[seq_along(v1) > max(0, tail(which(is.na(v1)), 1))]
#[1] 7 8 9 10 11 12
v1 = 1:5
v1[seq_along(v1) > max(0, tail(which(is.na(v1)), 1))]
#[1] 1 2 3 4 5
v1 = c(1:5, NA)
v1[seq_along(v1) > max(0, tail(which(is.na(v1)), 1))]
#integer(0)
The following will do what you want.
i <- which(is.na(v1))
if(i[length(i)] < length(v1)){
v1[(i[length(i)] + 1):length(v1)]
}else{
NULL
}
#[1] 7 8 9 10 11 12

How to squeeze in missing values into a vector

Let me try to make this question as general as possible.
Let's say I have two variables a and b.
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
So b has 17 observations and is a subset of a which has 20 observations.
My question is the following: how I would use these two variables to generate a third variable c which like a has 20 observations but for which observations 7, 11 and 15 are missing, and for which the other observations are identical to b but in the order of a?
Or to put it somewhat differently: how could I squeeze in these missing observations into variable b at locations 7, 11 and 15?
It seems pretty straightforward (and it probably is) but I have been not getting this to work for a bit too long now.
1) loop Try this loop:
# test data
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
# lets work with vectors
A <- a[[1]]
B <- b[[1]]
j <- 1
C <- A
for(i in seq_along(A)) if (A[i] == B[j]) j <- j+1 else C[i] <- NA
which gives:
> C
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
2) Reduce Here is a loop-free version:
f <- function(j, a) j + (a == B[j])
r <- Reduce(f, A, acc = TRUE)
ifelse(duplicated(r), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
3) dtw. Using dtw in the package of the same name we can get a compact loop-free one-liner:
library(dtw)
ifelse(duplicated(dtw(A, B)$index2), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
REVISED Added additional solutions.
Here's a more complicated way of doing it, using the Levenshtein distance algorithm, that does a better job on more complicated examples (it also seemed faster in a couple of larger tests I tried):
# using same data as G. Grothendieck:
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
A = a[[1]]
B = b[[1]]
# compute the transformation between the two, assigning infinite weight to
# insertion and substitution
# using +1 here because the integers fed to intToUtf8 have to be larger than 0
# could also adjust the range more dynamically based on A and B
transf = attr(adist(intToUtf8(A+1), intToUtf8(B+1),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
More complex matching example (where the greedy algorithm would perform poorly):
A = c(1,1,2,2,1,1,1,2,2,2)
B = c(1,1,1,2,2,2)
transf = attr(adist(intToUtf8(A), intToUtf8(B),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] NA NA NA NA 1 1 1 2 2 2
# the greedy algorithm would return this instead:
#[1] 1 1 NA NA 1 NA NA 2 2 2
The data frame version, which isn't terribly different from G.'s above.
(Assumes a,b setup as above).
j <- 1
c <- a
for (i in (seq_along(a[,1]))) {
if (a[i,1]==b[j,1]) {
j <- j+1
} else
{
c[i,1] <- NA
}
}

Data on one row by ID

I have a data frame with one id column and several other column grouped by couple and i'm trying to put all the data for a same id on one row. ID's do not appear the same number of times each.
My data looks like this :
df <- data.frame(id=sample(1:4, 12, T), vpcc1=1:12, hpcc1=rnorm(12), vpcc2=1:12, hpcc2=rnorm(12), vpcc3=1:12, hpcc3=rnorm(12))
df
## id vpcc1 hpcc1 vpcc2 hpcc2 vpcc3 hpcc3
## 1 1 1 0.04632267 1 -0.37404379 1 0.90711353
## 2 4 2 0.50383152 2 0.06075954 2 0.30690284
## 3 1 3 1.52450117 3 -1.21539925 3 -1.12411614
## 4 1 4 -0.50624871 4 -0.75988364 4 -0.47970608
## 5 3 5 1.64610863 5 0.03445275 5 -0.18895338
## 6 1 6 0.22019099 6 -0.32101883 6 1.29375822
## 7 2 7 -0.10041807 7 -0.17351799 7 -0.03767921
## 8 2 8 0.81683565 8 0.62449158 8 0.50474787
## 9 2 9 -0.46891269 9 1.07743469 9 -0.55539149
## 10 1 10 0.69736549 10 -0.08573679 10 0.28025325
## 11 3 11 0.73354215 11 0.80676315 11 -1.12561358
## 12 2 12 -0.40903143 12 1.94155313 12 0.64231119
For the moment i came up with this :
align2 <- function(df) {
result <- lapply(1:nrow(df), function(j) lapply(1:3, function(i) {x <- df[j, paste0(c("vpcc", "hpcc"), i)]
names(x) <- paste0(c("vpcc", "hpcc"), (i + (j-1)*4))
return(x)}))
result2 <- lapply(result, function(x) do.call(cbind, x))
result3 <- do.call(cbind, result2)
return(result3)
}
testX <- lapply(1:4, function(k) align2(as.data.frame(split(df, f=df$id)[[k]])))
library(plyr)
testX2 <- do.call(rbind.fill, testX)
testX2
## vpcc1 hpcc1 vpcc2 hpcc2 vpcc3 hpcc3 vpcc4 hpcc4 vpcc5 hpcc5 vpcc6 hpcc6 vpcc7 hpcc7 vpcc8 hpcc8 ...
## 1 1 0.04632267 1 -0.37404379 1 0.90711353 3 1.5245012 3 -1.2153992 3 -1.1241161 4 -0.5062487 4 -0.7598836 ...
## 2 7 -0.10041807 7 -0.17351799 7 -0.03767921 8 0.8168356 8 0.6244916 8 0.5047479 9 -0.4689127 9 1.0774347 ...
## 3 5 1.64610863 5 0.03445275 5 -0.18895338 11 0.7335422 11 0.8067632 11 -1.1256136 NA NA NA NA ...
## 4 2 0.50383152 2 0.06075954 2 0.30690284 NA NA NA NA NA NA NA NA NA NA ...
It's a partial solution since it don't keep the id.
But I can't imagine there's not a easier way...
Thank you for suggestions
PS : maybe there's already a solution on SO but I didn't find it...
In your example the variables vpcc1 vpcc2 etc. are redundant, since they have all the same value. So you can transform the dataset into a more economical structure:
df <- data.frame(id=sample(1:4, 12, T), vpcc=1:12, hpcc1=rnorm(12),
hpcc2=rnorm(12),hpcc3=rnorm(12))
Then use reshape() and you'll have all the values for each id in a single row, with the columns corresponding to the vpcc value, so that "hpcc3.5" means hpcc3 when vpcc is 5.
reshape(df, idvar = "id", direction = "wide", timevar = "vpcc")
EDIT:
if vpccX varies, then maybe this will give you what you need?
df <- data.frame(id=sample(1:4, 12, T), vpcc1=1:12, hpcc1=rnorm(12), vpcc2=1:12,
hpcc2=rnorm(12), vpcc3=1:12, hpcc3=rnorm(12))
df$time = ave(df$id, df$id, FUN = function(x) 1:length(x))
reshape(df, idvar = "id", direction = "wide", timevar = "time")
of course, you can rename your variables, if it's needed.
When you say "same row", is it necessary that the output is like it is in your attempt or would you be happy with something like:
x <- aggregate(df[2:ncol(df)],list(df$id),list)
which allows you to view output on one row as:
x
# Group.1 vpcc1 hpcc1 vpcc2 hpcc2 vpcc3
#1 1 9, 10 1.4651392, 0.8581344 9, 10 -1.621135, 1.391945 9, 10
#2 2 1, 3, 7 2.784998, 1.667367, -1.329005 1, 3, 7 0.2115051, 0.7871399, -0.4835389 1, 3, 7
#3 3 5, 6 -0.5024987, 0.2822224 5, 6 0.155844, 1.336449 5, 6
#4 4 2, 4, 8, 11, 12 -0.48563550, -0.92684024, -0.04016263, -0.41861021, 0.02309864 2, 4, 8, 11, 12 -0.17304058, 0.25428404, -0.49897995, 0.03101927, -0.13529866 2, 4, 8, 11, 12
# hpcc3
#1 -0.05182822, 0.28365514
#2 -0.06189895, -0.83640652, 0.19425789
#3 -0.006440312, 1.378218706
#4 0.09412386, 0.16733125, -1.15198965, -1.00839015, -0.16114475
and reference different values of vpcc and hpcc using list notation:
x$vpcc1
#$`0`
#[1] 9 10
#$`1`
#[1] 1 3 7
#$`2`
#[1] 5 6
#$`3`
#[1] 2 4 8 11 12
x$vpcc1[[1]]
#[1] 9 10
?

Resources