Collapse and intersect data frames - r

I have two data.frames which have a 3 columns:
1. id - a unique key
target - semicolon separated unique values
source - similar for each of the data frames but different for the two data.frames.
Here's simulated data:
set.seed(1)
df.1 <- data.frame(id=LETTERS[sample(length(LETTERS),10,replace=F)],
target=sapply(1:10,function(x) paste(LETTERS[sample(length(LETTERS),5,replace=F)],collapse=";")),
source="A",stringsAsFactors=F)
df.2 <- data.frame(id=LETTERS[sample(length(LETTERS),5,replace=F)],
target=sapply(1:5,function(x) paste(LETTERS[sample(length(LETTERS),5,replace=F)],collapse=";")),
source="B",stringsAsFactors=F)
I'm looking for a function that will collapse the two data.frames together and will create 3 columns:
1.intersected.targets - semicolon separated unique values that are intersected between the two data.frames
2.source1.targets - targets that are unique to the first data.frame
3.source2.targets - targets that are unique to the second data.frame
So for the example above the resulting data.frame will be:
> res.df
id intersected.targets sourceA.targets sourceB.targets
1 G NA F;E;Q;I;X <NA>
2 J NA M;R;X;I;Y <NA>
3 N NA Y;F;P;C;Z <NA>
4 U NA K;A;J;U;H <NA>
5 E NA M;O;L;E;S <NA>
6 S NA R;T;C;Q;J <NA>
7 W NA V;Q;S;M;L <NA>
8 M NA U;A;L;Q;P <NA>
9 B NA C;H;M;P;I <NA>
10 X NA <NA> G;L;S;B;T
11 H NA <NA> I;U;Z;H;K
12 Y NA <NA> L;R;J;H;Q
13 O NA <NA> F;R;C;Z;D
14 L V M;K;F;B X;J;R;Y

This is a continuation of DavidArenberg's deleted answer that taught me the notion of creating a list column in a data.table. I didn't know how to properly implement my idea of using setdiff row by row but eventually after multiple searches found an answer by Frank that does it. Here is David's (partial) answer:
=====
Here's a possible solution on a different seed that have more than one intersections and more than one letter in a single intersection
#Generating Data
set.seed(123)
df.1 <- data.frame(id=LETTERS[sample(length(LETTERS),10,replace=F)],
target=sapply(1:10,function(x) paste(LETTERS[sample(length(LETTERS),5,
replace=F)],collapse=";")),
source="A",stringsAsFactors=F)
df.2 <- data.frame(id=LETTERS[sample(length(LETTERS),5, replace=F)],
target=sapply(1:5,function(x) paste(LETTERS[sample(length(LETTERS),5,
replace=F)],collapse=";")),
source="B",stringsAsFactors=F)
#Solution
library(data.table)
library(stringi)
res <- dcast(rbind(setDT(df.1), setDT(df.2)), id ~ source, value.var = "target")
res[!is.na(A) & !is.na(B), intersected.targets :=
stri_extract_all(A, regex = gsub(";", "|", B, fixed = TRUE))]
res
==========================
So I used his listifying code to make an A2and B2 column that are the list-version of A and B
res[ , A2 := stri_extract_all(A, regex = "[[:alpha:]]") ]
res[ , B2 := stri_extract_all(B, regex = "[[:alpha:]]") ]
Then used Map() to do a row by row setdiff:
res[, SourceA := Map( setdiff, A2, intersected.targets)]
res[, SourceB := Map( setdiff, B, intersected.targets)]
res
#-------------------------------
id A B intersected.targets A2 B2 SourceA SourceB
1: A M;S;F;H;X NA NULL M,S,F,H,X NA M,S,F,H,X NA
2: C NA T;P;R;A;K NULL NA T,P,R,A,K NA T,P,R,A,K
3: G NA G;Q;K;S;C NULL NA G,Q,K,S,C NA G,Q,K,S,C
4: H Y;L;Q;N;C NA NULL Y,L,Q,N,C NA Y,L,Q,N,C NA
5: J X;R;P;W;O F;J;O;I;C O X,R,P,W,O F,J,O,I,C X,R,P,W F,J,I,C
6: K D;K;J;I;Z NA NULL D,K,J,I,Z NA D,K,J,I,Z NA
7: Q D;F;L;G;S NA NULL D,F,L,G,S NA D,F,L,G,S NA
8: R NA L;U;T;S;J NULL NA L,U,T,S,J NA L,U,T,S,J
9: T X;G;B;H;U NA NULL X,G,B,H,U NA X,G,B,H,U NA
10: U S;N;O;G;D NA NULL S,N,O,G,D NA S,N,O,G,D NA
11: W Z;W;Q;S;A NA NULL Z,W,Q,S,A NA Z,W,Q,S,A NA
12: X B;L;T;C;M NA NULL B,L,T,C,M NA B,L,T,C,M NA
13: Z F;D;S;U;I L;Y;V;U;D D,U F,D,S,U,I L,Y,V,U,D F,S,I L,Y,V
I'm leaving the clean-up as a student exercise.

The pain in the butt in this type of data cleaning, as #42- mentions, is unlisting data frames of lists.
library(dplyr)
library(stringr)
df <- full_join(df.1, df.2) %>%
spread(source, target) %>%
mutate(intersect_targets = str_c(A,B,sep = ";"))
df[,4][!is.na(df[,4])] <- names(do.call("c",lapply(df$intersect_targets, function(x)
which(table(str_split(x, ";"))>1))))
a <- sapply(seq(nrow(df)), function(x) {
str_split(df[x,2:3],";")
})
sa <- do.call("c",lapply(mapply(setdiff,a[1,], a[2,]),paste0, collapse = ","))
sb <- do.call("c",lapply(mapply(setdiff,a[2,], a[1,]), paste0, collapse = ","))
df[,2:3] <-cbind(sa,sb)
head(df)
id A B intersect_targets
1 B C,H,M,P,I NA <NA>
2 E M,O,L,E,S NA <NA>
3 G F,E,Q,I,X NA <NA>
4 H NA I,U,Z,H,K <NA>
5 J M,R,X,I,Y NA <NA>
6 L M,K,F,B X,J,R,Y V

Related

Comparing two columns in R based on matching values in a common column

So, I have two datasets, such that all the columns in one dataset is present in the other, along with some extra columns. What I want to do is to create a new dataset of the differences between the entries in common columns, on the basis of matching a common identifier column, present in both the datasets. How can I do that in R?
If it were a single column, I could have used the sqldf function as
sqldf("select a.v1 - b.v1 from ds1 a left join ds2 b on a.identifier=b.identifier")
But there are 900 common columns between both the datasets.
You can do this by simply joining the frames on identifier, and then subtracting the one frame from the other.
Here is an example of the approach using data.table
# load the library
library(data.table)
# do a left join on identifier
merged = setDT(ds2)[setDT(ds1), on="identifier"]
# get common column names, and remove "identifier" from that vector
cols = intersect(names(ds1), names(ds2))
cols = cols[cols!="identifier"]
# column-bind the identifier column with a substraction of the two sub-frames
cbind(
merged[,.(identifier)],
setnames(merged[,.SD, .SDcols = paste0("i.",cols)] - merged[,.SD, .SDcols = cols],paste0("diff_",cols))
)
Here is the same approach using dplyr:
library(dplyr)
merged = left_join(ds1,ds2, by="identifier")
cols = intersect(names(ds1), names(ds2))
cols = cols[cols!="identifier"]
bind_cols(
select(merged,identifier),
(select(merged, all_of(paste0(cols, ".x"))) - select(merged, all_of(paste0(cols, ".y")))) %>%
rename_with(~paste0("diff_",cols), everything())
)
Output (same under either approach):
identifier diff_v1 diff_v2 diff_v3
<char> <num> <num> <num>
1: O 0.5028498 0.7573174 -1.00630610
2: S -2.5631238 -0.7041228 1.33877932
3: N NA NA NA
4: C NA NA NA
5: J NA NA NA
6: R NA NA NA
7: K NA NA NA
8: E NA NA NA
9: X -0.1830764 0.2924459 -0.01860763
10: Y NA NA NA
11: W NA NA NA
12: T -0.4912840 -2.8126285 -1.33661044
13: I NA NA NA
14: L NA NA NA
15: U NA NA NA
16: M -0.3130889 1.1590316 -0.44551660
17: P NA NA NA
18: H NA NA NA
19: B NA NA NA
20: G -2.2817049 2.4156583 -0.34393988
21: Z NA NA NA
22: A -0.1654816 -0.8807393 -1.08534789
23: F NA NA NA
24: V NA NA NA
25: D 1.4653655 0.2604109 -0.17733840
26: Q NA NA NA
identifier diff_v1 diff_v2 diff_v3
Input:
set.seed(123)
ds1 = data.frame(identifier = sample(LETTERS),v1 = rnorm(26),v2 = rnorm(26),v3 = rnorm(26))
ds2 = data.frame(identifier = sample(LETTERS,8),v1 = rnorm(8),v2 = rnorm(8),v3 = rnorm(8))
Using S1 and S2 as an example (see Note at end) we find the common column names using intersect. In the example data in the Note at the end there is only one numeric common column that is not ID but the same code should work even if there are many.
Then using Filter extract the names of the numeric common columns. We have assumed that if a common column in S1 is numeric then it is also numeric in S2 so we only have to check S1. The Filter line could be omitted if we knew that all common columns were numeric.
Next ensure that the ID column is excluded using setdiff. If ID is non numeric the Filter line would have already removed it in which case we could omit the setdiff line.
Now construct the select clause. sprintf creates a character vector of the elements of the select clause and toString collapses it to a comma separated string giving the final select string. Finally run the SQL statement. Note that fn$sqldf turns on string interpolation in the SQL statement and $sel inserts the contents of the sel variable into the SQL string.
library (sqldf)
nms <- intersect(names(S1), names(S2))
nms <- names(Filter(is.numeric, S1[nms]))
nms <- setdiff(nms, "ID")
sel <- toString(sprintf("a.[%s] - b.[%s] as [%s]", nms, nms, nms))
fn$sqldf("select ID, $sel  
from S1 a  
left join S2 b using(ID)")
## ID extra
## 1 1 0
## 2 2 0
## 3 3 0
## 4 4 0
## 5 5 0
## 6 6 0
Pipe
The nms<- lines above could alternately be written in terms of pipes:
nms <- names(S1) |>
intersect(names(S2)) |>
subset(S1, select = _) |> 
Filter(f = is.numeric) |>
names() |>
setdiff("ID")
Note
The data frame sleep comes with R. S1 and S2 are used as an example.
S1 <- head(sleep)
S2 <- S1[-2]
S1
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
S2
## extra ID
## 1 0.7 1
## 2 -1.6 2
## 3 -0.2 3
## 4 -1.2 4
## 5 -0.1 5
## 6 3.4 6

How to select n random values from each rows of a dataframe in R?

I have a dataframe
df= data.frame(a=c(56,23,15,10),
b=c(43,NA,90.7,30.5),
c=c(12,7,10,2),
d=c(1,2,3,4),
e=c(NA,45,2,NA))
I want to select two random non-NA row values from each row and convert the rest to NA
Required Output- will differ because of randomness
df= data.frame(
a=c(56,NA,15,NA),
b=c(43,NA,NA,NA),
c=c(NA,7,NA,2),
d=c(NA,NA,3,4),
e=c(NA,45,NA,NA))
Code Used
I know to select random non-NA value from specific rows
set.seed(2)
sample(which(!is.na(df[1,])),2)
But no idea how to apply it all dataframe and get the required output
You may write a function to keep n random values in a row.
keep_n_value <- function(x, n) {
x1 <- which(!is.na(x))
x[-sample(x1, n)] <- NA
x
}
Apply the function by row using base R -
set.seed(123)
df[] <- t(apply(df, 1, keep_n_value, 2))
df
# a b c d e
#1 NA NA 12 1 NA
#2 NA NA 7 2 NA
#3 NA 90.7 10 NA NA
#4 NA 30.5 NA 4 NA
Or if you prefer tidyverse -
purrr::pmap_df(df, ~keep_n_value(c(...), 2))
Base R:
You could try column wise apply (sapply) and randomly replace two non-NA values to be NA, like:
as.data.frame(sapply(df, function(x) replace(x, sample(which(!is.na(x)), 2), NA)))
Example Output:
a b c d e
1 56 NA 12 NA NA
2 23 NA NA 2 NA
3 NA NA 10 3 NA
4 NA 30.5 NA NA NA
One option using dplyr and purrr could be:
df %>%
mutate(pmap_dfr(across(everything()), ~ `[<-`(c(...), !seq_along(c(...)) %in% sample(which(!is.na(c(...))), 2), NA)))
a b c d e
1 56 43.0 NA NA NA
2 23 NA 7 NA NA
3 15 NA NA NA 2
4 NA 30.5 2 NA NA

Create a one line data frame with NAs for a list of column names

I have a list with names and I would like to create a data frame with these names as column names and one NA value. I will replace some of the NAs during a loop.
n <- c('a','b')
d <- data.frame(a=NA, b=NA)
So basically I have a vector like n and I would like to automatically create a NA-data frame like d. Is there a handy way of doing so?
There are a few different ways you do this. Here are two -
setNames(do.call(data.frame, rep(list(NA), length(n))), n)
# a b
# 1 NA NA
which is basically generalized for any n
N <- letters[1:6]
setNames(do.call(data.frame, rep(list(NA), length(N))), N)
# a b c d e f
# 1 NA NA NA NA NA NA
A second method uses as.data.frame()
as.data.frame(setNames(rep(list(NA), length(N)), N))
# a b c d e f
# 1 NA NA NA NA NA NA
Or, since your just using NA values, NA[seq_along(N)] can replace rep()
setNames(data.frame(as.list(NA[seq_along(N)])), N)
# a b c d e f
# 1 NA NA NA NA NA NA
Note that all these will produce logical classed columns. For other classes, you can use NA_integer_, NA_character_, etc.

Insert an empty column between every column of a dataframe in R

Say you have a dataframe of four columns:
dat <- data.frame(A = rnorm(5), B = rnorm(5), C = rnorm(5), D = rnorm(5))
And you want to insert an empty column between each of the columns in the dataframe, so that the output is:
A A1 B B1 C C1 D D1
1 1.15660588 NA 0.78350197 NA -0.2098506 NA 2.07495662 NA
2 0.60107853 NA 0.03517539 NA -0.4119263 NA -0.08155673 NA
3 0.99680981 NA -0.83796981 NA 1.2742644 NA 0.67469277 NA
4 0.09940946 NA -0.89804952 NA 0.3419173 NA -0.95347049 NA
5 0.28270734 NA -0.57175554 NA -0.4889045 NA -0.11473839 NA
How would you do this?
The dataframe I would like to do this operation to has hundreds of columns and so obviously I don't want to type out each column and add them naively like this:
dat$A1 <- NA
dat$B1 <- NA
dat$C1 <- NA
dat$D1 <- NA
dat <- dat[, c("A", "A1", "B", "B1", "C", "C1", "D", "D1")]
Thanks for you help in advance!
You can try
res <- data.frame(dat, dat*NA)[order(rep(names(dat),2))]
res
# A A.1 B B.1 C C.1 D D.1
#1 1.15660588 NA 0.78350197 NA -0.2098506 NA 2.07495662 NA
#2 0.60107853 NA 0.03517539 NA -0.4119263 NA -0.08155673 NA
#3 0.99680981 NA -0.83796981 NA 1.2742644 NA 0.67469277 NA
#4 0.09940946 NA -0.89804952 NA 0.3419173 NA -0.95347049 NA
#5 0.28270734 NA -0.57175554 NA -0.4889045 NA -0.11473839 NA
NOTE: I am leaving the . in the column names as it is a trivial task to remove it.
Or another option is
dat[paste0(names(dat),1)] <- NA
dat[order(names(dat))]
you can try this
df <- cbind(dat, dat)
df <- df[, sort(names(df))]
df[, seq(2, 8,by=2)] <- NA
names(df) <- sub("\\.", "", names(df))
# create new data frame with twice the number of columns
bigdat <- data.frame(matrix(ncol = dim(dat)[2]*2, nrow = dim(dat)[1]))
# set sequence of target column indices
inds <- seq(1,dim(bigdat)[2],by=2)
# insert values
bigdat[,inds] <- dat
# set column names
colnames(bigdat)[inds] <- colnames(dat)

merge multiple data.frame by row in R

I would like to merge multiple data.frame in R using row.names, doing a full outer join. For this I was hoping to do the following:
x = as.data.frame(t(data.frame(a=10, b=13, c=14)))
y = as.data.frame(t(data.frame(a=1, b=2)))
z = as.data.frame(t(data.frame(a=3, b=4, c=3, d=11)))
res = Reduce(function(a,b) merge(a,b,by="row.names",all=T), list(x,y,z))
Warning message:
In merge.data.frame(a, b, by = "row.names", all = T) :
column name ‘Row.names’ is duplicated in the result
> res
Row.names Row.names V1.x V1.y V1
1 1 a 10 1 NA
2 2 b 13 2 NA
3 3 c 14 NA NA
4 a <NA> NA NA 3
5 b <NA> NA NA 4
6 c <NA> NA NA 3
7 d <NA> NA NA 11
What I was hoping to get would be:
V1 V2 V3
a 10 1 3
b 13 2 4
c 14 NA 3
d NA NA 11
The following works (up to some final column renaming):
res <- Reduce(function(a,b){
ans <- merge(a,b,by="row.names",all=T)
row.names(ans) <- ans[,"Row.names"]
ans[,!names(ans) %in% "Row.names"]
}, list(x,y,z))
Indeed:
> res
V1.x V1.y V1
a 10 1 3
b 13 2 4
c 14 NA 3
d NA NA 11
What happens with a row join is that a column with the original rownames is added in the answer, which in turn does not contain row names:
> merge(x,y,by="row.names",all=T)
Row.names V1.x V1.y
1 a 10 1
2 b 13 2
3 c 14 NA
This behavior is documented in ?merge (under Value)
If the matching involved row names, an extra character column called
Row.names is added at the left, and in all cases the result has
‘automatic’ row names.
When Reduce tries to merge again, it doesn't find any match unless the names are cleaned up manually.
For continuity, this is not a clean solution but a workaround, I transform the list argument of 'Reduce' using sapply.
Reduce(function(a,b) merge(a,b,by=0,all=T),
sapply(list(x,y,z),rbind))[,-c(1,2)]
x y.x y.y
1 10 1 3
2 13 2 4
3 14 NA 3
4 NA NA 11
Warning message:
In merge.data.frame(a, b, by = 0, all = T) :
column name ‘Row.names’ is duplicated in the result
For some reason I did not have much success with Reduce. given a list of data.frames (df.lst) and a list of suffixes (suff.lst) to change the names of identical columns, this is my solution (it's loop, I know it's ugly for R standards, but it works):
df.merg <- as.data.frame(df.lst[1])
colnames(df.merg)[-1] <- paste(colnames(df.merg)[-1],suff.lst[[1]],sep="")
for (i in 2:length(df.lst)) {
df.i <- as.data.frame(df.lst[i])
colnames(df.i)[-1] <- paste(colnames(df.i)[-1],suff.lst[[i]],sep="")
df.merg <- merge(df.merg, df.i, by.x="",by.y="", all=T)
}

Resources