connecting groups of duplicates - r

I have some data which has lots of duplication. For example, this data frame shows IDs in the data set that are known to be identical (e.g. row1 indicates a =b, therefore the rest of the data indicate that a=b=c and d=e=f):
a <- c('a','a','b','b','c','c','d','d','e','e','f','f')
b <- c('b','c','a','c','a','b','e','f','d','f','d','e')
duplicates <- cbind(a,b)
Is there any easy way to split these into two groups that are true IDs (e.g. here a,b & c are all the same and d,e & f are also all the same). So for my sample data:
a <- c('a','b','c','d','e','f')
b <- c('c1','c1','c1','c2','c2','c2')
new_id <- cbind(a,b)
The actual data has thousands of rows and is not fully connected (i.e. in a cluster of duplicates this could occur: a=b, a=c,b=/=c), due to some errors in duplicate detection.

Sounds like you are looking at network analyses. There are a few packages that deal with this. So you might want to use the one you are the most familiar with (network, tidygraph, igraph, diagrammeR). I use igraph, because I know that one a bit more than the others.
Steps:
First create a graph from the data using the dup data.frame. Next use the clusters function (or one of the other cluster options) to create clusters based on the data. Last step is to transform the clusters into a data.frame. Additionally you could plot the data (depends on how much data you have).
library(igraph)
g <- graph_from_data_frame(dup, directed = FALSE)
clust <- clusters(g)
clusters <- data.frame(name = names(clust$membership),
cluster = clust$membership,
row.names = NULL,
stringsAsFactors = FALSE)
clusters
name cluster
1 a 1
2 b 1
3 c 1
4 d 2
5 e 2
6 f 2
# plot graph if needed
plot(g)
data:
a <- c('a','a','b','b','c','c','d','d','e','e','f','f')
b <- c('b','c','a','c','a','b','e','f','d','f','d','e')
dup <- data.frame(a,b, stringsAsFactors = FALSE)

You could work with factors.
df.1$id <- with(df.1, ifelse(as.numeric(a) %in% 1:3, "c1", "c2"))
new_id <- unique(df.1[, -2])
rownames(new_id) <- NULL # just in case
Yielding
> new_id
a id
1 a c1
2 b c1
3 c c1
4 d c2
5 e c2
6 f c2
Data
a <- c('a','a','b','b','c','c','d','d','e','e','f','f')
b <- c('b','c','a','c','a','b','e','f','d','f','d','e')
df.1 <- data.frame(a, b)

Related

Sequence of patterns in R sequence and events issues

I am trying to work with frequent sequences in R (SPADE). I have the following data set:
d1 <- c(1:10)
d2 <- c("nut", "bolt", "screw")
data <- data.frame(expand.grid(d1,d2))
data$status <- sample(c("a","b","c"), size = nrow(data), replace = TRUE)
colnames(data) <- c("day", "widget", "status")
day widget status
1 1 nut c
2 2 nut b
3 3 nut b
4 4 nut b
5 5 nut a
6 6 nut a
7 7 nut b
8 8 nut c
9 9 nut c
10 10 nut b
11 1 bolt a
12 2 bolt b
...
I have not been able to get the data into a format that seems to work with the various packages available. I think the basic issue is that most packages would like to have sequences that are tied to an identity and an event. In my case that doesn't exists.
I want to answer the question of:
If on any day the status of widget[bolt] is an "a" and widget[screw] is a "c" and on the next day widget[screw] is "b" then on the 3rd day widget[nut] is likely to be "a".
So there is no identity or transaction/event to use. Am I over complicating this issue? Or is there a package that is well suited for this. So far I have tried arulesSequence and TraMineR.
Thank you
Not sure what you want to do. If you would like to use TraMineR, here is how you could input your data assuming the widgets are your sequence ids:
library(TraMineR)
## Transforming into the STS form expected by seqdef()
sts.data <- seqformat(data, from="SPELL", to="STS", id="widget",
begin="day", end="day", status="status",
limit=10)
## Setting position names and sequence names
names(sts.data) <- paste0("d",rep(1:10))
rownames(sts.data) <- d2
sts.data
# d1 d2 d3 d4 d5 d6 d7 d8 d9 d10
# nut b a b b b a c a a a
# bolt c b a b a c b a c c
# screw a b a a c c b b b c
## Creating the state sequence object
sseq <- seqdef(sts.data)
## Potting the sequences
seqiplot(sseq, ytlab="id", ncol=3)
The key here is to reshape your dataset based on your objective. You have to make sure each row has all the input information (your criteria/conditions) and the target variable (what you want to find out).
Based on the problem you described:
The input info is "widget[bolt] value at a given day, widget[screw] value at the same given day and on widget[screw] value the day after", so you need to make sure each row of your new dataset has this info.
The target info is "3rd day widget[nut] value".
# for reproducibility reasons
set.seed(16)
# example dataset
d1 <- c(1:100)
d2 <- c("nut", "bolt", "screw")
data <- data.frame(expand.grid(d1,d2))
data$status <- sample(c("a","b","c"), size = nrow(data), replace = TRUE)
colnames(data) <- c("day", "widget", "status")
library(tidyverse)
data %>%
spread(widget, status) %>% # reshape data
mutate(screw_next_1 = lead(screw), # add screw next day
nut_next_2 = lead(nut, 2)) %>% # add nut 2 days after (target variable)
filter(bolt == "a" & screw == "c" & screw_next_1 == "b") # get rows that satisfy your criteria
# day nut bolt screw screw_next_1 nut_next_2
# 1 8 c a c b a
# 2 19 c a c b c
# 3 62 c a c b c
# 4 97 c a c b b
With a simple calculation you can say that based on the data you have the probability to have nut = a the 3rd day, given your criteria, is 1/4.
I think you'll find this type of question is most-easily addressed by reshaping your data from long to wide, and then implementing a logical test. For example:
# reshape from long to wide
data2 <- reshape2::dcast(data, day ~ widget)
# get the next-rows's value for "nut"
data2$next_nut <- dplyr::lead(data2$nut)
# implement your test
data2$bolt == "a" & data2$screw == "c" & data2$next_nut == "a"

randomly ordering across groups (not within group) in data.table

Let's say I want to order the iris dataset (as a data.table) by Species, keeping observations grouped by species and randomly ordering across species.
How do I do that?
I am not talking about generating a random order within groups (species).
My intuition was to write the code bellow. But it actually creates the within species random variable. Well at least it makes the question reproducible
d <- iris %>% data.table
set.seed('12345')
d[,g:=runif(.N),Species]
You may do a binary search in i. A smaller example:
d <- data.table(Species = rep(letters[1:4], each = 2), ri = 1:8)
set.seed(1)
d[.(sample(unique(Species))), on = "Species"]
# Species ri
# 1: b 3
# 2: b 4
# 3: d 7
# 4: d 8
# 5: c 5
# 6: c 6
# 7: a 1
# 8: a 2
Alternatively you could do:
e <- d[, .N, Species]
e[, g2 := runif(.N)]
d <- e[, .(Species, g2)][d, on = 'Species']
We can randomly sample from a series 1...N where N is the # of levels of the factor (Species) in question.
We then map the new order to a column and sort by it. Broken apart into steps for illustration it looks like this:
tmp <- sample_n(as.data.frame(seq(1,length(unique(d$Species)))),3)[,1]
d$index <- tmp[as.numeric(d$Species)]
d <- d[order(d$index),]
You could compact this into 1 line/step:
d <- d[order(sample_n(as.data.frame(seq(1,length(unique(d$Species)))),3)[,1][as.numeric(d$Species)]),]

Combine Membership and csize of clusters in R

My Initial Data:
library(igraph)
From <- c(1,2,3,4,5,6,7,8)
To <- c(NA,1,2,3,2,NA,6,7)
Value<- c(1,0,0.5,0.5,0,-1,-1,-0.5)
Data <- data.frame(From,To, Value)
Network <- graph.data.frame(Data[,c("From","To")])
Network<- Network - "NA"
plot(Network)
I want to know the size of the cluster they belong to. I want to combine the two functions clusters()$membership and clusters()$csize but i have no idea how i could. I want to have the belonging cluster size on each row.
Clusterx<-clusters(Network)$membership
ClusterSize<-clusters(Network)$csize
Example of possible final Data:
From <- c(1,2,3,4,5,6,7,8)
To <- c(NA,1,2,3,2,NA,6,7)
Value<- c(1,0,0.5,0.5,0,-1,-1,-0.5)
Csize<- c(5,5,5,5,5,3,3,3)
Data <- data.frame(From,To, Value,Csize)
This is a simple indexing operation.
clu <- clusters(Network)
clu$csize[ clu$membership ]
# [1] 5 5 5 5 5 3 3 3

Transferring factor properties between two data frames

I've built a predictive model that uses a large number (30 or so) of independent factor variables. As the dataset I'm using is much larger than the RAM of my machine, I have sampled it for both my training and test sets.
I am now looking to use the model to make predictions over the entire dataset. I'm pulling in the dataset 1 million rows at a time, and each time, I find new levels for some of my factor variables that were not in my training and test set, therefore preventing the model from making predictions.
As there are so many independent factor variables (and so many overall observations), correcting each case by hand is becoming a real pain.
One additional wrinkle to be aware of: there is no guarantee that the order of variables in the overall dataframe and the training/test sets are the same, as I do pre-processing on the data that changes their order.
As such, I'd like to write a function that:
Selects and sorts the columns of the new data based on the
configuration of my sampled dataframe
Loops through the sampled and new dataframe and designates all factor levels in the new
dataframe that do not exist in their corresponding column in the
sample dataframe as Other.
If a factor level exists in my sample but not the new dataframe, create the level (with no observations assigned to it) to its corresponding column in the new dataframe.
I've got #1 together, but don't know the best way to do #2 and #3. If it were any other language, I'd use for loops, but I know that's frowned upon in R.
Here's a reproducible example:
sampleData <- data.frame(abacus=factor(c("a","b","a","a","a")), montreal=factor(c("f","f","f","f","a")), boston=factor(c("z","y","z","z","q")))
dataset <- data.frame(florida=factor(c("e","q","z","d","b", "a")), montreal=factor(c("f","f","f","f","a", "a")), boston=factor(c("m","y","z","z","r", "f")), abacus=factor(c("a","b","z","a","a", "g")))
sampleData
abacus montreal boston
1 a f z
2 b f y
3 a f z
4 a f z
5 a a q
dataset
florida montreal boston abacus
1 e f m a
2 q f y b
3 z f z z
4 d f z a
5 b a r a
6 a a f g
sampleData <- sample[,order(names(sampleData))]
dataset <- dataset[,order(names(dataset))]
dataset <- dataset[,(colnames(sampleData)]
Below is what I would want dataset to look like once this function is complete (I don't really care about the final ordering of the columns in dataset; I'm just thinking its necessary for the loop (or whatever you guys deem best) to work. Notice that the column dataset$florida is omitted:
dataset
montreal boston abacus
1 f Other a
2 f y b
3 f z Other
4 f z a
5 a Other a
6 a Other Other
Also note that in dataset, the 'q' level for boston does not appear, although it does appear in sampleData. Therefore, the levels will differ if we omit 'q' from the factor in dataset, meaning that in 'dataset', we need boston to include the level q, but to have no actual observations assigned to it.
Last, note that as I'm doing this on 30 variables at a time, I need a programmatic solution and not one that reassigns factors by using explicit column names.
This seems like it might work.
From this function, the new levels returned for the boston column are Other y z q, even though there are no values for the level q. Regarding your comment in the original question, the only way I've found to effectively apply new factor levels is also with a for loop like you, and it's worked well for me so far.
A function, findOthers() :
findOthers <- function(newData) ## might want a second argument for sampleData
{
## take only those columns that are in 'sampleData'
dset <- newData[, names(sampleData)]
## change the 'dset' columns to character
dsetvals <- sapply(dset, as.character)
## change the 'sampleData' levels to character
samplevs <- sapply(sampleData, function(y) as.character(levels(y)))
## find the unmatched elements
others <- sapply(seq(ncol(dset)), function(i){
!(dsetvals[,i] %in% samplevs[[i]])
})
## change the unmatched elements to 'Other'
dsetvals[others] <- "Other"
## create new data frame
newDset <- data.frame(dsetvals)
## get the new levels for each column
newLevs <- lapply(seq(newDset), function(i){
Get <- c(as.character(newDset[[i]]), as.character(samplevs[[i]]))
ul <- unique(unlist(Get))
})
## set the new levels for each column
for(i in seq(newDset)) newDset[,i] <- factor(newDset[,i], newLevs[[i]])
## result
newDset
}
Your sample data :
sampleData <- data.frame(abacus=factor(c("a","b","a","a","a")),
montreal=factor(c("f","f","f","f","a")),
boston=factor(c("z","y","z","z","q")))
dataset <- data.frame(florida=factor(c("e","q","z","d","b", "a")),
montreal=factor(c("f","f","f","f","a", "a")),
boston=factor(c("m","y","z","z","r", "f")),
abacus=factor(c("a","b","z","a","a", "g")))
Call findOthers() and view the result with the new factor levels :
(new <- findOthers(newData = dataset))
# abacus montreal boston
# 1 a f Other
# 2 b f y
# 3 Other f z
# 4 a f z
# 5 a a Other
# 6 Other a Other
as.list(new)
# $abacus
# [1] a b Other a a Other
# Levels: a b Other
#
# $montreal
# [1] f f f f a a
# Levels: f a
#
# $boston
# [1] Other y z z Other Other
# Levels: Other y z q ## note the new level 'q', with no value in the column
To answer just the question you ask (rather than suggest what you might do instead). Here we have to make each column character, replace then re-factorise.
sampleData = sapply(sampleData, as.character)
sampleData = gsub("q", "other", sampleData)
sampleData = sapply(sampleData, as.factor)
This depends on "q" only inhabiting one column. Otherwise you just have to edit each column separately to get only the changes you want:
sampleData = sapply(sampleData, as.character)
sampleData$boston = gsub("q", "other", sampleData$boston)
sampleData = sapply(sampleData, as.factor)
However I think you should just filter the train and test data of these rows as they are so few
they will make absolutely no difference to your model. Otherwise you're making it difficult.
summary(dataset)
dataset <- dataset[dataset$abacus!="z", ]
If the dataset is very very large and you are not doing this because of that then you may want to do this with something like the dplyr package and filter function.
Does this accomplish what you want?
# Select and sort the columns of dataset as in sampleData
sampleData <- sampleData[, order(names(sampleData))]
dataset <- dataset[, colnames(sampleData)]
f <- function(dataset, sampleData, col) {
# For a given column col, assign "Other" to all factor levels
# in dataset[col] that do not exist in sampleData[col].
# If a factor level exists in sampleData[col] but not in dataset[col],
# preserve it as a factor level.
v <- factor(dataset[, col], levels = c(levels(sampleData[, col]), "Other"))
v[is.na(v)] <- "Other"
v
}
# Apply f to all columns of dataset
l <- lapply(colnames(dataset), function(x) f(dataset, sampleData, x))
res <- data.frame(l) # Format into a data frame
colnames(res) <- colnames(dataset) # Assign the names of dataset
dataset <- res # Assign the result to dataset
You can test as follows
> dataset[, "boston"]
[1] Other y z z Other Other
Levels: q y z Other
> dataset[, "montreal"]
[1] f f f f a a
Levels: a f Other
> dataset[, "abacus"]
[1] a b Other a a Other
Levels: a b Other

Improving performance of updating contents of large data frame using contents of similar data frame

I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.
The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)
Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.
So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:
K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1
...and the update data frame "df2":
K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2
The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.
I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).
I'm sure I'm missing something, but can't find a comparable example.
I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.
Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.
Here is some code to fabricate the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
Thanks
This loops by column, setting dt1 by reference and (hopefully) should be quick.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
nna = !is.na(dt2[[i]])
set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
K P1 P2 P3
[1,] A 1 1 1
[2,] B 2 1 2
[3,] C 1 2 2
[4,] D 2 2 2
This is likely not the fastest solution but is done entirely in base.
(updated answer per Tommy's comments)
#READING IN YOUR DATA FRAMES
df1 <- read.table(text=" K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1", header=TRUE)
df2 <- read.table(text=" K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2", header=TRUE)
all <- c(levels(df1$K), levels(df2$K)) #all cells of key column
dups <- all[duplicated(all)] #the overlapping key cells
ndups <- all[!all %in% dups] #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows
decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ],
FUN = decider)) #repalce all NAs of df2 with df1 values if they exist
df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ] #reorder based on key column
rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames
df5
This yields:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:
all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
colnames(LS[[i]]) <- colnames(LS[[2]])
return(LS[[i]])
}
)
LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])
decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5
EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.
require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
k = K[i]
p = unlist(dt2[i,-1,with=FALSE])
p = p[!is.na(p)]
dt1[J(k),names(p):=as.list(p),with=FALSE]
}
or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.
The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.
library(data.table)
library(reshape2)
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")
dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table
dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")
dt1s[dt2s,value:=value.new] # Update new values
dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

Resources