Finding rows with a unique combination of values (R) - r

This is a bit more complicated that the title lets on, and I'm sure if I could think of a way to better describe it, I could google it better.
I have data that looks like this:
SET ID
100301006 1287025
100301006 1287026
100301010 1287027
100301013 1287030
100301011 1287027
and I would like to identify and select those rows where each both values in a row have a unique value for the column. In the example above, I want to grab only the row:
100301013 1287030
I don't want SET 100301006, since it matches to 2 different records in the ID field (1287025 and 1287026). Similarly, I don't want SET 100301010 since the ID record it matches to (1287027) can also match another SET (10030011).
In some cases there could be more than 2 matches.
I could do this in loops, but that seems like a hack. I'd love a base R or data.table solution, but I'm not so interested in dplyr (trying to minimize dependencies).

We can use duplicated on each columns independently to create a list of logical vectors, Reduce it to a single vector with & and use that to subset the rows of the dataset
df1[Reduce(`&`, lapply(df1, function(x)
!(duplicated(x)|duplicated(x, fromLast = TRUE)))),]
# SET ID
#4 100301013 1287030
Or as #chinsoon12 suggested
m1 <- sapply(df1, function(x) !(duplicated(x)| duplicated(x, fromLast = TRUE)))
df1[rowSums(m1) == ncol(m1),, drop = FALSE]
data
df1 <- structure(list(SET = c(100301006L, 100301006L, 100301010L, 100301013L,
100301011L), ID = c(1287025L, 1287026L, 1287027L, 1287030L, 1287027L
)), class = "data.frame", row.names = c(NA, -5L))

Here's a quick base-R hack:
df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
SET ID
100301006 1287025
100301006 1287026
100301010 1287027
100301013 1287030
100301011 1287027")
counts <- sapply(df, function(x) { tb <- table(x); tb[ match(x, names(tb)) ]; })
counts
# SET ID
# 100301006 2 1
# 100301006 2 1
# 100301010 1 2
# 100301013 1 1
# 100301011 1 2
At this point, we have the number of times each element is found in its column ... so we want rows where all counts are 1.
df[ rowSums(counts == 1) == ncol(df), ]
# SET ID
# 4 100301013 1287030

You could use data.table to select only groups with 1 row, grouping by ID first, then by SET. This is similar to #r2evans method of checking that the counts for ID and SET are both 1.
library(data.table)
setDT(df)
df[, if(.N == 1) .SD, ID][, if(.N == 1) .SD, SET]
# SET ID
# 1: 100301013 1287030
Or for more than 2 columns
Reduce(function(x, y) x[, if(.N == 1) .SD, y], names(df), init = df)
# ID SET
# 1: 1287030 100301013

With base R, maybe you can use ave() to make it:
r <-df[which(with(df,ave(seq(nrow(df)),SET,FUN = length)*ave(seq(nrow(df)),ID,FUN = length)) == 1),]
> r
SET ID
4 100301013 1287030
DATA
df <- read.table(text="SET ID
100301006 1287025
100301006 1287026
100301010 1287027
100301013 1287030
100301011 1287027",header = T)

If we have a dataframe df and want to find unique values of columns: column1, column2, column3:
library(dplyr)
df <- df %>% group_by(column1,column2,column3) %>% summarise()

Related

In R: How to subset a large dataframe by top 5 longest runs of frequent values in 1 column?

I have a dataframe with 1 column. The values in this column can ONLY be "good" or "bad". I would like to find the top 5 largest runs of "bad".
I am able to use the rle(df) function to get the running length of all the "good" and "bad".
How do i find the 5 largest runs that attribute to ONLY "bad"?
How do i get the starting and ending indices of the top 5 largest runs for ONLY "bad"?
Your assistance is much appreciated!
One option would be rleid. Convert the 'data.frame' to 'data.table' (setDT(df1)), creating grouping column with rleid (generates a unique id based on adjacent non-matching elements, create the number of elements per group (n) as a column, and row number also as another column ('rn'), subset the rows where 'goodbad' is "bad", order 'n' in decreasing order, grouped by 'grp', summarise the 'first' and 'last' row numbe, as well as the entry for goodbad
library(data.table)
setDT(df1)[, grp := rleid(goodbad)][, n := .N, grp][ ,
rn := .I][goodbad == 'bad'][order(-n), .(goodbad = first(goodbad),
n = n, start = rn[1], last = rn[.N]), .(grp)
][n %in% head(unique(n), 5)][, grp := NULL][]
Or we can use rle and other base R methods
rl <- rle(df1$goodbad)
grp <- with(rl, rep(seq_along(values), lengths))
df2 <- transform(df1, grp = grp, n = rep(rl$lengths, rl$lengths),
rn = seq_len(nrow(df1)))
df3 <- subset(df2, goodbad == 'bad')
do.call(data.frame, aggregate(rn ~ grp, subset(df3[order(-df3$n),],
n %in% head(unique(n), 5)), range))
data
set.seed(24)
df1 <- data.frame(goodbad = sample(c("good", "bad"), 100,
replace = TRUE), stringsAsFactors = FALSE)
The sort(...) function arranges things by increasing or decreasing order. The default is increasing, but you can set "decreasing = TRUE". Use ?sort for more info.
The which(...) function returns the INDEX of values that meet a logical criteria. The code below sorts the times columns of rows where the goodbad value == GOOD.
sort(your.df$times[which(your.df$goodbad == GOOD)])
If you wanted to get the top 5 you could do this:
top5_good <- sort(your.df$times[which(your.df$goodbad == GOOD)])[1:5]
top5_bad <- sort(your.df$times[which(your.df$goodbad == BAD)])[1:5]

Matching columns in 2 data frames when numbers don't exactly match

How do I match two different data frames when the values I am comparing are not exactly the same?
I was thinking of using merge() but I am not sure.
Table1:
ID Value.1
10001 x
18273-9 y
12824/5/6/7 z
10283/5/9 d
Table2:
ID Value.2
10001 a
18274 b
12826 c
10289 u
How do I merge Table 1 and 2 based on ID?
Which specific function of fuzzyjoin package would I use, especially with the "/" & "-" cases? How do I expand the "-" case from 18273-9 so that R will register 18273 / 18274 / 18275 / ...?
You can write a function to extract the corresponding sequences from the strings containing "/" or "-" and recombine them into a new data.frame as follows:
df1 <- data.frame(ID=c("10001","18273-9","15273-8", "15170-4", "12824/5/6/7","10283/5/9"),
value=c("a","c","c", "d","k", "l"), stringsAsFactors = F)
df2 <- data.frame(ID=c("10001","18274","12826","10289"),
value=c("o","p","q","r"), stringsAsFactors = F)
doIt <- function(df){
listAsDF <- function(l) {
x <- stack(setNames(l, temp$value))
names(x) <- c("ID", "value")
return(x)
}
Base <- df[!grepl("\\/", df$ID) & !grepl("\\-", df$ID), ]
#1 cases when - present
temp <- df[grep("\\-", df$ID),]
temp <- listAsDF(lapply(strsplit(temp$ID, "-"), function(e) seq(e[1], paste0(strtrim(e[1], nchar(e[1])-1), e[2]), 1)))
Base <- rbind(Base, temp)
#2 cases when / present
temp <- df[grep("\\/", df$ID),]
temp <- listAsDF(lapply(strsplit(temp$ID, "/"), function(a) c(a[1], paste0(strtrim(a[1], nchar(a[1])-1), a[-1]))))
Base <- rbind(Base, temp)
return(Base)
}
Then you can mergge the df2 and df1:
merge(doIt(df1), df2, by = "ID", all.x = T)
Hope this helps!
You could use the fuzzy string matching function "agrep" from base R.
df1 <- data.frame(ID=c("10001","18273-9","12824/5/6/7","10283/5/9"),
value=c("a","c","d","k"))
df2 <- data.frame(ID=c("10001","18274","12826","10289"),
value=c("o","p","q","r"))
apply(df1, 1, function(x) agrep(x["ID"], df2$ID, max = 3.5))
As you see it struggles to find the match for row 4. So it might make sense to clean your ID variable (e.g., take out the "/") before running agrep.
One option could consist in extracting the format of ID you want to keep. And then do your merge.
You can format your ID column as follow :
library(stringr)
library(dplyr)
If you want only the digits before any symbols
Table1 %>% mutate(ID = str_extract("[0-9]*"))
If you want to keep the first sequence of 5 digits
Table1 %>% mutate(ID = str_extract("[0-9]{5}"))
This answers your second question, but does not use the fuzzyjoin package

Sample from specific rows in a dataframe column [duplicate]

I'm looking for an efficient way to select rows from a data table such that I have one representative row for each unique value in a particular column.
Let me propose a simple example:
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
dt = as.data.table( z )
my objective is to subset data table dt by sampling one row for each letter a-h in column z.
OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.
dt[dt[ , .I[sample(.N,1)] , by = z]$V1]
You can use dplyr
library(dplyr)
dt %>%
group_by(z) %%
sample_n(1)
I think that shuffling the data.table row-wise and then applying unique(...,by) could also work. Groups are formed with by and the previous shuffling trickles down inside each group:
# shuffle the data.table row-wise
dt <- dt[sample(dim(dt)[1])]
# uniqueness by given column(s)
unique(dt, by = "z")
Below is an example on a bigger data.table with grouping by 3 columns. Comparing with #akrun ' solution seems to give the same grouping:
set.seed(2017)
dt <- data.table(c1 = sample(52*10^6),
c2 = sample(LETTERS, replace = TRUE),
c3 = sample(10^5, replace = TRUE),
c4 = sample(10^3, replace = TRUE))
# the shuffling & uniqueness
system.time( test1 <- unique(dt[sample(dim(dt)[1])], by = c("c2","c3","c4")) )
# user system elapsed
# 13.87 0.49 14.33
# #akrun' solution
system.time( test2 <- dt[dt[ , .I[sample(.N,1)] , by = c("c2","c3","c4")]$V1] )
# user system elapsed
# 11.89 0.10 12.01
# Grouping is identical (so, all groups are being sampled in both cases)
identical(x=test1[,.(c2,c3)][order(c2,c3)],
y=test2[,.(c2,c3)][order(c2,c3)])
# [1] TRUE
For sampling more than one row per group check here
Updated workflow for dplyr. I added a second column v that can be grouped by z.
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
v <- 1:length(z)
dt = data.table(z,v)
library(dplyr)
dt %>%
group_by(z) %>%
slice_sample(n = 1)

How to explicitly name the count column generated by the .N function?

I want to group-by a data table by an id column and then count how many times each id occurs. This can be done as follows:
dt <- data.table(id = c(1, 1, 2))
dt_by_id <- dt[, .N, by = id]
dt_by_id
id N
1: 1 2
2: 2 1
That's pretty fine, but I want the N-column to have a different name (e. g. count). In the help it says:
.N is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in
advance and for convenience generally. When grouping by i, .N is the
number of rows in x matched to, for each row of i, regardless of
whether nomatch is NA or 0. It is renamed to N (no dot) in the result
(otherwise a column called ".N" could conflict with the .N variable,
see FAQ 4.6 for more details and example), unless it is explicitly
named; ... .
How to "explicitly name" the N-column when creating the dt_by_id data table? (I know how to rename it afterwards.) I tried
dt_by_id <- dt[, count = .N, by = id]
but this led to
Error in `[.data.table`(dt, , count = .N, by = id) :
unused argument (count = .N)
You have to list the output of your calculation if you want to give your own name:
dt[, .(count=.N), by = id]
This is identical to dt[, list(count=.N), by = id], if you prefer; . is an alias for list here.
If we have already named it, then use setnames
setnames(dt_by_id, "N", 'count')
or using rename
library(dplyr)
dt_by_id %>%
rename(count = N)
# id count
#1: 1 2
#2: 2 1
Using dplyr::count (x, name= "new column" ) will replace the default column name n with a new name.
dt <- data.frame(id = c(1, 1, 2))
dt %>%
dplyr:: count(id, name = 'ID')

Select row by level of a factor

I have a data frame, df2, containing observations grouped by a ID factor that I would like to subset. I have used another function to identify which rows within each factor group that I want to select. This is shown below in df:
df <- data.frame(ID = c("A","B","C"),
pos = c(1,3,2))
df2 <- data.frame(ID = c(rep("A",5), rep("B",5), rep("C",5)),
obs = c(1:15))
In df, pos corresponds to the index of the row that I want to select within the factor level mentioned in ID, not in the whole dataframe df2.I'm looking for a way to select the rows for each ID according to the right index (so their row number within the level of each factor of df2).
So, in this example, I want to select the first value in df2 with ID == 'A', the third value in df2 with ID == 'B' and the second value in df2 with ID == 'C'.
This would then give me:
df3 <- data.frame(ID = c("A", "B", "C"),
obs = c(1, 8, 12))
dplyr
library(dplyr)
merge(df,df2) %>%
group_by(ID) %>%
filter(row_number() == pos) %>%
select(-pos)
# ID obs
# 1 A 1
# 2 B 8
# 3 C 12
base R
df2m <- merge(df,df2)
do.call(rbind,
by(df2m, df2m$ID, function(SD) SD[SD$pos[1], setdiff(names(SD),"pos")])
)
by splits the merged data frame df2m by df2m$ID and operates on each part; it returns results in a list, so they must be rbinded together at the end. Each subset of the data (associated with each value of ID) is filtered by pos and deselects the "pos" column using normal data.frame syntax.
data.table suggested by #DavidArenburg in a comment
library(data.table)
setkey(setDT(df2),"ID")[df][,
.SD[pos[1L], !"pos", with=FALSE]
, by = ID]
The first part -- setkey(setDT(df2),"ID")[df] -- is the merge. After that, the resulting table is split by = ID, and each Subset of Data, .SD is operated on. pos[1L] is subsetting in the normal way, while !"pos", with=FALSE corresponds to dropping the pos column.
See #eddi's answer for a better data.table approach.
Here's the base R solution:
df2$pos <- ave(df2$obs, df2$ID, FUN=seq_along)
merge(df, df2)
ID pos obs
1 A 1 1
2 B 3 8
3 C 2 12
If df2 is sorted by ID, you can just do df2$pos <- sequence(table(df2$ID)) for the first line.
Using data.table version 1.9.5+:
setDT(df2)[df, .SD[pos], by = .EACHI, on = 'ID']
which merges on ID column, then selects the pos row for each of the rows of df.

Resources