Sub-setting rows/columns in a dataframe based on Similarity in R - r

I have a dataframe "df" as below:
V1 V2
1 b a
2 b a
3 a b
4 b a
5 a b
6 a b
7 a b
8 b a
9 a b
10 a b
11 a b
12 b a
Is there a way that I can automate the following 3 steps in R?
Step 1:
R identifies that in the 12 rows of the dataframe "df" & the pattern "a b" is repeating majority of the times.
Step2:
Based on the majority pattern in Step 1, R subsets a dataframe with only those rows which contain the majority pattern in Step 1.
Step3:
R outputs a new sub-setted dataframe of Step2.
Is there a package to do this or a function that I can build? Any guidance will be very valuable. Thank You

Do you need duplicates of the most common combination? If not, there's a really simple way to do it with data.table
library("data.table")
#Create sample data, set seed to have the same output
set.seed(1)
df <- data.table(V1 = sample(c("a", "b", "c"),10 , replace = T),
V2 = sample(c("a", "b", "c"),10 , replace = T),
V3 = sample(c("a", "b", "c"),10 , replace = T))
#Subset
cols <- names(df)
df[, .N, by = cols][order(-N)][1,]
Output(N is number of occurences):
V1 V2 V3 N
1: b c b 2

With your updated Q, you may try this with data.table:
library(data.table)
setDT(df)
cols <- c("V1", "V2")
df[, .N, by = cols][N == max(N)][, N := NULL][df, on = cols, nomatch = 0]
# V1 V2 id
#1: a b 3
#2: a b 5
#3: a b 6
#4: a b 7
#5: a b 9
#6: a b 10
#7: a b 11
Explanation
setDT(df) coerces the data.frame to data.table without copying.
The relevant columns are defined to save typing.
The number of occurences of each combination in the relevant columns are counted.
Only the combination with the highest number of occurences is kept. This completes Step 1 of the Q which asks to find the majority pattern.
Note that in case of a tie, i.e., two or more combinations have the same maximum number of occurences, all combinations are returned. The OP hasn't specified how he wants to handle this case.
The counts are removed as they are no longer needed.
Finally, the original df is joined so that all rows of df which match the majority pattern are selected. nomatch = 0 specifies an inner join. This completes Step 2 and perhaps also Step 3 (but this is unclear to me, see comment.
Note that the row numbers in the id column are kept in the result without any additional effort. This would be the case for any other additional column in dfas well.
Data
df <- fread("id V1 V2
1 b a
2 b a
3 a b
4 b a
5 a b
6 a b
7 a b
8 b a
9 a b
10 a b
11 a b
12 b a")
# EDIT: By request of the OP, the row number (id) should be kept
# df[, id := NULL]

What exactly is "similarity" in your query - just finding the most common row and taking only that? If yes, you just need to group by all variables and sort by occurence
If you're talking about similar, but not matching text in your columns you need to look into edit distances (this package is good for the task https://cran.r-project.org/web/packages/stringdist/stringdist.pdf)

Related

How to keep columns in the output when grouping a data.table by several columns

I am trying to "tidy" a large dataset, where multiple different types of data is merged in columns, and some data in column names. This is a common scenario in biological dataset.
My data table has replicate measurements which I want to collapse into a mean. Converting the data into tidy format, these replicate values become additional rows. If I try to aggregate/group by several columns and calculate the mean of the replicates:
collapsed.data <- tidy.dt[, mean(expression, na.rm = T), by=list(Sequence.window,Gene.names,ratio,enrichment.type,condition)]
I get a resultant table that has only the columns used in by statement and followed with the mean(expression) as column V1. Is it possible to get all the other (unchanged) columns as well?
A minimalist example showing what I am trying to achieve is as follows:
library(data.table)
dt <- data.table(a = c("a", "a", "b", "b", "c", "a", "c", "a"), b = rnorm(8),
c = c(1,1,1,1,1,2,1,2), d = rep('x', 8), e = rep('test', 8))
dt[, mean(b), by = list(a, c)]
# a c V1
#1: a 1 -0.7597186
#2: b 1 -0.3001626
#3: c 1 -0.6893773
#4: a 2 -0.1589146
As you can see the columns d and e are dropped.
One possibility is to include d and e in the grouping:
res <- dt[, mean(b), by = list(a, c, d, e)]
res
# a c d e V1
#1: a 1 x test 0.9271986
#2: b 1 x test -0.3161799
#3: c 1 x test 1.3709635
#4: a 2 x test 0.1543337
If you want to keep all columns except the one you want to aggregrate, you can do this in a more programmatic way:
cols_to_group_by <- setdiff(colnames(dt), "b")
res <- dt[, mean(b), by = cols_to_group_by]
The result is the same as above.
By this, you have reduced the number of rows. If you want to keep all rows, you can add an additional column:
dt[, mean_b := mean(b), by = list(a, c)]
dt
# a b c d e mean_b
#1: a 1.1127632 1 x test 0.9271986
#2: a 0.7416341 1 x test 0.9271986
#3: b 0.9040880 1 x test -0.3161799
#4: b -1.5364479 1 x test -0.3161799
#5: c 1.9846982 1 x test 1.3709635
#6: a 0.2615139 2 x test 0.1543337
#7: c 0.7572287 1 x test 1.3709635
#8: a 0.0471535 2 x test 0.1543337
Here, dtis modified by reference, i.e., without copying all of dt, which might save time on large data.

Finding factors that correspond to more than one values

Suppose, that one has the following dataframe:
x=data.frame(c(1,1,2,2,2,3),c("A","A","B","B","B","B"))
names(x)=c("v1","v2")
x
v1 v2
1 1 A
2 1 A
3 2 B
4 2 B
5 2 B
6 3 B
In this dataframe a value in v1 I want to correspond into a label in v2. However, as one can see in this example B has more than one corresponding values.
Is there any elegant and fast way to find which labels in v2 correspond to more than one values in v1 ?
The result I want ideally to show, the values - which in our example should be c(2,3) - as well as the row number - which in our example should be r=c(5,6).
Assuming that we want the index of the unique elements in 'v1' grouped by 'v2' and that should have more than one unique elements, we create a logical index with ave and use that to subset the rows of 'x'.
i1 <- with(x, ave(v1, v2, FUN = function(x)
length(unique(x))>1 & !duplicated(x, fromLast=TRUE)))!=0
x[i1,]
# v1 v2
#5 2 B
#6 3 B
Or a faster option is data.table
library(data.table)
i1 <- setDT(x)[, .I[uniqueN(v1)>1 & !duplicated(v1, fromLast=TRUE)], v2]$V1
x[i1, 'v1', with = FALSE][, rn := i1][]
# v1 rn
#1: 2 5
#2: 3 6

Group a data.table using a column which is list

I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
a[, sum(j), by = k]
right now I am getting the following error:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.
I think this might work:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2
If we are using tidyr, a compact option would be
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr pipes
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2
Since by-group operations can be slow, I'd consider...
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in #MikeyMike's answer, we can dat[, sum(j), by=k].
In data.table 1.9.7+, we can similarly do
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]

How do I subset a data table row in R to get the rows unique to itself

I know this may be a simple question but I cant seem to get it right.
I have two data tables data table old_dt and data table new_dt. Both data tables has two similar columns. My goal is to get the rows from new_dt that is not in old_dt.
Here is an example. Old_dt
v1 v2
1 a
2 b
3 c
4 d
Here is new_dt
v1 v2
3 c
4 d
5 e
What I want is to get just the 5 e row.
Using setdiff didn't work because my real data is more than 3 million rows. Using subset like this
sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)
Only resulted in new_dt itself.
Using
sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)
Reulted in nothing.
Using
sub.cti <- new_dt[,.(!old_dt$v1, !old_dt$v2)]
Reulted in multiple rows of FALSEs
Can somebody help me?
Thank you in advance
We can do a join (data from #giraffehere's post)
df2[!df1, on = "a"]
# a b
#1: 6 14
#2: 7 15
To get rows in 'df1' that are not in 'df2' based on the 'a' column
df1[!df2, on = "a"]
# a b
#1: 4 9
#2: 5 10
In the OP's example we need to join on both columns
new_dt[!old_dt, on = c("v1", "v2")]
# v1 v2
#1: 5 e
NOTE: Here I assumed the 'new_dt' and 'old_dt' as data.tables.
Of course, dplyr is a good package. For dealing with this problem, a shorter anti_join can be used
library(dplyr)
anti_join(new_dt, old_dt)
# v1 v2
# (int) (chr)
#1 5 e
or the setdiff from dplyr can work on data.frame, data.table, tbl_df etc.
setdiff(new_dt, old_dt)
# v1 v2
#1: 5 e
However, the question is tagged as data.table.
dplyr would help a lot when you deal with tabular data in R - Would recommend you learn more about dplyr here
library(dplyr)
library(magrittr) # this is just for shorter code with %<>%
# Create a sequence number that combine v1 & v2
Old_dt %<>%
mutate(sequence = paste0(v1,v2))
new_dt %<>%
mutate(sequence = paste0(v1,v2))
# Filter new_dt by sequence not existed in old_dt
result <- new_dt %>%
filter(!(sequence %in% Old_dt$sequence)) %>%
select(v1:v2)
v1 v2
5 e
EDIT: I noticed OP wanted both rows and not just one to match on. I'll keep the data initialization part of the solution here as it is referenced above by #akron. However, use the top solution #akrun posted. It is the more of the "data.table way".
df1 <- data.table(a = 1:5, b = 6:10)
df2 <- data.table(a = c(1, 2, 3, 6, 7), b = 11:15)
head(df1)
a b
1: 1 6
2: 2 7
3: 3 8
4: 4 9
5: 5 10
head(df2)
a b
1: 1 11
2: 2 12
3: 3 13
4: 6 14
5: 7 15
If column a has repeats, you could try this base R hack:
id.var1 <- paste(df1$a, df1$b,sep="_")
id.var2 <- paste(df2$a, df2$b,sep="_")
dfKeep <- df[!(id.var2 %in% id.var1),]

Select random row for each unique value in one specific column od fata frame

I have quite simple request that I cannot, however, deal with by use of one code line.
All I want is to subset an input data frame in the way that in the output data frame there is only one randomly selected row for each unique value (factor's level) of one particular data frame's column.
E.x. I have (v2 is a particular data frame's column)
v1 v2
1 A 1
2 B 1
3 C 2
4 A 1
5 B 2
6 B 1
7 B 1
8 C 2
9 D 1
10 E 1
And want to have as an output data frame:
v1 v2
1 B 1
2 C 2
Thank you for any suggestions in advance!
This is way more than what you asked for, but I wrote a function called stratified that lets you take random samples from a data.frame by one or more group variables.
You can load it and use it like this:
library(devtools)
source_gist("https://gist.github.com/mrdwab/6424112")
# [1] "https://raw.github.com/gist/6424112"
# SHA-1 hash of file is 0006d8548785ec8a5651c3dd599648cc88d153a4
## One row
stratified(mydf, "v2", 1)
# v1 v2
# 10 E 1
# 8 C 2
## Two rows
stratified(mydf, "v2", 2)
# v1 v2
# 2 B 1
# 6 B 1
# 3 C 2
# 5 B 2
I'll add official documentation to the function at some point, but here's a summary to help you get the best use out of it:
The arguments to stratified are:
df: The input data.frame
group: A character vector of the column or columns that make up the "strata".
size: The desired sample size.
If size is a value less than 1, a proportionate sample is taken from each stratum.
If size is a single integer of 1 or more, that number of samples is taken from each stratum.
If size is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10).
select: This allows you to subset the groups in the sampling process. This is a list. For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")).
replace: For sampling with replacement.
You can iterate over the unique values in your column and find the row indices for each vlaue and select one row index at random using sample. Like this:
# Set seed for reproducible results
set.seed(1)
# Generate indices
ind <- sapply( unique( df$v2 ) , function(x) sample( which(df$v2==x) , 1 ) )
# Subset data.frame
df[ ind , ]
# v1 v2
#2 B 1
#5 B 2

Resources