Remove duplicating docs of docs with high similarity

Remove duplicating docs of docs with high similarity - r

When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?

Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.
Let's say we have a corpus consisting of two sets of similar documents, { (a1, a2, a3), (b1, b2) } where the letters indicate similarity. We want to keep just one document when the others are "duplicates", defined as similarity exceeding a threshold, say 0.80.
We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.
library("quanteda")
# Loading required package: quanteda
# Package version: 1.3.14
mydocs <- c(a1 = "a a a a a b b c d w g j t",
b1 = "l y y h x x x x x y y y y",
a2 = "a a a a a b c s k w i r f",
b2 = "p q w e d x x x x y y y y",
a3 = "a a a a a b b x k w i r f")
mydfm <- dfm(mydocs)
(sim <- textstat_simil(mydfm))
# a1 b1 a2 b2
# b1 -0.22203788
# a2 0.80492203 -0.23090513
# b2 -0.23427416 0.90082239 -0.28140219
# a3 0.81167608 -0.09065452 0.92242890 -0.12530944
# create a data.frame of the unique pairs and their similarities
sim_pair_names <- t(combn(docnames(mydfm), 2))
sim_pairs <- data.frame(sim_pair_names,
sim = as.numeric(sim),
stringsAsFactors = FALSE)
sim_pairs
# X1 X2 sim
# 1 a1 b1 -0.22203788
# 2 a1 a2 0.80492203
# 3 a1 b2 -0.23427416
# 4 a1 a3 0.81167608
# 5 b1 a2 -0.23090513
# 6 b1 b2 0.90082239
# 7 b1 a3 -0.09065452
# 8 a2 b2 -0.28140219
# 9 a2 a3 0.92242890
# 10 b2 a3 -0.12530944
Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().
# set the threshold for similarity
threshold <- 0.80
# discard one of the pair if similarity > threshold
todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)
todrop
# [1] "a1" "a1" "b1" "a2"
# then subset the dfm, keeping only the "keepers"
dfm_subset(mydfm, !docnames(mydfm) %in% todrop)
# Document-feature matrix of: 2 documents, 20 features (62.5% sparse).
# 2 x 20 sparse Matrix of class "dfm"
# features
# docs a b c d w g j t l y h x s k i r f p q e
# b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1
# a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0
Other solutions to this problem of similar documents would be to form them into clusters, or to reduce the document matrix using principal components analysis, along the lines of latent semantic analysis.

You already received some excellent answers. But if you prefer a more automated approach targeted at your specific use case, you can use the package LexisNexisTools (which I wrote). It comes with a function called lnt_similarity(), which does exactly what you were looking for. I wrote a quick tutorial with mock data here.
The main difference between the solutions here and in lnt_similarity() is that I also take into account word order, which can make a big difference in some rare cases (see this blog post).
I also suggest you think carefully about thresholds as you might otherwise remove some articles wrongfully. I included a function to visualize the difference between two articles so you can get a better grip of the data you are removing called lnt_diff().

If you have thousands of documents, it takes a lot of space in your RAM to save all the similarity scores, but you can set a minimum threshold in textstat_proxy(), the underlying function of textstat_simil().
In this example, cosine similarity scores smaller than 0.9 are all ignored.
library("quanteda")
mydocs <- c(a1 = "a a a a a b b c d w g j t",
b1 = "l y y h x x x x x y y y y",
a2 = "a a a a a b c s k w i r f",
b2 = "p q w e d x x x x y y y y",
a3 = "a a a a a b b x k w i r f")
mydfm <- dfm(mydocs)
(sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))
# 5 x 5 sparse Matrix of class "dsTMatrix"
# a1 b1 a2 b2 a3
# a1 1 . . . .
# b1 . 1.0000000 . 0.9113423 .
# a2 . . 1.0000000 . 0.9415838
# b2 . 0.9113423 . 1.0000000 .
# a3 . . 0.9415838 . 1.0000000
matrix2list <- function(x) {
names(x#x) <- rownames(x)[x#i + 1]
split(x#x, factor(x#j + 1, levels = seq(ncol(x)), labels = colnames(x)))
}
matrix2list(sim)
# $a1
# a1
# 1
#
# $b1
# b1
# 1
#
# $a2
# a2
# 1
#
# $b2
# b1 b2
# 0.9113423 1.0000000
#
# $a3
# a2 a3
# 0.9415838 1.0000000
See https://koheiw.net/?p=839 for the performance differences.

Related

Comparing two columns of two dataframes (logical operators)

I would like to compare two columns simultaneously. My data looks like this:
a <- data.frame("a1" = c(1,1,1,3,4), "a2" = c(2,1,2,1,2))
b <- data.frame("b1" = c(1,1,3,1,3), "b2" = c(2,2,1,2,1))
cbind(a, b)
# a1 a2 b1 b2
# 1 1 2 1 2
# 2 1 1 1 2
# 3 1 2 3 1
# 4 3 1 1 2
# 5 4 2 3 1
I would like to identify all rows of a where a1 is not in b1 or where a1 is in b1 but a2 for the special a1 is not in b2 for the special b2. So the second question is: When a1 is in b1 is then a2 for this row for a1 also in b2 for this row for b1.
Example for line 2: I am checking, if a1 = 1 is anywhere in b1 = c(1,1,3,1,3). It is, so I want to check if a2 = 1 in line 2 (where a1 = 1) is anywhere in b2 where b1 = a1 = 1, so here b2 = c(2, 2, 2). For line 2 a2 = 1 is not in b2 = c(2, 2, 2), so the result should show me this line.
The first question is easy to answer with the following code:
a[which(!(a$a1 %in% b$b1)), ]
# a1 a2
# 5 4 2
But I can't fix the second problem. Maybe I am working in a wrong way with the logical operators. My result should look like this:
a1 a2
2 1 1
4 4 2

Following the explanation in your edit, you want the rows where either the specific a1 from a is not in b1 from b or where the specific a1 from a is equal to b1 of the same row in b and a2 from a is not among the values of b2 from b of the rows for which b1 equals the value of the specific a1.
In R, you can write this like that:
cond <- sapply(seq(nrow(a)), # check each row, one by one
function (i){
!(a$a1[i] %in% b$b1) | # a1 of the specific row is not in b1 or
!(a$a2[i] %in% b$b2[b$b1==a$a1[i]]) # a2 of the specific row is not in the values of b2 for which b1 equals a1 of the sepcific row
})
a[cond, ]
# a1 a2
#2 1 1
#5 4 2

Obviously not a nice solution, but it works with my data (unequal dimension of rows of the two datasets, not the same position of the values in the variables) - here with new example data, because I chose the first really bad.
a <- data.frame("a1" = c(1,1,1,3,4), "a2" = c(2,1,2,1,2))
b <- data.frame("b1" = c(1,3,1,1), "b2" = c(2,1,2,2))
test <- function (data1, data2) {
for (i in unique(data1[data1$a1 %in% data2$b1, "a1"])) {
temp_data1 <- data1[data1$a1 == i, c("a1", "a2")]
temp_data2 <- data2[data2$b1 == i, c("b1", "b2")]
for (j in unique(temp_data1$a2)) {
test <- j %in% unique(temp_data2$b2)
if (test == FALSE) {
print(unique(temp_data1[temp_data1$a1 == i & temp_data1$a2 == j, ]))
}
}
}
for (k in unique(data1[which(!(data1$a1 %in% data2$b1)), "a1"])) {
print(unique(data1[data1$a1 == k, c("a1", "a2")]))
}
}
test(a, b)
a1 a2
2 1 1
a1 a2
5 4 2

Based on your answer I improved the function test(). This version returns a dataframe:
a <- data.frame(a1=c(1,1,1,3,4), a2=c(2,1,2,1,2))
b <- data.frame(b1=c(1,1,3,1,3), b2=c(2,2,1,2,1))
test <- function (a, b) {
R <- subset(a,!a1 %in% b$b1)
I <- unique(a$a1[a$a1 %in% b$b1])
for (i in I) {
ai <- subset(a, a1 == i)
bi <- subset(b, b1 == i)
J <- unique(bi$b2)
for (j in unique(ai$a2)) if (! j %in% J) R <- rbind(subset(ai, a2==j), R)
}
R
}
test(a, b)

Phylogenetic Tree - how to create a branch by species matrix?

Working with a phylogenetic tree in R, I would like to create a matrix which indicates if each branch of the tree (B1 to B8) is associated with each species (A to E), where 1s indicate that the branch is associated. (Shown below)
The R function which.edge() is useful for identifying the terminal branch for a species. but it doesn't identify ALL the branches associated with each species. What function could I use to identify all the branches in the tree that go from the root to the tip for each species?
Example Tree
library(ape)
ex.tree <- read.tree(text="(A:4,((B:1,C:1):2,(D:2,E:2):1):1);")
plot(ex.tree)
edgelabels() #shows branches 1-8
The is the matrix I would like to create (Species A-E as columns, Branches B1-B8 as rows), but with an easy function rather than by hand.
B1 <- c(1,0,0,0,0)
B2 <- c(0,1,1,1,1)
B3 <- c(0,1,1,0,0)
B4 <- c(0,1,0,0,0)
B5 <- c(0,0,1,0,0)
B6 <- c(0,0,0,1,1)
B7 <- c(0,0,0,1,0)
B8 <- c(0,0,0,0,1)
Mat <- rbind(B1,B2,B3,B4,B5,B6,B7,B8)
colnames(Mat) <- c("A","B","C","D","E")
Mat
For example, Branch B2 goes to species B-E, but not to species A. For Species E, branches B2, B6, B8 are present.
Which R function(s) would be best? Thanks in advance!

I am unaware of any built-in function that does this. I wrote a helper function that can calculate this from the edge data stored in the tree object.
branchNodeAdjacency <- function(x) {
m <- matrix(0, ncol=nt, nrow=nrow(x$edge))
from <- x$edge[,1]
to <- x$edge[,2]
g <- seq_along(x$tip.label)
while (any(!is.na(g))) {
i <- match(g, to)
m[cbind(i, seq_along(i))] <- 1
g <- from[i]
}
rownames(m) <- paste0("B", seq.int(nrow(m)))
colnames(m) <- x$tip.label
m
}
branchNodeAdjacency(ex.tree)
# A B C D E
# B1 1 0 0 0 0
# B2 0 1 1 1 1
# B3 0 1 1 0 0
# B4 0 1 0 0 0
# B5 0 0 1 0 0
# B6 0 0 0 1 1
# B7 0 0 0 1 0
# B8 0 0 0 0 1
The idea is we keep track of which leaf node values are represented by each internal node.

Generating a 96 or 384 well plate layout in R

I am trying to write some code which will take a .csv file which contains some sample names as input and will output a data.frame containing the sample names and either a 96 well plate or 384 well plate format (A1, B1, C1...). For those who do not know, a 96 well plate has eight alphabetically labeled rows (A, B, C, D, E, F, G, H) and 12 numerically labeled columns (1:12) and a 384 well plate has 16 alphabetically labeled rows (A:P) and 24 numerically labeled columns (1:24). I am trying to write some code that will generate either of these formats (there CAN be two different functions to do this) allowing for the samples to be labeled either DOWN (A1, B1, C1, D1, E1, F1, G1, H1, A2...) or ACROSS (A1, A2, A3, A4, A5 ...).
So far, I have figured out how to get the row names fairly easily
rowLetter <- rep(LETTERS[1:8], length.out = variable)
#variable will be based on how many samples I have
I just cannot figure out how to get the numeric column names to apply correctly... I have tried:
colNumber <- rep(1:12, times = variable)
but it isn't that simple. All 8 rows must be filled before the col number increases by 1 if you're going 'DOWN' or all 12 columns must be filled before the row letter increases by 1 if you're going 'ACROSS'.
EDIT:
Here is a clunky version. It takes the number of samples that you have, a 'plate format' which IS NOT functional yet, and a direction and will return a data.frame with the wells and plate numbers. Next, I am going to a) fix the plate format so that it will work correctly and b) give this function the ability to take a list of samples names or ID's or whatever and return the sample names, well positions, and plate numbers!
plateLayout <- function(numOfSamples, plateFormat = 96, direction = "DOWN"){
#This assumes that each well will be filled in order. I may need to change this, but lets get it working first.
#Calculate the number of plates required
platesRequired <- ceiling(numOfSamples/plateFormat)
rowLetter <- character(0)
colNumber <- numeric(0)
plateNumber <- numeric(0)
#The following will work if the samples are going DOWN
if(direction == "DOWN"){
for(k in 1:platesRequired){
rowLetter <- c(rowLetter, rep(LETTERS[1:8], length.out = 96))
for(i in 1:12){
colNumber <- c(colNumber, rep(i, times = 8))
}
plateNumber <- c(plateNumber, rep(k, times = 96))
}
plateLayout <- paste0(rowLetter, colNumber)
plateLayout <- data.frame(plateLayout, plateNumber)
plateLayout <- plateLayout[1:numOfSamples,]
return(plateLayout)
}
#The following will work if the samples are going ACROSS
if(direction == "ACROSS"){
for(k in 1:platesRequired){
colNumber <- c(colNumber, rep(1:12, times = 8))
for(i in 1:8){
rowLetter <- c(rowLetter, rep(LETTERS[i], times = 12))
}
plateNumber <- c(plateNumber, rep(k, times = 96))
}
plateLayout <- paste0(rowLetter, colNumber)
plateLayout <- data.frame(plateLayout, plateNumber)
plateLayout <- plateLayout[1:numOfSamples,]
return(plateLayout)
}
}
Does anybody have any thoughts on what else might make this cool? I'm going to use this function to generate .csv or .txt files to use as sample name imports for different instruments so I will be kind of constrained in terms of 'cool features', but I think it would be cool to use ggplot to make a graphic which shows the plates and sample names?

You don't need for loops. Here is a start:
#some sample ids
ids <- c(LETTERS, letters)
#plate size:
n <- 96
nrow <- 8
samples <- character(n)
samples[seq_along(ids)] <- ids
samples <- matrix(samples, nrow=nrow)
colnames(samples) <- seq_len(n/nrow)
rownames(samples) <- LETTERS[seq_len(nrow)]
# 1 2 3 4 5 6 7 8 9 10 11 12
# A "A" "I" "Q" "Y" "g" "o" "w" "" "" "" "" ""
# B "B" "J" "R" "Z" "h" "p" "x" "" "" "" "" ""
# C "C" "K" "S" "a" "i" "q" "y" "" "" "" "" ""
# D "D" "L" "T" "b" "j" "r" "z" "" "" "" "" ""
# E "E" "M" "U" "c" "k" "s" "" "" "" "" "" ""
# F "F" "N" "V" "d" "l" "t" "" "" "" "" "" ""
# G "G" "O" "W" "e" "m" "u" "" "" "" "" "" ""
# H "H" "P" "X" "f" "n" "v" "" "" "" "" "" ""
library(reshape2)
samples <- melt(samples)
samples$position <- paste0(samples$Var1, samples$Var2)
# Var1 Var2 value position
# 1 A 1 A A1
# 2 B 1 B B1
# 3 C 1 C C1
# 4 D 1 D D1
# 5 E 1 E E1
# 6 F 1 F F1
# 7 G 1 G G1
# 8 H 1 H H1
# 9 A 2 I A2
# 10 B 2 J B2
# 11 C 2 K C2
# 12 D 2 L D2
# 13 E 2 M E2
# 14 F 2 N F2
# 15 G 2 O G2
# 16 H 2 P H2
# 17 A 3 Q A3
# 18 B 3 R B3
# 19 C 3 S C3
# 20 D 3 T D3
# 21 E 3 U E3
# 22 F 3 V F3
# 23 G 3 W G3
# 24 H 3 X H3
# 25 A 4 Y A4
# 26 B 4 Z B4
# 27 C 4 a C4
# 28 D 4 b D4
# 29 E 4 c E4
# 30 F 4 d F4
# 31 G 4 e G4
# 32 H 4 f H4
# 33 A 5 g A5
# 34 B 5 h B5
# 35 C 5 i C5
# 36 D 5 j D5
# 37 E 5 k E5
# 38 F 5 l F5
# 39 G 5 m G5
# 40 H 5 n H5
# 41 A 6 o A6
# 42 B 6 p B6
# 43 C 6 q C6
# 44 D 6 r D6
# 45 E 6 s E6
# 46 F 6 t F6
# 47 G 6 u G6
# 48 H 6 v H6
# 49 A 7 w A7
# 50 B 7 x B7
# 51 C 7 y C7
# 52 D 7 z D7
# 53 E 7 E7
# 54 F 7 F7
# 55 G 7 G7
# 56 H 7 H7
# 57 A 8 A8
# 58 B 8 B8
# 59 C 8 C8
# 60 D 8 D8
# 61 E 8 E8
# 62 F 8 F8
# 63 G 8 G8
# 64 H 8 H8
# 65 A 9 A9
# 66 B 9 B9
# 67 C 9 C9
# 68 D 9 D9
# 69 E 9 E9
# 70 F 9 F9
# 71 G 9 G9
# 72 H 9 H9
# 73 A 10 A10
# 74 B 10 B10
# 75 C 10 C10
# 76 D 10 D10
# 77 E 10 E10
# 78 F 10 F10
# 79 G 10 G10
# 80 H 10 H10
# 81 A 11 A11
# 82 B 11 B11
# 83 C 11 C11
# 84 D 11 D11
# 85 E 11 E11
# 86 F 11 F11
# 87 G 11 G11
# 88 H 11 H11
# 89 A 12 A12
# 90 B 12 B12
# 91 C 12 C12
# 92 D 12 D12
# 93 E 12 E12
# 94 F 12 F12
# 95 G 12 G12
# 96 H 12 H12
Use the byrow argument to fill the matrix in the other direction:
samples <- matrix(samples, nrow=nrow, byrow=TRUE)
To fill more than one plate, you can use basically the same idea, but use an array instead of a matrix.

I've never written this code in R before but it should be the same as Perl, Python or Java
For Row major order (going across) the pseudocode algorithm is simply:
for each( i : 0..totalNumWells - 1){
column = (i % numColumns)
row = ((i % totalNumWells) / numColumns)
}
Where numColumns is 12 for 96 well plate, 24 or 384 and totalNumWells is 96 or 384 respectively. This will give you a column and row index in 0-based coordinates which is perfect for accessing arrays.
wellName = ABCs[row], column + 1
Where ABCs is an array of all the valid letters in your plate (or A-Z). +1 is to convert 0-based into 1-based, otherwise the first well will be A0 instead of A1.
I also want to point out that often 384 wells aren't in row major order. I've seen most often sequencing centers preferring a "checker board" pattern A01, A03, A05... then A02, A04, A06..., B01, B03... etc to be able to combine 4 96-well plates into a single 384 well without changing the layout and simplifying the picking robot's work. that's a much harder algorithm to compute the ith well for

The following code does what I set out to do. You can use it to make as many plates as you need, with the assumptions that whatever your import list is will be in order. It can make as many plates as you need and will add a column for "plateNumber" which will indicate which batch it's on. It can only handle 96 or 384 well plates, but that is all I deal in so that is fine.
plateLayout <- function(numOfSamples, plateFormat = 96, direction = "DOWN"){
#This assumes that each well will be filled in order.
#Calculate the number of plates required
platesRequired <- ceiling(numOfSamples/plateFormat)
rowLetter <- character(0)
colNumber <- numeric(0)
plateNumber <- numeric(0)
#define the number of columns and number of rows based on plate format (96 or 384 well plate)
switch(as.character(plateFormat),
"96" = {numberOfColumns = 12; numberOfRows = 8},
"384" = {numberOfColumns = 24; numberOfRows = 16})
#The following will work if the samples are going DOWN
if(direction == "DOWN"){
for(k in 1:platesRequired){
rowLetter <- c(rowLetter, rep(LETTERS[1:numberOfRows], length.out = plateFormat))
for(i in 1:numberOfColumns){
colNumber <- c(colNumber, rep(i, times = numberOfRows))
}
plateNumber <- c(plateNumber, rep(k, times = plateFormat))
}
plateLayout <- paste0(rowLetter, colNumber)
plateLayout <- data.frame(plateNumber,plateLayout)
plateLayout <- plateLayout[1:numOfSamples,]
return(plateLayout)
}
#The following will work if the samples are going ACROSS
if(direction == "ACROSS"){
for(k in 1:platesRequired){
colNumber <- c(colNumber, rep(1:numberOfColumns, times = numberOfRows))
for(i in 1:numberOfRows){
rowLetter <- c(rowLetter, rep(LETTERS[i], times = numberOfColumns))
}
plateNumber <- c(plateNumber, rep(k, times = plateFormat))
}
plateLayout <- paste0(rowLetter, colNumber)
plateLayout <- data.frame(plateNumber, plateLayout)
plateLayout <- plateLayout[1:numOfSamples,]
return(plateLayout)
}
}
An example of how to use this would be as follows
#load whatever data you're going to use to get a plate layout on (sample ID's or names or whatever)
thisData <- read.csv("data.csv")
#make a data.frame containing your sample names and the function's output
#alternatively you can use length() if you have a list
plateLayoutDataFrame <- data.frame(thisData$sampleNames, plateLayout(nrow(thisData), plateFormat = 96, direction = "DOWN")
#It will return something similar to the following, depending on your selections
#data plateNumber plateLayout
#sample1 1 A1
#sample2 1 B1
#sample3 1 C1
#sample4 1 D1
#sample5 1 E1
#sample6 1 F1
#sample7 1 G1
#sample8 1 H1
#sample9 1 A2
#sample10 1 B2
#sample11 1 C2
#sample12 1 D2
#sample13 1 E2
#sample14 1 F2
#sample15 1 G2
That sums up this function for now. Roland offered a good method of doing this which is less verbose, but I wanted to avoid the use of external packages if possible. I'm working on a shiny app now which actually uses this! I want it to be able to automatically subset based on the 'plateNumber' and write each plate as it's own file... for more on this, go to: Automatic multi-file download in R-Shiny

Here's how I'd do it.
put_samples_in_plates = function(sample_list, nwells=96, direction="across")
{
if(!nwells %in% c(96, 384)){
stop("Invalid plate size")
}
nsamples = nrow(sample_list)
nplates = ceiling(nsamples/nwells);
if(nwells==96){
rows = LETTERS[1:8]
cols = 1:12
}else if(nwells==384){
rows = LETTERS[1:16]
cols = 1:24
}else{
stop("Unrecognized nwells")
}
nrows = length(rows)
ncols = length(cols)
if(tolower(direction)=="down"){
single_plate_df = data.frame(row = rep(rows, times=ncols),
col = rep(cols, each=nrows))
}else if(tolower(direction)=="across"){
single_plate_df = data.frame(row = rep(rows, each=ncols),
col = rep(cols, times=nrows))
}else{
stop("Unrecognized direction")
}
single_plate_df = transform(single_plate_df,
well = sprintf("%s%02d", row, col))
toobig_plate_df = cbind(data.frame(plate=rep(1:nplates, each=nwells)),
do.call("rbind", replicate(nplates,
single_plate_df,
simplify=FALSE)))
res = cbind(sample_list, toobig_plate_df[1:nsamples,])
return(res)}
# Quick test
a_sample_list = data.frame(x=1:386, y=rnorm(386))
r.096.across = put_samples_in_plates(sample_list = a_sample_list,
nwells= 96,
direction="across")
r.096.down = put_samples_in_plates(sample_list = a_sample_list,
nwells= 96,
direction="down")
r.384.across = put_samples_in_plates(sample_list = a_sample_list,
nwells=384,
direction="across")
r.384.down = put_samples_in_plates(sample_list = a_sample_list,
nwells=384,
direction="down")
Two points worth noting in the function above:
the use of the times and each parameters within the rep function to differentiate "across" and "down" directions, and
the use of replicate to repeat the individual plate as many times as needed along with the use of a call to rbind from do.call.

Combining objects across a list

I have a simple question. I have a list of objects. Each object holds a few lists. Before this gets too complicated, let me illustrate:
x = a list
x[[1]] = some object
x[[2]] = another object
...
x[[n]] = another object
And as I said, each object holds some more lists. But I'm interested in a specific list, let's call it "a".
x[[1]][[a]] = ('A': 1, 'B': 2, 'C': 3, ..., Z: 26)
Sorry for the python-like syntax! I am really just learning R. Anyway, what I want to do is combine the lists held in these objects, then take their median. To make this more clear, I want to group all 'A' elements, then take their median:
x[[1]][[a]][['A']], x[[2]][[a]][['A']], x[[3]][[a]][['A']], ..., x[[n]][[a]][['A']]
Similarly I want to group all 'B', 'C', ..., 'Z' elements and take their median...
x[[1]][[a]][['Z']], x[[2]][[a]][['Z']], x[[3]][[a]][['Z']], ..., x[[n]][[a]][['Z']]
So the question is what's the best way to do this? I've spent hours trying to figure this out! It would be great if someone could help me.
And if you would like to know what I'm actually doing, basically I have a list (x) of random forest objects. So x[[1]] is the first random forest, x[[100]] is the 100th random forest. Each random forest has a list of predicted values, which are stored in, e.g. x[[1]][['predicted']]. Each prediction list has a label associated with its predicted value. What I'm actually trying to do is calculate each label's median predicted value across all 100 random forests. And I want to do it efficiently. In Python, this is easy, but in R I'm not so sure. Anyway, thanks for the help!!! I really appreciate it.

Here's one way you could do it. It's a bit tough because you can't use rapply to subset by the names of list elements (which is frustrating). But you can unlist and then subset on names and take the median that way...
# Make some reproducible data
set.seed(1)
l <- list( a = sample(10,3) , b = sample(10,3) , c = sample(10,3) )
ll <- list( l , l , l )
# Unlist - we get a named vector but all a's have unique names - e.g. a1 , a2... an
unl <- unlist(ll)
# a1 a2 a3 b1 b2 b3 c1 c2 c3 a1 a2 a3 b1 b2 b3 c1 c2 c3 a1 a2 a3 b1 b2 b3 c1 c2 c3
# 3 4 5 10 2 8 10 6 9 3 4 5 10 2 8 10 6 9 3 4 5 10 2 8 10 6 9
# Subset by those elements that contian 'a' in their name
a.unl <- unl[ grepl("a",names(unl)) ]
# a1 a2 a3 a1 a2 a3 a1 a2 a3
# 3 4 5 3 4 5 3 4 5
# Take median
median( a.unl )
# [1] 4
To loop over multiple names try this...
sapply( c( "a" , "b" , "c" ) , function(x) median( unl[ grepl(x,names(unl) ) ] ) )
# a b c
# 4 8 9

you could do this with a simple loop for every A,B,C,...
x <- c()
for( i in 1:n ) x <- c( x, x[[i]][[a]][['A']] )
median(x)

Sample data for creating your top-level list x:
x <- replicate(3, list(a = as.list(setNames(sample(1:100, 26), LETTERS)),
b = runif(10)),
simplify = FALSE)
First, extract a from each list:
a.only <- lapply(ll, `[[`, "a")
Then, to compute all A through Z medians in one shot, do:
do.call(mapply, c(a.only, FUN = function(...) median(unlist(list(...)))))
# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 55 59 41 21 93 72 65 74 51 42 87 25 60 40 13 77 35 31 92 51 57 37 87 67 29 46
If the sublists contain more items than you need, say you only want to compute medians on A, C, Z, do:
a.slices <- lapply(a.only, `[`, c("A", "C", "Z"))
do.call(mapply, c(a.slices, FUN = function(...) median(unlist(list(...)))))
# A C Z
# 55 41 46

accessing matrix in R

i have a matrix in R as follows:
YITEMREVENUE XCARTADD XCARTUNIQADD XCARTADDTOTALRS
YITEMREVENUE 1.0000000000 -0.02630016 -0.01811156 0.0008988723
XCARTADD -0.0263001551 1.00000000 0.02955307 -0.0438881639
XCARTUNIQADD -0.0181115638 0.02955307 1.00000000 0.0917359285
XCARTADDTOTALRS 0.0008988723 -0.04388816 0.09173593 1.0000000000
i want to list out the names of the columns with negative values only.. my output should look like:
YITEMREVENUE - XCARTADD XCARTUNIQADD
XCARTADD - YITEMREVENUE XCARTADDTOTALRS
XCARTUNIQADD - YITEMREVENUE
XCARTADDTOTALRS - XCARTADD
is it possible in R?

I would first cast the matrix to a data.frame, in code this would be:
# Some example data
dat = matrix(runif(9) - 0.5, 3, 3)
dimnames(dat) = list(LETTERS[1:3], LETTERS[1:3])
> dat
A B C
A 0.1216529 0.3501861 0.47473598
B -0.4720577 0.4887181 -0.41118597
C 0.4406510 -0.2516563 0.02344829
# Cast to data.frame
library(reshape)
df = melt(dat)
df
X1 X2 value
1 A A 0.12165293
2 B A -0.47205771
3 C A 0.44065104
4 A B 0.35018605
5 B B 0.48871810
6 C B -0.25165634
7 A C 0.47473598
8 B C -0.41118597
9 C C 0.02344829
# And find the combinations of row-columns which have < 0
df[df$value < 0, c("X1","X2")]
X1 X2
2 B A
6 C B
8 B C

If your data are in a data frame called m, you can use the following :
lapply(m, function(v) {rownames(m)[v<0]})
If your data are in a matrix called m, you can use :
apply(m, 2,function(v) {rownames(m)[v<0]})
In both cases, you will get a list like this :
$YITEMREVENUE
[1] "XCARTADD" "XCARTUNIQADD"
$XCARTADD
[1] "YITEMREVENUE" "XCARTADDTOTALRS"
$XCARTUNIQADD
[1] "YITEMREVENUE"
$XCARTADDTOTALRS
[1] "XCARTADD"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove duplicating docs of docs with high similarity - r

When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?

Related

Comparing two columns of two dataframes (logical operators)

Phylogenetic Tree - how to create a branch by species matrix?

Generating a 96 or 384 well plate layout in R

Combining objects across a list

accessing matrix in R

Categories

Resources