I am stuck on this problem and would be happy for advice. I have the following data.frame:
c1 <- factor(c("a","a","a","a"))
c2 <- factor(c("b","b","y","b"))
c3 <- factor(c("c","y","z","c"))
c4 <- factor(c("y","z","","y"))
c5 <- factor(c("z","","","z"))
x <- data.frame(c1,c2,c3,c4,c5)
So this data looks like this:
c1 c2 c3 c4 c5
1 a b c y z
2 a b y z
3 a y z
4 a b c y z
So in each row, there is a sequence of varying length of a, b, c which concludes with values for y and z. What I need to do is to move values y and z each to separate column that I can work with, so the data looks like this:
c6 c7 c8 c9 c10
1 a b c y z
2 a b y z
3 a y z
4 a b c y z
I have worked out to identify the length of each sequence per row and added that as a column, so I know which column y and z is located in:
x$not.na <- apply(paths, 1, function(x) length(which(!x=="")))
But I am stuck on how to loop(?) over each row to perform the necessary cut and paste of z and y.
Something like this:
lastTwoToEnd<-function(x){
i<-sum(x!="")-1:0
x[c(setdiff(seq_along(x),i),i)]
}
data.frame(t(apply(x,1,lastTwoToEnd)))
## X1 X2 X3 X4 X5
## 1 a b c y z
## 2 a b y z
## 3 a y z
## 4 a b c y z
Related
When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?
Your question is fairly thin on details - such as a reproducible example - but it's an interesting question and challenge. So here goes.
Let's say we have a corpus consisting of two sets of similar documents, { (a1, a2, a3), (b1, b2) } where the letters indicate similarity. We want to keep just one document when the others are "duplicates", defined as similarity exceeding a threshold, say 0.80.
We can use textstat_simil() to generate a similarity matrix, and then form pairwise sets directly from the returned dist object, and then keep just one of the similar sets.
library("quanteda")
# Loading required package: quanteda
# Package version: 1.3.14
mydocs <- c(a1 = "a a a a a b b c d w g j t",
b1 = "l y y h x x x x x y y y y",
a2 = "a a a a a b c s k w i r f",
b2 = "p q w e d x x x x y y y y",
a3 = "a a a a a b b x k w i r f")
mydfm <- dfm(mydocs)
(sim <- textstat_simil(mydfm))
# a1 b1 a2 b2
# b1 -0.22203788
# a2 0.80492203 -0.23090513
# b2 -0.23427416 0.90082239 -0.28140219
# a3 0.81167608 -0.09065452 0.92242890 -0.12530944
# create a data.frame of the unique pairs and their similarities
sim_pair_names <- t(combn(docnames(mydfm), 2))
sim_pairs <- data.frame(sim_pair_names,
sim = as.numeric(sim),
stringsAsFactors = FALSE)
sim_pairs
# X1 X2 sim
# 1 a1 b1 -0.22203788
# 2 a1 a2 0.80492203
# 3 a1 b2 -0.23427416
# 4 a1 a3 0.81167608
# 5 b1 a2 -0.23090513
# 6 b1 b2 0.90082239
# 7 b1 a3 -0.09065452
# 8 a2 b2 -0.28140219
# 9 a2 a3 0.92242890
# 10 b2 a3 -0.12530944
Subsetting this on our threshold condition, we can extract the names of the unlucky documents to be dropped, and feed this to a logical condition in dfm_subset().
# set the threshold for similarity
threshold <- 0.80
# discard one of the pair if similarity > threshold
todrop <- subset(sim_pairs, select = X1, subset = sim > threshold, drop = TRUE)
todrop
# [1] "a1" "a1" "b1" "a2"
# then subset the dfm, keeping only the "keepers"
dfm_subset(mydfm, !docnames(mydfm) %in% todrop)
# Document-feature matrix of: 2 documents, 20 features (62.5% sparse).
# 2 x 20 sparse Matrix of class "dfm"
# features
# docs a b c d w g j t l y h x s k i r f p q e
# b2 0 0 0 1 1 0 0 0 0 4 0 4 0 0 0 0 0 1 1 1
# a3 5 2 0 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 0
Other solutions to this problem of similar documents would be to form them into clusters, or to reduce the document matrix using principal components analysis, along the lines of latent semantic analysis.
You already received some excellent answers. But if you prefer a more automated approach targeted at your specific use case, you can use the package LexisNexisTools (which I wrote). It comes with a function called lnt_similarity(), which does exactly what you were looking for. I wrote a quick tutorial with mock data here.
The main difference between the solutions here and in lnt_similarity() is that I also take into account word order, which can make a big difference in some rare cases (see this blog post).
I also suggest you think carefully about thresholds as you might otherwise remove some articles wrongfully. I included a function to visualize the difference between two articles so you can get a better grip of the data you are removing called lnt_diff().
If you have thousands of documents, it takes a lot of space in your RAM to save all the similarity scores, but you can set a minimum threshold in textstat_proxy(), the underlying function of textstat_simil().
In this example, cosine similarity scores smaller than 0.9 are all ignored.
library("quanteda")
mydocs <- c(a1 = "a a a a a b b c d w g j t",
b1 = "l y y h x x x x x y y y y",
a2 = "a a a a a b c s k w i r f",
b2 = "p q w e d x x x x y y y y",
a3 = "a a a a a b b x k w i r f")
mydfm <- dfm(mydocs)
(sim <- textstat_proxy(mydfm, method = "cosine", min_proxy = 0.9))
# 5 x 5 sparse Matrix of class "dsTMatrix"
# a1 b1 a2 b2 a3
# a1 1 . . . .
# b1 . 1.0000000 . 0.9113423 .
# a2 . . 1.0000000 . 0.9415838
# b2 . 0.9113423 . 1.0000000 .
# a3 . . 0.9415838 . 1.0000000
matrix2list <- function(x) {
names(x#x) <- rownames(x)[x#i + 1]
split(x#x, factor(x#j + 1, levels = seq(ncol(x)), labels = colnames(x)))
}
matrix2list(sim)
# $a1
# a1
# 1
#
# $b1
# b1
# 1
#
# $a2
# a2
# 1
#
# $b2
# b1 b2
# 0.9113423 1.0000000
#
# $a3
# a2 a3
# 0.9415838 1.0000000
See https://koheiw.net/?p=839 for the performance differences.
I have some sequencing data for some biological samples. The file to read contains 7 columns first that contain characters, as they have gene names and codes etc. From the 8th column are my samples that contain count data, so a number assigned to a gene depending of how much of that gene is present in a given sample.
The problem is that the CSV file I have contains non-integer values and so I need to convert them into integers (as.integer).
This works absolutely find if I delete the columns that contain gene information etc. and have a matrix with only the values! However, I need the gene information and therefore the columns that contain this information, but if I carry out as.integer on the entire data frame, any characters get returned as NA and so I also lose all of this information!
I am struggling as I am guessing I should make the first 7 columns as.characters? Or apply the as.integer function to the 8th column up to the last, however I am struggling to think of the code to do this!
Try using lapply() to apply as.integer() to all except the first 7 columns?
df[, -seq(1, 7)] <- lapply(df[, -seq(1, 7)], as.integer)
#result
> df
c1 c2 c3 c4 c5 c6 c7 c8 c9
1 G F Y M V M X 104 13
2 J E F O Q V H 67 11
3 N Q P L S K L 107 -13
4 U I C E M F Y 102 -14
5 E X Z S L B O 129 7
6 S K I Y Y C F 125 15
7 W O A P A G J 55 -2
8 M S H C J J V 30 17
9 L G X N N L B 129 7
10 B N V G Z T S 99 -12
Sample data:
set.seed(1)
df <- data.frame(
c1 = sample(LETTERS, 10),
c2 = sample(LETTERS, 10),
c3 = sample(LETTERS, 10),
c4 = sample(LETTERS, 10),
c5 = sample(LETTERS, 10),
c6 = sample(LETTERS, 10),
c7 = sample(LETTERS, 10),
c8 = rexp(10, rate = 0.01),
c9 = rnorm(10, sd = 20)
)
> df
c1 c2 c3 c4 c5 c6 c7 c8 c9
1 G F Y M V M X 104.94389 13.939268
2 J E F O Q V H 67.88807 11.133264
3 N Q P L S K L 107.98811 -13.775114
4 U I C E M F Y 102.82469 -14.149903
5 E X Z S L B O 129.22616 7.291639
6 S K I Y Y C F 125.31054 15.370658
7 W O A P A G J 55.46414 -2.246924
8 M S H C J J V 30.12830 17.622155
9 L G X N N L B 129.31247 7.962118
10 B N V G Z T S 99.45558 -12.240528
I have a large database from which I have extracted a data value (x) using the aggregate function:
library(plotrix)
aggregate(mydataNC[,c(52)],by=list(patientNC, siteNC, supNC),max)
OUTPUT:
Each (x) value has a corresponding distance value in located in a column titled (dist) in this database.
What is the easiest way to extract the value dist and added to the table?
I'd probably start with merge() first. Here's a small reproducible example you can use to see what's going on and modify it to use your data:
# generate bogus data and view it
x1 <- rep(c("A", "B", "C"), each = 4)
x2 <- rep(c("E", "E", "F", "F"), times = 3)
y1 <- rnorm(12)
y2 <- rnorm(12)
md <- data.frame(x1, x2, y1, y2)
> head(md)
x1 x2 y1 y2
1 A E -1.4603164 -0.9662473
2 A E -0.5247227 1.7970341
3 A F 0.8990502 1.7596285
4 A F -0.6791145 2.2900357
5 B E 1.2894863 0.1152571
6 B E -0.1981511 0.6388998
# aggregate by taking maximum of each unique (x1, x2) combination
md.agg <- with(md, aggregate(y1, by = list(x1, x2), FUN = max))
names(md.agg) <- c("x1", "x2", "y1")
> md.agg
x1 x2 y1
1 A E -0.5247227
2 B E 1.2894863
3 C E 0.9982510
4 A F 0.8990502
5 B F 2.5125956
6 C F -0.5916491
# merge y2 into the aggregated data
md.final <- merge(md, md.agg)
> md.final
x1 x2 y1 y2
1 A E -0.5247227 1.7970341
2 A F 0.8990502 1.7596285
3 B E 1.2894863 0.1152571
4 B F 2.5125956 -0.2217510
5 C E 0.9982510 0.6813261
6 C F -0.5916491 1.0348518
I have two data frames x and y:
> x <- data.frame(name = c("foo","bar"), c1 = c(0.1,0.2), c2=c("y","w"))
> x
name c1 c2
1 foo 0.1 y
2 bar 0.2 w
> y <- data.frame(name = c("foo","bar","qux"), c1 = c(0.3,0.2,0.8), c2=c("k","w","z"))
> y
name c1 c2
1 foo 0.3 k
2 bar 0.2 w
3 qux 0.8 z
In reality the column can be more than c2.
What I want to do is to merge them so that it result in this:
name c1 c2
foo 0.1 y
bar 0.2 w
qux 0.8 z
So note that when merging and when there are two rows with same name but different c1 value
we pick one with lowest c1, regardless the value in c2,c3,c4.... How can I achieve that?
I tried the command merge(x,y,by='name') but didn't work as I expected.
unique.data.table has a by argument that you can use for this.
Combined with order(c1) so that the "first" element will also be the min of c1 for each name
library(data.table)
x <- data.table(x, key=name)
y <- data.table(y, key=name)
xy <- merge(x, y, all=TRUE)
unique(xy[order(c1)], by="name")
# name c1 c2
# 1: foo 0.1 y
# 2: bar 0.2 w
# 3: qux 0.8 z
I have a data set I would like to remove the rows of data that have duplicate information in 4 different columns.
foo<- data.frame(g1 = c("1","0","0","1","1"), v1 = c("7","5","4","4","3"), v2 = c("a","b","x","x","e"), y1 = c("y","c","f","f","w"), y2= c("y","y","y","f","c"), y3 = c("y","c","c","f","w"), y4= c("y","y","f","f","c"), y5=c("y","w","f","f","w"), y6=c("y","c","f","f","w"))
foo then looks like:
g1 v1 v2 y1 y2 y3 y4 y5 y6
1 1 7 a y y y y y y
2 0 5 b c y c y w c
3 0 4 x f y c f f f
4 1 4 x f f f f f f
5 1 3 e w c w c w w
Now, I want to remove any row that has duplicated data based on the Y1-6columns. So, only row 4 and 1 would be removed if done properly, based on all Y variables being the exact same. Its a multiple column condition.
I believe I am close, but its just not working correctly.
I have tried: new = foo[!(duplicated(foo[,1:6]))]
thinking to use the duplicated command that it would search and only find those that matched exactly?
I thought about using a conditional statement with &, but can't figure out how to do that either.
new = foo[foo$y1==foo$y2|foo$y3|foo$y4|foo$y5|foo$y6]
I thought about which but Im now overwhelmed and lost. I would expect foo to look like:
g1 v1 v2 y1 y2 y3 y4 y5 y6
2 0 5 b c y c y w c
3 0 4 x f y c f f f
5 1 3 e w c w c w w
> foo[apply(foo[ , paste("y", 1:6, sep = "")], 1,
FUN = function(x) length(unique(x)) > 1 ), ]
g1 v1 v2 y1 y2 y3 y4 y5 y6
2 0 5 b c y c y w c
3 0 4 x f y c f f f
5 1 3 e w c w c w w
foo[apply(foo, 1, function(x) any(x != x[1])),]
> foo[ !rowSums( apply( foo[2:6], 2, "!=", foo[1] ) )==0, ]
y1 y2 y3 y4 y5 y6
2 c y c y w c
3 f y c f f f
5 w c w c w w
> foo[ ! colSums( apply( foo, 1, duplicated, foo[1] ) ) == 5, ]
y1 y2 y3 y4 y5 y6
2 c y c y w c
3 f y c f f f
5 w c w c w w