Alternate way to remove outliers in R - r

I'm looking to remove the outlier data points in the clusters after k means clustering and using this way to do so in R :-
1.)Plot the graph:-
plot(sort(df[[1]]$var))
plot(sort(df[[2]]$var))
2.)From the graph see the outlier( in my case extreme )data points.
rownames(df[[1]])<-1:nrow(df[[1]])
rownames(df[[2]])<-1:nrow(df[[2]])
3.)Go to view(df[[1]]),view(df[[2]]) sort the var in descending order and note down those row index numbers which are the outlier data points and remove those rows from df[[1]] ,df[[2]]
df[[1]]<-df[[1]][-c(200,320,216),]
df[[2]]<-df[[2]][-c(7000,1200,2320),]
df is a list with 3 elements , df[[1]] access the first element/ cluster
Is there any other easy and efficient way to achieve the same?

You need to include a short, reproducible example showing what you want and what you have tried. That said, the following may give you some hints if I'm guessing what you want correctly. Note that you can get min/max cut values from CIs or other means.
a <- 1:40
b <- a[a %in% 4:35] # Define outliers as <= 4 or >= 35
b
length(b) # Note there are no NAs using this approach
Basically cut off the outliers at the relevant outlier values and graph the remaining elements.

Related

changing class and getting numbers

I am working with the golub dataset in R (separated by the AML and ALL) and I am attempting to do a hypothesis test in relation to two genes. For the AML patient group, I want to find out the proportion of patients who have a higher expression of gene 900 as compared to gene 1000, then I want to see if that out of those who have a higher expression value for gene 900, the number is less than half. I have a general idea to do the other half, and I had something like this for the first part, but seeing as its T/F I tried to switch it to numeric which gave 0 and 1 but I want the actual numbers and not in the logical form.
`gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
x <- golub[900,gol.fac=="AML"]
y <- golub[1000,gol.fac=="AML"]
z <-golub[900,gol.fac=="AML"] > golub[1000,gol.fac=="AML"]
k <- as.numeric(z)`
Use max
max(golub[900,gol.fac=="AML"], golub[1000,gol.fac=="AML"])
Or if you have multiple values then use pmax
pmax(golub[900,gol.fac=="AML"], golub[1000,gol.fac=="AML"])
Instead of doing multiple slices of rows, just get the max by subsetting once
max(golub[900:1000, "AML"])

Find n closest non-NA values to position t in vector

This is probably a simple question for those experienced in R, but it is something that I (a novice) am struggling with...
I have two examples of vectors that are common to the problem I am trying to solve, A and B:
A <- c(1,3,NA,3,NA,4,NA,1,7,NA,2,NA,9,9,10)
B <- c(1,3,NA,NA,NA,NA,NA,NA,NA,NA,2,NA,9)
#and three scalars
R <- 4
t <- 5
N <- 3
There is a fourth scalar, n, where 0<=n<=N. In general, N <= R.
I want to find the n closest non-NA values to t such that they fall within a radius R centered on t. I.e., the search radius, R comprises of R+1 values. For example A, the search radius sequence is (3,NA,3,NA,4,NA,1), where t=NA, the middle value in the search radius sequence.
The expected answer can be one of two results for A:
answerA1 <- c(3,4,1)
OR
answerA2 <- c(3,4,3)
The expected answer for B:
answerB <- c(1,3)
How would I accomplish this task in the most time- and space-efficient manner? One liners, loops, etc. are welcome. If I have to choose a preference, it is for speed!
Thanks in advance!
Note:
For this case, I understand that the third closest non-NA value may involve choosing a preference for the third value to fall on either the right or left of t (as shown by the two possible answers above). I do not have a preference for whether this values falls to the left or the right of t but, if there is a way to leave it to random chance, (whether the third value falls to the right or the left) that would be ideal (but, again, it is not a requirement).
A relatively short solution is:
orderedA <- A[order(abs(seq_len(length(A)) - t))][seq_len(R*2)]
n_obj <- min(sum(is.na(orderedA)), N, length(na.omit(orderedA)))
res <- na.omit(orderedA)[seq_len(n_obj)]
res
#[1] 3 4 3
Breaking this down a little more the steps are:
Order A, by the absolute distance from the position of interest, t.
Code is: A[order(abs(seq_len(length(A)) - t))]
Subset to the first R*2 elements (so this will get the elements on either side of t within R.
Code is: [seq_len(R*2)]
Get the first min(N, # of non-NA, len of non-NA) elements
Code is: min(sum(is.na(orderedA)), N, length(na.omit(orderedA)))
Drop NA
Code is: na.omit()
Take first elements determined in step 3 (whichever is smaller)
Code is: [seq_len(n_obj)]
Something like this?
thingfinder <- function(A,R,t,n) {
left <- A[t:(t-R-1)]
right <- A[t:(t+R+1)]
leftrightmat <- cbind(left,right)
raw_ans <- as.vector(t(leftrightmat))
ans <- raw_ans[!is.na(raw_ans)]
return(ans[1:n])
}
thingfinder(A=c(1,3,NA,3,NA,4,NA,1,7,NA,2,NA,9,9,10), R=3, t=5, n=3)
## [1] 3 4 3
This would give priority to the left side, of course.
In case it is helpful to others, #Mike H. also provided me with a solution to return the index positions associated with the desired vector elements res:
A <- setNames(A, seq_len(length(A)))
orderedA <- A[order(abs(seq_len(length(A)) - t))][seq_len(R*2)]
n_obj <- min(sum(is.na(orderedA)), N, length(na.omit(orderedA)))
res <- na.omit(orderedA)[seq_len(n_obj)]
positions <- as.numeric(names(res))

R: Finding duplicates in a data frame and recording them in vectors

I am trying to create some lines on a graph based on a third coordinate (x,y, temp). I would like to get a vector of indexes so I can split them into x and y vectors for each duplicate temperature. To make this more clear, I will include my actual data set:
DataFrame
I am trying to make multiple lines that have the same temp value. For example, I would like to have the following coordinates on the same line [0,14] [0,22] [0,26] [0,28]. They all have the temp value of 5.8. Once I find the duplicates, I will record the indexes in a vector which will allow me to retrieve the x and y coordinates. One other aspect is that I will not always know how many entries are going to be in the data.frame.
My question is how can I find the duplicates and store their indices in a vector? Once I have the indices for the duplicate temps, I can be sure to grab their x y coordinates and use that to create lines.
If you can answer my question or have any advice on how I can do this better, all help is appreciated
Consider the following:
df <- data.frame(temp = sample.int(n=3, size=5, replace=T))
df
temp
1 3
2 3
3 1
4 3
5 1
duplicated(df$temp)
[1] FALSE TRUE FALSE TRUE TRUE
which(duplicated(df$temp))
[1] 2 4 5
You've stated in the comments that you're looking to make an isopleth graph. The procedure you have described will not generate anything resembling an isopleth graph. Since it looks like your data is arranged in a regular grid, you should do something like the solutions presented in this question and answer, which use functions specifically designed for extracting contours from a grid of values. Another option is the contourLines function in the gDevices package. If you want higher-resolution, less jagged contours, you might look into using either the interp.surface or Krig functions from the fields package to interpolate your data to the resolution you require.

In R: sort the maximum dissimilarity between rows in a matrix

I have a matrix, which includes 100 rows and 10 columns, here I want to compare the diversity between rows and sort them. And then, I want to select the 10 maximum dissimilarity rows from it, Which method can I use?
set.seed(123)
mat <- matrix(runif(100 * 10), nrow = 100, ncol = 10)
My initial method is to calculate the similarity (e.g. saying tanimoto coefficient or others: http://en.wikipedia.org/wiki/Jaccard_index ) between two rows, and dissimilairty = 1 - similarity, and then compare the dissimilarty values. At last I will sort all dissimilarity value, and select the 10 maximum dissimilarity values. But it seems that the result is a 100 * 100 matrix, maybe need efficient method to such calculation if there are a large number of rows. However, this is just my thought, maybe not right, so I need help.
[update]
After looking for some literatures. I find the one definition for the maximum dissimilarity method.
Maximum dissimilarity method: It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that . The process repeats itself until there is a sufficient number of cluster centers.
Here in my question, the sufficient number should be 10.
Thanks.
First of all, the Jacard Index is not right for you. From the wikipedia page
The Jaccard coefficient measures similarity between finite sample sets...
Your matrix has samples of floats, so you have a different problem (note that the Index in question is defined in terms of intersections; that should be a red flag right there :-).
So, you have to decide what you mean by dissimilarity. One natural interpretation would be to say row A is more dissimilar from the data set than row B if it has a greater Euclidean distance to the center of mass of the data set. You can think of the center of mass of the data set as the vector you get by taking the mean of each of the colums and putting them together (apply(mat, 2, mean)).
With this, you can take the distance of each row to that central vector, and then get an ordering on those distances. From that you can work back to the rows you desire from the original matrix.
All together:
center <- apply(mat, 2, mean)
# not quite the distances, actually, but their squares. That will work fine for us though, since the order
# will still be the same
dists <- apply(mat, 1, function(row) sum((row - center) ** 2))
# this gives us the row indices in order of least to greaest dissimiliarity
dist.order <- order(dists)
# Now we just grab the 10 most dissimilar of those
most.dissimilar.ids <- dist.order[91:100]
# and use them to get the corresponding rows of the matrix
most.dissimilar <- mat[most.dissimilar.ids,]
If I was actually writing this, I probably would have compressed the last three lines as most.dissimilar <- mat[order(dists)[91:100],], but hopefully having it broken up like this makes it a little easier to see what's going on.
Of course, if distance from the center of mass doesn't make sense as the best way of thinking of "dissimilarity" in your context, then you'll have to amend with something that does.

Seqfplot: percentage vs. number of most frequent sequences?

I'm using the R packages TraMineR to compute and analyze state sequences.
I would like to obtain a sequence frequency plots using the command seqfplot. However, instead of setting the number of the most frequent sequences to be plotted using
seqfplot(mydata.seq, tlim=1:20)
it would be useful to set the percentage of the most frequent sequences needed to reach - for example - the 50% of the sample. I tried with this
seqfplot(mydata.seq, trep = 0.5)
but - differently from seqrep.grp and seqrep - the option trep is not supported by seqfplot command. Should I create a new function to do that?
Thank you.
You are right, the trep argument is an argument of TraMineR seqrep function which looks for representative sequences covering at least a trep percentage of all sequences.
If you specifically want the most frequent sequence patterns such that their cumulated percent frequencies is say 50%, then you have to compute the selection filter your self. Here is how you can do that using the biofam data.
library(TraMineR)
data(biofam)
bf.seq <- seqdef(biofam[,10:25])
## first retrieve the "Percent" column of the frequency table provided
## as the "freq" attribute of the object returned by the seqtab function.
bf.freq <- seqtab(bf.seq, tlim=nrow(bf.seq))
bf.tab <- attr(bf.freq,"freq")
bf.perct <- bf.tab[,"Percent"]
## Compute the cumulated percentages
bf.cumsum <- cumsum(bf.perct)
## Now we can use the cumulated percentage to select
## the wanted patterns
bf.freq50 <- bf.freq[bf.cumsum <= 50,]
## And to plot the frequent patterns
(nfreq <- length(bf.cumsum[bf.cumsum <= 50]))
seqfplot(bf.seq, tlim=1:nfreq)
Hope this helps.

Resources