R: Finding duplicates in a data frame and recording them in vectors - r

I am trying to create some lines on a graph based on a third coordinate (x,y, temp). I would like to get a vector of indexes so I can split them into x and y vectors for each duplicate temperature. To make this more clear, I will include my actual data set:
DataFrame
I am trying to make multiple lines that have the same temp value. For example, I would like to have the following coordinates on the same line [0,14] [0,22] [0,26] [0,28]. They all have the temp value of 5.8. Once I find the duplicates, I will record the indexes in a vector which will allow me to retrieve the x and y coordinates. One other aspect is that I will not always know how many entries are going to be in the data.frame.
My question is how can I find the duplicates and store their indices in a vector? Once I have the indices for the duplicate temps, I can be sure to grab their x y coordinates and use that to create lines.
If you can answer my question or have any advice on how I can do this better, all help is appreciated

Consider the following:
df <- data.frame(temp = sample.int(n=3, size=5, replace=T))
df
temp
1 3
2 3
3 1
4 3
5 1
duplicated(df$temp)
[1] FALSE TRUE FALSE TRUE TRUE
which(duplicated(df$temp))
[1] 2 4 5

You've stated in the comments that you're looking to make an isopleth graph. The procedure you have described will not generate anything resembling an isopleth graph. Since it looks like your data is arranged in a regular grid, you should do something like the solutions presented in this question and answer, which use functions specifically designed for extracting contours from a grid of values. Another option is the contourLines function in the gDevices package. If you want higher-resolution, less jagged contours, you might look into using either the interp.surface or Krig functions from the fields package to interpolate your data to the resolution you require.

Related

Alternate way to remove outliers in R

I'm looking to remove the outlier data points in the clusters after k means clustering and using this way to do so in R :-
1.)Plot the graph:-
plot(sort(df[[1]]$var))
plot(sort(df[[2]]$var))
2.)From the graph see the outlier( in my case extreme )data points.
rownames(df[[1]])<-1:nrow(df[[1]])
rownames(df[[2]])<-1:nrow(df[[2]])
3.)Go to view(df[[1]]),view(df[[2]]) sort the var in descending order and note down those row index numbers which are the outlier data points and remove those rows from df[[1]] ,df[[2]]
df[[1]]<-df[[1]][-c(200,320,216),]
df[[2]]<-df[[2]][-c(7000,1200,2320),]
df is a list with 3 elements , df[[1]] access the first element/ cluster
Is there any other easy and efficient way to achieve the same?
You need to include a short, reproducible example showing what you want and what you have tried. That said, the following may give you some hints if I'm guessing what you want correctly. Note that you can get min/max cut values from CIs or other means.
a <- 1:40
b <- a[a %in% 4:35] # Define outliers as <= 4 or >= 35
b
length(b) # Note there are no NAs using this approach
Basically cut off the outliers at the relevant outlier values and graph the remaining elements.

R Plot Multiple Lines According to Choice of User in function()

I want to plot the data by using function(). My data consists of 4 vectors, say a, b, c and d.
I have to plot them by choice of vector at that time.
For example
i want to plot vector a and c then graph must have 2 lines....
if i want 2 plot all 4 vectors then there must be 4 lines in graph.
Till I have tried switch() but i think thats not suitable related to my work.
Is it even possible to write such code in an anonymous function ?
If yes, what is the right way, and if not, is there any workaround ?

How to use pointDistance with a very large vector

I've got a big problem.
I've got a large raster (rows=180, columns=480, number of cells=86400)
At first I binarized it (so that there are only 1's and 0's) and then I labelled the clusters.(Cells that are 1 and connected to each other got the same label.)
Now I need to calculate all the distances between the cells, that are NOT 0.
There are quiet a lot and that's my big problem.
I did this to get the coordinates of the cells I'm interested in (get the positions (i.e. cell numbers) of the cells, that are not 0):
V=getValues(label)
Vu=c(1:max(V))
pos=which(V %in% Vu)
XY=xyFromCell(label,pos)
This works very well. So XY is a matrix, which contains all the coordinates (of cells that are not 0). But now I'm struggling. I need to calculate the distances between ALL of these coordinates. Then I have to put each one of them in one of 43 bins of distances. It's kind of like this (just an example):
0<x<0.2 bin 1
0.2<x<0.4 bin2
When I use this:
pD=pointDistance(XY,lonlat=FALSE)
R says it's not possible to allocate vector of this size. It's getting too large.
Then I thought I could do this (create an empty data frame df or something like that and let the function pointDistance run over every single value of XY):
for (i in 1:nrow(XY))
{pD=PointDistance(XY,XY[i,],lonlat=FALSE)
pDbin=as.matrix(table(cut(pD,breaks=seq(0,8.6,by=0.2),Labels=1:43)))
df=cbind(df,pDbin)
df=apply(df,1,FUN=function(x) sum(x))}
It is working when I try this with e.g. the first 50 values of XY.
But when I use that for the whole XY matrix it's taking too much time.(Sometimes this XY matrix contains 10000 xy-coordinates)
Does anyone have an idea how to do it faster?
I don't know if this will works fast or not. I recommend you try this:
Let say you have dataframe with value 0 or 1 in each cell. To find coordinates all you have to do is write the below code:
cord_matrix <- which(dataframe == 1, arr.ind = TRUE)
Now, you get the coordinate matrix with row index and column index.
To find the euclidean distance use dist() function. Go through it. It will look like this:
dist_vector <- dist(cord_matrix)
It will return lower triangular matrix. can be transformed into vector/symmetric matrix. Now all you have to do is calculating bins according to your requirement.
Let me know if this works within the specific memory space.

Most efficient format for array data for R import?

I'm in the enviable position of being able to set up the format for my data collection ahead of time, rather than being handed some crazy format and having to struggle with it. I'd like to make sure I'm setting it up in a way that minimizes headaches down the road, but I'm not very familiar with importing into multidimensional arrays so I'd like input. It also seems like a thought exercise that others might get some use from.
I am compiling a large number of data summaries (500+) with 23 single data values for each experiment and two additional vectors that vary between 100 and 1500 data values (these two vectors happen to always match in length for each sample, but their length is different for each sample). I'm having to store all of these in an Excel sheet which I'm currently building. I want to set it up in a way that efficiently stores this data for import into an R array.
I'm assuming that the longer dimensions, which vary in length, will have the max length (1500) and a bunch of NA's at the end rather than try to keep track of ragged data in Excel.
My current plan would be to store these in long form in Excel, with data labels in the first column (dim1, dim2,...), and the data summaries in each subsequent column (a, b, c...), since this saves the most space. Using a smaller number of dimensions as an example (7 single values, 2 vectors of length 1500), the data would look like this in Excel:
a b c...
dim1 2 5 7...
dim2 3 6 8...
dim3 6 8 2 ...
dim4 5 6 1...
dim5 6 2 1...
dim6 0 3 8...
dim7 8 5 4...
dim8 1 1 1...
dim8 2 2 2 ...
... continued x1500
dim9 4 4 4...
dim9 5 5 5 ...
...continued x1500
Can I easily import this, using the leftmost column to identify the dimensions of the array in long form? I don't see an easy way to do this using Reshape2, but perhaps I'm missing something. Or, do I need to have the data in paired columns?
It isn't clear to me whether this format is the most efficient way to organize this data for import into a multidimensional array, or if there is a better way. Eventually there will be a large number of samples so I'd like to think through this now rather than struggle later.
What is the most painless way to import this...or, is there a more efficient way of setting it up for easier import?
Hmm.. I can't think of a case that you would have to use melt. If you keep the current format, and add a heading to the 'dim' column then you should be able to work with that data fairly easily.
If you did transpose the data on 'dim' I think it would make things a lot more difficult.
It might good to know what variable types a,b,c,etc. are in order to make a better assessment.

R: Sample into bins of predefined sizes (partition sample vector)

I'm working on a dataset that consists of ~10^6 values which clustered into a variable number of bins. In the course of my analysis, I am trying to randomize my clustering, but keeping bin size constant. As a toy example (in pseudocode), this would look something like this:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
for (rand in 1:no.of.randomizations) {
rand.data <- partition.sample(seq(1,15), partitions=sizes, replace=F)
}
So, I am looking for a function like "partition.sample" that will take a vector (like seq(1,15)) and randomly sample from it, returning a list with the data partitioned into the right bin sizes given already by "sizes".
I've been trying to write one such function myself, since the task seems to be not so hard. However, the partitioning of a vector into given bin sizes looks like it would be a lot faster and more efficient if done "under the hood", meaning probably not in native R. So I wonder whether I have just missed the name of the appropriate function, or whether someone could please point me to a smart solution that is around :-)
Your help & time are very much appreciated! :-)
Best,
Lymond
UPDATE:
By "no.of.randomizations" I mean the actual number of times I run through the whole "randomization loop". This will, later on, obviously include more steps than just the actual sampling.
Moreover, I would in addition be interested in a trick to do the above feat for sampling without replacement.
Thanks in advance, your help is very much appreciated!
Revised: This should be fairly efficient. It's complexity should be primarily in the permutation step:
# A single step:
x <- sample( unlist(data))
list( one=x[1:4], two=x[5:8], three=x[9], four=x[10:12], five=x[13:16])
As mentioned above the "no.of.randomizations" may have been the number of repeated applications of this proces, in which case you may want to wrap replicate around that:
replic <- replicate(n=4, { x <- sample(unlist(data))
list( x[1:4], x[5:8], x[9], x[10:12], x[13:15]) } )
After some more thinking and googling, I have come up with a feasible solution. However, I am still not convinced that this is the fastest and most efficient way to go.
In principle, I can generate one long vector of a uniqe permutation of "data" and then split it into a list of vectors of lengths "sizes" by going via a factor argument supplied to split. For this, I need an additional ID scheme for my different groups of "data", which I happen to have in my case.
It becomes clearer when viewed as code:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
So far, everything as above
names <- c("set1", "set2", "set3", "set4", "set5");
In my case, I am lucky enough to have "names" already provided from the data. Otherwise, I would have to obtain them as (e.g.)
names <- seq(1, length(data));
This "names" vector can then be expanded by "sizes" using rep:
cut.by <- rep(names, times = sizes);
[1] 1 1 1 1 2 2 2 2 3 4 4 4 5
[14] 5 5
This new vector "cut.by" can then by provided as argument to split()
rand.data <- split(sample(1:15, 15), cut.by)
$`1`
[1] 8 9 14 4
$`2`
[1] 10 2 15 13
$`3`
[1] 12
$`4`
[1] 11 3 5
$`5`
[1] 7 6 1
This does the job I was looking for alright. It samples from the background "1:15" and splits the result into vectors of lengths "sizes" through the vector "cut.by".
However, I am still not happy to have to go via an additional (possibly) long vector to indicate the split positions, such as "cut.by" in the code above. This definitely works, but for very long data vectors, it could become quite slow, I guess.
Thank you anyway for the answers and pointers provided! Your help is very much appreciated :-)

Resources