Filtering grouped data in R - r

I was wondering if anyone can help with grouping the data below as I'm trying to use the subset function to filter out volumes below a certain threshold but given that the data represents groups of objects, this creates the problem of removing certain items that should be kept.
In Column F ( and I) you can see Blue, Red, and Yellow Objects. Each represent three separate colored probes on one DNA strand. Odd numbered or non-numbered Blue ,Red, and Yellow are paired with a homologous strand represented by an even numbered Blue, Red, and Yellow. Ie data in rows 2,3,and 4 are one "group" and pair with the "group" shown in rows 5,6,and 7. This then repeats, so 8,9,10 are a new group and that group pairs with the one in 11,12,13.
What I would like to do is subset the groups so that only those below a certain Distance to Midpoint (column M) are kept. The Midpoint here is the midpoint of the line that connects the blue of one group with the blue of its partner, so the subset should only apply to the Blue distance to midpoint, and that is where I'm having a problem. For instance if I ask to keep blue distances to midpoint that are less than 3, then the objects in row 3 and 4 should be kept because they are part of the group with the blue distance below 3. Right now though when I filter with the subset function I lose Red Selection and Yellow Selection. I'm confident there is a straighforward solution to this in R, but I'd also be open to some type of filtering in excel if anyone has suggestions via that route instead.
EDIT
I managed to work something out in Excel last night after posting the question. Solution isn't pretty but it works well enough. I just added a new column next to "distance to midpoint" that gives all the objects in one group the same distance so that when I filter the data I won't lose any objects that I shouldn't. If it helps anyone in the future, the formula I used in excel was:
=SQRT ( ((INDEX($B$2:$B$945,1+QUOTIENT(ROWS(B$2:B2)-1,3)*3))- (INDEX($O$2:$O$945,1+QUOTIENT(ROWS(O$2:O2)-1,3)*3)) ) ^2 +( (INDEX($C$2:$C$945,1+QUOTIENT(ROWS(C$2:C2)-1,3)*3))-(INDEX($P$2:$P$945,1+QUOTIENT(ROWS(P$2:P2)-1,3)*3)) ) ^2 +( (INDEX($D$2:$D$945,1+QUOTIENT(ROWS(D$2:D2)-1,3)*3))-(INDEX($Q$2:$Q$945,1+QUOTIENT(ROWS(Q$2:Q2)-1,3)*3)) ) ^2)

Would be easier with a reproducible example, but here's a (hacky) plyr solution:
filterframe<-function(df,threshold){
df$grouper<-rep(seq(from=1,to=6),nrow(df)/6)
dataout<-df%>%group_by(grouper)%>%summarise(keep=.[[1]]$distance_to_midpoint<threshold)
dataout[dataout$keep,]
}
filterframe(mydata)

A base R solution provided below. The idea is that once your data are in R, you (edit) keep! rows iff they meet 2 criteria. First, the Surpass column has to contain the word "blue" in it, which is done with the grepl function. Second, the distance must below a certain threshold (set arbitrarily by thresh.
fakeData=data.frame(Surpass=c('blue', 'red', 'green', 'blue'),
distance=c(1,2,5,3), num=c(90,10,9,4))
#thresh is your distance threshold
thresh = 2
fakeDataNoBlue = fakeData[which(grepl('blue', fakeData$Surpass)
& fakeData$distance < thresh),]
There's probably also a quick dplyr solution using filter, but I haven't fully explored the functionality there. Also, I may be a bit confused on if you also want to keep the other colors. If so, that's the same as saying you want to remove the blue ones exceeding a certain distance threshold, which you would just do a -which command, and turn the < operator into a > operator.

Related

Give different color distribution for different columns in a data.frame

I tried to build a heat-map for the cluster result of my data.frame. My data.frame has 5 columns with corresponding row names. I want to know if I could give the color distribution based on different colors, since the range of my 5 variables are so different, and if I don't scale them the result from "pheatmap" function in R would be a heat-map with only one or two color. And I really don't want to scale the data since I do need the positive or negative sign of my data point to remain what is should be. And here's the head of my data.frame, which I omit the rownames.
r.Square_gamma_logLink cof_glm.gamma_logLink int_glm.gamma_logLink estimated_shape_logLink
0.2524970 0.002357581 8.685446 3.558583
0.5932941 0.002651972 9.486916 8.085618
0.3615135 -0.001646538 10.071672 6.195176
0.4131553 -0.002218262 10.563557 8.671028
0.3529775 -0.002336544 10.984005 4.569396
0.4169932 0.002213259 9.602592 5.216084
estimated_dispersion_logLink
0.2810107
0.1236764
0.1614159
0.1153266
0.2188473
0.1917147
I did try to use the pheatmap, and the heatmap function, which are not quite useful, and the result is looks pretty much like this.

Assign integral value to list of relative values

I have an assortment of syrups, each of which has a value - the amount of sugar per volume. As people blend these syrups, I track which ones are used, and created a table to get a Relative Weight of each blend. I understand > Data > Sort > Options > Custom Sort Order.
However, I really don't wish to sort each table, and am looking for a way to parse a column of this list as entered, and return a column with results in an Integral Relative Value of each row, as compared to the weights of syrups in the other rows of the table.
Unique Name weight Not Unique Relative Value
blueberry .250 2
raspberry .333 3
orange .425 4
tangerine .333 3
blackberry .225 1
I am attempting to find the "Relative Sort", a nested function which can assign an integral value of the Unique Name which compares the weights of the syrups. A "Lookup" only works if there is an absolute equality, right?
What if someone doesn't use "blackberry syrup", then "blueberry" is the lightest, and should be labeled as 1.
Is this too complicated for LibreOffice Calc?
It's a recursive greater than/less than/equal to comparison?
IF the problem is calculating the right hand column below from entries that may be sorted ascending by value as on the left:
then an answer is, in C2 and copied down to suit (provided C1 is blank or 0):
=IF(B1<>B2,C1+1,C1)
Without sorting the RANK function might be simpler and adequate (though in the example returning 5 rather than 4).

Randomly pairing elements of a vector in R to count unique arrangements

Background:
On this combinatorics question, the issue is how to determine the sample space: the ways 8 different soccer teams can be paired up for the next round of competition. Two different answers have been advanced for that part of the problem: 28 (see comments OP) and 105 (see edit within OP and answer).
I'd like to do this manually to try to hone down on the mistake in whichever answer is incorrect.
What I have tried:
teams = 1:8
names(teams) = c("RM", "BCN", "SEV", "JUV", "ROM", "MC", "LIV", "BYN")
split(sample(teams), rep(1:(length(teams)/2), each=2))
Unfortunately, the output is a list, and I wanted a vector to be able to run something like:
unique(...,MARGIN=2)
Is there a way of doing this in an elegant manner?
After a now erased answer (thank you), I would go with
a <- replicate(1e5, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2))))
to simulate 100,000 random samples, and later run
unique(a, MARGIN = 2).
But how can I account for the fact that the order of the 4 pairings of opponents doesn't matter, and that LIV-BYN and BYN-LIV, for example, is the same pairing (field advantage notwithstanding)?
> u = ncol(unique(replicate(1e6, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2)))), MARGIN = 2))
> u / (factorial(4) * 2^4)
[1] 105
The idea of unlist is from #Song Zhengyi, and if his answer is un-deleted, I'll accept it. The complete answer is in the lines above.
u needs to be divided by 4! because
BCN-RM, BYN-SEV, JUV-ROM, LIV-MC
is exactly the same as
LIV-MC, BCN-RM, BYN-SEV, JUV-ROM
or
BCN-RM, LIV-MC, BYN-SEV, JUV-ROM
etc.
The term 2^4 is to avoid over-counting since for every possible unique draw, each one of the pairings can be flipped without loss (discarding field advantage): BCN-RM is the same as RM-BCN, and there are 4 pairs in each draw.
If field advantage is a consideration (real life)...
> u/factorial(4)
[1] 1680
we end up with 1,680 possible draws.

Vectorizing raster brick objects with r-raster so that I can count them

I have an image of columns of red and blue bordered circles like so:
Where the columns alternate red and blue (in this example the first column is red)
I have been able to create a raster brick and plot the image in RGB layers but I want to count these columns into a vector like this (from above example). Values 1(red) and 2(blue)
1,1,1,1,2,2,2,1,1,2,1,1,1 ...
Is it possible to clear out areas of the brick I don't need for counting and collapse the brick down into values I could then convert into the numbers or labels I want? Or is there a much simpler way that I'm unable to locate? Also long term I want to be able to point the program at several images without opening them myself.
Edit: To clear somethings up, I want to count the circles top to bottom, left to right. For example once the first col is counted, I want to start over at the top of the next column on the right. Also I'm not sure if I'm headed in the right direction but I was able to remove all background cells from the image. leaving me with a plot of only values where the circles are.
Edit 2:
The current code I have for the image above.
color.image <- brick("image")
color.image = dropLayer(color.image,4) #gets rid of a channel
plot(color.image)
e <- extent(-10, 240, 45, 84.8) #xmin,xmax, ymin,ymax
ccolor.image <- crop(color.image, e)
plot(ccolor.image)
#thresholding to simplify what I was dealing with
mini=ccolor.image[ccolor.image > 97] = NA
mini=ccolor.image[ccolor.image < 15] = NA
mini=ccolor.image[ccolor.image > 20] = 80
plot(ccolor.image)
mcolor = as.matrix(ccolor.image)
colSums(ccolor.image)
rowSums(ccolor.image)
Edit 3:
I figured it out! Or at least found a round about way to do it, will post code later once I clean it up some. I still however would like input on creating a vector based on the matrix of values I have for my simplified raster brick matrix. Code coming soon!
The fastest way to count values in a raster is freq(x, merge=T). This will give you the value in one column and the frequency in as many columns as you have rows. In this way we the need to poll a value of interest and sum all the other columns (the counts). Hope that helps!
freq_vals <- freq ( rasterbrick , merge = T )
sum( freq_vals [ which ( freq_vals$value == 1 ) , 2 : ncol ( freq_vals ) ] )

Cumulative sum of a georeferenced variable in R

I have a number of fishing boat tracks, and I'm trying to detect a certain pattern in their movement using R. In doing so I have reached a point where I have discarded all points of the track where the desired pattern is not occurring within a given time window, and I'm left with the remaining georeferenced points. These points have a score value associated, which measures the 'intensity' of the desired pattern.
track_1[1:10,]:
LAT LON SCORE
1 32.34855 -35.49264 80.67
2 31.54764 -35.58691 18.14
3 31.38293 -35.25243 46.70
4 31.21447 -35.25830 22.65
5 30.76365 -35.38881 11.93
6 30.75872 -35.54733 22.97
7 30.60261 -35.95472 35.98
8 30.62818 -36.27024 31.09
9 31.35912 -35.73573 14.97
10 31.15218 -36.38027 37.60
The code bellow provides the same data
data.frame(cbind(
LAT=c(32.34855,31.54764,31.38293,31.21447,30.76365,30.75872,30.60261,30.62818,31.35912,31.15218),
LON=c(-35.49264,-35.58691,-35.25243,-35.25830,-35.38881,-35.54733,-35.95472,-36.27024,-35.73573,-36.38027),
SCORE=c(80.67,18.14,46.70,22.65,11.93,22.97,35.98,31.09,14.97,37.60)))
Because some of these points occur geographically close to each other I need to 'pool' their scores together. Hence, I now need a way to throw this data into some kind of a spatial grid and cumulatively sum the scores of all points that fall in the same cell of the grid. This would allow me to find in what areas a given fishing boat exhibits the pattern I'm after the most (and this is not just about time spent in one place). Ultimately, the preferred output would contain lat and lon for every grid cell (center), and the sum of all scores on each cell. In addition, I would also like to be able to adjust the sizing of the grid cells.
I've looked around and all I can find either does not preserve the georeferenced information, is very inefficient, or performs binning of data. There may already be some answers out there, but it might be the case that I'm not able to recognize them since I'm a bit out of my league on this stuff. Can someone please point me to some direction (package, function, etc.)? Any guidance will be greatly appreciated.
Take your lat/lon coordinates, and multiply them by the inverse of your desired grid cell edge lengths, measured in degrees. The result will be a pair of floating point numbers whose integer part identifies the grid cell in question. Take the floor of these and you have two numbers describing the cell, which you could paste to form a single string. You may add that as a new factor column of your data frame. Then you can perform operations based on that factor, like summarizing values.
Example:
latScale <- 2 # one cell for every 0.5 degrees
lonScale <- 2 # likewise
track_1$cell <- factor(with(track_1,
paste(floor(LAT*latScale), floor(LON*lonScale), sep='.')))
library(plyr)
ddply(track_1, .(cell), summarize,
LAT=mean(LAT), LON=mean(LON), SCORE=sum(SCORE))
If you want to, you can use weighted.mean instead of mean. If you don't like these factors, you can put more effort in making them nice (e.g. by using compass directions instead of signs), or drop them altogether and use a pair of integer columns instead.

Resources