Sliding window in a data frame r

Sliding window in a data frame r - r

I’m trying to obtain the proportions of individuals that that shares certain DNA sequences between two given points. And I want to use a specific sliding window. In order to show the problem I create this example. First I create a data frame with four columns.
x<-c(rep("sc256",times=2000),rep("sc784",times=2000))
pos1<-round(runif(2000,100,5000),digits=0)
pos2<-round(runif(2000,100,5000),digits=0)
y3<-rep(c(2,1),times=2000)
M1<-data.frame(x,pos1,pos2,y3)
colnames(M1)=c("iid","pos1","pos2","chr")
I also create a function to obtain the proportion of individuals that have sequences in a particular interval.
roh_island<-function(pop,chr,p1,p2){
a<-pop[pop$chr==chr,]
island<-subset(a,pos1>=p1 & pos2<=p2)
n<-nrow(island)/length(M1$iid)
return(n)
}
roh_island(M1,1,345,700)
Now I want to transform this interval into a sliding window of size 10 that moves between values 0 and 7000. So this window will take positions [0,10);(10,20),…,(6990,7000]. I also need that the new function with the slide window stores all the windows and proportion of individuals in each in a data frame to afterwards plot it. I try some solutions that I have found regarding sliding windows I saw but I could not make them work. Thanks

This code will slide p1 from 0 to 6990 in steps of 10 while p2 slides from 10 to 7000 in steps of 10:
output = apply(data.frame(seq(0,6990,10), seq(10,7000,10)), MARGIN=1,
function(x,y,z,a) roh_island(M1, 1, x[1], x[2]))
plot(output, col="blue")
grid(5, 5)

Related

Bourdet Derivative in R with Smoothing Window

I am calculating pressure derivatives using algorithms from this PDF:
Derivative Algorithms
I have been able to implement the "two-points" and "three-consecutive-points" methods relatively easily using dplyr's lag/lead functions to offset the original columns forward and back one row.
The issue with those two methods is that there can be a ton of noise in the high resolution data we use. This is why there is the third method, "three-smoothed-points" which is significantly more difficult to implement. There is a user-defined "window width",W, that is typically between 0 and 0.5. The algorithm chooses point_L and point_R as being the first ones such that ln(deltaP/deltaP_L) > W and ln(deltaP/deltaP_R) > W. Here is what I have so far:
#If necessary install DPLYR
#install.packages("dplyr")
library(dplyr)
#Create initial Data Frame
elapsedTime <- c(0.09583, 0.10833, 0.12083, 0.13333, 0.14583, 0.1680,
0.18383, 0.25583)
deltaP <- c(71.95, 80.68, 88.39, 97.12, 104.24, 108.34, 110.67, 122.29)
df <- data.frame(elapsedTime,deltaP)
#Shift the elapsedTime and deltaP columns forward and back one row
df$lagTime <- lag(df$elapsedTime,1)
df$leadTime <- lead(df$elapsedTime,1)
df$lagP <- lag(df$deltaP,1)
df$leadP <- lead(df$deltaP,1)
#Calculate the 2 and 3 point derivatives using nearest neighbors
df$TwoPtDer <- (df$leadP - df$lagP) / log(df$leadTime/df$lagTime)
df$ThreeConsDer <- ((df$deltaP-df$lagP)/(log(df$elapsedTime/df$lagTime)))*
((log(df$leadTime/df$elapsedTime))/(log(df$leadTime/df$lagTime))) +
((df$leadP-df$deltaP)/(log(df$leadTime/df$elapsedTime)))*
((log(df$elapsedTime/df$lagTime))/(log(df$leadTime/df$lagTime)))
#Calculate the window value for the current 1 row shift
df$lnDeltaT_left <- abs(log(df$elapsedTime/df$lagTime))
df$lnDeltaT_right <- abs(log(df$elapsedTime/df$leadTime))
Resulting Data Table
If you look at the picture linked above, you will see that based on a W of 0.1, only row 2 matches this criteria for both the left and right point. Just FYI, this data set is an extension of the data used in example 2.5 in the referenced PDF.
So, my ultimate question is this:
How can I choose the correct point_L and point_R such that they meet the above criteria? My initial thoughts are some kind of while loop, but being an inexperienced programmer, I am having trouble writing a loop that gets anywhere close to what I am shooting for.
Thank you for any suggestions you may have!

Divide column values within a vector

I'm not sure if my title is properly expressing what I'm asking. Once I'm done writing, it'll make sense. Firstly, I just started learning R, so I am a newbie. I've been reading through tutorial series and PDF's I've found online.
I'm working on a data set and I created a data frame of just the year 2001 and the DAM value Bon. Here's a picture.
What I want to do now is create a matrix with 3 columns: Coho Adults, Coho Jacks and the third column the ratio of Coho Jacks to Adults. This is what I'm having trouble with. The ratio between Coho Jacks to Adults.
If I do a line of code like this I get a normal output.
(cohoPassage <- matrix(fishPassage1995BON[c(5,6, 7)], ncol = 3))
The values are 259756, 6780 114934.
I'm figuring in order to get the ratio, I should divide column 5 and column 6's values. So basically 259756/6780 = 38.31
I've tried many things like:
(cohoPassage <- matrix(fishPassage1995BON[c(5,6, 5/6)], ncol = 3))
This just outputs the value of the fifth column instead of dividing for some reason
I've tried this:
matrix(fishPassage1995BON[c(5,6)],fishPassage1995BON[,5]/fishPassage1995BON[,6], ncol = 3)
Which gives me an incorrect output
I decided to break down the problem and divide the fifth and sixth columns separately and it gave the correct ratio.
If I create a matrix like this
matrix(fishPassage1995BON[,5]/fishPassage1995BON[,6])
It outputs the correct ratio of 38.31209. But when I try to combine everything, I just keep getting errors.
What can I do? Any help would be appreciated. Thank you.

Show KMeans cluster results with clusters as columns

My data has 40+ variables and I am creating a 3 cluster model on it.
I have built a kmeans model:
teen_clusters <- kmeans(interests_z, 3).
It works fine. It is getting an output that I can read is the issue.
When I screen print the model, it places the variables on the top (40 across) and the clusters as rows (3 deep). Very hard to read.
I want it the other way around. 3 cluster columns and 40 rows.
I have tried the below, but get the same thing. This does way too much screen wrap.
aggregate(interests_z,by=list(teen_clusters$cluster),FUN=mean)

Since we don't have your data lets use mtcars ...
ret <- kmeans(mtcars,3)
ret$centers # the default format
t(ret$centers) # transposed as you want
To see the components of ret use str(ret)

Cumulative sum of a georeferenced variable in R

I have a number of fishing boat tracks, and I'm trying to detect a certain pattern in their movement using R. In doing so I have reached a point where I have discarded all points of the track where the desired pattern is not occurring within a given time window, and I'm left with the remaining georeferenced points. These points have a score value associated, which measures the 'intensity' of the desired pattern.
track_1[1:10,]:
LAT LON SCORE
1 32.34855 -35.49264 80.67
2 31.54764 -35.58691 18.14
3 31.38293 -35.25243 46.70
4 31.21447 -35.25830 22.65
5 30.76365 -35.38881 11.93
6 30.75872 -35.54733 22.97
7 30.60261 -35.95472 35.98
8 30.62818 -36.27024 31.09
9 31.35912 -35.73573 14.97
10 31.15218 -36.38027 37.60
The code bellow provides the same data
data.frame(cbind(
LAT=c(32.34855,31.54764,31.38293,31.21447,30.76365,30.75872,30.60261,30.62818,31.35912,31.15218),
LON=c(-35.49264,-35.58691,-35.25243,-35.25830,-35.38881,-35.54733,-35.95472,-36.27024,-35.73573,-36.38027),
SCORE=c(80.67,18.14,46.70,22.65,11.93,22.97,35.98,31.09,14.97,37.60)))
Because some of these points occur geographically close to each other I need to 'pool' their scores together. Hence, I now need a way to throw this data into some kind of a spatial grid and cumulatively sum the scores of all points that fall in the same cell of the grid. This would allow me to find in what areas a given fishing boat exhibits the pattern I'm after the most (and this is not just about time spent in one place). Ultimately, the preferred output would contain lat and lon for every grid cell (center), and the sum of all scores on each cell. In addition, I would also like to be able to adjust the sizing of the grid cells.
I've looked around and all I can find either does not preserve the georeferenced information, is very inefficient, or performs binning of data. There may already be some answers out there, but it might be the case that I'm not able to recognize them since I'm a bit out of my league on this stuff. Can someone please point me to some direction (package, function, etc.)? Any guidance will be greatly appreciated.

Take your lat/lon coordinates, and multiply them by the inverse of your desired grid cell edge lengths, measured in degrees. The result will be a pair of floating point numbers whose integer part identifies the grid cell in question. Take the floor of these and you have two numbers describing the cell, which you could paste to form a single string. You may add that as a new factor column of your data frame. Then you can perform operations based on that factor, like summarizing values.
Example:
latScale <- 2 # one cell for every 0.5 degrees
lonScale <- 2 # likewise
track_1$cell <- factor(with(track_1,
paste(floor(LAT*latScale), floor(LON*lonScale), sep='.')))
library(plyr)
ddply(track_1, .(cell), summarize,
LAT=mean(LAT), LON=mean(LON), SCORE=sum(SCORE))
If you want to, you can use weighted.mean instead of mean. If you don't like these factors, you can put more effort in making them nice (e.g. by using compass directions instead of signs), or drop them altogether and use a pair of integer columns instead.

Drawing a Square Line Chart using quantmod

Is there a way to get quantmod to draw a square line chart?
I've tried modifying my time series so that each data point is replicated one second before the next datapoint (hoping this would approximate a square line), but quantmod seems to data on the x axis sequentially & evenly spaces without regard to the actually values of x (i.e. the horizontal space between one point an the next is the same whether the delta-T is 1 second or 1 minute).
I suppose I could convert my timeseries from a sparse to a dense one (one entry per second instead of one entry per change in value), but this seems very kludgy and should be unnecessary.
I'm constructing my time series thus:
library(quantmod)
myNumericVector <- c(3,7,2,9,4)
myDateTimeStrings <- paste("2011-10-31", c("5:26:00", "5:26:10", "5:26:40", "5:26:50", "5:27:00"))
myXts <- xts(myNumericVector, order.by=as.POSIXct(myDateTimeStrings))
And drawing the chart like so:
chartSeries(myXts, type="line", show.grid="true", theme=chartTheme("black"))
To illustrate what I have vs. what I want, the result looks like the blue line below but I'd like something more like the green:
Also, for the curious, here is the code that replicates points in the time series such that the gap between one value and the next are as small as possible:
mySquareDateTimes <- rep(as.POSIXct(myDateTimeStrings),2)[-1]
mySquareDateTimes[seq(2,8,by=2)] <- mySquareDateTimes[seq(2,8,by=2)] - 1
mySquareXts <- xts(rep(myNumericVector,each=2)[-10], order.by=mySquareDateTimes)
chartSeries(mySquareXts, type="line", show.grid="true", theme=chartTheme("black"))
The results are less than ideal.

You want a line.type of "step":
chartSeries(myXts, line.type="s")
See ?plot, specifically "type" under ... in the Arguments section (you may want "S" instead of "s").

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex