Segmenting a data frame by row based on previous rows values - r

I have a data frame in R that contains 2 columns named x and y (co-ordinates). The data frame represents a journey with each line representing the position at the next point in time.
x y seconds
1 0.0 0.0 0
2 -5.8 -8.5 1
3 -11.6 -18.2 2
4 -16.9 -30.1 3
5 -22.8 -40.8 4
6 -29.0 -51.6 5
I need to break the journey up into segments where each segment starts once the distance from the start of the previous segment crosses a certain threshold (e.g. 200).
I have recently switched from using SAS to R, and this is the first time I've come across anything I can do easily in SAS but can't even think of the way to approach the problem in R.
I've posted the SAS code I would use below to do the same job. It creates a new column called segment.
%let cutoff=200;
data segments;
set journey;
retain segment distance x_start y_start;
if _n_=1 then do;
x_start=x;
y_start=y;
segment=1;
distance=0;
end;
distance + sqrt((x-x_start)**2+(y-y_start)**2);
if distance>&cutoff then do;
x_start=x;
y_start=y;
segment+1;
distance=0;
end;
keep x y seconds segment;
run;
Edit: Example output
If the cutoff were 200 then an example of required output would look something like...
x y seconds segment
1 0.0 0.0 0 1
2 40.0 30.0 1 1
3 80.0 60.0 2 1
4 120.0 90.0 3 1
5 160.0 120.0 4 2
6 120.0 150.0 5 2
7 80.0 180.0 6 2
8 40.0 210.0 7 2
9 0.0 240.0 8 3

If your data set is dd, something like
cutoff <- 200
origin <- dd[1,c("x","y")]
cur.seg <- 1
dd$segment <- NA
for (i in 1:nrow(dd)) {
dist <- sqrt(sum((dd[i,c("x","y")]-origin)^2))
if (dist>cutoff) {
cur.seg <- cur.seg+1
origin <- dd[i,c("x","y")]
}
dd$segment[i] <- cur.seg
}
should work. There are some refinements (it might be more efficient to compute distances of the current origin to all rows, then use which(dist>cutoff)[1] to jump to the first row that goes beyond the cutoff), and it would be interesting to try to come up with a completely vectorized solution, but this should be OK. How big is your data set?

Related

Is there way to calculate multiple new rows of a data frame based on previous rows' values?

I am creating a data frame (hoops) with three columns (t, x, y) and 700 rows. See code at bottom. In the first row, I have set column t to equal 0. In the second row, I want to have the column t be calculated by taking the previous row's t value and add a constant (hoops_var). I want this to formula to continue to row 700.
hoops<-data.frame(t=double(),x=double(),y=double())
hoops_var<- 1.5
hoops[1,1]<- 0
hoops[1,2]<- (hoops$t+23)
hoops[1,3]<- (hoops$t/2)
# What I want for row 2
hoops[2,1]<- hoops[[1,1]]+hoops_var #this formula for rows 2 to 700
hoops[2,2]<- (hoops$t+23) #same as row 1
hoops[2,2]<- (hoops$t/2) #same as row 1
# What I want for row 3 to 700 (same as row 2)
hoops[3:700,1]<- hoops[[2,2]]+hoops_var #same as row 2
hoops[3:700,2]<- (hoops$t+23) #same as rows 1 & 2
hoops[3:700,3]<- (hoops$t/2) #same as row 1 & 2
The first four rows of the table should look like this
The only applicable solution I found (linked at bottom) did not work for me.
I am fairly new to R, so apologies if this is a dumb question. Thanks in advance for any help.
R: Creating a new row based on previous rows
You should use vectorized operations
# first create all columns as vectors
hoops_t <- hoops_var*(0:699) #0:699 gives a vector of 700 consecutive integers
hoops_x <- hoops_t+23
hoops_y <- hoops_t/2
# now we are ready to put all vectors in a dataframe
hoops <- data.frame(t=hoops_t,x=hoops_x,y=hoops_y)
Now if you want to change the t column you can use lag from dplyr to shift all values for example
library(dplyr)
hoops$t[2:nrow(hoops)] <- lag(hoops$x*hoops$y)[2:nrow(hoops)]
I select only [2:nrow(hoops)] (all rows except the first one) because you don't want the first row to be modified
You could use the following :
n <- 10 #Number of rows in the data.frame
t <- seq(0, by = 1.5, length.out = n)
x <- 23 + t
y <- t/2
hoops <- data.frame(t, x, y)
hoops #Sample for 10 rows.
# t x y
#1 0.0 23.0 0.00
#2 1.5 24.5 0.75
#3 3.0 26.0 1.50
#4 4.5 27.5 2.25
#5 6.0 29.0 3.00
#6 7.5 30.5 3.75
#7 9.0 32.0 4.50
#8 10.5 33.5 5.25
#9 12.0 35.0 6.00
#10 13.5 36.5 6.75

Imputation for longitudinal data using observation before and after missing data

I’m in the process of cleaning some longitudinal data and I have several missing cases. I am trying to use an imputation that incorporates observations before and after the missing case. I’m wondering how I can go about addressing the issues detailed below.
I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.
The details are below:
#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
mydat <- data.frame(id, time, ss)
*Bold characters represent changes from the dataset above
The goal here is to find a way to get the mean of the value before (3) and after (0) the NA value for ID #1 (variable ss) so that the data look like this: 1,3,2,3,1.5,0,0,
ID# 2 (variable ss) should look like this: 2,4,0,0,0,0,0
ID #3 (variable ss) should use a last observation carried forward approach, so it would need to look like this: 4,1,2,4,2,3,3
ID #4 (variable ss) has two consecutive NA values and should not be changed. It will be flagged for a different analysis later in my project. So, it should look like this: 2,1,0,NA,NA,0,0 (no change).
I use a package, smwrBase, the syntax for only filling in 1 missing value is below, but doesn't address id.
smwrBase::fillMissing(ss, max.fill=1)
The zoo package might be more standard, same issue though.
zoo::na.approx(ss, maxgap=1)
Below is an approach that accounts for the variable id. Current interpolation approaches dont like to fill in the last value, so i added a manual if stmt for that. A bit brute force as there might be a tapply approach out there.
> id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
> time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
> ss <- c(1,3,2,3,NA,0,0,2,4,0,NA,0,0,0,4,1,2,4,2,3,NA,2,1,0,NA,NA,0,0)
> mydat <- data.frame(id, time, ss, ss2=NA_real_)
> for (i in unique(id)) {
+ # interpolate for gaps
+ mydat$ss2[mydat$id==i] <- zoo::na.approx(ss[mydat$id==i], maxgap=1, na.rm=FALSE)
+ # extension for gap as last value
+ if(is.na(mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])])) {
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])] <-
+ mydat$ss2[mydat$id==i][length(mydat$ss2[mydat$id==i])-1]
+ }
+ }
> mydat
id time ss ss2
1 1 0 1 1.0
2 1 1 3 3.0
3 1 2 2 2.0
4 1 3 3 3.0
5 1 4 NA 1.5
6 1 5 0 0.0
7 1 6 0 0.0
8 2 0 2 2.0
9 2 1 4 4.0
10 2 2 0 0.0
11 2 3 NA 0.0
12 2 4 0 0.0
13 2 5 0 0.0
14 2 6 0 0.0
15 3 0 4 4.0
16 3 1 1 1.0
17 3 2 2 2.0
18 3 3 4 4.0
19 3 4 2 2.0
20 3 5 3 3.0
21 3 6 NA 3.0
22 4 0 2 2.0
23 4 1 1 1.0
24 4 2 0 0.0
25 4 3 NA NA
26 4 4 NA NA
27 4 5 0 0.0
28 4 6 0 0.0
The interpolated value in id=1 is 1.5 (avg of 3 and 0), id=2 is 0 (avg of 0 and 0, and id=3 is 3 (the value preceding since it there is no following value).

Bootstrapping multiple columns with R

I'm relatively new at R and I'm trying to build a function which will loop through columns in an imported table and produce an output which consists of the means and 95% confidence intervals. Ideally it should be possible to bootstrap columns with different sample sizes, but first I would like to get the iteration working. I have something that sort-of works, but I can't get it all the way there. This is what the code looks like, with the sample data and output included:
#cdata<-read.csv(file.choose(),header=T)#read data from selected file, works, commented out because data is provided below
#cdata #check imported data
#Sample Data
# WALL NRPK CISC WHSC LKWH YLPR
#1 21 8 1 2 2 5
#2 57 9 3 1 0 1
#3 45 6 9 1 2 0
#4 17 10 2 0 3 0
#5 33 2 4 0 0 0
#6 41 4 13 1 0 0
#7 21 4 7 1 0 0
#8 32 7 1 7 6 0
#9 9 7 0 5 1 0
#10 9 4 1 0 0 0
x<-cdata[,c("WALL","NRPK","LKWH","YLPR")] #only select relevant species
i<-nrow(x) #count number of rows for bootstrapping
g<-ncol(x) #count number of columns for iteration
#build bootstrapping function, this works for the first column but doesn't iterate
bootfun <- function(bootdata, reps) {
boot <- function(bootdata){
s1=sample(bootdata, size=i, replace=TRUE)
ms1=mean(s1)
return(ms1)
} # a single bootstrap
bootrep <- replicate(n=reps, boot(bootdata))
return(bootrep)
} #replicates bootstrap of "bootdata" "reps" number of times and outputs vector of results
cvr1 <- bootfun(x$YLPR,50000) #have unsuccessfully tried iterating the location various ways (i.e. x[i])
cvrquantile<-quantile(cvr1,c(0.025,0.975))
cvrmean<-mean(cvr1)
vec<-c(cvrmean,cvrquantile) #puts results into a suitable form for output
vecr<-sapply(vec,round,1) #rounds results
vecr
2.5% 97.5%
28.5 19.4 38.1
#apply(x[1:g],2,bootfun) ##doesn't work in this case
#desired output:
#Species Mean LowerCI UpperCI
#WALL 28.5 19.4 38.1
#NRPK 6.1 4.6 7.6
#YLPR 0.6 0.0 1.6
I've also tried this using the boot package, and it works beautifully to iterate through the means but I can't get it to do the same with the confidence intervals. The "ordinary" code above also has the advantage that you can easily retrieve the bootstrapping results, which might be used for other calculations. For the sake of completeness here is the boot code:
#Bootstrapping using boot package
library(boot)
#data<-read.csv(file.choose(),header=TRUE) #read data from selected file
#x<-data[,c("WALL","NRPK","LKWH","YLPR")] #only select relevant columns
#x #check data
#Sample Data
# WALL NRPK LKWH YLPR
#1 21 8 2 5
#2 57 9 0 1
#3 45 6 2 0
#4 17 10 3 0
#5 33 2 0 0
#6 41 4 0 0
#7 21 4 0 0
#8 32 7 6 0
#9 9 7 1 0
#10 9 4 0 0
i<-nrow(x) #count number of rows for resampling
g<-ncol(x) #count number of columns to step through with bootstrapping
boot.mean<-function(x,i){boot.mean<-mean(x[i])} #bootstrapping function to get the mean
z<-boot(x, boot.mean,R=50000) #bootstrapping function, uses mean and number of reps
boot.ci(z,type="perc") #derive 95% confidence intervals
apply(x[1:g],2, boot.mean) #bootstrap all columns
#output:
#WALL NRPK LKWH YLPR
#28.5 6.1 1.4 0.6
I've gone through all of the resources I can find and can't seem to get things working. What I would like for output would be the bootstrapped means with the associated confidence intervals for each column. Thanks!
Note: apply(x[1:g],2, boot.mean) #bootstrap all columns doesn't do any bootstrap. You are simply calculating the mean for each column.
For bootstrap mean and confidence interval, try this:
apply(x,2,function(y){
b<-boot(y,boot.mean,R=50000);
c(mean(b$t),boot.ci(b,type="perc", conf=0.95)$percent[4:5])
})

R: group XY point cluster, convert to density & save

Goal: change a set of point clusters into a density distribution.
Specifics: point clusters are well separated, and I'm interested in the density values of each sampling site (by count).I've been converting the counts by hand and an algorithm to allocate points into densities would be invaluable.
I'm not sure how to go about doing this and am very open to creative input!
Here's what the entire dataset looks like:
> head(markers)
x y
1 -494.5768 300.6698
2 -494.4280 300.7582
3 -494.5812 300.8424
4 -494.4000 300.9146
5 -494.8554 300.9102
6 -494.8038 300.9974
https://www.dropbox.com/s/ewcggnp3p29vhjh/datapoints.csv
I'd like to get an output in this format
x y density
1 6 1 0.0
2 7 1 17.6
3 8 1 11.2
4 12 1 14.4
5 13 1 0.0
6 14 1 8.0
7 14 2 0.0
etc
the x y points would be much larger, like -494.5768
I think it'd have to do something along the lines of ...
calculate distances between all point combinations
group the rows that have distances under a set threshold
subset/split clusters with plyr
find the average XY coordinates of the cluster
assign length(cluster) to the XY point.
recombine all the rows

Altering a large distance matrix to be just three columns

I have a large data frame/.csv that is a matrix with 42 columns and 110,357,407. It was derived from the x and y coordinates for two datasets of points, one with 41 and another with 110,357,407 and the values of the rows represent the distances between these two sets of points (the distance of each point on list 1 to every single point on list 2). The first column is a list of points (from 1 to 110,357,407). An excerpt from the matrix is below.
V1 V2 V3 V4 V5 V6 V7
1 38517.05 38717.8 38840.16 38961.37 39281.06 88551.03 88422.62
2 38514.05 38714.79 38837.15 38958.34 39278 88545.48 88417.09
3 38511.05 38711.79 38834.14 38955.3 39274.94 88539.92 88411.56
4 38508.05 38708.78 38831.13 38952.27 39271.88 88534.37 88406.03
5 38505.06 38705.78 38828.12 38949.24 39268.83 88528.82 88400.5
6 38502.07 38702.78 38825.12 38946.21 39265.78 88523.27 88394.97
7 38499.08 38699.78 38822.12 38943.18 39262.73 88517.72 88389.44
8 38496.09 38696.79 38819.12 38940.15 39259.68 88512.17 88383.91
9 38493.1 38693.8 38816.12 38937.13 39256.63 88506.62 88378.38
10 38490.12 38690.8 38813.12 38934.11 39253.58 88501.07 88372.85
11 38487.14 38687.81 38810.13 38931.09 39250.54 88495.52 88367.33
12 38484.16 38684.83 38807.14 38928.07 39247.5 88489.98 88361.8
13 38481.18 38681.84 38804.15 38925.06 39244.46 88484.43 88356.28
14 38478.21 38678.86 38801.16 38922.04 39241.43 88478.88 88350.75
15 38475.23 38675.88 38798.17 38919.03 39238.39 88473.34 88345.23
16 38472.26 38672.9 38795.19 38916.03 39235.36 88467.8 88339.71
My issue is that I would like to change this matrix into just 3 columns, the first column would be similar to the first column of the matrix with the 110,357,407 rows, the second would be the 41 data points (each matched up with a distance each of the first points to all of the others) and the third would be the distance between those points. So it would look something like this
Back Pres Dist
1 1 3486
2 1 3456
3 1 3483
4 1 3456
5 1 3429
6 1 3438
7 1 3422
8 1 3427
9 1 3428
(After the distances between the back and all of the first value of pres are complete, pres will change to 2 and will eventually work its way up to 41)
I realize that this will output a hugely ridiculous number of rows, but this is the format that I need to run some processes that are outside of R.
I tried using this code
cols.Output <- data.frame(col = rep(colnames(output3), each = nrow(output3)),
row = rep(rownames(output3), ncol(output3)),
value = as.vector(output3))
But there won’t be the same number of rows for each column, so I received an error (and I don’t think it would have really worked with my pres column needs). I tried experimenting with some of the rbind.fill and cbind.fill functions (the one in plyr and ones that others have come up with in the forum). I also looked into some of the melting and reshaping but I was very confused about the functions and couldn’t figure out how to implement them appropriately (or if they even are appropriate for what I need). I would really appreciate any help on this as I’ve been struggling with it for a long time.
Edit: Just to be a little more clear about what I need. Take these two smaller data sets
back <- 1 dataset with 5 sets of x, y points
pres <- 1 dataset with 3 sets of x, y points
Calculating distances between these two data frames generates the initial matrix:
Back 1 2 3
1 3427 3444 3451
2 3432 3486 3476
3 3486 3479 3486
4 3449 3438 3484
5 3483 3486 3486
And my desired output would look like this:
Back Pres Dist
1 1 3427
2 1 3432
3 1 3486
4 1 3449
5 1 3483
1 2 3444
2 2 3486
3 2 3479
4 2 3438
5 2 3486
1 3 3451
2 3 3476
3 3 3486
4 3 3484
5 3 3486
Yes, it looks this is the kind of problem generally solved with some combination of melt and cast in the reshape2 package. That said, with 100+ million rows, I'm not sure that that's the most efficient way to go in this case.
You could do it all manually as follows. I'll assume your data frame is called df, and the distances are in columns 2 to 42. See if this works.
d <- unlist(df[-1]) # put all the distances into a vector
newdf <- cbind(expand.grid(back=seq_len(nrow(df)), pres=seq_len(ncol(df) - 1)), d)
This will probably die unless you have tons of memory. The same holds for any simple solution though, since you have > 4.2 billion elements in the vector of distances. You can work on subsets of the full dataset at a time to get around this problem.
Here's how to use melt on a small example:
require(reshape2)
a <- matrix(rnorm(9), nrow = 3)
a[, 1] <- 1:3 ## Pretending these are one set of points
rownames(a) <- a[, 1] ## We'll put them as rownames instead of a column
melt(a[, -1]) ## And omit that column when melting
If you have memory issues, you could write a for loop and do it in pieces, writing each to a file when they're completed.

Resources