I have a data frame that has a xyz and another variable, A.
data.frame(xx,yy,zz,Amp)
xx yy zz Amp
1 63021.71 403205.0 1.181028516 1170
2 63021.71 403105.0 0.977028516 1381
3 63021.71 403105.0 0.861028516 807
4 63021.71 403105.0 0.784028516 668
5 53021.71 403105.0 0.620028516 19919
6 53021.71 403305.0 0.455028516 32500
7 53021.71 403105.0 0.446028516 32500
8 43021.71 403105.0 0.436028516 32500
9 43021.71 404105.0 0.426028516 32500
10 43021.71 403105.0 0.281028516 17464
First I want to create regular grid for xyz.
Next I want to fill this grid with Amp values.
I would like to do this by creating by using arrays.
Any help would be much appreciated.
i would like the final result to look like this:
dim(Amp)
10 10 10
You do not have enough data in your MWE to create a 10x10x10 array without interpolation. Currently you have 3 unique xx values, 4 unique yy values, and 10 unique zz values. So you could create a 3x4x10 array, but you don't have enough values in Amp to assign to each point in a 3x4x10 3D regular grid. You only have 10 Amp values, describing 10 unique points in 3D space. A 3x4x10 regular grid array would have 120 Amp values, one for each point in the grid. Furthermore, values in a regular grid are equally spaced in each dimension and your yy and zz values are not equally spaced.
Check the spacing in each dimension:
> diff(sort(unique(xx)))
[1] 10000 10000
> diff(sort(unique(yy)))
[1] 100 100 800
> diff(sort(unique(zz)))
[1] 0.145 0.010 0.010 0.009 0.165 0.164 0.077 0.116 0.204
The current MWE looks like this in 3D:
library(rgl)
plot3d(xx,yy,zz, col="red")
To form a 10x10x10 regular grid, you need to convert your dataset into one that has 1000 coordinate points and Amp values. I'm not exactly sure how you'd like to do this given your MWE, but here's an example given the current data:
# MWE data
xx = c(63021.71,63021.71,63021.71,63021.71,53021.71,53021.71,53021.71,43021.71,43021.71,43021.71)
yy = c(403205,403105,403105,403105,403105,403305,403105,403105,404105,403105)
zz = c(1.181028516,0.977028516,0.861028516,0.784028516,0.620028516,0.455028516,0.446028516,0.436028516,0.426028516,0.281028516)
Amp = c(1170,1381,807,668,19919,32500,32500,32500,32500,17464)
# create equally-spaced vectors of 10 values in each dimension
xx <- seq(min(xx), max(xx), length.out = 10)
yy <- seq(min(yy), max(yy), length.out = 10)
zz <- seq(min(zz), max(zz), length.out = 10)
# fake up some Amp data points
set.seed(123)
Amp <- runif(1000, min = min(Amp), max=max(Amp))
# directly create a 10x10x10 regular grid of Amp values as an array
dfa <- array(data = Amp,
dim = c(10,10,10),
dimnames = list(xx,yy,zz)
)
> dim(dfa)
[1] 10 10 10
# Alternatively, make a data.frame first
df <- data.frame(expand.grid(xx,yy,zz))
names(df) <- c("xx","yy","zz")
df$Amp <- Amp
dfa <- array(data = df$Amp,
dim=c(length(unique(df$xx)),
length(unique(df$yy)),
length(unique(df$zz))),
dimnames=list(unique(df$xx), unique(df$yy), unique(df$zz))
)
# you'll want to verify that the Amp values were assigned to the correct xyz coordinates.
# Here's a little function to help:
get_arr_loc = function(x, y, z) {
x + (y-1)*10 + (z-1)*100
}
# and some arbitrary coordinates checked. This could be done in a more systematic way...
> df[get_arr_loc(1,1,1), "Amp"] == dfa[1,1,1]
[1] TRUE
> df[get_arr_loc(10,2,1), "Amp"] == dfa[10,2,1]
[1] TRUE
> df[get_arr_loc(3,6,9), "Amp"] == dfa[3,6,9]
[1] TRUE
> df[get_arr_loc(10,10,10), "Amp"] == dfa[10,10,10]
[1] TRUE
Related
I have a raster and a shapefile. The raster contains NA and I am filling the NAs using the focal function
library(terra)
v <- vect(system.file("ex/lux.shp", package="terra"))
r <- rast(system.file("ex/elev.tif", package="terra"))
r[45:60, 45:60] <- NA
r_fill <- terra::focal(r, 5, mean, na.policy="only", na.rm=TRUE)
However, there are some NA still left. So I do this:
na_count <- terra::freq(r_fill, value = NA)
while(na_count$count != 0){
r_fill <- terra::focal(r_fill, 5, mean, na.policy="only", na.rm=TRUE)
na_count <- terra::freq(r_fill, value = NA)
}
Once all NA's are filled, I clip the raster again using the shapefile
r_fill <- terra::crop(r_fill, v, mask = T, touches = T)
This is what my before and after looks like:
I wondered if the while loop is an efficient way to fill the NAs or basically determine how many times I have to run focal to fill all the NAs in the raster.
Perhaps we can, or want to, dispense with the while( altogether by making a better estimate of focal('s w= arg in a world where r, as ground truth, isn't available. Were it available, we could readily derive direct value of w
r <- rast(system.file("ex/elev.tif", package="terra"))
# and it's variants
r2 <- r
r2[45:60, 45:60] <- NA
freq(r2, value=NA) - freq(r, value=NA)
layer value count
1 0 NA 256
sqrt((freq(r2, value=NA) - freq(r, value=NA))$count)
[1] 16
which might be a good value for w=, and introducing another variant
r3 <- r
r3[40:47, 40:47] <- NA
r3[60:67, 60:67] <- NA
r3[30:37, 30:37] <- NA
r3[70:77, 40:47] <- NA
rm(r)
We no longer have our ground truth. How might we estimate an edge of w=? Turning to boundaries( default values (inner)
r2_bi <- boundaries(r2)
r3_bi <- boundaries(r3)
# examining some properties of r2_bi, r3_bi
freq(r2_bi, value=1)$count
[1] 503
freq(r3_bi, value=1)$count
[1] 579
freq(r2_bi, value=1)$count/freq(r2_bi, value = 0)$count
[1] 0.1306833
freq(r3_bi, value=1)$count/freq(r3_bi, value = 0)$count
[1] 0.1534588
sum(freq(r2_bi, value=1)$count,freq(r2_bi, value = 0)$count)
[1] 4352
sum(freq(r3_bi, value=1)$count,freq(r3_bi, value = 0)$count)
[1] 4352
Taken in reverse order, sum[s] and freq[s] suggest that while the total area of (let's call them holes) are the same, they differ in number and r2 is generally larger than r3. This is also clear from the first pair of freq[s].
Now we drift into some voodoo, hocus pocus in pursuit of a better edge estimate
sum(freq(r2)$count) - sum(freq(r2, value = NA)$count)
[1] 154
sum(freq(r3)$count) - sum(freq(r3, value = NA)$count)
[1] 154
(sum(freq(r3)$count) - sum(freq(r3, value = NA)$count))
[1] 12.40967
freq(r2_bi, value=1)$count/freq(r2_bi, value = 0)$count
[1] 0.1306833
freq(r2_bi, value=0)$count/freq(r2_bi, value = 1)$count
[1] 7.652087
freq(r3_bi, value=1)$count/freq(r3_bi, value = 0)$count
[1] 0.1534588
taking the larger, i.e. freq(r2_bi 7.052087
7.652087/0.1306833
[1] 58.55444
154+58
[1] 212
sqrt(212)
[1] 14.56022
round(sqrt(212)+1)
[1] 16
Well, except for that +1 part, maybe still a decent estimate for w=, to be used on both r2 and r3 if called upon to find a better w, and perhaps obviate the need for while(.
Another approach to looking for squares and their edges:
wtf3 <- values(r3_bi$elevation)
wtf2 <- values(r2_bi$elevation)
wtf2_tbl_df2 <- as.data.frame(table(rle(as.vector(is.na(wtf2)))$lengths))
wtf3_tbl_df2 <- as.data.frame(table(rle(as.vector(is.na(wtf3)))$lengths))
names(wtf2_tbl_df2)
[1] "Var1" "Freq"
wtf2_tbl_df2[which(wtf2_tbl_df2$Var1 == wtf2_tbl_df2$Freq), ]
Var1 Freq
14 16 16
wtf3_tbl_df2[which(wtf3_tbl_df2$Freq == max(wtf3_tbl_df2$Freq)), ]
Var1 Freq
7 8 35
35/8
[1] 4.375 # 4 squares of 8 with 3 8 length vectors
bringing in v finally and filling
v <- vect(system.file("ex/lux.shp", package="terra"))
r2_fill_17 <- focal(r2, 16 + 1 , mean, na.policy='only', na.rm = TRUE)
r3_fill_9 <- focal(r3, 8 + 1 , mean, na.policy='only', na.rm = TRUE)
r2_fill_17_cropv <- crop(r2_fill_17, v, mask = TRUE, touches = TRUE)
r3_fill_9_cropv <- crop(r3_fill_9, v, mask = TRUE, touches = TRUE)
And I now appreciate your while( approach as your r2 looks better, more naturally transitioned, though the r3 looks fine. In my few, brief experiments with smaller than 'hole', i.e. focal(r2, 9, I got the sense it would take 2 passes to fill, that suggests focal(r2, 5 would take 4.
I guess further determining the proportion of fill:hole:rast for when to deploy a while would be worthwhile.
I have a grid of 10x10m coordinates that I extracted from a raster. I have a set of 'starting points'. For each starting point, I want to find the location (coordinates) of cells within a 10-50m radius around it.
I am aware of functions to do this with a raster starting point, but additional analyses that I have not included here require that I perform the search from a grid of coordinates in the format shown below.
The code below achieves my aim, however the outer function produces vectors that are far too large (> 10 Gb) on my actual dataset (which is a grid of 9 million 10x10m cells, with 3000 starting points).
I am looking for alternatives that achieve the same result as the following (simplified) code, but do not require large vector storage or looping over each starting point separately.
library(raster)
library(tidyverse)
#Set up the mock raster
orig=raster(nrows=100, ncols=100)
res(orig)=10
vals <- rep(c(1, 2, 3, 1, 2, 3, 1, 3, 2), times = c(72, 72, 72, 72, 72, 72, 72, 72, 72))
setValues(orig, vals)
values(orig) <- vals
xygrid <- as.data.frame(orig, xy = TRUE) %>% .[,1:2]
head(xygrid)
x y
1 -175 85
2 -165 85
3 -155 85
4 -145 85
5 -135 85
6 -125 85
#the initial starting points
init_locs <- c(5, 10, 15, 20)
#calculate the distance to every surrounding cell from starting point
Rx <- outer(xygrid[init_locs, 1], xygrid[, 1], "-")
Ry <- outer(xygrid[init_locs, 2], xygrid[, 2], "-")
R <- sqrt(Rx^2+Ry^2) #overall distance
for (i in 1:length(R[,1])) {
expr2 <- (R[i,] > 10 & R[i,] <= 50) #extract the location of cells within 10-50m
inv <- xygrid[expr2,] #extract the coordinates of these cells
}
head(inv)
x y
15 -35 85
16 -25 85
17 -15 85
18 -5 85
22 35 85
23 45 85
(Raster and spatial data are not my specialty, but this made me think of a naive approach that might work acceptably. I don't know anything about the methods #Robert Hijmans mentioned, those are likely much more performant. I just thought this sounded like an interesting question to explore with basic methods.)</caveat>
Approach
The main challenge here is you have 9 million cells, but only around 80 of those will be with 50m of any given point. If you calculate all those cells' distances to 3,000 starting points and then filter for those under 50m, that's 9M x 3k = 27 billion calculations, and a gigantic data structure, almost all of which is unnecessary.
We can quickly get ~1,000x more efficient by splitting this into two problems -- first, what general region of potentially-within-50m-points should we look at, and second, what is the actual distance to the points in those regions?
We can precalculate a modestly sized <2MB hash table for step 1. Then, by joining it to our locations (a very fast operation), we can focus our calculations on the 1/1000th of points that have a chance of being within 50m. I arbitrarily split the original cells into 100 x 100 = 10k sectors, each sector holding 30x30 cells.
1. Creating hash table
For the hash table, I'll assign each point to a sector, somewhat arbitrarily as 30x30 cells, so we have 100x100 = 10k sectors. This could be tuned based on speed vs. memory tradeoffs.
max_dist = 30 # sector width, in cells
xygrid2 <- expand_grid(
x = seq(0, 2999, by = 1), # 3000x3000 location grid
y = seq(0, 2999, by = 1))
xygrid2$sector_x = xygrid2$x %/% max_dist # 100 x 100 sectors
xygrid2$sector_y = xygrid2$y %/% max_dist
y_range = max(xygrid2$sector_y) + 1
xygrid2$sector_num = xygrid2$sector_x*y_range + xygrid2$sector_y
We now have 10,000 sectors assigned. Now which sectors are adjacent to which others? In every case, the adjacent sectors follow the same pattern. In this case, I have 100 sectors across x, so the sectors adjacent to sector S will have sector numbers that vary from S by -101 -100 -99 -1 0 1 99 100 101. We can use this pattern to assign all the adjacencies instantaneously. For simplicity, I leave in sectors outside our range; they will be ignored later anyway.
sector_num_deltas <- rep(-1:1, by = 3) + rep(-1:1, each = 3) * y_range
distinct(xygrid2, sector_num) %>%
uncount(9) %>% # copy each row 9 times, one for each adjacency
mutate(sector_num_adj = sector_num + sector_num_deltas) -> adjacencies
2. Join and calculate
Now that we have that, the rest goes much faster, since we can do the calculations only on the 1/1000th of sectors that are nearby. With that, we can now identify the 240,000 points that are within 50m of the 3,000 starting positions in under 4 seconds:
# Here are 3,000 random starting locations
set.seed(42)
sample_starts <- xygrid2 %>%
slice_sample(n = 3000) %>%
mutate(sample_num = row_number())
# Join each location to all the adjacent sectors, and then add all the
# locations within those sectors, and then calculate distances.
sample_starts %>% # 3,000 starting points...
# join each position to the nine adjacent sectors = ~27,000 rows
left_join(adjacencies, by = "sector_num") %>%
# join each sector to the (30x30 = 900) cells in those sectors --> 24 million rows
# That's a lot, but it's only 1/1000th of the starting problem with
# 3k x 9M = 27 billion comparisons!
left_join(xygrid2, by = c("sector_num_adj" = "sector_num")) %>%
select(-contains("sector")) %>%
mutate(dist = sqrt((x.x-x.y)^2 + (y.x-y.y)^2)) %>%
filter(dist <= 5) -> result
The result tells us that our 3,000 sample starting points are within 5 decimeters (50m) of 242,575 cells, about 80 for each starting point.
result
# A tibble: 242,575 x 6
x.x y.x sample_num x.y y.y dist
<dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 1069 140 1 1064 140 5
2 1069 140 1 1065 137 5
3 1069 140 1 1065 138 4.47
4 1069 140 1 1065 139 4.12
5 1069 140 1 1065 140 4
6 1069 140 1 1065 141 4.12
7 1069 140 1 1065 142 4.47
8 1069 140 1 1065 143 5
9 1069 140 1 1066 136 5
10 1069 140 1 1066 137 4.24
# … with 242,565 more rows
Here's a sample to see how that's working in a small corner of our data:
ggplot(a %>% mutate(sample_grp = sector_num_adj %% 8 %>% as.factor),
aes(x.y, y.y, color = sample_grp)) +
geom_point(data = adjacencies %>% filter(sector_num_adj == 5864) %>%
left_join(xygrid2) %>% distinct(x, y, sector_num),
color = "gray80", shape = 21,
aes(x, y)) +
geom_point(data = adjacencies %>% filter(sector_num == 5864) %>%
left_join(xygrid2) %>% distinct(x, y, sector_num),
color = "gray70", shape = 21,
aes(x, y)) +
annotate("text", alpha = 0.5,
x = c(1725, 1750),
y = c(1960, 1940),
label = c("Lookup area", "sector of\nstarting location")) +
geom_point(size = 1) +
scale_color_discrete(guide = FALSE) +
coord_equal() -> my_plot
library(gganimate)
animate(
my_plot +
gganimate::view_zoom_manual(pan_zoom = -1, ease = "quadratic-in-out",
xmin = c(0, 1700),
xmax = c(3000, 1800),
ymin = c(0, 1880),
ymax = c(3000, 1980)),
duration = 3, fps = 20, width = 300)
Example data --- you were using a lon/lat example, but based on your code, I am assuming that you are using planar data.
library(raster)
r <- raster(nrows=100, ncols=100, xmn=0, xmx=100, ymn=0, ymx=100, crs="+proj=utm +zone=1 +datum=WGS84")
values(r) <- 1:ncell(r) # for display only
xygrid <- as.data.frame(r, xy = TRUE)[,1:2]
locs <- c(8025, 1550, 5075)
dn <- 2.5 # min dist
dx <- 5.5 # max dist
The simplest approach would be to use pointDistance
p <- xyFromCell(r, locs)
d <- pointDistance(xygrid, p, lonlat=FALSE)
u <- unique(which(d>dn & d<dx) %% nrow(d))
pts <- xygrid[u,]
plot(r)
points(pts)
But you will probably run out of memory with that, and it is inefficient to compute all distance. Instead, you may intersect the points with a buffer around the points of interest
b1 <- buffer(SpatialPoints(p, proj4string=crs(r)), dx)
b2 <- buffer(SpatialPoints(p, proj4string=crs(r)), dn)
b <- erase(b1, b2)
x <- intersect(SpatialPoints(xygrid, proj4string=crs(r)), b)
plot(r)
points(x, cex=.5)
points(xyFromCell(r, locs), col="red", pch="x")
With terra it goes like this -- and works well for large datasets in version 1.1-11 that should be on CRAN this week
library(terra)
rr <- rast(r)
pp <- xyFromCell(rr, locs)
bb1 <- buffer(vect(pp), dx)
bb2 <- buffer(vect(pp), dn)
bb <- erase(bb1, bb2)
xx <- intersect(vect(as.matrix(xygrid)), bb)
You can do similar things with sf.
Given that you have so many data points, you might want to start with removing all points that are clearly not of interest
xySel <- lapply(locs, function(i) {
xy <- xygrid[i,]
s <- xygrid[,1] > xy[,1]-dx & xygrid[,1] < xy[,1]+dx & xygrid[,2] > xy[,2]-dx & xygrid[,2] < xy[,2]+dx
xygrid[s,]
})
xySel = do.call(rbind, xySel)
dim(xySel)
# [1] 363 2
dim(xygrid)
#[1] 10000 2
And now you could run pointDistance as above on all data (or else inside the lapply function)
You say that you need to use points, and not a raster. I have seen that idea many times, and 9 out of 10 times that is wrong. Maybe it is true in your case. For others who stumble upon this question, here are are two raster based approaches.
With the raster package you could use extract( ... ,cellnumbers=TRUE) or ajacent. With adjacent, you would first make a weights matrix using one of the buffers made above
buf <- disaggregate(b)[2,]
rb <- crop(r, buf)
w <- as.matrix(rasterize(buf, rb, background=NA) )
w[6,6]=0
And then use the weight matrix like this
a <- adjacent(r, locs, w, pairs=FALSE)
pts <- xyFromCell(r, a)
plot(r)
points(pts)
With terra you could use the cells method
d <- cells(rr, bb)
xy <- xyFromCell(rr, d[,2])
plot(rr)
points(xy, cex=.5)
lines(bb, col="red", lwd=2)
I have a dataset of species and their rough locations in a 100 x 200 meter area. The location part of the data frame is not in a format that I find to be usable. In this 100 x 200 meter rectangle, there are two hundred 10 x 10 meter squares named A through CV. Within each 10 x 10 square there are four 5 x 5 meter squares named 1, 2, 3, and 4, respectively (1 is south of 2 and west of 3. 4 is east of 2 and north of 3). I want to let R know that A is the square with corners at (0 ,0), (10,0), (0,0), and (0,10), that B is just north of A and has corners (0,10), (0,20), (10,10), and (10,20), and K is just east of A and has corners at (10,0), (10,10), (20,0), and (20,10), and so on for all the 10 x 10 meter squares. Additionally, I want to let R know where each 5 x 5 meter square is in the 100 x 200 meter plot.
So, my data frame looks something like this
10x10 5x5 Tree Diameter
A 1 tree1 4
B 1 tree2 4
C 4 tree3 6
D 3 tree4 2
E 3 tree5 3
F 2 tree6 7
G 1 tree7 12
H 2 tree8 1
I 2 tree9 2
J 3 tree10 8
K 4 tree11 3
L 1 tree12 7
M 2 tree13 5
Eventually, I want to be able to plot the 100 x 200 meter area and have each 10 x 10 meter square show up with the number of trees, or number of species, or total biomass
What is the best way to turn the data I have into spatial data that R can use for graphing and perhaps analysis?
Here's a start.
## set up a vector of all 10x10 position tags
tags10 <- c(LETTERS,
paste0("A",LETTERS),
paste0("B",LETTERS),
paste0("C",LETTERS[1:22]))
A function to convert (e.g.) {"J",3} to the center of the corresponding sub-square.
convpos <- function(pos10,pos5) {
## convert letters to major (x,y) positions
p1 <- as.numeric(factor(pos10,levels=tags10)) ## or use match()
p1.x <- ((p1-1) %% 10) *10+5 ## %% is modulo operator
p1.y <- ((p1-1) %/% 10)*10+5 ## %/% is integer division
## sort out sub-positions
p2.x <- ifelse(pos5 <=2,2.5,7.5) ## {1,2} vs {3,4} values
p2.y <- ifelse(pos5 %%2 ==1 ,2.5,7.5) ## odd {1,3} vs even {2,4} values
c(p1.x+p2.x,p1.y+p2.y)
}
usage:
convpos("J",2)
convpos(mydata$tenbytenpos,mydata$fivebyfivepos)
Important notes:
this is a proof of concept, I can pretty much guarantee I haven't got the correspondence of x and y coordinates quite right. But you should be able to trace through this line-by-line and see what it's doing ...
it should work correctly on vectors (see second usage example above): I switched from switch to ifelse for that reason
your column names (10x10) are likely to get mangled into something like X10.10 when reading data into R: see ?data.frame and ?check.names
Similar to what #Ben Bolker has done, here's a lookup function (though you may need to transpose something to make the labels match what you describe).
tenbyten <- c(LETTERS[1:26],
paste0("A",LETTERS[1:26]),
paste0("B",LETTERS[1:26]),
paste0("C",LETTERS[1:22]))
tenbyten <- matrix(rep(tenbyten, each = 2), ncol = 10)
tenbyten <- t(apply(tenbyten, 1, function(x){rep(x, each = 2)}))
# the 1234 squares
squares <- matrix(c(rep(c(1,2),10),rep(c(4,3),10)), nrow = 20, ncol = 20)
# stick together into a reference grid
my.grid <- matrix(paste(tenbyten, squares, sep = "-"), nrow = 20, ncol = 20)
# a lookup function for the site grid
coordLookup <- function(tbt, fbf, .my.grid = my.grid){
x <- col(.my.grid) * 5 - 2.5
y <- row(.my.grid) * 5 - 2.5
marker <- .my.grid == paste(tbt, fbf, sep = "-")
list(x = x[marker], y = y[marker])
}
coordLookup("BB",2)
$x
[1] 52.5
$y
[1] 37.5
If this isn't what you're looking for, then maybe you'd prefer a SpatialPolygonsDataFrame, which has proper polygon IDs, and you attach data to, etc. In that case just Google around for how to make one from scratch, and manipulate the row() and col() functions to get your polygon corners, similar to what's given in this lookup function, which only returns centroids.
Edit: getting SPDF started:
This is modified from the function example and can hopefully be a good start:
library(sp)
# really you have a 20x20 grid, counting the small ones.
# c(2.5,2.5) specifies the distance in any direction from the cell center
grd <- GridTopology(c(1,1), c(2.5,2.5), c(20,20)))
grd <- as.SpatialPolygons.GridTopology(grd)
# get centroids
coords <- coordinates(polys)
# make SPDF, with an extra column for your grid codes, taken from the above.
# you can add further columns to this data.frame(), using polys#data
polys <- SpatialPolygonsDataFrame(grd,
data=data.frame(x=coords[,1], y=coords[,2], my.ID = as.vector(my.grid),
row.names=getSpPPolygonsIDSlots(grd)))
The question hast 2 parts.
Which is the data structure in R that allows to store the paired data:
0:0
0.5:10
1:20
(Python dictionary {[0]:0, [0.5]:10, [1]:20})
and how to initiate it with one liner? i.e. to couple seq(0,1,by=0.5)
with seq(0,10,by=5) in this data structure
Assume I added 0.25 to the list, then I want the weighted average of the neighbor nodes to appear (automatically) in the data set, i.e. the element 0.25:5 and the paired set would be
0:0
0.25:5
0.5:10
1:20
If I add the element 0.3, then it must be paired with 5+(10-5)*(0.3-0.25)/(0.5-0.25)=6 and element 0.3:6 to be added.
How I can create the class with S4 or Reference Class class model where I could put this functionality?
Not really sure what you are getting at but maybe the package hash may have what you want
library(hash)
h<-hash(keys=seq(0,1,by=0.5),values=seq(0,10,by=5))
h[['0.25']]<-2.5
Probably deals with the first part of your question. http://cran.r-project.org/web/packages/hash/hash.pdf may allude to help on the second.
a similar construct with lists
lst<-list()
lst<-seq(0,10,5)
names(lst)<-seq(0,1,0.5)
> lst['0.5']
0.5
5
lst['0.25']<-2.5
for your second part you could construct a simple function to update you hash/list with a new value.
A two-column data.frame seems appropriate:
xy <- data.frame(x = seq(0, 1, by = 0.5), y = seq(0, 20, by = 10))
xy
# x y
# 1 0.0 0
# 2 0.5 10
# 3 1.0 20
Then, what you are trying to do is a linear-interpolation, which you can achieve using the approx function. For example:
approx(xy$x, xy$y, xout = 0.3)
# $x
# [1] 0.3
#
# $y
# [1] 6
If you want to add that result to the data.frame, you can do something like:
xy <- as.data.frame(approx(xy$x, xy$y, xout = sort(c(xy$x, 0.3))))
xy
# x y
# 1 0.0 0
# 2 0.3 6
# 3 0.5 10
# 4 1.0 20
which is a bit expensive, especially if you plan to add points one at a time. You could instead add all your points at once since the result is independent of the order in which you add them:
add.points <- c(0.25, 0.3)
xy <- as.data.frame(approx(xy$x, xy$y, xout = sort(c(xy$x, add.points))))
xy
# x y
# 1 0.00 0
# 2 0.25 5
# 3 0.30 6
# 4 0.50 10
# 5 1.00 20
I'm using the cut function to split my data in equal bins, it does the job but I'm not happy with the way it returns the values. What I need is the center of the bin not the upper and lower ends.
I've also tried to use cut2{Hmisc}, this gives me the center of each bins, but it divides the range of data in bins that contains the same numbers of observations, rather than being of the same length.
Does anyone have a solution to this?
It's not too hard to make the breaks and labels yourself, with something like this. Here since the midpoint is a single number, I don't actually return a factor with labels but instead a numeric vector.
cut2 <- function(x, breaks) {
r <- range(x)
b <- seq(r[1], r[2], length=2*breaks+1)
brk <- b[0:breaks*2+1]
mid <- b[1:breaks*2]
brk[1] <- brk[1]-0.01
k <- cut(x, breaks=brk, labels=FALSE)
mid[k]
}
There's probably a better way to get the bin breaks and midpoints; I didn't think about it very hard.
Note that this answer is different than Joshua's; his gives the median of the data in each bins while this gives the center of each bin.
> head(cut2(x,3))
[1] 16.666667 3.333333 16.666667 3.333333 16.666667 16.666667
> head(ave(x, cut(x,3), FUN=median))
[1] 18 2 18 2 18 18
Use ave like so:
set.seed(21)
x <- sample(0:20, 100, replace=TRUE)
xCenter <- ave(x, cut(x,3), FUN=median)
We can use smart_cut from package cutr:
devtools::install_github("moodymudskipper/cutr")
library(cutr)
Using #Joshua's sample data:
median by interval (same output as #Joshua except it's an ordered factor) :
smart_cut(x,3, "n_intervals", labels= ~ median(.))
# [1] 18 2 18 2 18 18 ...
# Levels: 2 < 11 < 18
center of each interval (same output as #Aaron except it's an ordered factor) :
smart_cut(x,3, "n_intervals", labels= ~ mean(.y))
# [1] 16.67 3.333 16.67 3.333 16.67 16.67 ...
# Levels: 3.333 < 10 < 16.67
mean of values by interval :
smart_cut(x,3, "n_intervals", labels= ~ mean(.))
# [1] 17.48 2.571 17.48 2.571 17.48 17.48 ...
# Levels: 2.571 < 11.06 < 17.48
labels can be a character vector just like in base::cut.default, but it can also be, as it is here, a function of 2 parameters, the first being the values contained in the bin, and the second the cut points of the bin.
more on cutr and smart_cut