Related
I have a raster and a shapefile. The raster contains NA and I am filling the NAs using the focal function
library(terra)
v <- vect(system.file("ex/lux.shp", package="terra"))
r <- rast(system.file("ex/elev.tif", package="terra"))
r[45:60, 45:60] <- NA
r_fill <- terra::focal(r, 5, mean, na.policy="only", na.rm=TRUE)
However, there are some NA still left. So I do this:
na_count <- terra::freq(r_fill, value = NA)
while(na_count$count != 0){
r_fill <- terra::focal(r_fill, 5, mean, na.policy="only", na.rm=TRUE)
na_count <- terra::freq(r_fill, value = NA)
}
Once all NA's are filled, I clip the raster again using the shapefile
r_fill <- terra::crop(r_fill, v, mask = T, touches = T)
This is what my before and after looks like:
I wondered if the while loop is an efficient way to fill the NAs or basically determine how many times I have to run focal to fill all the NAs in the raster.
Perhaps we can, or want to, dispense with the while( altogether by making a better estimate of focal('s w= arg in a world where r, as ground truth, isn't available. Were it available, we could readily derive direct value of w
r <- rast(system.file("ex/elev.tif", package="terra"))
# and it's variants
r2 <- r
r2[45:60, 45:60] <- NA
freq(r2, value=NA) - freq(r, value=NA)
layer value count
1 0 NA 256
sqrt((freq(r2, value=NA) - freq(r, value=NA))$count)
[1] 16
which might be a good value for w=, and introducing another variant
r3 <- r
r3[40:47, 40:47] <- NA
r3[60:67, 60:67] <- NA
r3[30:37, 30:37] <- NA
r3[70:77, 40:47] <- NA
rm(r)
We no longer have our ground truth. How might we estimate an edge of w=? Turning to boundaries( default values (inner)
r2_bi <- boundaries(r2)
r3_bi <- boundaries(r3)
# examining some properties of r2_bi, r3_bi
freq(r2_bi, value=1)$count
[1] 503
freq(r3_bi, value=1)$count
[1] 579
freq(r2_bi, value=1)$count/freq(r2_bi, value = 0)$count
[1] 0.1306833
freq(r3_bi, value=1)$count/freq(r3_bi, value = 0)$count
[1] 0.1534588
sum(freq(r2_bi, value=1)$count,freq(r2_bi, value = 0)$count)
[1] 4352
sum(freq(r3_bi, value=1)$count,freq(r3_bi, value = 0)$count)
[1] 4352
Taken in reverse order, sum[s] and freq[s] suggest that while the total area of (let's call them holes) are the same, they differ in number and r2 is generally larger than r3. This is also clear from the first pair of freq[s].
Now we drift into some voodoo, hocus pocus in pursuit of a better edge estimate
sum(freq(r2)$count) - sum(freq(r2, value = NA)$count)
[1] 154
sum(freq(r3)$count) - sum(freq(r3, value = NA)$count)
[1] 154
(sum(freq(r3)$count) - sum(freq(r3, value = NA)$count))
[1] 12.40967
freq(r2_bi, value=1)$count/freq(r2_bi, value = 0)$count
[1] 0.1306833
freq(r2_bi, value=0)$count/freq(r2_bi, value = 1)$count
[1] 7.652087
freq(r3_bi, value=1)$count/freq(r3_bi, value = 0)$count
[1] 0.1534588
taking the larger, i.e. freq(r2_bi 7.052087
7.652087/0.1306833
[1] 58.55444
154+58
[1] 212
sqrt(212)
[1] 14.56022
round(sqrt(212)+1)
[1] 16
Well, except for that +1 part, maybe still a decent estimate for w=, to be used on both r2 and r3 if called upon to find a better w, and perhaps obviate the need for while(.
Another approach to looking for squares and their edges:
wtf3 <- values(r3_bi$elevation)
wtf2 <- values(r2_bi$elevation)
wtf2_tbl_df2 <- as.data.frame(table(rle(as.vector(is.na(wtf2)))$lengths))
wtf3_tbl_df2 <- as.data.frame(table(rle(as.vector(is.na(wtf3)))$lengths))
names(wtf2_tbl_df2)
[1] "Var1" "Freq"
wtf2_tbl_df2[which(wtf2_tbl_df2$Var1 == wtf2_tbl_df2$Freq), ]
Var1 Freq
14 16 16
wtf3_tbl_df2[which(wtf3_tbl_df2$Freq == max(wtf3_tbl_df2$Freq)), ]
Var1 Freq
7 8 35
35/8
[1] 4.375 # 4 squares of 8 with 3 8 length vectors
bringing in v finally and filling
v <- vect(system.file("ex/lux.shp", package="terra"))
r2_fill_17 <- focal(r2, 16 + 1 , mean, na.policy='only', na.rm = TRUE)
r3_fill_9 <- focal(r3, 8 + 1 , mean, na.policy='only', na.rm = TRUE)
r2_fill_17_cropv <- crop(r2_fill_17, v, mask = TRUE, touches = TRUE)
r3_fill_9_cropv <- crop(r3_fill_9, v, mask = TRUE, touches = TRUE)
And I now appreciate your while( approach as your r2 looks better, more naturally transitioned, though the r3 looks fine. In my few, brief experiments with smaller than 'hole', i.e. focal(r2, 9, I got the sense it would take 2 passes to fill, that suggests focal(r2, 5 would take 4.
I guess further determining the proportion of fill:hole:rast for when to deploy a while would be worthwhile.
I essentially have two columns (vectors) with speed and accel in a data.frame as such:
speed acceleration
1 3.2694444 2.6539535522
2 3.3388889 2.5096979141
3 3.3888889 2.2722134590
4 3.4388889 1.9815256596
5 3.5000000 1.6777544022
6 3.5555556 1.3933215141
7 3.6055556 1.1439051628
8 3.6527778 0.9334115982
9 3.6722222 0.7561602592
I need to find for each value speed on the x axis (speed), what is the top 10% max values from the y axis (acceleration). This also needs to be in a specific interval. For example speed 3.2-3.4, 3.4-3.6, and so on. Can you please show me how a for loop would look like in this situation?
As #alistaire already pointed out, you have provided a very limited amount of data. So we first have to simulate I a bit more data based on which we can test our code.
set.seed(1)
# your data
speed <- c(3.2694444, 3.3388889, 3.3388889, 3.4388889, 3.5,
3.5555556, 3.6055556, 3.6527778, 3.6722222)
acceleration <- c(2.6539535522, 2.5096979141, 2.2722134590,
1.9815256596, 1.6777544022, 1.3933215141,
1.1439051628, 0.9334115982, 0.7561602592)
df <- data.frame(speed, acceleration)
# expand data.frame and add a little bit of noise to all values
# to make them 'unique'
df <- as.data.frame(do.call(
rbind,
replicate(15L, apply(df, 2, \(x) (x + runif(length(x), -1e-1, 1e-1) )),
simplify = FALSE)
))
The function create_intervals, as the name suggests, creates user-defined intervals. The rest of the code does the 'heavy lifting' and stores the desired result in out.
If you would like to have intervals of speed with equal widths, simply specify the number of groups (n_groups) you would like to have and leave the rest of the arguments (i.e. lwr, upr, and interval_span) unspecified.
# Cut speed into user-defined intervals
create_intervals <- \(n_groups = NULL, lwr = NULL, upr = NULL, interval_span = NULL) {
if (!is.null(lwr) & !is.null(upr) & !is.null(interval_span) & is.null(n_groups)) {
speed_low <- subset(df, speed < lwr, select = speed)
first_interval <- with(speed_low, c(min(speed), lwr))
middle_intervals <- seq(lwr + interval_span, upr - interval_span, interval_span)
speed_upp <- subset(df, speed > upr, select = speed)
last_interval <- with(speed_upp, c(upr, max(speed)))
intervals <- c(first_interval, middle_intervals, last_interval)
} else {
step <- with(df, c(max(speed) - min(speed))/n_groups)
intervals <- array(0L, dim = n_groups)
for(i in seq_len(n_groups)) {
intervals[i] <- min(df$speed) + i * step
}
}
return(intervals)
}
# three intervals with equal width
my_intervals <- create_intervals(n_groups = 3L)
# Compute values of speed when acceleration is greater then
# or equal to the 90th percentile
out <- lapply(1:(length(my_intervals)-1L), \(i) {
x <- subset(df, speed >= my_intervals[i] & speed <= my_intervals[i+1L])
x[x$acceleration >= quantile(x$acceleration, 0.9), ]
})
# function to round values to two decimal places
r <- \(x) format(round(x, 2), nsmall = 2L)
# assign names to each element of out
for(i in seq_along(out)) {
names(out)[i] <- paste0(r(my_intervals[i]), '-', r(my_intervals[i+1L]))
}
Output 1
> out
$`3.38-3.57`
speed acceleration
11 3.394378 2.583636
21 3.383631 2.267659
57 3.434123 2.300234
83 3.394886 2.580924
101 3.395459 2.460971
$`3.57-3.76`
speed acceleration
6 3.635234 1.447290
41 3.572868 1.618293
51 3.615017 1.420020
95 3.575412 1.763215
We could also compute the desired values of speed based on intervals that make more 'sense' than just equally spaced speed intervals, e.g. [min(speed), 3.3), [3.3, 3.45), [3.45, 3.6), and [3.6, max(speed)).
This can be accomplished by leaving n_groups unspecified and instead specify lwr, upr, and an interval_span that makes sense. For instance, it makes sense to have a interval span of 0.15 when the lower limit is 3.3 and the upper limit is 3.6.
# custom boundaries based on a lower limit and upper limit
my_intervals <- create_intervals(lwr = 3.3, upr = 3.6, interval_span = 0.15)
Output 2
> out
$`3.18-3.30`
speed acceleration
37 3.238781 2.696456
82 3.258691 2.722076
$`3.30-3.45`
speed acceleration
11 3.394378 2.583636
19 3.328292 2.711825
73 3.315306 2.644580
83 3.394886 2.580924
$`3.45-3.60`
speed acceleration
4 3.520530 2.018930
40 3.517329 2.032943
58 3.485247 2.079893
67 3.458031 2.078545
$`3.60-3.76`
speed acceleration
6 3.635234 1.447290
34 3.688131 1.218969
51 3.615017 1.420020
78 3.628465 1.348873
Note: use function(x) instead of \(x) if you use a version of R <4.1.0
I have a dataframe of many rows with only one column, the column having strings of variable lengths, ranging from 30000 to 200000 characters(DNA sequence). [Below is a sample of 150 characters]
TTCCCCAAACAGCAACTTTAAGGAGCAGCTTCCTTTATGATCCCTGATTGCCTCCCCTTTGTTCCCATAACAAGTAGTTTAAATTTTCTGTTAAAGTCCAAACCACATATTTACAATACCTCGCACC
Here is the full dataset: https://drive.google.com/open?id=1f9prtKW5NnS-BLI5lqsl4FEi4PvRfxGR
I have a code in R, which divides each row into 20 bins depending on its length, and counts the occurrence of G's and C's for each bin, and gives me back a matrix of 20 columns. Here is the code:
library(data.table)
data <- fread("string.fa", header = F)
loopchar <- function(data){ bins <- sapply(seq(1, nchar(data), nchar(data)/20), function(x) substr(data, x, x + nchar(data)/20 - 1))output <- (str_count(bins, c("G"))/nchar(bins) + str_count(bins, c("C"))/nchar(bins))*100}
result <- data.frame(t(apply(data,1,loopchar)))
However, now I want to do something different. Instead of nchar(data)/20, I want the substring segments (20) to vary from a list I have. So now for my data frame, the first row should be divided into 22 bins/segments, and the code would be nchar(data)/22.
The second row should be divided into 21 bins, and the code would be nchar(data)/21, and so on. I want the function to keep changing the number of bins for the data. Both my data dataframe with strings and vector list of numbers with bins are of the same length.
What is the best way to do this?
It's more natural to use some of the Bioconductor's libraries for such tasks. In my case I use Biostrings, but maybe you could find another way.
Data
Your file is too big, so I have created a text file (in memory), which contains random DNA for each line:
# set seed to create reproducible example
set.seed(53101614)
# create an example text file in memory
temp <- tempfile()
writeLines(
sapply(1:100, function(i){
paste(sample(c("A", "T", "C", "G"), sample(100:6000),
replace = T), collapse = "")
}),
con = temp
)
# read lines from tmp file
dna <- readLines(temp)
# unlink file
unlink(temp)
Data preprocessing
Creating Biostrings::DNAStringSet object
Using Biostrings::DNAStringSet() function we can read character vector to create DNAStringSet object. Note that I assume that all the records are in standard DNA alphabet i.e. each string contains only A, T, C, G symbols. If it does not hold in your case, refer to Biostrings documentation.
dna <- DNAStringSet(dna, use.names = F)
# inspect the output
dna
A DNAStringSet instance of length 100
width seq
[1] 2235 GGGCTTCCGGTGGTTGTAGGCCCATAAGGTGGGAAATATACA...GAAACGTCGACAAGATACAAACGAGTGGTCAACAGGCCAGCC
[2] 1507 ATGCGGTCTATCTACTTGTTCGGCCGAACCTTGAGGGCAGCC...AACGCTTTGTACCTGTCCCAGAGTCAGAAGTAACAGTTTAGC
[3] 1462 CATTGGAGTACATAGGGTATTCCCTCTCGTTGTATAACTCCA...TCCTACTTGCGAAGGCAGTCGCACACAAGGGTCTATTTCGTC
[4] 1440 ATGCTACGTTGGTAGGGTAACGCAGACTAGAACCACACGGGA...ATAAAGCCGTCACAAGGAATGTTAGCACTCAATGGCTCGCTA
[5] 3976 AAGCGGAAGTACACGTACCCGCGTAGATTACGTATAGTCGCC...TTACGCGTTGCTCAAATCGTTCGGTGCAGTTTTATAGTGATG
... ... ...
[96] 4924 AGTAAGCAGATCCAGAGTACTGTGAAAGACGTCAGATCCCGA...TATAATGGGTTGCGTGTTTGATTCTGCCATGAATCCTATGTT
[97] 5702 CCTGAAGAGGACGTTTCCCCCTACATCCAGTAGTATTGGTGT...TCTGCTTTGCGCGGCGGGGCCGGACTGTCCATGGCTCACTTG
[98] 5603 GCGGCTGATTATTGCCCGTCTGCCTGCATGATCGAGCAGAAC...CTCTTTACATGCTCATAGGAATCGGCAACGAAGGAGAGAGTC
[99] 3775 GGCAAGACGGTCAGATGTTTTGATGTCCGGGCGGATATCCTT...CGCTGCCCGTGACAATAGTTATCATAAGGAGACCTGGATGGT
[100] 407 TGTCGCAACCTCTCTTGCACGTCCAATTCCCCGACGGTTCTA...GCGACATTCCGGAGTCTGCGCAGCCTATGTATACCCTACAGA
Create the vector of random N numbers of bins
set.seed(53101614)
k <- sample(100, 100, replace = T)
# inspect the output
head(k)
[1] 37 32 63 76 19 41
Create Views object were each DNA sequence represented by N = k[i] chunks
It is much easier to solve your problem using IRanges::Views container. This thing is furiously fast and beautiful.
First of all we divide each DNA sequenced into k[i] ranges:
seqviews <- lapply(seq_along(dna), function(i){
seq = dna[[i]]
seq_length = length(seq)
starts = seq(1, seq_length - seq_length %% k[i], seq_length %/% k[i])
Views(seq, start = starts, end = c(starts[-1] - 1, seq_length))
}
)
# inspect the output for k[2] and seqviews[2]
k[2]
seqviews[2]
32
Views on a 1507-letter DNAString subject subject: ATGCGGTCTATCTACTTG...GTCAGAAGTAACAGTTTAG
views:
start end width
[1] 1 47 47 [ATGCGGTCTATCTACTTGTTCGGCCGAACCTTGAGGGCAGCCAGCTA]
[2] 48 94 47 [ACCGCCGGAGACCTGAGTCCACCACACCCATTCGATCTCCATGGTTG]
[3] 95 141 47 [GCGCTCTCCGAGGTGCCACGTCAAGTTGTACTACTCTCTCAGACCTC]
[4] 142 188 47 [TTGTTAGAAGTCCCGAGGTATATGCGCAATACCTCAACCGAAGCGCC]
[5] 189 235 47 [TGATGAGCAAACGTTTCTTATAGTCGCGACCTTGTCCCGAGGACTTG]
... ... ... ... ...
[28] 1270 1316 47 [AGGCGAGGGCAGGGCACATGTTTCTACAGTGAGGCGTGATCCGCTCC]
[29] 1317 1363 47 [GAGGCAAGCTCGTGAACTGTCGTGGCAAGTTACTTATGAGGATGTCA]
[30] 1364 1410 47 [TGGGCAGATGCAACAGACTGCTATTGGCGGGAGAGAGGCATCGACAT]
[31] 1411 1457 47 [ACCGTCTCAAGTACCACAGCTGAGAGGCTCTCGTGGAGATGCGCACA]
[32] 1458 1507 50 [TGAGTCGTAACGCTTTGTACCTGTCCCAGAGTCAGAAGTAACAGTTTAGC]
After that, we check if all sequences have been divided to desired number of chunks:
all(sapply(seq_along(k), function(i) k[i] == length(seqviews[[i]])))
[1] TRUE
Important observation
Before we proceed, there is one important observation about your function.
Your function produces N chunks with variable length (because the indices it produces are floats but not integers, so substr() when you call it, rounds provided indices to the nearest integer.
As an example, extracting 1st record from the dna set, and splitting this sequence into 37 bins using your code will produce following results:
dna_1 <- as.character(dna[[1]])
sprintf("DNA#1: %d bp long, 37 chunks", nchar(dna_1))
[1] "DNA#1: 2235 bp long, 37 chunks"
bins <- sapply(seq(1, nchar(dna_1), nchar(dna_1)/37),
function(x){
substr(dna_1, x, x + nchar(dna_1)/37 - 1)
}
)
bins_length <- sapply(bins, nchar)
barplot(table(bins_length),
xlab = "Bin's length",
ylab = "Count",
main = "Bin's length variability"
)
The approach I use in my code, while length(dna[[i]]) %% k[i] != 0 (reminder), produces k[i] - 1 bins of equal lengths, and only the last bin has its length equal to length(dna[i]) %/% k[i] + length(dna[[i]] %% k[i]:
bins_length <- sapply(seqviews, length)
barplot(table(bins_length),
xlab = "Bin's length",
ylab = "Count",
main = "Bin's length variability"
)
GC content calculation
As it is mentioned above, Biostrings::letterFrequency() applied to IRanges::Views allows you to calculate GC content easily:
Find the GC frequency for each bin in every DNA sequence
GC <- lapply(seqviews, letterFrequency, letters = "GC", as.prob = TRUE)
Convert to percents
GC <- lapply(GC, "*", 100)
Inspect the output
head(GC[[1]])
G|C
[1,] 53.33333
[2,] 46.66667
[3,] 50.00000
[4,] 55.00000
[5,] 60.00000
[6,] 45.00000
Plot GC content for DNAs 1:9
par(mfrow = c(3, 3))
invisible(
lapply(1:9, function(i){
plot(GC[[i]],
type = "l",
main = sprintf("DNA #%d, %d bp, %d bins", i, length(dna[[i]]), k[i]),
xlab = "N bins",
ylab = "GC content, %",
ylim = c(0, 100)
)
abline(h = 50, lty = 2, col = "red")
}
)
)
I have a data frame that has a xyz and another variable, A.
data.frame(xx,yy,zz,Amp)
xx yy zz Amp
1 63021.71 403205.0 1.181028516 1170
2 63021.71 403105.0 0.977028516 1381
3 63021.71 403105.0 0.861028516 807
4 63021.71 403105.0 0.784028516 668
5 53021.71 403105.0 0.620028516 19919
6 53021.71 403305.0 0.455028516 32500
7 53021.71 403105.0 0.446028516 32500
8 43021.71 403105.0 0.436028516 32500
9 43021.71 404105.0 0.426028516 32500
10 43021.71 403105.0 0.281028516 17464
First I want to create regular grid for xyz.
Next I want to fill this grid with Amp values.
I would like to do this by creating by using arrays.
Any help would be much appreciated.
i would like the final result to look like this:
dim(Amp)
10 10 10
You do not have enough data in your MWE to create a 10x10x10 array without interpolation. Currently you have 3 unique xx values, 4 unique yy values, and 10 unique zz values. So you could create a 3x4x10 array, but you don't have enough values in Amp to assign to each point in a 3x4x10 3D regular grid. You only have 10 Amp values, describing 10 unique points in 3D space. A 3x4x10 regular grid array would have 120 Amp values, one for each point in the grid. Furthermore, values in a regular grid are equally spaced in each dimension and your yy and zz values are not equally spaced.
Check the spacing in each dimension:
> diff(sort(unique(xx)))
[1] 10000 10000
> diff(sort(unique(yy)))
[1] 100 100 800
> diff(sort(unique(zz)))
[1] 0.145 0.010 0.010 0.009 0.165 0.164 0.077 0.116 0.204
The current MWE looks like this in 3D:
library(rgl)
plot3d(xx,yy,zz, col="red")
To form a 10x10x10 regular grid, you need to convert your dataset into one that has 1000 coordinate points and Amp values. I'm not exactly sure how you'd like to do this given your MWE, but here's an example given the current data:
# MWE data
xx = c(63021.71,63021.71,63021.71,63021.71,53021.71,53021.71,53021.71,43021.71,43021.71,43021.71)
yy = c(403205,403105,403105,403105,403105,403305,403105,403105,404105,403105)
zz = c(1.181028516,0.977028516,0.861028516,0.784028516,0.620028516,0.455028516,0.446028516,0.436028516,0.426028516,0.281028516)
Amp = c(1170,1381,807,668,19919,32500,32500,32500,32500,17464)
# create equally-spaced vectors of 10 values in each dimension
xx <- seq(min(xx), max(xx), length.out = 10)
yy <- seq(min(yy), max(yy), length.out = 10)
zz <- seq(min(zz), max(zz), length.out = 10)
# fake up some Amp data points
set.seed(123)
Amp <- runif(1000, min = min(Amp), max=max(Amp))
# directly create a 10x10x10 regular grid of Amp values as an array
dfa <- array(data = Amp,
dim = c(10,10,10),
dimnames = list(xx,yy,zz)
)
> dim(dfa)
[1] 10 10 10
# Alternatively, make a data.frame first
df <- data.frame(expand.grid(xx,yy,zz))
names(df) <- c("xx","yy","zz")
df$Amp <- Amp
dfa <- array(data = df$Amp,
dim=c(length(unique(df$xx)),
length(unique(df$yy)),
length(unique(df$zz))),
dimnames=list(unique(df$xx), unique(df$yy), unique(df$zz))
)
# you'll want to verify that the Amp values were assigned to the correct xyz coordinates.
# Here's a little function to help:
get_arr_loc = function(x, y, z) {
x + (y-1)*10 + (z-1)*100
}
# and some arbitrary coordinates checked. This could be done in a more systematic way...
> df[get_arr_loc(1,1,1), "Amp"] == dfa[1,1,1]
[1] TRUE
> df[get_arr_loc(10,2,1), "Amp"] == dfa[10,2,1]
[1] TRUE
> df[get_arr_loc(3,6,9), "Amp"] == dfa[3,6,9]
[1] TRUE
> df[get_arr_loc(10,10,10), "Amp"] == dfa[10,10,10]
[1] TRUE
I have a dataset of species and their rough locations in a 100 x 200 meter area. The location part of the data frame is not in a format that I find to be usable. In this 100 x 200 meter rectangle, there are two hundred 10 x 10 meter squares named A through CV. Within each 10 x 10 square there are four 5 x 5 meter squares named 1, 2, 3, and 4, respectively (1 is south of 2 and west of 3. 4 is east of 2 and north of 3). I want to let R know that A is the square with corners at (0 ,0), (10,0), (0,0), and (0,10), that B is just north of A and has corners (0,10), (0,20), (10,10), and (10,20), and K is just east of A and has corners at (10,0), (10,10), (20,0), and (20,10), and so on for all the 10 x 10 meter squares. Additionally, I want to let R know where each 5 x 5 meter square is in the 100 x 200 meter plot.
So, my data frame looks something like this
10x10 5x5 Tree Diameter
A 1 tree1 4
B 1 tree2 4
C 4 tree3 6
D 3 tree4 2
E 3 tree5 3
F 2 tree6 7
G 1 tree7 12
H 2 tree8 1
I 2 tree9 2
J 3 tree10 8
K 4 tree11 3
L 1 tree12 7
M 2 tree13 5
Eventually, I want to be able to plot the 100 x 200 meter area and have each 10 x 10 meter square show up with the number of trees, or number of species, or total biomass
What is the best way to turn the data I have into spatial data that R can use for graphing and perhaps analysis?
Here's a start.
## set up a vector of all 10x10 position tags
tags10 <- c(LETTERS,
paste0("A",LETTERS),
paste0("B",LETTERS),
paste0("C",LETTERS[1:22]))
A function to convert (e.g.) {"J",3} to the center of the corresponding sub-square.
convpos <- function(pos10,pos5) {
## convert letters to major (x,y) positions
p1 <- as.numeric(factor(pos10,levels=tags10)) ## or use match()
p1.x <- ((p1-1) %% 10) *10+5 ## %% is modulo operator
p1.y <- ((p1-1) %/% 10)*10+5 ## %/% is integer division
## sort out sub-positions
p2.x <- ifelse(pos5 <=2,2.5,7.5) ## {1,2} vs {3,4} values
p2.y <- ifelse(pos5 %%2 ==1 ,2.5,7.5) ## odd {1,3} vs even {2,4} values
c(p1.x+p2.x,p1.y+p2.y)
}
usage:
convpos("J",2)
convpos(mydata$tenbytenpos,mydata$fivebyfivepos)
Important notes:
this is a proof of concept, I can pretty much guarantee I haven't got the correspondence of x and y coordinates quite right. But you should be able to trace through this line-by-line and see what it's doing ...
it should work correctly on vectors (see second usage example above): I switched from switch to ifelse for that reason
your column names (10x10) are likely to get mangled into something like X10.10 when reading data into R: see ?data.frame and ?check.names
Similar to what #Ben Bolker has done, here's a lookup function (though you may need to transpose something to make the labels match what you describe).
tenbyten <- c(LETTERS[1:26],
paste0("A",LETTERS[1:26]),
paste0("B",LETTERS[1:26]),
paste0("C",LETTERS[1:22]))
tenbyten <- matrix(rep(tenbyten, each = 2), ncol = 10)
tenbyten <- t(apply(tenbyten, 1, function(x){rep(x, each = 2)}))
# the 1234 squares
squares <- matrix(c(rep(c(1,2),10),rep(c(4,3),10)), nrow = 20, ncol = 20)
# stick together into a reference grid
my.grid <- matrix(paste(tenbyten, squares, sep = "-"), nrow = 20, ncol = 20)
# a lookup function for the site grid
coordLookup <- function(tbt, fbf, .my.grid = my.grid){
x <- col(.my.grid) * 5 - 2.5
y <- row(.my.grid) * 5 - 2.5
marker <- .my.grid == paste(tbt, fbf, sep = "-")
list(x = x[marker], y = y[marker])
}
coordLookup("BB",2)
$x
[1] 52.5
$y
[1] 37.5
If this isn't what you're looking for, then maybe you'd prefer a SpatialPolygonsDataFrame, which has proper polygon IDs, and you attach data to, etc. In that case just Google around for how to make one from scratch, and manipulate the row() and col() functions to get your polygon corners, similar to what's given in this lookup function, which only returns centroids.
Edit: getting SPDF started:
This is modified from the function example and can hopefully be a good start:
library(sp)
# really you have a 20x20 grid, counting the small ones.
# c(2.5,2.5) specifies the distance in any direction from the cell center
grd <- GridTopology(c(1,1), c(2.5,2.5), c(20,20)))
grd <- as.SpatialPolygons.GridTopology(grd)
# get centroids
coords <- coordinates(polys)
# make SPDF, with an extra column for your grid codes, taken from the above.
# you can add further columns to this data.frame(), using polys#data
polys <- SpatialPolygonsDataFrame(grd,
data=data.frame(x=coords[,1], y=coords[,2], my.ID = as.vector(my.grid),
row.names=getSpPPolygonsIDSlots(grd)))