Apply Function to multiple vectors/lists R - r

I made a function g to calculate maximum likelihood following a multivariated normal distribution in order to classify the rows of a data frame test between 34 classes. So I need to use this function over a data frame and with several different set of parameters. My code works but it's to slow. I want to get it faster, maybe by removing the for loop and using some other apply's family function (which I don't have any experience).
g = function(x,p,mu,Sigma){
log(p) - log(det(Sigma)) - as.matrix(x-mu)%*%solve(Sigma)%*%t(as.matrix(x-mu))
}
mu = summaryBy(.~class,train,FUN=mean)
Sigma = by(train[,1:8],train$class,cov)
p = as.data.frame(table(train$class))
p$Freq = p$Freq/n
k = length(levels(train$class))
logver = NULL
for(j in 1:k){
logver = cbind(logver,apply(test,1,g,p=p$Freq[j],mu=mu[j,-1],Sigma=Sigma[[j]]))
}
preds = apply(logver,1,which.max)
The output logver must be a data frame with one row for each row of test and one col for each j, so in this case with 340000 rows and 34 cols.
p$Freq is a numeric vector 1x34. mu is a data frame 34x9 (fist col is a factor generated by summaryBy mean). Sigma is a list with 34 elements and each element is a 8x8 covariace matrix. test is a data frame 340000x8.
head(test)
band1 band2 band3 band4 band5 band6 band7 band9
1 2.0592 4.3630 6.6506 10.5952 18.4566 37.3683 36.9154 33.9467
2 2.5772 4.0766 6.0116 10.1476 18.8585 36.7654 36.2221 33.3717
3 2.8240 4.0766 6.4183 9.6813 18.5148 37.3113 35.7318 33.8367
4 2.8999 4.4317 7.4529 10.2842 18.4566 37.2513 37.3219 33.8367
5 2.8684 3.5324 7.5845 10.9021 19.2262 37.4758 36.0219 33.3472
6 2.4069 4.3256 6.0241 10.6668 20.0381 36.7203 36.4816 33.3472
head(train)
band1 band2 band3 band4 band5 band6 band7 band9 class
1 5.1224 8.1723 11.6837 15.6408 22.5884 33.9782 32.2985 32.1805 Green
2 5.4430 6.1158 9.8344 14.7719 23.2234 34.1247 32.0722 32.4367 Dry
3 4.5048 7.7364 11.9494 15.7740 22.6291 33.7642 32.4599 32.5217 Conifer
4 5.3120 6.9558 9.8344 14.3223 22.8088 34.0513 32.2985 32.4045 Snow
5 5.2907 6.6837 10.5367 16.0684 22.6291 33.7156 32.1650 32.4900 Ice
6 5.3120 6.8131 10.9727 15.7114 22.8088 34.6136 32.4772 32.4367 Soil
Thank you all

Related

Prediction with lm

I have the following data frame:
lm mean resids sd resids resid 1 resid 2 resid 3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767
Each row represents an estimated linear model with window length 3. I used rollapply on a separate dataframe with the function lm(y~t) to extract the coefficients and intercepts into a new dataframe, which I have combined with the residuals from the same model and their corresponding means and residuals.
Since the window length is 3, it implies that there are 3 residuals as shown, per model, in resid 1, resid 2 and resid 3. The mean and sd of these are included accordingly.
I am seeking to predict the next observation, in essence, k+1, where k is the window length, using the intercept and beta.
Recall that lm1 takes observations 1,2,3 to estimate the intercept and the beta, and lm2 takes 2,3,4, lm3 takes 3,4,5, etc. The function for the prediction should be:
predict_lm1 = intercept_lm1 + beta_lm1*(k+1)
Where k+1 = 4. For lm2:
predict_lm2 = intercept_lm2 + beta_lm2*(k+1)
Where k+1 = 5.
Clearly, k increases by 1 every time I move down one row in the dataset. This is because the explanatory variable is time, t, which is a sequence increasing by one per observation.
Should I use a for loop, or an apply function here?
How can I make a function that iterates down the rows and calculates the predictions accordingly with the information found in that row?
Thanks.
EDIT:
I managed to find a possible solution by writing the following:
n=nrow(dataset)
for(i in n){
predictions = dataset$Intercept + dataset$beta*(k+1)
}
However, k does not increase by 1 per iteration. Thus, k+1 is always = 4.
How can I make sure k increases by 1 accordingly?
EDIT 2
I managed to add 1 to k by writing the following:
n=nrow(dataset)
for(i in n){
x = 0
x[i] = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x[i])
}
However, the first prediction is overestimated. It should be 203, whereas it is estimated as 228, implying that it sets the explanatory variable as 1 too high.
Yet, the second prediction is correct. I am not sure what I am doing wrong. Any advice?
EDIT 3
I managed to find a solution as follows:
n=nrow(dataset)
for(i in n){
x = k + 1
preds = dataset$`(Intercept)` + dataset$t*(x)
x = x + 1
}
Your loop is not iterating:
dataset <- read.table(text="lm meanresids sdresids resid1 resid2 resid3 intercept beta
1 0.000000e+00 6.2806844 -3.6261548 7.2523096 -3.6261548 103.62615 24.989340
2 -2.960595e-16 8.7515899 -5.0527328 10.1054656 -5.0527328 141.96786 -1.047323
3 -2.960595e-16 5.9138984 -3.4143908 6.8287817 -3.4143908 206.29046 -26.448694
4 3.700743e-17 0.5110845 0.2950748 -0.5901495 0.2950748 240.89801 -35.806642
5 7.401487e-16 6.6260504 3.8255520 -7.6511040 3.8255520 187.03479 -23.444762
6 5.921189e-16 8.7217431 5.0355007 -10.0710014 5.0355007 41.43239 3.138396
7 0.000000e+00 5.5269434 3.1909823 -6.3819645 3.1909823 -119.90628 27.817845
8 -1.480297e-16 1.0204260 -0.5891432 1.1782864 -0.5891432 -180.33773 35.623363
9 -5.921189e-16 6.9488186 -4.0119023 8.0238046 -4.0119023 -64.72245 21.820226
10 -8.881784e-16 8.6621512 -5.0010953 10.0021906 -5.0010953 191.65339 -5.218767", header=T)
n <- nrow(dataset)
predictions <- data.frame()
for(i in 1:n){
k <- i ##not sure where k is coming from but put it here
predictions <- rbind(predictions, dataset$intercept[i] + dataset$beta[i]*(k+1))
}
predictions

Variable length formula construction

I am trying to apply the Simpson's Diversity Index across a number of different datasets with a variable number of species ('nuse') captured. As such I am trying to construct code which can cope with this automatically without needing to manually construct a formula each time I do it. Example dataset for a manual formula is below:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x) {
total <- x[,"total"]
nuse1 <- x[,"nuse1"]
nuse2 <- x[,"nuse2"]
nuse3 <- x[,"nuse3"]
nuse4 <- x[,"nuse4"]
div <- round(((1-(((nuse1*(nuse1 - 1)) + (nuse2*(nuse2 - 1)) + (nuse3*(nuse3 - 1)) + (nuse4*(nuse4 - 1)))/(total*(total - 1))))),digits=4)
return(div)
}
diverse$Simpson <- simp(diverse)
diverse
As you can see this works fine. However, how would I be able to create a function which could automatically adjust to, for example, 9 species (so up to nuse9)?
I have experimented with the paste function + as.formula as indicated here Formula with dynamic number of variables; however it is the expand form of (nuse1 * (nuse1 - 1)) that I'm struggling with. Does anyone have any suggestions please? Thanks.
How about something like:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x, species) {
spcs <- grep(species, colnames(x)) # which column names have "nuse"
total <- rowSums(x[,spcs]) # sum by row
div <- round(1 - rowSums(apply(x[,spcs], 2, function(s) s*(s-1))) / (total*(total - 1)), digits = 4)
return(div)
}
diverse$Simpson2 <- simp(diverse, species = "nuse")
diverse
# nuse1 nuse2 nuse3 nuse4 total Simpson2
# 1 0 5 0 5 10 0.5556
# 2 20 5 2 8 35 0.6151
# 3 40 3 8 2 53 0.4107
# 4 20 20 20 20 80 0.7595
All it does is find out which columns start with "nuse" or any other species you have in your dataset. It constructs the "total" value within the function and does not require a total column in the dataset.

How can i iteratively do clustering for different clusters (k) values

I have the following PCA data on which i am doing Kmeans clustering:
head(pcdffinal)
PC1 PC2 PC3 PC4 PC5 PC6
1 -9.204228 -2.73517110 2.7975063 0.6794614 -0.84627095 0.4455297
2 2.927245 0.05666389 0.5085896 0.1472800 0.18193152 0.1041490
3 -4.667932 -1.98176361 2.2751862 0.5347725 -0.43314927 0.3222719
4 -1.366505 -0.40858595 0.5005192 0.4507366 -0.54996933 0.5533013
5 -4.689454 -2.77185636 2.4323856 0.7387788 0.49237229 -0.4817083
6 -3.477046 -1.84904214 1.5539558 0.5463861 -0.03231143 0.2814843
opt.cluster<-3
set.seed(115)
pccomp.km <- kmeans(pcdffinal,opt.cluster,nstart=25)
head(pccomp.km$cluster)
[1] 2 1 2 2 2 2
barplot(table(pccomp.km$cluster), col="steelblue")
pccomp.km$tot.withinss #For total within cluster sum of squares.
[1] 13172.59
We can also use a plot to illustrate the groups that the data have been arranged into.
par(mfrow=c(1,1))
plot(pcdffinal[,1:2],col=(pccomp.km$cluster+1),main=paste('K-Means Clustering result with k = ', opt.cluster,sep=" "),pch=20,cex=2)
points(pccomp.km$centers, pch=15,cex=2)#plotting the centres of the cluster as black squares
library("factoextra")
fviz_cluster(pccomp.km, data = pcdffinal, frame.type = "convex")+ theme_minimal()
df.num_kmeans<-df.num
df.num_kmeans$cluster.kmeans <- pccomp.km$cluster# is a vector of cluster assignment from kmeans() added as a column to the original dataset as
save this dataset & kmeans model for further use
saveRDS(pccomp.km, "kmeans_model.RDS")
write.csv(df.num_kmeans,"dfnum_kmeans.cluster.csv")
library(cluster)
clusplot(df.num_kmeans,pccomp.km$cluster,color = TRUE,shade=TRUE,labels = 2,lines = 0)
library(ggfortify)
autoplot(pccomp.km, data=pcdffinal, frame=TRUE,frame.type='norm')
I would like to do Kmeans iteratively for a range of Ks say k=2:6 each time making plots for the respective k as well as saving the models as well as the data as a csv but each done separately for different k's.
Need help to convert the above codes into an iterative with the counter i going from 2 till 6.
original data:
head(df.num_kmeans)
datausage mou revenue calldrop handset2g handset3g smartphone
1 896804.7 2854801 40830.404 27515 7930 19040 20810
2 155932.1 419109 5512.498 5247 2325 2856 3257
3 674983.3 2021183 25252.265 21068 6497 13056 14273
4 522787.2 1303221 14547.380 8865 4693 9439 10746
5 523465.7 1714641 24177.095 25441 8668 12605 14766
6 527062.3 1651303 20153.482 18219 6822 11067 12994
rechargecount rechargesum arpu subscribers
1 4461 235430 197704.10 105822
2 843 39820 34799.21 18210
3 2944 157099 133842.38 71351
4 2278 121697 104681.58 44975
5 2802 144262 133190.55 75860
6 2875 143333 119389.91 63740
Using random forest for accuracy comparison
dfnum.kmeans <- read.csv("dfnum_kmeans.cluster.csv")
table(dfnum.kmeans$cluster.kmeans) # size of each cluster
convert cluster var into a factor
dfnum.kmeans$cluster.kmeans <- as.factor(dfnum.kmeans$cluster.kmeans)
is.factor(dfnum.kmeans$cluster.kmeans)
create training and test sets (75:25 split) using 'caret' package
set.seed(128) # for reproducibility
inTrain_kmeans <- caret::createDataPartition(y = dfnum.kmeans$cluster.kmeans, p = 0.75, list = FALSE)
training_kmeans <- dfnum.kmeans[inTrain_kmeans, ]
testing_kmeans <- dfnum.kmeans[-inTrain_kmeans, ]
set.seed(122)
control <- trainControl(method = "repeatedcv", number = 10,allowParallel = TRUE)
modFit.rfcaret_kmeans <- caret::train(cluster.kmeans~ ., method = "rf",data = training_kmeans, trControl = control, number = 25)
modFit.rfcaret_kmeans$finalModel
pred.test_kmeans = predict(modFit.rfcaret_kmeans, testing_kmeans); confusionMatrix(pred.test_kmeans, testing_kmeans$cluster.kmeans )
confusionMatrix(pred.test_kmeans, testing_kmeans$cluster.kmeans )$overall[1]
Assuming that your original dataframe is df.num, the following could save all the files (for different k values) in your working directory:
for (k in 2:6) {
set.seed(115)
pccomp.km <- kmeans(pcdffinal,k,nstart=25)
head(pccomp.km$cluster)
print(paste(k, pccomp.km$tot.withinss)) #For total within cluster sum of squares.
png(paste0('kmeans_proj_',k, '.png'))
par(mfrow=c(1,1))
plot(pcdffinal[,1:2],col=(pccomp.km$cluster+1),main=paste('K-Means Clustering result with k = ', k,sep=" "),pch=20,cex=2)
points(pccomp.km$centers, pch=15,cex=2)#plotting the centres of the cluster as black squares
dev.off()
png(paste0('kmeans_fviz_',k, '.png'))
print(fviz_cluster(pccomp.km, data = pcdffinal, frame.type = "convex")+ theme_minimal())
dev.off()
df.num_kmeans<-df.num
df.num_kmeans$cluster.kmeans <- pccomp.km$cluster# is a vector of cluster assignment from kmeans() added as a column to the original dataset as
saveRDS(pccomp.km, paste0("kmeans_model_", k, ".RDS"))
write.csv(df.num_kmeans,paste0("dfnum_kmeans_", k, ".cluster.csv"))
png(paste0('clusplot_',k, '.png'))
clusplot(df.num_kmeans,pccomp.km$cluster,color = TRUE,shade=TRUE,labels = 2,lines = 0)
dev.off()
png(paste0('autoplot_',k, '.png'))
print(autoplot(pccomp.km, data=pcdffinal, frame=TRUE,frame.type='norm'))
dev.off()
}

Randomize data between two columns in R

I have searched for an answer or a solution to this task with no success as of yet, so I do apologize if this is redundant.
I want to randomize the data between two columns. This is to simulate species misidentification in vegetation field data, so I want to assign some sort of probability of misidentification between the two columns as well. I would imagine that there is some way to do this using sample or the "permute" package.
I will select some readily available data for an example.
library (vegan)
data (dune)
If you type head (dune), then you can see that this is a data frame with sites as rows and species as columns. For convenience sake, we can presume some field tech has potential to misidentify Poa pratensis and Poa trivialis.
poa = data.frame(Poaprat=dune$Poaprat,Poatriv=dune$Poatriv)
head(poa)
Poaprat Poatriv
1 4 2
2 4 7
3 5 6
4 4 5
5 2 6
6 3 4
What would be the best way to randomize the values between these two columns (transferring between each other and/or adding to one when both are present). The resulting data may look like:
Poaprat Poatriv
1 6 0
2 4 7
3 5 6
4 5 4
5 0 7
6 4 3
P.S.
For the cringing ecologist out there: please realize, I have made this example in the interest of time and that I know relative cover values are not additive. I apologize for needing to do that.
*** Edit: For more clarity, the type of data being randomized would be percent cover estimates (so values between 0% and 100%). The data in this quick example are relative cover estimates, not counts.
You'll still need to replace the actual columns with the new ones and there may be a more elegant way to do this (it's late in EDT land) and you'll have to decide what else besides the normal distribution you'll want to use (i.e. how you'll replace sample()) but you get your swaps and adds with:
library(vegan)
library(purrr)
data(dune)
poa <- data.frame(
Poaprat=dune$Poaprat,
Poatriv=dune$Poatriv
)
map2_df(poa$Poaprat, poa$Poatriv, function(x, y) {
for (i in 1:length(x)) {
what <- sample(c("left", "right", "swap"), 1)
switch(
what,
left={
x[i] <- x[i] + y[i]
y[i] <- 0
},
right={
y[i] <- x[i] + y[i]
x[i] <- 0
},
swap={
tmp <- y[i]
y[i] <- x[i]
x[i] <- tmp
}
)
}
data.frame(Poaprat=x, Poatriv=y)
})
Here is my approach:
Let's define a function that will take a number of specimens (n) and a probability (p) that it could be labeled incorrectly. This function will sample a 1 with probability p and a 0 with 1-p. The sum of this random sampling will give how many of the n specimens were incorrect.
mislabel = function(x, p){
N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
sum(N_mis)
}
Once defined the function, apply it to each column and store it into two new columns
p_miss = 0.3
poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)
The final number of specimens tagged for each species can be calculated by substracting the incorrect from same species and adding the incorrect from the other specimen.
poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled
Result:
> head(poa)
Poaprat Poatriv Poaprat_mislabeled Poatriv_mislabeled Poaprat_final Poatriv_final
1 4 2 0 0 4 2
2 4 7 1 2 5 6
3 5 6 0 3 8 3
4 4 5 1 2 5 4
5 2 6 0 3 5 3
6 3 4 1 2 4 3
Complete procedure:
mislabel = function(x, p){
N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
sum(N_mis)
}
p_miss = 0.3
poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)
poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled
The p_miss variable is the probability of labeling incorrectly both species. You could also use a different value for each to simulate a non symmetrical chance that it may be easier to mislabel one of them compared to the other.
I just wanted to check in since accepting the answer from hrbrmstr. Given a little bit of time today, I went ahead and made a function that does this task with some degree of flexibility. It allows for inclusion of multiple species pairs, different probabilities between different species pairs (asymmetry in different direction), and includes explicitly the probability of the value staying the same.
misID = function(X, species,probs = c(0.1,0.1,0,0.8)){
library(purrr)
X2 = X
if (!is.matrix(species) == T){
as.matrix(species)
}
if (!is.matrix(probs) == T){
probs=matrix(probs,ncol=4,byrow=T)
}
if (nrow(probs) == 1){
probs = matrix(rep(probs[1,],nrow(species)),ncol=4,byrow=T)
}
for (i in 1:nrow(species)){
Spp = data.frame(X[species[i,1]],X[species[i,2]])
mis = map2_df(Spp[1],Spp[2],function(x,y) {
for(n in 1:length(x)) {
what = sample(c('left', 'right', 'swap','same'), size=1,prob=probs[i,])
switch(
what,
left = {
x[n] = x[n] + y[n]
y[n] = 0
},
right = {
y[n] = x[n] + y[n]
x[n] = 0
},
swap = {
tmp = y[n]
y[n] = x[n]
x[n] = tmp
},
same = {
x[n] = x[n]
y[n] = y[n]
}
)
}
misSpp = data.frame(x,y)
colnames(misSpp) =c(names(Spp[1]),names(Spp[2]))
return(misSpp)
})
X2[names(mis[1])] = mis[1]
X2[names(mis[2])] = mis[2]
}
return(X2)
}
There are probably a number of minor inefficiencies in here, but by and large it does what I need it to do. Sorry that there are no comments, but I did figure out how to handle getting the shuffled data into the data frame easily.
Thanks for pointing out the "purrr" package for me and also the switch function.
Example:
library(vegan)
library(labdsv)
data(dune)
#First convert relative abundances to my best guess at the % values in Van der Maarel (1979)
code = c(1,2,3,4,5,6,7,8,9)
value = c(0.1,1,2.5,4.25,5.5,20,40,60.5,90)
veg = vegtrans(dune,code,value)
specpairs = matrix(c("Poaprat","Poatriv","Trifprat","Trifrepe"),ncol=2,byrow=T) #create matrix of species pairs
probmat = matrix(c(0.3,0,0,0.7,0,0.5,0,0.5),ncol=4,byrow=T) #create matrix of misclassification probabilities
veg2 = misID(veg,specpairs,probs = probmat)
print(veg2)

Create Spatial Data in R

I have a dataset of species and their rough locations in a 100 x 200 meter area. The location part of the data frame is not in a format that I find to be usable. In this 100 x 200 meter rectangle, there are two hundred 10 x 10 meter squares named A through CV. Within each 10 x 10 square there are four 5 x 5 meter squares named 1, 2, 3, and 4, respectively (1 is south of 2 and west of 3. 4 is east of 2 and north of 3). I want to let R know that A is the square with corners at (0 ,0), (10,0), (0,0), and (0,10), that B is just north of A and has corners (0,10), (0,20), (10,10), and (10,20), and K is just east of A and has corners at (10,0), (10,10), (20,0), and (20,10), and so on for all the 10 x 10 meter squares. Additionally, I want to let R know where each 5 x 5 meter square is in the 100 x 200 meter plot.
So, my data frame looks something like this
10x10 5x5 Tree Diameter
A 1 tree1 4
B 1 tree2 4
C 4 tree3 6
D 3 tree4 2
E 3 tree5 3
F 2 tree6 7
G 1 tree7 12
H 2 tree8 1
I 2 tree9 2
J 3 tree10 8
K 4 tree11 3
L 1 tree12 7
M 2 tree13 5
Eventually, I want to be able to plot the 100 x 200 meter area and have each 10 x 10 meter square show up with the number of trees, or number of species, or total biomass
What is the best way to turn the data I have into spatial data that R can use for graphing and perhaps analysis?
Here's a start.
## set up a vector of all 10x10 position tags
tags10 <- c(LETTERS,
paste0("A",LETTERS),
paste0("B",LETTERS),
paste0("C",LETTERS[1:22]))
A function to convert (e.g.) {"J",3} to the center of the corresponding sub-square.
convpos <- function(pos10,pos5) {
## convert letters to major (x,y) positions
p1 <- as.numeric(factor(pos10,levels=tags10)) ## or use match()
p1.x <- ((p1-1) %% 10) *10+5 ## %% is modulo operator
p1.y <- ((p1-1) %/% 10)*10+5 ## %/% is integer division
## sort out sub-positions
p2.x <- ifelse(pos5 <=2,2.5,7.5) ## {1,2} vs {3,4} values
p2.y <- ifelse(pos5 %%2 ==1 ,2.5,7.5) ## odd {1,3} vs even {2,4} values
c(p1.x+p2.x,p1.y+p2.y)
}
usage:
convpos("J",2)
convpos(mydata$tenbytenpos,mydata$fivebyfivepos)
Important notes:
this is a proof of concept, I can pretty much guarantee I haven't got the correspondence of x and y coordinates quite right. But you should be able to trace through this line-by-line and see what it's doing ...
it should work correctly on vectors (see second usage example above): I switched from switch to ifelse for that reason
your column names (10x10) are likely to get mangled into something like X10.10 when reading data into R: see ?data.frame and ?check.names
Similar to what #Ben Bolker has done, here's a lookup function (though you may need to transpose something to make the labels match what you describe).
tenbyten <- c(LETTERS[1:26],
paste0("A",LETTERS[1:26]),
paste0("B",LETTERS[1:26]),
paste0("C",LETTERS[1:22]))
tenbyten <- matrix(rep(tenbyten, each = 2), ncol = 10)
tenbyten <- t(apply(tenbyten, 1, function(x){rep(x, each = 2)}))
# the 1234 squares
squares <- matrix(c(rep(c(1,2),10),rep(c(4,3),10)), nrow = 20, ncol = 20)
# stick together into a reference grid
my.grid <- matrix(paste(tenbyten, squares, sep = "-"), nrow = 20, ncol = 20)
# a lookup function for the site grid
coordLookup <- function(tbt, fbf, .my.grid = my.grid){
x <- col(.my.grid) * 5 - 2.5
y <- row(.my.grid) * 5 - 2.5
marker <- .my.grid == paste(tbt, fbf, sep = "-")
list(x = x[marker], y = y[marker])
}
coordLookup("BB",2)
$x
[1] 52.5
$y
[1] 37.5
If this isn't what you're looking for, then maybe you'd prefer a SpatialPolygonsDataFrame, which has proper polygon IDs, and you attach data to, etc. In that case just Google around for how to make one from scratch, and manipulate the row() and col() functions to get your polygon corners, similar to what's given in this lookup function, which only returns centroids.
Edit: getting SPDF started:
This is modified from the function example and can hopefully be a good start:
library(sp)
# really you have a 20x20 grid, counting the small ones.
# c(2.5,2.5) specifies the distance in any direction from the cell center
grd <- GridTopology(c(1,1), c(2.5,2.5), c(20,20)))
grd <- as.SpatialPolygons.GridTopology(grd)
# get centroids
coords <- coordinates(polys)
# make SPDF, with an extra column for your grid codes, taken from the above.
# you can add further columns to this data.frame(), using polys#data
polys <- SpatialPolygonsDataFrame(grd,
data=data.frame(x=coords[,1], y=coords[,2], my.ID = as.vector(my.grid),
row.names=getSpPPolygonsIDSlots(grd)))

Resources