How to find the nearest distance between two different data frames - r

I am trying to find the nearest distance of locations in dataset 1 to dataset 2. Both data sets are different sizes. Ive looked into using the Haversine function but I'm unsure what I need to do after.

Since you have not provided a sample of your data, I am going to use the oregon.tract data set from the UScensus2000tract library as a reproducible example.
Here is a solution based on fast data.table that I get from this other answer here.
# load libraries
library(data.table)
library(geosphere)
library(UScensus2000tract)
library(rgeos)
Now let's create a new data.table with all possible pair combinations of origins (census centroids) and destinations (facilities)
# get all combinations of origin and destination pairs
# Note that I'm considering here that the distance from A -> B is equal
from B -> A.
odmatrix <- CJ(Datatwo$Code_A , Dataone$Code_B)
names(odmatrix) <- c('Code_A', 'Code_B') # update names of columns
# add coordinates of Datatwo centroids (origin)
odmatrix[Datatwo, c('lat_orig', 'long_orig') := list(i.Latitude,
i.Longitude), on= "Code_A" ]
# add coordinates of facilities (destination)
odmatrix[Dataone, c('lat_dest', 'long_dest') := list(i.Latitude,
i.Longitude), on= "Code_B" ]
Now you just need to:
# calculate distances
odmatrix[ , dist := distHaversine(matrix(c(long_orig, lat_orig), ncol
= 2),
matrix(c(long_dest, lat_dest), ncol
= 2))]
# and get the nearest destinations for each origin
odmatrix[, .( Code_B = Code_B[which.min(dist)],
dist = min(dist)),
by = Code_A]
### Prepare data for this reproducible example
# load data
data("oregon.tract")
# get centroids as a data.frame
centroids <- as.data.frame(gCentroid(oregon.tract,byid=TRUE))
# Convert row names into first column
setDT(centroids, keep.rownames = TRUE)[]
# get two data.frames equivalent to your census and facility data
frames
Datatwo<- copy(centroids)
Dataone <- copy(centroids)
names(Datatwo) <- c('Code_A', 'Longitude', 'Latitude')
names(Dataone) <- c('Code_B', 'Longitude', 'Latitude')

Related

Random sampling R

I am new to R and trying to exploit a fairly simple task. I have a dataset composed of 20 obs of 19 variabile and I want to generate three non overlapping groups of 5 obs. I am using the slice_sample function from dplyr package, but how do I reiterate excluding the obs already picked up in the first round?
library( "dplyr")
set.seed(123)
NF_1 <- slice_sample(NF, n = 5)
You can use the sample function from base R.
All you have to do is sample the rows with replace = FALSE, which means you won't have any overlapping. You can also define the number of samples.
n_groups <- 3
observations_per_group <- 5
size <- n_groups * obersavations_per_group
selected_samples <- sample(seq_len(nrow(NF)), size = size, replace = FALSE)
# Now index those selected rows
NF_1 <- NF[selected_samples, ]
Now, according to your comment, if you want to generate N dataframes, each with a number of samples and also label them accordingly, you can use lapply (which is a function that "applies" a function to a set of values). The "l" in "lapply" means that it returns a list. There are other types of apply functions. You can read more about that (and I highly recommend that you do!) here.
This code should solve your problem, or at least give you a good idea or where to go.
n_groups <- 3
observations_per_group <- 5
size <- observations_per_group * n_groups
# First we'll get the row samples.
selected_samples <- sample(
seq_len(nrow(NF)),
size = size,
replace = FALSE
)
# Now we split them between the number of groups
split_samples <- split(
selected_samples,
rep(1:n_groups, observations_per_group)
)
# For each group (1 to n_groups) we'll define a dataframe with samples
# and store them sequentially in a list.
my_dataframes <- lapply(1:n_groups, function(x) {
# our subset df will be the original df with the list of samples
# for group at position "x" (1, 2, 3.., n_groups)
subset_df <- NF[split_samples[x], ]
return(subset_df)
})
# now, if you need to access the results, you can simply do:
first_df <- my_dataframes[[1]] # use double brackets to access list elements

Extract p values and r values for all pairwise variables

I have multiple variables for multiple countries over multiple years. I would like to generate a dataframe containing both an R^2 value and a P value for each pair of variables. I'm somewhat close, have a minimum working example and an idea of what the end product should look like, but am having some difficulties actually implementing it. If anyone could help, that would be most appreciated.
Please note, I would like to do this more manually than using packages like Hmisc as that has created a number of other issues. I'd had a look around for similar solutions as well, but havent had much luck.
# Code to generate minimum working example (country year pairs).
library(tidyindexR)
library(tidyverse)
library(dplyr)
library(reshape2)
# Function to generate minimum working example data
simulateCountryData = function(N=200, NEACH = 20, SEED=100){
variableOne<-rnorm(N,sample(1:100, NEACH),0.5)
variableOne[variableOne<0]<-0
variableTwo<-rnorm(N,sample(1:100, NEACH),0.5)
variableTwo[variableTwo<0]<-0
variableThree<-rnorm(N,sample(1:100, NEACH),0.5)
variableThree[variableTwo<0]<-0
geocodeNum<-factor(rep(seq(1,N/NEACH),each=NEACH))
year<-rep(seq(2000,2000+NEACH-1,1),N/NEACH)
# Putting it all together
AllData<-data.frame(geocodeNum,
year,
variableOne,
variableTwo,
variableThree)
return(AllData)
}
# This runs the function and generates the data
mySimData = simulateCountryData()
I have a reasonable idea of how to get correlations (both p values and r values) between 2 manually selected variables, but am having some trouble implementing it on the entire dataset and on a country level (rather than all at once).
# Example pvalue
corrP = cor.test(spreadMySimData$variableOne,spreadMySimData$variableTwo)$p.value
# Examplwe r value
corrEst = cor(spreadMySimData$variableOne,spreadMySimData$variableTwo)
Finally, the end result should look something like this :
myVariables = colnames(spreadMySimData[3:ncol(spreadMySimData)])
myMatrix = expand.grid(myVariables,myVariables)
# I'm having trouble actually trying to get the r values and p values in the dataframe
myMatrix = as.data.frame(myMatrix)
myMatrix$Pval = runif(9,0.01,1)
myMatrix$Rval = runif(9,0.2,1)
myMatrix
Thanks again :)
This will compute r and p for all the unique pairs.
# matrix of unique pairs coded as numeric
mx_combos <- combn(1:length(myVariables), 2)
# list of unique pairs coded as numeric
ls_combos <- split(mx_combos, rep(1:ncol(mx_combos), each = nrow(mx_combos)))
# for each pair in the list, create a 1 x 4 dataframe
ls_rows <- lapply(ls_combos, function(p) {
# lookup names of variables
v1 <- myVariables[p[1]]
v2 <- myVariables[p[2]]
# perform the cor.test()
htest <- cor.test(mySimData[[v1]], mySimData[[v2]])
# record pertinent info in a dataframe
data.frame(Var1 = v1,
Var2 = v2,
Pval = htest$p.value,
Rval = unname(htest$estimate))
})
# row bind the list of dataframes
dplyr::bind_rows(ls_rows)

How to add strata information to a genind

I have created a genind object from a table containing SNPs information.
I need to insert population information into this genind.
I know which individuals (which are identified by numbers) should go into each population.
How do I pick the correct individuals and place them into separate populations?
It's always helpful to make a reproducible example when asking a question.
First, loading the necessary library (pretty sure its adegenet)
library(adegenet)
Making some fake data by first getting a vector of alleles
alleles <- paste0("0",1:4)
Setting number of loci, individuals per population, and the number of populations
nloci <- 10
nind <- 10
npops <- 2
Using a for loop to make the fake dataset
i <- NULL
out <- NULL
for(i in 1:npops){
#there are nind*nloci genotypes in each population
#make a
gts <- replicate(n = nind*nloci,
expr = paste0(sample(x = alleles,size = 1,replace = T),
sample(x = alleles,size = 1,replace = T)))
gts <- as.data.frame(matrix(data = gts,
nrow = nind, ncol = nloci, byrow = T))
#making generic locus colnames()
colnames(gts) <- paste("locus_",1:nloci)
out <- rbind(out,gts)
} #end of for loop
head(out)
Now converting that data.frame into a genind
obj <- df2genind(out, ploidy=2, ncode=2)
obj
Note that the row.names() are considered individual IDs
Now for setting the populations, note its empty right now
obj#pop
You just need a vector that represents the populations corresponding to each individual.
Option 1
If your individual IDs are clustered by population (e.g. 1-10 are from pop1 and 11-20 are from pop2), then something like this should work
pops<- paste0("pop",1:npops)
Set the populations using that vector, make sure it's a factor
obj#pop <- as.factor(rep(pops,each=nind))
obj#pop
Option 2
If the original data.frame (table) that contained your SNP information also contained population information, you could use that as your vector
e.g. If out looked like this
out$pops <- sample(x = pops,size = nrow(out),replace = T)
head(out)
Then do could use that column as your vector
obj#pop <- as.factor(out$pops)
obj#pop
Option 3
Alternatively, if you had another table that enabled you to identify which individuals corresponded to which population, then you use that information. It assumes that the second table (data.frame) is the same number of rows as out
Here is an example second table
df <- data.frame(pops = rep(pops,each=nind),
id = sample(x = 1:nrow(out),size = nrow(out),replace = F))
head(df)
Note that the IDs are not in order, but they were in out and therefore are in the obj, so df needs to be ordered by df$id
df <- df[order(df$id),]
head(df)
After they are in the correct order
obj#pop <- as.factor(df$pops)
obj#pop

Extract data using a matching matrix pair of data in R

I have two data sets with latitude, longitude, and temperature data. One data set corresponds to a geographic region of interest with the corresponding lat/long pairs that form the boundary and contents of the region (Matrix Dimension = 4518x2)
The other data set contains lat/long and temperature data for a larger region that envelopes the region of interest (Matrix Dimenion = 10875x3).
My question is: How do you extract the appropriate row data (lat, long, temperature) from the 2nd data set that matches the first data set's lat/long data?
I've tried a variety of "for loops," "subset," and "unique" commands but I can't obtain the matching temperature data.
Thanks in advance!
10/31 Edit: I forgot to mention that I'm using "R" to process this data.
The lat/long data for the region of interest was provided as a list of 4,518 files containing the lat/long coordinates in the name of each file:
x<- dir()
lenx<- length(x)
g <- strsplit(x, "_")
coord1 <- matrix(NA,nrow=lenx, ncol=1)
coord2 <- matrix(NA,nrow=lenx, ncol=1)
for(i in 1:lenx) {
coord1[i,1] <- unlist(g)[2+3*(i-1)]
coord2[i,1] <- unlist(g)[3+3*(i-1)]
}
coord1<-as.numeric(coord1)
coord2<-as.numeric(coord2)
coord<- cbind(coord1, coord2)
The lat/long and temperature data was obtained from an NCDF file for with temperature data for 10,875 lat/long pairs:
long<- tempcd$var[["Temp"]]$size[1]
lat<- tempcd$var[["Temp"]]$size[2]
time<- tempcd$var[["Temp"]]$size[3]
proj<- tempcd$var[["Temp"]]$size[4]
temp<- matrix(NA, nrow=lat*long, ncol = time)
lat_c<- matrix(NA, nrow=lat*long, ncol=1)
long_c<- matrix(NA, nrow=lat*long, ncol =1)
counter<- 1
for(i in 1:lat){
for(j in 1:long){
temp[counter,]<-get.var.ncdf(precipcd, varid= "Prcp", count = c(1,1,time,1), start=c(j,i,1,1))
counter<- counter+1
}
}
temp_gcm <- cbind(lat_c, long_c, temp)`
So now the question is how do you remove values from "temp_gcm" that correspond to lat/long data pairs from "coord?"
Noe,
I can think of a number of ways you could do this. The simplest, albeit not the most efficient would be to make use of R's which() function, which takes a logical argument, while iterating over the data frame which you want to apply the matches to. Of course, this is assuming that there can be at most a single match in the larger data set. Based on your data sets, I would do something like this:
attach(temp_gcm) # adds the temp_gcm column names to the global namespace
attach(coord) # adds the coord column names to the global namespace
matched.temp = vector(length = nrow(coord)) # To store matching results
for (i in seq(coord)) {
matched.temp[i] = temp[which(lat_c == coord1[i] & long_c == coord2[i])]
}
# Now add the results column to the coord data frame (indexes match)
coord$temperature = matched.temp
The function which(lat_c == coord1[i] & long_c == coord2[i]) returns a vector of all rows in the dataframe temp_gcm which satisfy lat_c and long_c matching coord1 and coord2 respectively from row i in the iteration (NOTE: I'm assuming this vector will only have length 1, i.e. there is only 1 possible match). matched.temp[i] will then be assigned the value from the column temp in the dataframe temp_gcm which satisfied the logical condition. Note that the goal in doing this is that we create a vector which has matched values that correspond by index to the rows of the dataframe coord.
I hope this helps. Note that this is a rudimentary approach, and I would advise looking up the function merge() as well as apply() to do this in a more succinct manner.
I added an additional column full of zeros to use as the resultant for an IF statement. "x" is the number of rows in temp_gcm. "y" is the number of columns (representative of time steps). "temp_s" is the standardized temperature data
indicator<- matrix(0, nrow = x, ncol = 1)
precip_s<- cbind(precip_s, indicator)
temp_s<- cbind(temp_s, indicator)
for(aa in 1:x){
current_lat<-latitudes[aa,1] #Latitudes corresponding to larger area
current_long<- longitudes[aa,1] #Longitudes corresponding to larger area
for(ab in 1:lenx){ #Lenx coresponds to nrow(coord)
if(current_lat == coord[ab,1] & current_long == coord[ab,2]) {
precip_s[aa,(y/12+1)]<-1 #y/12+1 corresponds to "indicator column"
temp_s[aa,(y/12+1)]<-1
}
}
}
precip_s<- precip_s[precip_s[,(y/12+1)]>0,] #Removes rows with "0"s remaining in "indcator" column
temp_s<- temp_s[temp_s[,(y/12+1)]>0,]
precip_s<- precip_s[,-(y/12+1)] #Removes "indicator column
temp_s<- temp_s[,-(y/12+1)]

Calculate statistics (e.g. average) across cells of identical data-frames

I am having a list of identically sorted dataframes. More specific these are the imputed dataframes which I get after doing Multiple imputations with the AmeliaII package. Now I want to create a new dataframe that is identical in structure, but contains the mean values of the cells calculated across the dataframes.
The way I achieve this at the moment is the following:
## do the Amelia run ------------------------------------------------------------
a.out <- amelia(merged, m=5, ts="Year", cs ="GEO",polytime=1)
## Calculate the output statistics ----------------------------------------------
left.side <- a.out$imputations[[1]][,1:2]
a.out.ncol <- ncol(a.out$imputations[[1]])
a <- a.out$imputations[[1]][,3:a.out.ncol]
b <- a.out$imputations[[2]][,3:a.out.ncol]
c <- a.out$imputations[[3]][,3:a.out.ncol]
d <- a.out$imputations[[4]][,3:a.out.ncol]
e <- a.out$imputations[[5]][,3:a.out.ncol]
# Calculate the Mean of the matrices
mean.right <- apply(abind(a,b,c,d,e,f,g,h,i,j,along=3),c(1,2),mean)
# recombine factors with values
mean <- cbind(left.side,mean.right)
I suppose that there is a much better way of doing this by using apply, plyr or the like, but as a R Newbie I am really a bit lost here. Do you have any suggestions how to go about this?
Here's an alternate approach using Reduce and plyr::llply
dfr1 <- data.frame(a = c(1,2.5,3), b = c(9.0,9,9), c = letters[1:3])
dfr2 <- data.frame(a = c(5,2,5), b = c(6,5,4), c = letters[1:3])
tst = list(dfr1, dfr2)
require(plyr)
tst2 = llply(tst, function(df) df[,sapply(df, is.numeric)]) # strip out non-numeric cols
ans = Reduce("+", tst2)/length(tst2)
EDIT. You can simplify your code considerably and accomplish what you want in 5 lines of R code. Here is an example using the Amelia package.
library(Amelia)
data(africa)
# carry out imputations
a.out = amelia(x = africa, cs = "country", ts = "year", logs = "gdp_pc")
# extract numeric columns from each element of a.out$impuations
tst2 = llply(a.out$imputations, function(df) df[,sapply(df, is.numeric)])
# sum them up and divide by length to get mean
mean.right = Reduce("+", tst2)/length(tst2)
# compute fixed columns and cbind with mean.right
left.side = a.out$imputations[[1]][1:2]
mean0 = cbind(left.side,mean.right)
If I understand your question correctly, then this should get you a long way:
#set up some data:
dfr1<-data.frame(a=c(1,2.5,3), b=c(9.0,9,9))
dfr2<-data.frame(a=c(5,2,5), b=c(6,5,4))
tst<-list(dfr1, dfr2)
#since all variables are numerical, use a threedimensional array
tst2<-array(do.call(c, lapply(tst, unlist)), dim=c(nrow(tst[[1]]), ncol(tst[[1]]), length(tst)))
#To see where you're at:
tst2
#rowMeans for a threedimensional array and dims=2 does the mean over the last dimension
result<-data.frame(rowMeans(tst2, dims=2))
rownames(result)<-rownames(tst[[1]])
colnames(result)<-colnames(tst[[1]])
#display the full result
result
HTH.
After many attempts, I've found a reasonably fast way to calculate cells' means across multiple data frames.
# First create an empty data frame for storing the average imputed values. This
# data frame will have the same dimensions of the original one
imp.df <- df
# Then create an array with the first two dimensions of the original data frame and
# the third dimension given by the number of imputations
a <- array(NA, dim=c(nrow(imp.df), ncol(imp.df), length(a.out$imputations)))
# Then copy each imputation in each "slice" of the array
for (z in 1:length(a.out$imputations)) {
a[,,z] <- as.matrix(a.out$imputations[[z]])
}
# Finally, for each cell, replace the actual value with the mean across all
# "slices" in the array
for (i in 1:dim(a)[1]) {
for (j in 1:dim(a)[2]) {
imp.df[i, j] <- mean(as.numeric(a[i, j,]))
}}

Resources