I have a data frame with geographical locations (two columns) and presence/abstinence data (0/1). What I need to find out is, how many of the other locations lie within their vicinity and have the value 1. That number should then be appended in a fourth column.
xcoords <- c(4487754, 4488111, 4487598, 4487417, 4487949, 4487802, 4488011, 4487638, 4487455, 4487478, 4487141, 4487550, 4487547, 4488010, 4487271, 4487170)
ycoords <- c(5294654, 5294706, 5295490, 5293859, 5295313, 5294910, 5294668, 5295399, 5294537, 5294408, 5294927, 5294617, 5294727, 5294953, 5294168, 5295142)
yesno <- round(runif(16, 0, 1),0)
df <- cbind(xcoords, ycoords, yesno)
I set the vicinity to 500 meters (the geodata is in a projected coordinate system, so the units correspond), and will do the calculation using Pythagoras' theorem.
buffer <- 500
What I do know, is how to build nested for-loops. But I don't want my function to look like this:
count.in.buffer <- function(df, buffer){
lon <- df$xcoords
lat <- df$ycoords
count <- rep(0, length(lon))
for(i in 1:length(lon)){ # for every row...
for(j in 1:length(lon) - 1){ # ... check all other rows
nolon <- lon[-i]
nolat <- lat[-i]
ifelse(sqrt((abs(lon[i] - nolon[j]))^2 + (abs(lat[i] - nolat[j]))^2) < buffer, ifelse(df$yesno == 1, count[i] <- count[i] + 1, count[i] <- count[i] + 0), count[i] <- count[i] + 0)
}
}
result <- cbind(data, count)
return(result)
}
That would take ages to be computed, because my data frame actually has 67000 rows.
Instead I would want something more efficient, but I don't understand how to convert the inner for-loop to a function I could use in an apply-function row-wise on the whole data frame. But I do suspect, that an apply-function would be the way to go, right?
P.S.: I solved the problem already with a proper GIS, by creating buffer polygons around each data point and linking the other data points to every buffer polygon's attribute table, if they lie within that buffer. But I think it should be much quicker to do in R.
You could try moving it into an apply statement, and subsetting your data into those that have values.
First, make a new df that only has those with 1s in the yesno (no point calculating distances for those that don't for the summing):
df1 <- df[df[ ,'yesno'] == 1,]
Then, we compare each row of df, to all of df1 in a vectorised way, and sum the counts (minusing out our initial count so we dont count it twice):
apply(df, 1, function(x){
sum(sqrt((x[1] - df1[ ,1])^2 +
(x[2] - df1[ ,2])^2) < buffer) - x[3]
})
Related
I have a table of ranges (start, stop), which looks something like this:
ID
start
stop
x1
351525
352525
x2
136790
136990
x3
74539
74739
x4
478181
478381
...
...
...
I also have a vector of positions.
The data can be simulated with:
s=round(runif(50,0,500000),0)
# ranges:
# (+200 is random, the difference my be more or less than that, but stop is always higher than start)
ranges=cbind(ID=paste0("x",1:50), start=s, stop=s+200)
# positions
pos=round(runif(5000,0,500000),0)
I want to select all IDs which have at least one position within their range.
I could loop through ranges and pos:
library(dplyr)
selected.IDs <- c()
for(r in 1:nrow(ranges)){
for(p in 1:length(pos)){
if(between(pos[p],left = ranges[r,2], right = ranges[r,3])){
selected.IDs <- append(selected.IDs, ranges[r,1])
break
} else{next}
}
}
That works fine (I think).
However, the 'ranges' object has 83,000 rows and there are 180,000 position. It takes a long time to loop through all of them.
Does anyone has an idea how to do that without a loop?
Thanks
I usually do this using overlap joins with data.table::foverlaps.
s <- round(runif(50,0,500000),0)
# ranges:
# (+200 is random, the difference my be more or less than that, but stop is always higher than start)
ranges <- data.table(ID=paste0("x",1:50), start=s, stop=s+200)
# positions
pos <- round(runif(5000,0,500000),0)
pos <- data.table(start = pos, stop = pos + 1)
setkey(pos, start, stop)
setkey(ranges, start, stop)
res <- foverlaps(ranges, pos, nomatch = 0)
selected.IDs <- res$ID
I want to calculate the smallest geographical distance between each row and the column of two dataframes.
DF1 has a number of institutions, and DF2 has a number of events. Like, so:
#DF1 (institutions)
DF1 <- data.frame(latitude=c(41.49532, 36.26906, 40.06599),
longitude=c(-98.77298, -101.40585, -80.72291))
DF1$institution <- letters[seq( from = 1, to = nrow(DF1))]
#DF2 (events)
DF2 <- data.frame(latitude=c(32.05, 32.62, 30.23), longitude=c(-86.82,
-87.67, -88.02))
DF2$ID <- seq_len(nrow(DF1)
I want to return the event with the smallest distance to each institution in DF1 and add both the distance and ID from DF2 to DF1. While I know how to calculate the pairwise distance I am incapable of calculating all the distances from DF[1,] to DF2 and return the smallest value and so forth.
This is what I tried (and failed).
library(geosphere)
#Define a function
distanceCALC <- function(x, y) { distm(x = x, y = y,
fun = distHaversine)}
#Define vector of events
DF2_vec <- DF2[, c('longitude', 'latitude')]
#Define df to hold distances
shrtdist <- data.frame()
Now, my attempt was to feed distanceCALC with row1 of DF1 and the vectorized events.
#Loop through every row in DF1 and calculate all the distances to instutions a, b, c. Append to DF1 smallest distance + DF2$ID.
#This only gives me the pairwise distance
for (i in nrow(DF1)){
result <- distanceCALC(DF1[i,c('longitude', 'latitude')], DF2_vec)
}
#Somehow take shortest distance for each row*column distance matrix
shrtdist <- rbind(shrtdist, min(result[,], na.rm = T))
My guess is that the solution entails reshaping of the data and lapply. Also, the loop is very bad practice and much too slow given the number of observations.
Any help is greatly appreciated.
Here's a simple way to approach this using the outer function
squared_distance <- function(x, y ) (x - y)^2
lat <- outer(DF1$latitude, DF2$latitude, squared_distance)
long <- outer(DF1$longitude, DF2$longitude, squared_distance)
pairwise_dist <- sqrt(lat + long)
rownames(pairwise_dist) <- DF1$institution
colnames(pairwise_dist) <- DF2$ID
pairwise_dist
This gives you a matrix of the distances between each institution (rows) and event (column). To get the distance and event in df1, we can do
df1$min_dist <- apply(pairwise_dist, 1, min)
df1$min_inst <- apply(pairwise_dist, 1, min)
Note that the reason the second one works in this case is because the events are labeled by number. If your real data doesn't have that handy feature, we need to do
df1$min_inst <- colnames(pairwise_dist)[apply(pairwise_dist, 1, which.min)]
Update using alternative distance function
I haven't tested this, but I think this should work. Again, the output will be a matrix.
gcd.hf <- function(DF1, DF2) {
sin2.long <- sin(outer(DF1$longitude, DF2$longitude, "-") / 2)^2
sin2.lat <- outer(DF1$latitude, DF2$latitude, "-")
cos.lat <- outer(cos(DF1$latitude), cos(DF2$latitude), "*")
a <- sin2.long + sin2.lat * cos.lat # we do this cell-wise
cir <- 2 * asin(pmin(1, sqrt(a))) # I never assign anything to "c" since that's concatenate. Rename this variable as appropriate (I have no idea if it's related to the circumference or not.)
cir * 6371
}
pairwise_dist <- gcd.hf(DF1, DF2)
I have a gridded climate dataset, such as:
# generate time vector
time1 <- seq(14847.5,14974.5, by = 1)
time2 <- seq(14947.5,14974.5, by = 1)
time <- c(time1,time2)
time <- as.POSIXct(time*86400,origin='1970-01-01 00:00')
# generate lat and lon coordinates
lat <- seq(80,90, by = 1)
lon <- seq(20,30, by = 1)
# generate 3dimensional array
dat <- array(runif(length(lat)*length(lon)*length(time)),
dim = c(length(lon),length(lat),length(time)))
such that
> dim(dat)
[1] 11 11 156
the dimensions of the data are describing the variable at different longitude (dim = 1), latitude (dim = 2), and time (dim = 3).
The issue I have at the moment is that some of the times are repeated, something to do with overlapping sensors measuring the data. Therefore, I was wondering if it was possible to only keep the unique times for dat, but average the data within the grid for the duplicated times i.e. if there are two repeated days we take the average value in each latitude and longitude grid for that time.
I can find the unique times as:
# only select unique times
new_time <- unique(time)
unique_time <- unique(time)
The following code then aims to loop through each grid (lat/lon) and average all of the duplicated days.
# loop through lat/lon coordinates to generate new data
new_dat <- array(dim = c(length(lon),length(lat),length(new_time)))
for(i in 1:length(lon)){
for(ii in 1:length(lat)){
dat2 <- dat[i,ii,]
dat2b <- NA
for(k in 1:length(unique_time)){
idx <- time == unique_time[k]
dat2b[k] <- mean(dat2[idx], na.rm = TRUE)
}
new_dat[i,ii,] <- dat2b
}
}
I'm convinced that this provides the correct answer, but I'm certain there is a much cleaner method do achieve this.
I should also note that my data is quite large (i.e. k = 7000), so this last loop is not very efficient, to say the least.
My original answer:
This is a bit more concise and efficient by use of aggregate:
for(i in 1:length(lon)){
for(ii in 1:length(lat)){
new_dat[i,ii,] <- as.numeric(aggregate(dat[i,ii,], by=list(time),mean)$x)
}
}
It still has 2 out of the 3 of the loops, but it manages to bypass creating dat2, dat2b, and unique_time.
My improved answer:
f <- function(i, ii){as.numeric(aggregate(dat[i,ii,], by=list(time),mean)$x)}
for(i in 1:nrow(expand.grid(1:length(lon),1:length(lat)))){
new_dat[expand.grid(1:length(lon),1:length(lat))[i,1],
expand.grid(1:length(lon),1:length(lat))[i,2],] <-
f(expand.grid(1:length(lon),1:length(lat))[i,1],expand.grid(1:length(lon),
1:length(lat))[i,2])
}
Got it down to just 1 loop. We could probably bypass that loop too with an apply.
I try to create a function to inject outliers to an existing data frame.
I started creating a new dataframe outsusing the maxand minvalues of the original dataframe. This outs dataframe will containing a certain amountof outliered data.
Later I want to inject the outliered values of the outs dataframe to the original dataframe.
What I want to get is a function to inject a certain amount of outliers to an original dataframe.
I have different problems for example: I do know if I am using correctly runif to create a dataframe of outliers and second I do not know how to inject the outliers to temp
The code I've tried until now is:
addOutlier <- function (data, amount){
maxi <- apply(data, 2, function(x) (mean(x)+(3*(sd(x)))))
mini <- apply(data, 2, function(x) (mean(x)-(3*(sd(x)))))
temp <- data
amount2 <- ifelse(amount<1, (prod(dim(data))*amount), amount)
outs <- runif(amount2, 2, min = mini, max = maxi) # outliers
if (amount2 >= prod(dim(data))) stop("exceeded data size")
for (i in 1:length(outs))
temp[sample.int(nrow(temp), 1), sample.int(ncol(temp), 1)] <- outs
return (temp)
}
Please any help to make this work, will be deeply appreciated
My understanding is that what you're trying to achieve is adding a set amount of outliers to each column in your vector. Alternatively, you seem to also be looking into adding a % of outliers to each column. I wrote down a solution only for the former case, but the latter should pretty easy to implement if you really need it. Note how I broke things down into two functions, to (hopefully) help clarify what is going on. Hope this helps!
add.outlier.to.vector <- function(vector, amount) {
cells.to.modify <- sample(1:length(vector), amount, replace=F)
mean.val <- mean(vector)
sd.val <- sd(vector)
min.val <- mean.val - 3 * sd.val
max.val <- mean.val + 3 * sd.val
vector[cells.to.modify] <- runif(amount, min=min.val, max=max.val)
return(vector)
}
add.outlier.to.data.frame <- function (temp, amount){
for (i in 1:ncol(temp)) {
temp[,i] <- add.outlier.to.vector(temp[,i], amount)
}
return (temp)
}
data <- data.frame(
a=c(1,2,3,4),
b=c(7,8,9,10)
)
add.outlier.to.data.frame(data, 2)
I currently have the following code that produces the desired results I want (Data_Index and Data_Percentages)
Input_Data <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/2010_pop_estimates.csv", row.names=1, stringsAsFactors = FALSE)
Input_Data <- data.frame(head(Input_Data))
Rows <-nrow(Input_Data)
Vars <-ncol(Input_Data) - 1
#Total population column
TotalCount <- Input_Data[1]
#Total population sum
TotalCountSum <- sum(TotalCount)
Input_Data[1] <- NULL
VarNames <- colnames(Input_Data)
Data_Per_Row <- c()
Data_Index_Row <- c()
for (i in 1:Rows) {
#Proportion of all areas population found in this row
OAPer <- TotalCount[i, ] / TotalCountSum * 100
Data_Per_Col <- c()
Data_Index_Col <- c()
for(u in 1:Vars) {
# For every column value in the selected row
# the percentage of that value compared to the
# total population (TotalCount) for that row is calculated
VarPer <- Input_Data[i, u] / TotalCount[i, ] * 100
# Once the percentage is calculated the index
# score is calculated by diving this percentage
# by the proportion of the total population in that
# area compared to all areas
VarIndex <- VarPer / OAPer * 100
# Binds results for all columns in the row
Data_Per_Col <- cbind(Data_Per_Col, VarPer)
Data_Index_Col <- cbind(Data_Index_Col, VarIndex)
}
# Binds results for completed row with previously completed rows
Data_Per_Row <- rbind(Data_Per_Row, Data_Per_Col)
Data_Index_Row <- rbind(Data_Index_Row, Data_Index_Col)
}
colnames(Data_Per_Row) <- VarNames
colnames(Data_Index_Row) <- VarNames
# Changes the index scores to range from -1 to 1
OldRange <- (max(Data_Index_Row) - min(Data_Index_Row))
NewRange <- (1 - -1)
Data_Index <- (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
# Final outputs
Data_Index
Data_Percentages
The problem I have is that the code is very slow. I want to be able to use it on dataset that has 200,000 rows and 200 columns (which using the code at present will take around 4 days). I am sure there must be a way of speeding this process up, but I am not sure how exactly.
What the code is doing is taking (in this example) a population counts table divided into age bands and by different areas and turning it into percentages and index scores. Currently there are 2 loops so that every value in all the rows and columns are selected individually have calculations performed on them. I assume it is these loops that is making it run slow, are there any alternatives that produce the same results, but quicker? Thanks for any help you can offer.
This is your entire code. The for-loop is not necessary. And so is apply. The division can be implemented by diving a matrix entirely.
df <- Input_Data
total_count <- df[, 1]
total_sum <- sum(total_count)
df <- df[, -1]
# equivalent of your for-loop
oa_per <- total_count/total_sum * 100
Data_Per_Row <- df/matrix(rep(total_count, each=5), ncol=5, byrow=T)*100
Data_Index_Row <- Data_Per_Row/oa_per * 100
names(Data_Per_Row) <- names(Data_Index_Row) <- names(df)
# rest of your code: identical
OldRange = max(Data_Index_Row) - min(Data_Index_Row)
NewRange = (1 - -1)
Data_Index = (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
get rid of the "i" loop
use apply to calculate OAPer
OAPer<-apply(TotalCount,1,
function(x,tcs)x/tcs*100,
tcs = TotalCountSum)
Likewise, you can vectorize the work inside the "u" loop as well, would appreciate some comments in your code