I have a data frame of 3 points in space represented by their longitude and latitute:
myData <- structure(list(lng = c(-37.06852042, -37.07473406, -37.07683313
), lat = c(-11.01471746, -11.02468103, -11.02806217)), .Names = c("lng",
"lat"), row.names = c(NA, 3L), class = "data.frame")
Next, I use the geosphere package to get a distance matrix (in meters, which I convert to km) for the points:
> m <- round(distm(myData)/1000,2)
> rownames(m) <- c("A", "B", "C")
> colnames(m) <- c("A", "B", "C")
> m
A B C
A 0.00 1.30 1.74
B 1.30 0.00 0.44
C 1.74 0.44 0.00
Given this is a distance matrix and I have 6 ways of going to A, B and C (like A -> B -> C, C -> A >-B, and so on), I would like to extract some information from it, like the minimum, the median, and the maximum distance.
To illustrate it, I calculated all the possible ways of my example manually:
ways <- c(abc <- 1.3 + 0.44,
acb <- 1.74 + 0.44,
bac <- 1.3 + 1.74,
bca <- 0.44 + 1.74,
cab <- 1.74 + 1.3,
cba <- 0.44 + 1.3)
> min(ways)
[1] 1.74
> median(ways)
[1] 2.18
> max(ways)
[1] 3.04
How do I automate this task, given that I'll be working with more than 10 locals and this problem has factorial complexity?
I wrote a package called trotter that maps integers to different arrangement types (permutations, combinations and others). For this problem, it seems that you are interested in the permutations of locations. One of the objects in the package is the permutation pseudo-vector that is created using the function ppv.
First install "trotter":
install.packages("trotter")
Then an automated version of your task might look something like:
library(geosphere)
myData <- data.frame(
lng = c(-37.06852042, -37.07473406, -37.07683313),
lat = c(-11.01471746, -11.02468103, -11.02806217)
)
m <- round(distm(myData) / 1000, 2)
locations <- c("A", "B", "C")
rownames(m) <- colnames(m) <- locations
library(trotter)
perms <- ppv(k = length(locations), items = locations)
ways <- c()
for (i in 1:length(perms)) {
perm <- perms[i]
route <- paste(perm, collapse = "")
ways[[route]] <- sum(
sapply(
1:(length(perm) - 1),
function(i) m[perm[i], perm[i + 1]]
)
)
}
Back in the R console:
> ways
ABC ACB CAB CBA BCA BAC
1.74 2.18 3.04 1.74 2.18 3.04
> # What is the minimum route length?
> min(ways)
[1] 1.74
> # Which route (index) is this?
> which.min((ways))
ABC
1
Just remember, like you said, you're dealing with factorial complexity and you might end up waiting a while running this brute force search with more than a few locations...
Related
I'm trying to make this function work, but am failing.
What I need is a function that reads the names from a dataframe columns and uses them to perform a Wilcoxon test on each of those columns. "result" would be the main final product, a table with the genus names and their p-values on each row. I've added also a plotting feature for visualizing the values among groups for each column, that I would save naming them after the corresponding genus.
library("dplyr")
library("ggpubr")
library(PairedData)
library(tidyr)
process <- function(data, genus){
group_by(data,group) %>%summarise(
count = n(),
median = median(genus, na.rm = TRUE),
IQR = IQR(genus, na.rm = TRUE)
)
# Subset data before and after treatment
T0 <- subset(data, group == "T0", genus,drop = TRUE)
T2 <- subset(data, group == "T2", genus,drop = TRUE)
#Wilcoxon test for paired data, I want a table of names and corresponding p-values
res <- wilcox.test(T0, T2, paired = TRUE)
res$p.value
result <- spread(genus,res$p.value)
# Plot paired data, with title depending on the data and its p-value (this last one could be optional)
pd <- paired(T0, T2)
tiff(genus".tiff", width = 600, height = 400)
plot(pd, type = "profile") + labs(title=print(data[,genus]", paired p-value="res[,p.value]) +theme_bw()
dev.off()
}
l <- length(my_data)
glist <- list(colnames(my_data[3:l])) #bacteria start at col 3
wilcoxon <- process(data = my_data, genus = glist)
A reproducible dataset could be
my_data
Patient group Subdoligranulum Agathobacter
pt_10T0 T0 0.02 0.00
pt_10T2 T2 10.71 19.89
pt_15T0 T0 29.97 0.28
pt_15T2 T2 16.10 7.70
pt_20T0 T0 2.39 0.44
pt_20T2 T2 20.48 3.35
pt_32T0 T0 12.23 0.17
pt_32T2 T2 37.11 1.87
pt_36T0 T0 0.64 0.03
pt_36T2 T2 0.02 0.08
pt_39T0 T0 0.04 0.01
pt_39T2 T2 0.36 0.05
pt_3t0 T0 13.23 1.34
pt_3T2 T2 19.22 1.51
pt_9T0 T0 11.69 0.57
pt_9T2 T2 34.56 3.52
I'm not very familiar with functions, and haven't found yet a good tutorial on how to make them from a dataframe... so this is my best attempt, I hope some of you can make it work.
Thank you for the help!
Simply, return the needed value at end of processing. Below does not test the plot step (with unknown packages) but adjusted for proper R grammar:
proc_wilcox <- function(data, genus){
# Subset data before and after treatment
T0 <- data[[genus]][data$group == "T0"]
T2 <- data[[genus]][data$group == "T2"]
# Wilcoxon test for paired data
res <- wilcox.test(T0, T2, paired = TRUE)
# Plot paired data, with title depending on the data and its p-value
# pd <- paired(T0, T2)
# tiff(paste0(genus, ".tiff"), width = 600, height = 400)
# plot(pd, type = "profile") +
# labs(title=paste0(genus, " paired p-value= ", res$p.value)) +
# theme_bw()
# dev.off()
return(res$p.value)
}
Then, call the method with an apply function such as sapply or slightly faster vapply designed to process across iterables and return same length.
# VECTOR OF RESULTS (USING sapply)
wilcoxon_results <- sapply(
names(my_data)[3:ncol(my_data)],
function(col) proc_wilcox(my_data, col)
)
# VECTOR OF RESULTS (USING vapply)
wilcoxon_results <- vapply(
names(my_data)[3:ncol(my_data)],
function(col) proc_wilcox(my_data, col),
numeric(1)
)
wilcoxon_results
# Subdoligranulum Agathobacter
# 0.1484375 0.0078125
wilcoxon_df <- data.frame(wilcoxon_results)
wilcoxon_df
# wilcoxon_results
# Subdoligranulum 0.1484375
# Agathobacter 0.0078125
My current mission: pick up some "good" columns from a incomplete matrix, trying to remove NAs while keeping real data.
My idea: I can calculate evey column's missing data NA%. For a given threshold t, all the NA% > t columns will be removed. The removed columns also contain some real data. In these columns, present/missing will show the "price" of deleting these columes. My idea is to search the lowest "price" to delete NA as much as possible, for each dataset.
I already wrote my function till the last 2 steps:
myfunc1 <- function(x){
return(sum(is.na(x))
}
myfunc2 <- function(x){
return (round(myfunc1(x) / length(x),4))
}
myfunc3 <- function(t, set){
m <- which(apply(set, MARGIN = 2, myfunc2) > t)
missed <- sum(is.na(set[m]))
present <- sum(!is.na(set[m]))
return(present/ missed)
}
myfunc3(0.5, setA) # worked
threshold <- seq(from = 0, to = 0.95, by = 0.5)
apply(X = threshold, MARGIN = 1, FUN = myfunc3, set = setA) # not worked. stuck here.
I have 10 datasets from setA to setJ, I want to test all thresholds from 0 to 0.95. I want a matrix as a return with 10 datasets as column and 20 rows threshold with every 0.05 interval.
Did I do this correctly? Are there better ideas, or already existing libraries that I could use?
----------edit: example-----------
setA <- data.frame(cbind(c(1,2,3,4,NA,6,7,NA), c(1,2,NA,4,5,NA,NA,8),c(1,2,3,4,5,6,NA,8), c(1,2,3,4,5,6,7,8),c(NA,NA,NA,4,NA,6,NA,NA)))
colnames(setA) <- sprintf("col%s", seq(1:5))
rownames(setA) <- sprintf("sample%s", seq(1:8))
View(setA)
myfunc1 <- function(x){
return(sum(is.na(x)))
}
myfunc2 <- function(x){
return (round(myfunc1(x) / length(x),4))
}
myfunc3 <- function(t, set){
m <- which(apply(set, MARGIN = 2, myfunc2) > t)
missed <- sum(is.na(set[m]))
present <- sum(!is.na(set[m]))
return(present/ missed)
}
In setA, there are 8 samples. Each sample has 5 attributes to describe the sample. Unfortunately, some data are missing. I need to delete some column with too many NAs. First, let me calculate every column's NA% .
> apply(setA, MARGIN = 2, myfunc2)
col1 col2 col3 col4 col5
0.250 0.375 0.125 0.000 0.750
If I set the threshold t = 0.3, that means col2, col5 are considered too many NAs and need to be deleted, others are acceptable. If I delete the 2 columns, I also delete some real data. (I deleted 7 real data and 9 NAs, 7/9 = 0.78. This means I sacrifice 0.78 real data when I delete 1 NA)
> myfunc3(0.3, setA)
[1] 0.7777778
I want to try every threshold's result and then decide.
threshold <- seq(from = 0, to = 0.9, by = 0.1)
apply(X= threshold, MARGIN = 1, FUN = myfunc3, set = setA) # not work
I manualy calculate setA part:
threshold: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
price: 1.667 1.667 1.118 0.778 0.334 0.334 0.334 0.334 NaN NaN
At last I want a talbe like:
threshold: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
setA: 1.667 1.667 1.118 0.778 0.334 0.334 0.334 0.334 NaN NaN
setB:
setC:
...
setJ:
Do I have the correct way with the problem?
-----------Edit---------------
I already solved the problem and please close the thread.
i have this script:
library(plyr)
library(gstat)
library(sp)
library(dplyr)
library(ggplot2)
library(scales)
a<-c(10,20,30,40,50,60,70,80,90,100)
b<-c(15,25,35,45,55,65,75,85,95,105)
x<-rep(a,3)
y<-rep(b,3)
E<-sample(30)
freq<-rep(c(100,200,300),10)
data<-data.frame(x,y,freq,E)
data<-arrange(data,x,y,freq)
df <- ddply(data,"freq", function (h){
dim_h<-length(h$x)
perc_max <- 0.9
perc_min <- 0.8
u<-round((seq(perc_max,perc_min,by=-0.1))*dim_h)
dim_u<-length(u)
perc_punti<- percent(seq(perc_max,perc_min,by=-0.1))
for (i in 1:dim_u)
{ t<-u[i]
time[i]<-system.time(
for (j in 1:2)
{
df_tass <- sample_n(h, t)
df_residuo <- slice(h,-as.numeric(rownames(df_tass)))
coordinates(df_tass)= ~x + y
x.range <- range(h$x)
y.range <- range(h$y)
grid <- expand.grid(x = seq(from = x.range[1], to = x.range[2], by = 1), y = seq(from = y.range[1],
to = y.range[2], by = 1))
coordinates(grid) <- ~x + y
gridded(grid) <- TRUE
nearest = krige(E ~ 1, df_tass, grid, nmax = 1)
nearest_df<-as.data.frame(nearest)
names(nearest_df) <- c("x", "y", "E")
#Error of prediction
df_pred <- inner_join(nearest_df[1:3],select(df_residuo,x,y,E),by=c("x","y"))
names(df_pred) <- c("x", "y", "E_pred","E")
sqm[j] <- mean((df_pred[,4]-df_pred[,3])^2)
})[3]
sqmm[i]<-mean(sqm)
}
df_finale<-data.frame(sqmm,time,perc_punti)
})
df
I measured in several points of coordinates (x,y) the value of the electromagnetic field (E value) at different frequencies (freq value). For each frequency value, I use once 90% of points and once 80% (with the for loop with l) to interpolate the value of the electromagnetic field (E) inside grid with Nearest Neighbour Interpolation (krige function); and i repeat this 2 times. The remaining points will then be used to calculate the prediction error. I hope it's clear.
This script above is a simplified case. Unfortunately, in my case the script takes too long for the two for-loops implemented.
I want to ask if it's possible to simplify the code in some way, for instance by using the apply function family. Thanks.
Reply #clemlaflemme ok it works! thanks... now i have a little proble with the final dataframe, it looks like this:
freq 1 2
1 100 121.00 338.00
2 100 0.47 0.85
3 200 81.00 462.50
4 200 0.74 0.73
5 300 36.00 234.00
6 300 0.82 0.76
but i want something like this:
freq sqmm time
1 100 121.0 0.47
2 100 338.0 0.85
3 200 81.0 0.74
4 200 462.5 0.73
5 300 36.0 0.82
6 300 234.0 0.76
how can i do that??
I have a series of geographical positions at sea which I am trying to get geological sediment type information for. I am using an export of the national british geological sediment database (df1)which is a large data set of coordinates and sediment information.
Currently I have been rounding the coordinates in the BGS export file (df1) and averaging/recalculating the sediment type for these coordinate squares, then I have rounded my coordinates in (df2) and matched these to these squares to get a sediment classification.
The BGS export looks like this (df1);
NUM X Y GRAV SAND MUD
1 228 1.93656 52.31307 1.07 98.83 0.10
2 142 1.84667 52.45333 0.00 52.60 47.40
3 182 1.91950 52.17750 9.48 90.38 0.14
4 124 1.88333 52.70833 0.00 98.80 1.20
5 2807 1.91050 51.45000 2.05 97.91 0.05
6 2787 1.74683 51.99382 41.32 52.08 6.60
7 2776 1.66117 51.63550 9.83 87.36 2.81
8 2763 1.82467 51.71767 43.92 47.25 8.83
9 2753 1.76867 51.96349 57.66 39.18 3.15
10 68 2.86967 52.96333 0.30 98.90 0.80
11 2912 1.70083 51.77783 26.90 64.87 8.22
12 2914 1.59750 51.88882 32.00 65.02 2.97
13 2886 1.98833 51.34267 1.05 98.91 0.04
14 2891 1.87817 51.31549 68.57 31.34 0.08
15 2898 1.37433 51.41249 35.93 61.48 2.59
16 45 2.06667 51.82500 9.70 88.10 2.20
17 2904 1.63617 51.45999 16.28 66.67 17.05
My positions at sea look like this (df2);
haul DecStartLat DecStartLong
1993H_2 55.23983 -5.512830
2794H_1 55.26670 -5.516700
1993H_1 55.27183 -5.521330
0709A_71 55.26569 -5.519730
0396H_2 55.44120 -5.917800
0299H_2 55.44015 -5.917310
0514A_26 55.46897 -5.912167
0411A_64 55.47289 -5.911820
0410A_65 55.46869 -5.911930
0514A_24 55.63585 -5.783500
0295H_4 55.57250 -5.754300
0410A_62 55.63656 -6.041870
0413A_53 55.73280 -6.020600
0396H_13 55.66470 -6.002300
2794H_8 55.83330 -5.883300
0612A_15 55.84025 -5.912130
0410A_74 55.84311 -5.910180
0299H_16 55.90568 -5.732490
0200H_18 55.88600 -5.742900
0612A_18 55.90450 -5.835880
This is my script...
get.Sed.type <- function(x,y) {
x$Y2 <- round(x$Y, digits=1)
x$X2 <- round(x$X, digits=1)
x$BGSQ <- paste(x$Y2,x$X2,sep="_")
x$RATIO <- x$SAND/x$MUD
x <- aggregate(cbind(GRAV,RATIO)~BGSQ,data=x,FUN=mean)
FOLK <- (x$GRAV)
FOLK[(FOLK)<1] <- 0
FOLK[(FOLK)>=1&(FOLK)<5] <- 1
FOLK[(FOLK)>=5&(FOLK)<30] <- 5
FOLK[(FOLK)>=30&(FOLK)<80] <- 30
FOLK[(FOLK)>=80] <- 80
R_CLASS <- (x$RATIO)
R_CLASS[(R_CLASS)<1/9] <- 0
R_CLASS[(R_CLASS)>=1/9&(R_CLASS)<1] <- 0.1
R_CLASS[(R_CLASS)>=1&(R_CLASS)<9] <- 1
R_CLASS[(R_CLASS)>=9] <- 9
x$FOLK_CLASS <- NULL
x$FOLK_CLASS[(R_CLASS)==0&(FOLK)==0] <- "M"
x$FOLK_CLASS[(R_CLASS)%in%c(0,0.1)&(FOLK)==5] <- "gM"
x$FOLK_CLASS[(R_CLASS)==0.1&(FOLK)==0] <- "sM"
x$FOLK_CLASS[(R_CLASS)==0&(FOLK)==1] <- "(g)M"
x$FOLK_CLASS[(R_CLASS)==0.1&(FOLK)==1] <- "(g)sM"
x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==0] <- "S"
x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==0] <- "mS"
x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==1] <- "(g)S"
x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==1] <- "(g)sM"
x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==5] <- "gmS"
x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==5] <- "gS"
x$FOLK_CLASS[(FOLK)==80] <- "G"
x$FOLK_CLASS[(R_CLASS)%in%c(0,0.1)&(FOLK)==30] <- "mG"
x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==30] <- "msG"
x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==30] <- "sG"
y$Lat <- round(y$DecStartLat, digits=1)
y$Long <- round(y$DecStartLong, digits=1)
y$LATLONG100_sq <- paste(y$Lat,y$Long,sep="_")
y <- merge(y, x[,c(1,4)],all.x=TRUE,by.x="LATLONG100_sq",by.y="BGSQ")
#Delete unwanted columns
y <- y[, !(colnames(y) %in% c("Lat","Long","LATLONG100_sq"))]
#Name column something logical
colnames(y)[colnames(y) == 'FOLK_CLASS'] <- 'BGS_class'
return(y)
}
However I have a dozen or so positions in db2 for which there are no corresponding values in the BGS export (db1), I want to know how I can either ask it to do another average for the squares surrounding that respective square (i.e. round to a larger number and repeat the process) OR to ask it to find the coordinate in the BGS export file that is closest in proximity and take the existing value.
Going for the second option stated in the question, I suggest to frame the question as follows:
Say that you have a set of m coordinates from db1 and n coordinates from db2, m <=n, and that currently the intersection of these sets is empty.
You'd like to match each point from db1 with a point from db2 such that the "error" of the matching, e.g. sum of distances, will be minimized.
A simple greedy approach for solving this might be to generate an m x n matrix with the distances between each pair of coordinates, and sequentially select the closest match for each point.
Of course, If there are many points to match, or if you're after an optimal solution, you may want to consider more elaborate matching algorithms (e.g. the Hungarian algorithm).
Code:
#generate some data (this data will generate sub-optimal matching with greedy matching)
db1 <- data.frame(id=c("a1","a2","a3","a4"), x=c(1,5,10,20), y=c(1,5,10,20))
db2 <- data.frame(id=c("b1","b2","b3","b4"),x=c(1.1,2.1,8.1,14.1), y=c(1.1,1.1,8.1,14.1))
#create cartesian product
product <- merge(db1, db2, by=NULL)
#calculate auclidean distances for each possible matching
product$d <- sqrt((product$x.x - product$x.y)^2 + (product$y.x - product$y.y)^2)
#(naively & greedily) find the best match for each point
sorted <- product[ order(product[,"d"]), ]
found <- vector()
res <- vector() #this vector will hold the result
for (i in 1:nrow(db1)) {
for (j in 1:nrow(sorted)) {
db2_val <- as.character(sorted[j,"id.y"])
if (sorted[j,"id.x"] == db1[i, "id"] && length(grep(db2_val, found)) == 0) {
#print(paste("matching ", db1[i, "id"], " with ", db2_val))
res[i] <- db2_val
found <- c(found, db2_val)
break
}
}
}
Note that I'm sure the code can be improved and made more elegant by using methods other than loop.
Hopefully I do not misunderstand, but as far as I get from the title, you need to match based on minimum distance. If this distance is allowed to be Euclidean distance, then one can use the fast RANN package, if not, then one needs to compute the great circle distance.
Some of the provided data
BGS_df <-
read.table(text =
" NUM X Y GRAV SAND MUD
1 228 1.93656 52.31307 1.07 98.83 0.10
2 142 1.84667 52.45333 0.00 52.60 47.40
3 182 1.91950 52.17750 9.48 90.38 0.14
4 124 1.88333 52.70833 0.00 98.80 1.20
5 2807 1.91050 51.45000 2.05 97.91 0.05",
header = TRUE)
my_positions <-
read.table(text =
"haul DecStartLat DecStartLong
1993H_2 55.23983 -5.512830
2794H_1 55.26670 -5.516700
1993H_1 55.27183 -5.521330",
header = TRUE)
Euclidean distance (using RANN package)
library(RANN)
# For each point in my_positions, find the nearest neighbor from BGS_df:
# Give X and then Y (longtitude and then latitude)
# Note that argument k sets the number of nearest neighbours, here 1 (the closest)
closest_RANN <- RANN::nn2(data = BGS_df[, c("X", "Y")],
query = my_positions[, c("DecStartLong", "DecStartLat")],
k = 1)
results_RANN <- cbind(my_positions[, c("haul", "DecStartLong", "DecStartLat")],
BGS_df[closest_RANN$nn.idx, ])
results_RANN
# haul DecStartLong DecStartLat NUM X Y GRAV SAND MUD
# 4 1993H_2 -5.51283 55.23983 124 1.88333 52.70833 0 98.8 1.2
# 4.1 2794H_1 -5.51670 55.26670 124 1.88333 52.70833 0 98.8 1.2
# 4.2 1993H_1 -5.52133 55.27183 124 1.88333 52.70833 0 98.8 1.2
Great circle distance (using geosphere package)
library(geosphere)
# Compute matrix of great circle distances
dist_mat <- geosphere::distm(x = BGS_df[, c("X", "Y")],
y = my_positions[, c("DecStartLong", "DecStartLat")],
fun = distHaversine) # can try other distances
# For each column (point in my_positions) get the index of row of min dist
# (corresponds to row index in BGS_df)
BGS_idx <- apply(dist_mat, 2, which.min)
results_geo <- cbind(my_positions[, c("haul", "DecStartLong", "DecStartLat")],
BGS_df[BGS_idx, ])
identical(results_geo, results_RANN) # here TRUE, but not always expected
Here's what I can use to list weight for all terminal nodes : but how can I add some code to get response prediction as well as weight by each terminal node ID :
say I want my output to look like this
--
Here below is what I have so far to get the weight
nodes(airct, unique(where(airct)))
Thank you
The Binary tree is a big S4 object, so sometimes it is difficult to extract the data.
But the plot method for BinaryTree object, has an optional panel function of the form function(node) plotting the terminal nodes. So when you plot you can get node informations.
here I use the plot function, to extract the information and even better I used the gridExtra package to convert the terminal node to a table.
library(party)
library(gridExtra)
set.seed(100)
lls <- data.frame(N = gl(3, 50, labels = c("A", "B", "C")),
a = rnorm(150) + rep(c(1, 0,150)),
b = runif(150))
pond= sample(1:5,150,replace=TRUE)
tt <- ctree(formula=N~a+b, data=lls,weights = pond)
output.df <- data.frame()
innerWeights <- function(node){
dat <- data.frame (x=node$nodeID,
y=sum(node$weights),
z=paste(round(node$prediction,2),collapse=' '))
grid.table(dat,
cols = c('ID','Weights','Prediction'),
h.even.alpha=1,
h.odd.alpha=1,
v.even.alpha=0.5,
v.odd.alpha=1)
output.df <<- rbind(output.df,dat) # note the use of <<-
}
plot(tt, type='simple', terminal_panel = innerWeights)
data
ID Weights Prediction
1 4 24 0.42 0.5 0.08
2 5 17 0.06 0.24 0.71
3 6 24 0.08 0 0.92
4 7 388 0.37 0.37 0.26
Here's what I found , it works fine with a bit extra information. But I just want to post it here just in case someone need them in the future.
y <- do.call(rbind, nodes(tt, unique(where(tt))))
write.table(y, 'clipboard', sep='\t')
#agstudy , let me know what you think.