How to associate variable values from a df to another - r

I have a dataframe with three values, x and y are coordinates and z is the value of the indipendent variable:
x.range <- c(1,10)
y.range <- c(20,50)
grid <- expand.grid(x = seq(x.range[1], x.range[2], by=0.5),
y = seq(y.range[1], y.range[2], by=0.5))
grid$z <- runif(nrow(grid),10, 70)
Now i have another dataframe like this with only x and y values:
x1 <- c(3.7,5.4,9.2)
y1 <- c(41.1,30.3,22.9)
df <- data.frame(x=x1,y=y1)
Now i want to associate to the points of dataframe df the z value of the nearest point of dataframe grid (with the shortest distance). Thanks.

This isn't the prettiest, but works
apply(df, 1,
function(x){
pythag <- sqrt((x[1] - grid$x)^2 +
(x[2] - grid$y)^2)
grid[which.min(pythag), "z"]
})
Simply returning the value for the nearest point using Pythagoras.
Edit
Recoding to adhere to coding standards:
pythag <- function(x, y, g){
which.min(((x - g$x)^2 + (y - g$y)^2)^0.5)
}
idx <- mapply(FUN = pythag,
x = df[["x"]],
y = df[["y"]],
MoreArgs = list(g = grid))
grid[idx,]

Related

Plot graph with values of vectors

I want to visualize the elements of my vectors in a graph. I want to generate a graph with a certain x- and y-axis and then put the values of my vectors as points into the graph. I also want different colors for the values of the different vectors. How do I do that?
For example: I have 10 elements in vector A and want to put those elements into the graph. The first Element of vector A has the y-value A[1] and the x-value 1. The second Element of vector A has the y-value A[2] and the x-value 2. Same with vector B.
vec1 = 1:10
vec2 = 1:10
for(idx in 1:10){
vec1[idx] = runif(1, min=0, max=100)
vec2[idx] = runif(1, min=0, max=100)
}
plot(vec1 and vec2) // How do I do this?
dput output for vec1: c(81.9624423747882, 45.583715592511, 56.2400584807619, 8.25600677635521, 82.0227505406365, 45.6240070518106, 68.7916911672801, 94.491201499477, 22.0095717580989, 4.29550902917981)
dput output for vec2: c(29.5684755546972, 68.0154771078378, 52.2058120695874, 2.48502977192402, 91.9532125117257, 24.7736480785534, 66.5003522532061, 79.014728218317, 47.9641782585531, 20.5593338003382)
Starting with
vec1 = 1:10
vec2 = 1:10
for(idx in 1:10){
vec1[idx] = runif(1, min=0, max=100)
vec2[idx] = runif(1, min=0, max=100)
}
plot(vec1 and vec2) // How do I do this?
Try this:
plot( 1:20, c(vec1,vec2) , col=rep(1:2,10) # just points
lines( 1:20, c(vec1,vec2) ) # add lines
# if you wanted the same x's for both sequences the first argument could be
# rep(1:10, 2) instead of 1:20
Note: Your set up code could have been just two lines (no loop):
vec1 = runif(10, min=0, max=100)
vec2 = runif(10, min=0, max=100)
I think the easiest is to create a data frame, which is usually what most functions expect in R:
library(tidyverse)
vec1 = 1:10
vec2 = 1:10
for(idx in 1:10){
vec1[idx] = runif(1, min=0, max=100)
vec2[idx] = runif(1, min=0, max=100)
}
df <- data.frame(order = 1:10, vec1, vec2) %>%
pivot_longer(!order, names_to = "color", values_to = "value")
plot(df$order, df$value, col = c("red","blue")[df$color %>% as.factor()])
I'm wondering or guessing whether you are aiming for the facility provided by teh base-plotting function arrows? This is the example in the ?arrows page:
x <- stats::runif(12); y <- stats::rnorm(12)
i <- order(x, y); x <- x[i]; y <- y[i]
plot(x,y, main = "arrows(.)" )
## draw arrows from point to point :
s <- seq(length(x)-1) # one shorter than data
arrows(x[s], y[s], x[s+1], y[s+1], col = 1:3)
If you wanted instead to plot with each vector (represented by "arrows") starting from the origin it would be:
x <- stats::runif(12); y <- stats::rnorm(12)
# ordering not needed this time
plot(x,y, main = "arrows(.)", xlim=c(0, max(x)) # to let origin be seen)
## draw arrows from origin to point :
s <- length(x) # one shorter than data
arrows(rep(0,s), rep(0,s), x, y, col = 1:3)

R-caret-plyr : how to modify downSample function to create sampled data of different proportions

Below is the downSample function of caret that I found here .
downSample <- function(x, y, list = FALSE, yname = "Class")
{
xc <- class(x)
if(!is.data.frame(x)) x <- as.data.frame(x)
if(!is.factor(y))
{
warning("Down-sampling requires a factor variable as the response. The original data was returned.")
return(list(x = x, y = y))
}
minClass <- min(table(y))
x$.outcome <- y
x <- ddply(x, .(y),
function(dat, n) dat[sample(seq(along = dat$.outcome), n),,drop = FALSE],
n = minClass)
y <- x$.outcome
x <- x[, !(colnames(x) %in% c("y", ".outcome")), drop = FALSE]
if(list)
{
if(xc[1] == "matrix") x <- as.matrix(x)
out <- list(x = x, y = y)
} else {
out <- cbind(x, y)
colnames(out)[ncol(out)] <- yname
}
out
}
suppose that my data set is iris :
data(iris)
x <- iris[, -5]
y <- iris[, 5]
to make the response variable a hugely unbalanced binary one :
y[-c(130, 146)] <- "setosa"
There are now therefore two instances of "virginica" and 148 instances of "setosa". I would like to modify the function downSample so that, in the end, instead of returning a subsampled data set with 50% of minClass, it returns a subsampled data set with for instance 30% (k) of minor class and 70% of major class. Because using the downSample function for n instances in the minClass it selects n instances of the other class to get a fully balanced data set. But in my case I loose a lot of data so I just want to balance it a bit not fully.
Let's suppose that k = 20 % i.e. in the end I want 20% of minClaas and 80% of the other class. I have already tried to modify this part of function :
x <- ddply(x, .(y), function(dat, n)
dat[sample(seq(along = dat$.outcome), n),, drop = FALSE], n = minClass)
by changing n to 4*n but I did not achieve it. There is this error :
Error in size <= n/2 :
comparison (4) is possible only for atomic and list types
Your help would be appreciated.
A simple way to perform this is to change the n = minClass part of the ddply call.
downSample_custom <- function(x, y, list = FALSE, yname = "Class", frac = 1){ #add argument frac which is in the 0 - 1 range
xc <- class(x)
if(!is.data.frame(x)) x <- as.data.frame(x)
if(!is.factor(y))
{
warning("Down-sampling requires a factor variable as the response. The original data was returned.")
return(list(x = x, y = y))
}
minClass <- min(table(y))
x$.outcome <- y
x <- ddply(x, .(y),
function(dat, n) dat[sample(seq(along = dat$.outcome), n),,drop = FALSE],
n = minClass*frac) #change the n to this
y <- x$.outcome
x <- x[, !(colnames(x) %in% c("y", ".outcome")), drop = FALSE]
if(list)
{
if(xc[1] == "matrix") x <- as.matrix(x)
out <- list(x = x, y = y)
} else {
out <- cbind(x, y)
colnames(out)[ncol(out)] <- yname
}
out
}
Does it work:
library(plyr)
imbalanced y:
set.seed(1)
y <- as.factor(sample(c("M", "F"),
prob = c(0.1, 0.9),
size = 10000,
replace = TRUE))
x <- rnorm(10000)
table(downSample_custom(x, y)[,2])
output:
F M
1044 1044
table(downSample_custom(x, y, frac = 0.5)[,2])
output:
F M
522 522
table(downSample_custom(x, y, frac = 0.2)[,2])
output
F M
208 208
using frac > 1 returns an error:
downSample_custom(x, y, frac = 2)
output
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
EDIT: answer to the updated question.
This can be achieved for instance by sampling the indexes of each class separately. Here is an example that works only for two class problems:
downSample_custom <- function(x, y, yname = "Class", frac = 1){
lev <- levels(y)
minClass <- min(table(y))
lev_min <- levels(y)[which.min(table(y))]
inds_down <- sample(which(y == lev[lev != lev_min]), size = minClass * frac) #sample the indexes of the more abundant class according to minClass * frac
inds_minClass <- which(y == lev[lev == lev_min]) #take all the indexes of the lesser abundant class
out <- data.frame(x, y)
out <- out[sort(c(inds_down, inds_minClass)),]
colnames(out)[ncol(out)] <- yname
return(out)
}
how it looks in practice:
table(downSample_custom(x, y)[,2])
output:
F M
1044 1044
table(downSample_custom(x, y, frac = 5)[,2])
output:
F M
5220 1044
head(downSample_custom(x, y, frac = 5))
output:
x Class
1 -1.5163733 F
2 0.6291412 F
4 1.1797811 M
5 1.1176545 F
6 -1.2377359 F
7 -1.2301645 M

Plot a generic surface and contour in R

I have the following data
var.asym <- function(alpha1, alpha2, xi, beta, n){
term11 <- alpha1*(1-alpha1)^(2*xi-1)
term12 <- alpha1*(1-alpha1)^(xi-1)*(1-alpha2)^xi
term22 <- alpha2*(1-alpha2)^(2*xi-1)
Sigma <- matrix(c(term11, term12, term12, term22), nrow=2, byrow=TRUE)
Sigma*beta^2/n
}
mop.jacob.inv <- function(alpha1, alpha2, xi, beta){
term11 <- -qgpd(alpha1, xi, 0, beta)/xi - beta*(1-alpha1)^xi*log(1-alpha1)/xi
term12 <- qgpd(alpha1, xi, 0, beta)/beta
term21 <- -qgpd(alpha2, xi, 0, beta)/xi - beta*(1-alpha2)^xi*log(1-alpha2)/xi
term22 <- qgpd(alpha2, xi, 0, beta)/beta
jacob <- matrix(c(term11, term12, term21, term22), nrow=2, byrow=TRUE)
jacob.inv <- solve(jacob)
jacob.inv
}
var.asym2 <- function(alpha1, alpha2) var.asym(alpha1, alpha2, 0.2, 1, 1000)
mop.jacob.inv2 <- function(alpha1, alpha2) mop.jacob.inv(alpha1, alpha2, 0.2, 1)
object <- function(alpha1, alpha2){
term1 <- mop.jacob.inv2(alpha1, alpha2)%*%var.asym2(alpha1, alpha2)%*%t(mop.jacob.inv2(alpha1, alpha2))
sum(diag(term1))
}
x <- seq(0.01, 0.98, by=0.01)
y <- seq(x[1]+0.01, 0.99, by=0.01)
xy <- cbind(rep(x[1], length(x)), y)
for(i in 2:length(x)){
y <- seq(x[i]+0.01, 0.99, by=0.01)
xy <- rbind(xy, cbind(rep(x[i], length(x)-i+1), y))
}
object.xy <- rep(0, 4851)
for(i in 1:4851){
object.xy[i] <- object(xy[i, 1], xy[i, 2])
}
Now I want to plot a surface of (xy[, 1], xy[, 2], object.xy). Is there a way to do so in R? I tried persp and contour function but it did not seem to be appropriate for this case since they both require increasing sequences x and y. I guess a more general question would be how to make contour plot when we are given a sequence of triplets (x, y, z).
library(dplyr)
library(tidyr)
library(magrittr)
long_data =
data.frame(
x = xy[,1] %>% round(2),
y = xy[,2] %>% round(2),
z = object.xy)
wide_data =
long_data %>%
spread(x, z)
y = wide_data$y
wide_data %<>% select(-y)
x = names(wide_data) %>% as.numeric
z = wide_data %>% as.matrix
persp(x, y, z)
contour(x, y, z)
Dunno why the round helps, but it does. The reshape was necessary to build a matrix from x, y, z data. Note that the contour lines coalesce into a black dot because of the huge narrow peak in the data.

R: gWidgets: gText: add a search/find function

I'd like to ask if there is a way of adding a search/find function in gtext gwidget.
x <- c(1, 2)
y <- c(3, 4)
z <- c(5, 6)
df <- data.frame(x, y, z)
df.co <- capture.output(df) # get df as text
str.split <- strsplit(df.co, "\\s+") # split every line in its components
w1 <- gwindow()
gt1 <- gtext(container=w1)
insert(gt1, df.co)

Can GGPLOT make 2D summaries of data?

I wish to plot mean (or other function) of reaction time as a function of the location of the target in the x y plane.
As test data:
library(ggplot2)
xs <- runif(100,-1,1)
ys <- runif(100,-1,1)
rts <- rnorm(100)
testDF <- data.frame("x"=xs,"y"=ys,"rt"=rts)
I know I can do this:
p <- ggplot(data = testDF,aes(x=x,y=y))+geom_bin2d(bins=10)
What I would like to be able to do, is the same thing but plot a function of the data in each bin rather than counts. Can I do this?
Or do I need to generate the conditional means first in R (e.g. drt <- tapply(testDF$rt,list(cut(testDF$x,10),cut(testDF$y,10)),mean)) and then plot that?
Thank you.
Update With the release of ggplot2 0.9.0, much of this functionality is covered by the new additions of stat_summary2d and stat_summary_bin.
here is a gist for this answer: https://gist.github.com/1341218
here is a slight modification of stat_bin2d so as to accept arbitrary function:
StatAggr2d <- proto(Stat, {
objname <- "aggr2d"
default_aes <- function(.) aes(fill = ..value..)
required_aes <- c("x", "y", "z")
default_geom <- function(.) GeomRect
calculate <- function(., data, scales, binwidth = NULL, bins = 30, breaks = NULL, origin = NULL, drop = TRUE, fun = mean, ...) {
range <- list(
x = scales$x$output_set(),
y = scales$y$output_set()
)
# Determine binwidth, if omitted
if (is.null(binwidth)) {
binwidth <- c(NA, NA)
if (is.integer(data$x)) {
binwidth[1] <- 1
} else {
binwidth[1] <- diff(range$x) / bins
}
if (is.integer(data$y)) {
binwidth[2] <- 1
} else {
binwidth[2] <- diff(range$y) / bins
}
}
stopifnot(is.numeric(binwidth))
stopifnot(length(binwidth) == 2)
# Determine breaks, if omitted
if (is.null(breaks)) {
if (is.null(origin)) {
breaks <- list(
fullseq(range$x, binwidth[1]),
fullseq(range$y, binwidth[2])
)
} else {
breaks <- list(
seq(origin[1], max(range$x) + binwidth[1], binwidth[1]),
seq(origin[2], max(range$y) + binwidth[2], binwidth[2])
)
}
}
stopifnot(is.list(breaks))
stopifnot(length(breaks) == 2)
stopifnot(all(sapply(breaks, is.numeric)))
names(breaks) <- c("x", "y")
xbin <- cut(data$x, sort(breaks$x), include.lowest=TRUE)
ybin <- cut(data$y, sort(breaks$y), include.lowest=TRUE)
if (is.null(data$weight)) data$weight <- 1
ans <- ddply(data.frame(data, xbin, ybin), .(xbin, ybin), function(d) data.frame(value = fun(d$z)))
within(ans,{
xint <- as.numeric(xbin)
xmin <- breaks$x[xint]
xmax <- breaks$x[xint + 1]
yint <- as.numeric(ybin)
ymin <- breaks$y[yint]
ymax <- breaks$y[yint + 1]
})
}
})
stat_aggr2d <- StatAggr2d$build_accessor()
and usage:
ggplot(data = testDF,aes(x=x,y=y, z=rts))+stat_aggr2d(bins=3)
ggplot(data = testDF,aes(x=x,y=y, z=rts))+
stat_aggr2d(bins=3, fun = function(x) sum(x^2))
As well, here is a slight modification of stat_binhex:
StatAggrhex <- proto(Stat, {
objname <- "aggrhex"
default_aes <- function(.) aes(fill = ..value..)
required_aes <- c("x", "y", "z")
default_geom <- function(.) GeomHex
calculate <- function(., data, scales, binwidth = NULL, bins = 30, na.rm = FALSE, fun = mean, ...) {
try_require("hexbin")
data <- remove_missing(data, na.rm, c("x", "y"), name="stat_hexbin")
if (is.null(binwidth)) {
binwidth <- c(
diff(scales$x$input_set()) / bins,
diff(scales$y$input_set() ) / bins
)
}
try_require("hexbin")
x <- data$x
y <- data$y
# Convert binwidths into bounds + nbins
xbnds <- c(
round_any(min(x), binwidth[1], floor) - 1e-6,
round_any(max(x), binwidth[1], ceiling) + 1e-6
)
xbins <- diff(xbnds) / binwidth[1]
ybnds <- c(
round_any(min(y), binwidth[1], floor) - 1e-6,
round_any(max(y), binwidth[2], ceiling) + 1e-6
)
ybins <- diff(ybnds) / binwidth[2]
# Call hexbin
hb <- hexbin(
x, xbnds = xbnds, xbins = xbins,
y, ybnds = ybnds, shape = ybins / xbins,
IDs = TRUE
)
value <- tapply(data$z, hb#cID, fun)
# Convert to data frame
data.frame(hcell2xy(hb), value)
}
})
stat_aggrhex <- StatAggrhex$build_accessor()
and usage:
ggplot(data = testDF,aes(x=x,y=y, z=rts))+stat_aggrhex(bins=3)
ggplot(data = testDF,aes(x=x,y=y, z=rts))+
stat_aggrhex(bins=3, fun = function(x) sum(x^2))
This turned out to be harder than I expected.
You can almost trick ggplot into doing this, by providing a weights aesthetic, but that only gives you the sum of the weights in the bin, not the mean (and you have to specify drop=FALSE to retain negative bin values). You can also retrieve either counts or density within a bin, but neither of those really solves the problem.
Here's what I ended up with:
## breaks vector (slightly coarser than the 10x10 spec above;
## even 64 bins is a lot for binning only 100 points)
bvec <- seq(-1,1,by=0.25)
## helper function
tmpf <- function(x,y,z,FUN=mean,breaks) {
midfun <- function(x) (head(x,-1)+tail(x,-1))/2
mids <- list(x=midfun(breaks$x),y=midfun(breaks$y))
tt <- tapply(z,list(cut(x,breaks$x),cut(y,breaks$y)),FUN)
mt <- melt(tt)
## factor order gets scrambled (argh), reset it
mt$X1 <- factor(mt$X1,levels=rownames(tt))
mt$X2 <- factor(mt$X2,levels=colnames(tt))
transform(X,
x=mids$x[mt$X1],
y=mids$y[mt$X2])
}
ggplot(data=with(testDF,tmpf(x,y,rt,breaks=list(x=bvec,y=bvec))),
aes(x=x,y=y,fill=value))+
geom_tile()+
scale_x_continuous(expand=c(0,0))+ ## expand to fill plot region
scale_y_continuous(expand=c(0,0))
This assumes equal bin widths, etc., could be extended ... it really is too bad that (as far as I can tell) stat_bin2d doesn't accept a user-specified function.

Resources