How to make calculations across two different lists of dataframes? - r

I have two lists of data frames, such that data is a list of 47 data frames, where each data frame has columns [coords, x, y, liklihood, x.1, x.2, liklihood.1, etc.] and dataA is a list of 47 data frames each of the same length as those in data, but with fewer columns [coords, x, y] that represent different coordinates.
I want to create a third list, or add a column to each data frame in one of the lists, that will contain the distance calculation from pointDistance(p1, p2) where p1 is the x and y columns of each data frame in list data, and p2 is the x and y columns of each data frame in list dataA.
I am trying to keep the dataframes in lists rather than having 47*2 individual data frames in my global environment.
Minimal Reproducible Example:
coords <- rnorm(10)
x <- rnorm(10)
y <- rnorm(10)
liklihood <- rnorm(10)
x.1 <- rnorm(10)
y.1 <- rnorm(10)
day1 <- data.frame(coords,x,y,liklihood,x.1,y.1)
coords <- rnorm(10)
x <- rnorm(10)
y <- rnorm(10)
liklihood <- rnorm(10)
x.1 <- rnorm(10)
y.1 <- rnorm(10)
day2 <- data.frame(coords,x,y,liklihood,x.1,y.1)
data <- list(day1,day2)
coords <- rnorm(10)
x <- rnorm(10)
y <- rnorm(10)
liklihood <- rnorm(10)
day1 <- data.frame(coords,x,y,liklihood)
coords <- rnorm(10)
x <- rnorm(10)
y <- rnorm(10)
liklihood <- rnorm(10)
day2 <- data.frame(coords,x,y,liklihood)
dataA <- list(day1,day2)

You can use mapply in base R to do this.
First, write a function that would return a single correct data frame if it was given a pair of data frames from your two lists, like data[[1]] and dataA[[1]]
library(raster)
append_distances <- function(df1, df2)
{
df1$distance <- pointDistance(cbind(df1$x, df1$y), cbind(df2$x, df2$y), lonlat = FALSE)
return(df1)
}
Now we just pass this function and your two lists to mapply:
data <- mapply(append_distances, data, dataA, SIMPLIFY = FALSE)
and now each data frame indata has a distance column added:
data
#> [[1]]
#> coords x y liklihood x.1 y.1 distance
#> 1 0.4761741 0.7913819 0.11597299 -0.6159504 -0.17626836 -0.8649915 2.1378779
#> 2 0.2608518 0.4389639 -1.44510285 -0.5452702 -2.31927588 -0.5114613 3.0321765
#> 3 2.1098629 0.3457442 1.59630572 -0.3205454 0.25760236 1.6791924 0.4150714
#> 4 0.5937334 -0.2043505 0.23667944 -0.2480409 -0.52856599 -0.4263619 1.6662791
#> 5 0.2819461 -1.9768319 0.68344331 -0.4975349 -0.08315893 0.9271072 2.3841079
#> 6 0.5779044 -0.5706433 0.89377684 -1.0084165 -0.83697268 0.9928353 0.6818632
#> 7 0.1410554 -0.6133513 0.25957971 -0.1781339 -0.77489990 -0.7191718 0.8303696
#> 8 -1.1769578 0.9203776 -0.06258728 -0.8991639 -0.38907408 -0.8388408 0.5028145
#> 9 -0.1388739 -0.8279408 1.15568431 -0.3312423 1.17269754 -1.4530041 1.6042288
#> 10 -0.3755364 0.6285803 0.52453490 0.7323463 -0.49051839 -0.1949171 0.6205714
#>
#> [[2]]
#> coords x y liklihood x.1 y.1 distance
#> 1 2.2158425 0.16430566 -0.5721804 -0.7523029 0.2866881 -2.027529031 0.4418775
#> 2 1.5753250 -0.67190607 -0.1140359 -0.3125333 -0.5361148 0.153228235 1.7182954
#> 3 0.8558108 1.19404509 -1.5834463 0.3858246 0.4475970 0.460910344 1.6229581
#> 4 0.8027824 0.76579023 -0.5938679 0.5592208 0.5883806 0.231569460 3.3608275
#> 5 -1.1487244 0.01013471 0.6855049 0.7148735 -2.2822053 1.918921619 2.3790501
#> 6 0.1014336 0.73941541 -0.4487482 0.1758588 0.8579709 0.029777437 1.8923570
#> 7 -0.8238857 0.67911991 -0.9140873 -0.6887611 -1.0709704 -0.009789701 1.4694983
#> 8 -0.1553338 0.78560221 -0.8218460 -0.5537232 0.7295692 0.744225760 2.4279377
#> 9 -0.6297834 0.09747354 0.2048211 -1.0849396 -0.2201589 0.173386536 0.8638957
#> 10 -0.4616377 -0.51116686 0.3204535 -0.5285903 1.0053890 -0.534173400 1.0715881

Related

Wrong value occur when converting points from UTM to WGS84 in R

I use the method from Stanislav in this topic of Forum, which is a question about "converting latitude and longitude points to UTM". I edited the function reversely to change points from UTM to WGS84, which is:
library(sp); library(rgdal)
#Function
UTMToLongLat<-function(x,y,zone){
xy <- data.frame(ID = 1:length(x), X = x, Y = y)
coordinates(xy) <- c("X", "Y")
proj4string(xy) <- CRS(paste("+proj=utm +zone=",zone," ellps=WGS84",sep=''))
res <- spTransform(xy, CRS("+proj=longlat +datum=WGS84"))
return(as.data.frame(res))
}
The example in the previous question mentioned above is tried, that is:
x2 <- c(-48636.65, 1109577); y2 <- c(213372.05, 5546301)
What is expected is (118, 10), (119, 50) in WGS84. Colin's example is in UTM51.
So, the following sentence is used:
done2 <- UTMToLongLat(x2,y2,51)
However, it produced: (118.0729, 1.92326), (131.4686, 49.75866).
What is wrong? By the way, how to control the decimal digits of the output?
First, you mistook the expression of the coordinate. It should be:
x <- c(-48636.65, 213372.05)
y <- c(1109577, 5546301)
In the function, it will be transformed and stored as:
> data.frame(ID = 1:length(x), X = x, Y = y)
# ID X Y
# 1 1 -48636.65 1109577
# 2 2 213372.05 5546301
And execute your function again:
> UTMToLongLat(x, y, 51)
# ID X Y
# 1 1 118 9.999997
# 2 2 119 50.000001
To control the decimal digits:
> round(UTMToLongLat(x, y, 51))
# ID X Y
# 1 1 118 10
# 2 2 119 50

Aggregate a table by applying a function of multiple columns

Considering the following table df, with categorical variables noted x1 and x2 and numerical measurements noted y1, y2 and y3:
df <- data.frame(x1=sample(letters[1:3], 20, replace=TRUE),
x2=sample(letters[4:6], 20, replace=TRUE),
y1=rnorm(20), y2=rnorm(20), y3=rnorm(20))
I'd like to apply on it a function of the 3 numerical measurements y with respect to the categorical variables x. For example the following function, where the input y is a table of 3 columns, which should output one new column:
f <- function(y){ sum((y[,1] - y[,2]) / y[,3]) }
I tried it with aggregate, dplyr, summarizeBy.. without success as it seems that for every method, mixing the inputs columns is not an option. Any idea on how to do that with such kind of functions (i.e. taking advantage of aggregation)?
aggregate(data = df, y1 + y2 + y3 ~ x1 + x2, FUN = f)
To clarify, the expected result can be obtained with something like:
groups <- unique(df[,c("x1", "x2")]) # coocurences of explanatory variables
res <- c()
for (i in 1:nrow(groups)){ # get the subtables
temp <- df[df$x1 == groups[i,1] & df$x2 == groups[i,2], c("y1", "y2", "y3")]
res <- c(res, f(temp)) # apply function on subtables
}
groups$res <- res # aggregate results
Which is not that fat for this simple toy example but very impractical with more complex data.
The problem is on th input side of your function. The way you specified it, it expects a dataframe.
A possible slution is to feed the function a list of columns. With a small change to your function:
f <- function(y) sum((y[[1]] - y[[2]]) / y[[3]])
You can now use it in a dplyr-chain:
df %>%
group_by(x1, x2) %>%
summarise(sum_y = f(list(y1, y2, y3)))
which gives:
# A tibble: 9 x 3
# Groups: x1 [?]
x1 x2 sum_y
<fct> <fct> <dbl>
1 a d 1.20
2 a e 0.457
3 a f -9.46
4 b d -1.11
5 b e -0.176
6 b f -1.34
7 c d -0.994
8 c e 3.38
9 c f -2.63

Cumulative pnorm in R

I'm looking to calculate a cumulative pnorm through as series.
set.seed(10)
df = data.frame(sample = rnorm(10))
# head(df)
# sample
# 1 0.01874617
# 2 -0.18425254
# 3 -1.37133055
# 4 -0.59916772
# 5 0.29454513
# 6 0.38979430
I would like the result to be
# na
# 0.2397501 # last value of pnorm(df$sample[1:2],mean(df$sample[1:2]),sd(df$sample[1:2]))
# 0.1262907 # last value of pnorm(df$sample[1:3],mean(df$sample[1:3]),sd(df$sample[1:3]))
# 0.4577793 # last value of pnorm(df$sample[1:4],mean(df$sample[1:4]),sd(df$sample[1:4]))
# .
# .
# .
if we can do this preferable in data.table, it would be nice.
You can do:
set.seed(10)
df = data.frame(sample = rnorm(10))
foo <- function(n, x) {
if (n==1) return(NA)
xn <- x[1:n]
tail(pnorm(xn, mean(xn), sd(xn)), 1)
}
sapply(seq(nrow(df)), foo, x=df$sample)
The way of calculation is similar to Calculating cumulative standard deviation by group using R
result:
#> sapply(seq(nrow(df)), foo, x=df$sample)
# [1] NA 0.23975006 0.12629071 0.45777934 0.84662051 0.83168998 0.11925118 0.50873996 0.06607348 0.63103339You can put the result in your dataframe:
df$result <- sapply(seq(nrow(df)), foo, x=df$sample)
Here is a compact version of the calculation (from #lmo)
c(NA, sapply(2:10, function(i) tail(pnorm(df$sample[1:i], mean(df$sample[1:i]), sd(df$sample[1:i])), 1)))

number elements in a vector with constraints

Given x and y I wish to create the desired.result below:
x <- 1:10
y <- c(2:4,6:7,8:9)
desired.result <- c(1,2,2,2,3,4,4,5,5,6)
where, in effect, each sequence in y is replaced in x by the the first element in the sequence in y and then the elements of the new x are numbered.
The intermediate step for x would be:
x.intermediate <- c(1,2,2,2,5,6,6,8,8,10)
Below is code that does this. However, the code is not general and is overly complex:
x <- 1:10
y <- list(c(2:4),(6:7),(8:9))
unique.x <- 1:(length(x[-unlist(y)]) + length(y))
y1 <- rep(min(unlist(y[1])), length(unlist(y[1])))
y2 <- rep(min(unlist(y[2])), length(unlist(y[2])))
y3 <- rep(min(unlist(y[3])), length(unlist(y[3])))
new.x <- x
new.x[unlist(y[1])] <- y1
new.x[unlist(y[2])] <- y2
new.x[unlist(y[3])] <- y3
rep(unique.x, rle(new.x)$lengths)
[1] 1 2 2 2 3 4 4 5 5 6
Below is my attempt to generalize the code. However, I am stuck on the second lapply.
x <- 1:10
y <- list(c(2:4),(6:7),(8:9))
unique.x <- 1:(length(x[-unlist(y)]) + length(y))
y2 <- lapply(y, function(i) rep(min(i), length(i)))
new.x <- x
lapply(y2, function(i) new.x[i[1]:(i[1]-1+length(i))] = i)
rep(unique.x, rle(new.x)$lengths)
Thank you for any advice. I suspect there is a much simpler solution I am overlooking. I prefer a solution in base R.
A solution like this should work:
x <- 1:10
y <- list(c(2:4),(6:7),(8:9))
x[unlist(y)]<-rep(sapply(y,'[',1),lapply(y,length))
rep(1:length(rle(x)$lengths), rle(x)$lengths)
## [1] 1 2 2 2 3 4 4 5 5 6

How to combine two vectors into a data frame

I have two vectors like this
x <-c(1,2,3)
y <-c(100,200,300)
x_name <- "cond"
y_name <- "rating"
I'd like to output the dataframe like this:
> print(df)
cond rating
1 x 1
2 x 2
3 x 3
4 y 100
5 y 200
6 y 300
What's the way to do it?
While this does not answer the question asked, it answers a related question that many people have had:
x <-c(1,2,3)
y <-c(100,200,300)
x_name <- "cond"
y_name <- "rating"
df <- data.frame(x,y)
names(df) <- c(x_name,y_name)
print(df)
cond rating
1 1 100
2 2 200
3 3 300
x <-c(1,2,3)
y <-c(100,200,300)
x_name <- "cond"
y_name <- "rating"
require(reshape2)
df <- melt(data.frame(x,y))
colnames(df) <- c(x_name, y_name)
print(df)
UPDATE (2017-02-07):
As an answer to #cdaringe comment - there are multiple solutions possible, one of them is below.
library(dplyr)
library(magrittr)
x <- c(1, 2, 3)
y <- c(100, 200, 300)
z <- c(1, 2, 3, 4, 5)
x_name <- "cond"
y_name <- "rating"
# Helper function to create data.frame for the chunk of the data
prepare <- function(name, value, xname = x_name, yname = y_name) {
data_frame(rep(name, length(value)), value) %>%
set_colnames(c(xname, yname))
}
bind_rows(
prepare("x", x),
prepare("y", y),
prepare("z", z)
)
This should do the trick, to produce the data frame you asked for, using only base R:
df <- data.frame(cond=c(rep("x", times=length(x)),
rep("y", times=length(y))),
rating=c(x, y))
df
cond rating
1 x 1
2 x 2
3 x 3
4 y 100
5 y 200
6 y 300
However, from your initial description, I'd say that this is perhaps a more likely usecase:
df2 <- data.frame(x, y)
colnames(df2) <- c(x_name, y_name)
df2
cond rating
1 1 100
2 2 200
3 3 300
[edit: moved parentheses in example 1]
You can use expand.grid( ) function.
x <-c(1,2,3)
y <-c(100,200,300)
expand.grid(cond=x,rating=y)
Here's a simple function. It generates a data frame and automatically uses the names of the vectors as values for the first column.
myfunc <- function(a, b, names = NULL) {
setNames(data.frame(c(rep(deparse(substitute(a)), length(a)),
rep(deparse(substitute(b)), length(b))), c(a, b)), names)
}
An example:
x <-c(1,2,3)
y <-c(100,200,300)
x_name <- "cond"
y_name <- "rating"
myfunc(x, y, c(x_name, y_name))
cond rating
1 x 1
2 x 2
3 x 3
4 y 100
5 y 200
6 y 300
df = data.frame(cond=c(rep("x",3),rep("y",3)),rating=c(x,y))
Alt simplification of https://stackoverflow.com/users/1969435/gx1sptdtda above:
cond <-c(1,2,3)
rating <-c(100,200,300)
df <- data.frame(cond, rating)
df
cond rating
1 1 100
2 2 200
3 3 300

Resources