We have a set of 50 csv files from participants, currently being read into a list as
file_paths <- fs::dir_ls("data")
file_paths
file_contents <- list ()
for (i in seq_along (file_paths)) {
file_contents[[i]] <- read_csv(
file = file_paths[[i]]
)
}
dt <- set_names(file_contents, file_paths)
My data looks like this:
level time X Y Type
1 1 355. -10.6 22.36 P
1 1 371. -33 24.85 O
1 2 389. -10.58 17.23 P
1 2 402. -16.7 30.46 O
1 3 419. -29.41 17.32 P
1 4 429. -10.28 26.36 O
2 5 438. -26.86 32.98 P
2 6 451. -21 17.06 O
2 7 463. -21 32.98 P
2 8 474. -19.9 17.06 O
We have 70 sets of coordinates per csv.
Time does not matter for this, but I would like to split up by the level column at some stage.
For every 'P' I want to compare it to 'O' and get the distance between coordinates.The first P will always match with the first O and so on.
For now, I have them split into two different lists, though this may be the complete wrong way to do it! I'm having trouble figuring out how to take all of these csv files and get the distances for all of them, the list seems to cause issues with most functions (like dist)
Here is how I've pulled the right information so far
for (i in seq_along (dt)) {
pLoc[[i]] <- dplyr::filter(dt[[i]], grepl("P", type))
oLoc[[i]] <- dplyr::filter(dt[[i]], grepl("o", type))
pX[[i]] <- pLoc[[i]] %>% pull(as.numeric(headX))
pY[[i]] <- pLoc[[i]] %>% pull(as.numeric(headY))
pCoordinates[[i]] <- cbind(pX[[i]], pY[[i]])
}
[EDITED] Following comments, here is how you can do it with the raster library:
library(raster)
library(dplyr)
df = data.frame(
x = c(10, 20 ,15,9),
y = c(45,34,54,24),
type = c("P","O","P","O")
)
df = cbind(df[df$type=="P",] %>%
dplyr::select(-type) %>%
dplyr::rename(xP = x,
yP = y),
df[df$type=="O",] %>%
dplyr::select(-type) %>%
dplyr::rename(xO = x,
yO = y))
The following could probably be achieved more efficiently with some form of the apply() function:
v = c()
for(i in 1:nrow(df)){
dist = raster::pointDistance(lonlat = F,
p1 = c(df$xP[i],df$yP[i]),
p2 = c(df$xO[i],df$yO[i]))
v = c(v,dist)
}
df$dist = v
print(df)
xP yP xO yO dist
1 10 45 20 34 14.86607
3 15 54 9 24 30.59412
Related
I need help please. I have two lists: the first contains ndvi time series for distinct points, the second contains precipitation time series for the same plots (plots are in the same order in the two lists).
I need to combine the two lists. I want to add the column called precipitation from one list to the corresponding ndvi column from the other list respecting the dates (represented here by letters in the row names) to a posterior analises of correlation between columns. However, both time series of ndvi and precipitation have distinct lenghts and distinct dates.
I created the two lists to be used as example of my dataset. However, in my actual dataset the row names are monthly dates in the format "%Y-%m-%d".
library(tidyverse)
set.seed(100)
# First variable is ndvi.mon1 (monthly ndvi)
ndvi.mon1 <- vector("list", length = 3)
for (i in seq_along(ndvi.mon1)) {
aux <- data.frame(ndvi = sample(randu$x,
sample(c(seq(1,20, 1)),1),
replace = T))
ndvi.mon1[i] <- aux
ndvi.mon1 <- ndvi.mon1 %>% map(data.frame)
rownames(ndvi.mon1[[i]]) <- sample(letters, size=seq(letters[1:as.numeric(aux %>% map(length))]) %>% length)
}
# Second variable is precipitation
precipitation <- vector("list", length = 3)
for (i in seq_along(ndvi.mon1)){
prec_aux <- data.frame(precipitation = sample(randu$x*500,
26,
replace = T))
row.names(prec_aux) <- seq(letters[1:as.numeric(prec_aux %>% map(length))])
precipitation[i] <- prec_aux
precipitation <- precipitation %>% map(data.frame)
rownames(precipitation[[i]]) <- letters[1:(as.numeric(precipitation[i] %>% map(dim) %>% map(first)))]
}
Can someone help me please?
Thank you!!!
Marcio.
Maybe like this?
library(dplyr)
library(purrr)
precipitation2 <- precipitation %>%
map(rownames_to_column) %>%
map(rename, precipitation = 2)
ndvi.mon2 <- ndvi.mon1 %>%
map(rownames_to_column) %>%
map(rename, ndvi = 2)
purrr::map2(ndvi.mon2, precipitation2, left_join, by = "rowname")
[[1]]
rowname ndvi precipitation
1 k 0.354886 209.7415
2 x 0.596309 103.3700
3 r 0.978769 403.8775
4 l 0.322291 354.2630
5 c 0.831722 348.9390
6 s 0.973205 273.6030
7 h 0.949827 218.6430
8 y 0.443353 61.9310
9 b 0.826368 8.3290
10 d 0.337308 291.2110
The below will return a list of data.frames, that have been merged, using rownames:
lapply(seq_along(ndvi.mon1), function(i) {
merge(
x = data.frame(date = rownames(ndvi.mon1[[i]]), ndvi = ndvi.mon1[[i]][,1]),
y = data.frame(date = rownames(precipitation[[i]]), precip = precipitation[[i]][,1]),
by="date"
)
})
Output:
[[1]]
date ndvi precip
1 b 0.826368 8.3290
2 c 0.831722 348.9390
3 d 0.337308 291.2110
4 h 0.949827 218.6430
5 k 0.354886 209.7415
6 l 0.322291 354.2630
7 r 0.978769 403.8775
8 s 0.973205 273.6030
9 x 0.596309 103.3700
10 y 0.443353 61.9310
[[2]]
date ndvi precip
1 g 0.415824 283.9335
2 k 0.573737 311.8785
3 p 0.582422 354.2630
4 y 0.952495 495.4340
[[3]]
date ndvi precip
1 b 0.656463 332.5700
2 c 0.347482 94.7870
3 d 0.215425 431.3770
4 e 0.063100 499.2245
5 f 0.419460 304.5190
6 g 0.712057 226.7125
7 h 0.666700 284.9645
8 i 0.778547 182.0295
9 k 0.902520 82.5515
10 l 0.593219 430.6630
11 m 0.788715 443.5345
12 n 0.347482 132.3950
13 q 0.719538 79.1835
14 r 0.911370 100.7025
15 s 0.258743 309.3575
16 t 0.940644 142.3725
17 u 0.626980 335.4360
18 v 0.167640 390.4915
19 w 0.826368 63.3760
20 x 0.937211 439.8685
I am working with the R programming language.
I generated the following random data set that contains x and y points:
set.seed(123)
x_cor = rnorm(10,100,100)
y_cor = rnorm(10,100,100)
my_data = data.frame(x_cor,y_cor)
x_cor y_cor
1 43.95244 222.40818
2 76.98225 135.98138
3 255.87083 140.07715
4 107.05084 111.06827
5 112.92877 44.41589
6 271.50650 278.69131
7 146.09162 149.78505
8 -26.50612 -96.66172
9 31.31471 170.13559
10 55.43380 52.72086
I am trying to write a "greedy search" algorithm that shows which point is located the "shortest distance" from some starting point.
For example, suppose we start at -26.50612, -96.66172
distance <- function(x1,x2, y1,y2) {
dist <- sqrt((x1-x2)^2 + (y1 - y2)^2)
return(dist)
}
Then I calculated the distance between -26.50612, -96.66172 and each point :
results <- list()
for (i in 1:10){
distance_i <- distance(-26.50612, my_data[i,1], -96.66172, my_data[i,2] )
index = i
my_data_i = data.frame(distance_i, index)
results[[i]] <- my_data_i
}
results_df <- data.frame(do.call(rbind.data.frame, results))
However, I don't think this is working because the distance between the starting point -26.50612, -96.66172 and itself is not 0 (see 8th row):
distance_i index
1 264.6443 1
2 238.7042 2
3 191.3048 3
4 185.0577 4
5 151.7506 5
6 306.4785 6
7 331.2483 7
8 223.3056 8
9 213.3817 9
10 331.6455 10
My Question:
Can someone please show me how to write a function that correctly finds the nearest point from an initial point
(Step 1) Then removes the nearest point and the initial point from "my_data"
(Step 2) And then re-calculates the nearest point from "my_data" using the nearest point identified Step 1 (i.e. with the removed data)
And in the end, shows the path that was taken (e.g. 5,7,1,9,3, etc)
Can someone please show me how to do this?
Thanks!
This could be helpful and I think you can solve the further tasks by yourself
start <- c(x= -26.50612, y= -96.66172)
library(dplyr)
my_data <- data.frame(x_cor,y_cor) %>%
rowwise() %>%
mutate(dist = distance(start["x"], x_cor, start["y"], y_cor))
The solution is implemented as a recursive function distmin, which finds the closest point between an input x and a dataframe Y and then calls itself with the closest point and the dataframe without the closest point as arguments.
EDIT: I reimplemented distmin to use dataframes.
my_data = data.frame(x_cor,y_cor) |>
mutate(idx = row_number())
distmin <- function(x, Y) {
if(nrow(Y) == 0) {
NULL
} else {
dst <- sqrt((x$x_cor - Y$x_cor)^2 + (x$y_cor - Y$y_cor)^2)
m <- which.min(dst)
res <- data.frame(x, dist = dst[m], nearest = Y[m,"idx"])
rbind(res, distmin(Y[m,], Y[-m,]))
}}
N <- 5
distmin(my_data[N,], my_data[-N,])
##> x_cor y_cor idx dist nearest
##> 5 112.92877 44.41589 5 58.09169 10
##> 10 55.43380 52.72086 10 77.90211 4
##> 4 107.05084 111.06827 4 39.04847 2
##> 2 76.98225 135.98138 2 57.02661 9
##> 9 31.31471 170.13559 9 53.77858 1
##> 1 43.95244 222.40818 1 125.32571 7
##> 7 146.09162 149.78505 7 110.20762 3
##> 3 255.87083 140.07715 3 139.49323 6
##> 6 271.50650 278.69131 6 479.27176 8
The following shows the order in which points are called.
distmin(my_data[N,], my_data[-N,]) |>
mutate(ord = row_number()) |>
ggplot(aes(x = x_cor, y_cor)) +
geom_text(aes(label = ord))
I have a temperature profiler (tp) data for date, depth and temperature. The depth for each date is not exactly the same so I need to unify it to the same depth and set the temperature for that depth by linear approximation. I was able to do this with a loop using ‘approx’ function (see first part of the enclosed code). But I know that I should do it better without a loop (considering I will have about 600,000 rows). I tried to do it with ‘by’ function but was not successful transforming the result (list) into a data frame or matrix (see second part of the code).
Keep in mind that length of the rounded depth is not always the same as in the example.
Rounded depth is in Depth2 column, interpulated temperature is put in Temp2
What is the ‘right’ way to solve this?
# create df manually
tp <- data.frame(Date=double(31), Depth=double(31), Temperature=double(31))
tp$Date[1:11] <- '2009-12-17' ; tp$Date[12:22] <- '2009-12-18'; tp$Date[23:31] <- '2009-12-19'
tp$Depth <- c(24.92,25.50,25.88,26.33,26.92,27.41,27.93,28.37,28.82,29.38,29.92,25.07,25.56,26.06,26.54,27.04,27.53,28.03,28.52,29.02,29.50,30.01,25.05,25.55,26.04,26.53,27.02,27.52,28.01,28.53,29.01)
tp$Temperature <- c(19.08,19.06,19.06,18.87,18.67,17.27,16.53,16.43,16.30,16.26,16.22,17.62,17.43,17.11,16.72,16.38,16.28,16.20,16.15,16.13,16.11,16.08,17.54,17.43,17.32,17.14,16.89,16.53,16.28,16.20,16.13)
# create rounded depth column
tp$Depth2 <- round(tp$Depth)
# loop on date to calculate linear approximation for rounded depth
dtgrp <- tp[!duplicated(tp[,1]),1]
for (i in dtgrp) {
x1 <- tp[tp$Date == i, "Depth"]
y1 <- tp[tp$Date == i, "Temperature"]
x2 <- tp[tp$Date == i, "Depth2"]
tpa <- approx(x=x1,y=y1,xout=x2, rule=2)
tp[tp$Date == i, "Temp2"] <- tpa$y
}
# reduce result to rounded depth
tp1 <- tp[!duplicated(tp[,-c(2:3)]),-c(2:3)]
# not part of the question, but the end need is for a matrix, so this complete it:
library(reshape2)
tpbydt <- acast(tp1, Date~Depth2, value.var="Temp2")
# second part: I tried to use the by function (instead of loop) but got lost when tring to convert it to data frame or matrix
rdpth <- function(x1,y1,x2) {
tpa <- approx(x=x1,y=y1,xout=x2, rule=2)
return(tpa)
}
tp2 <- by(tp, tp$Date,function(tp) rdpth(tp$Depth,tp$Temperature,tp$Depth2), simplify = TRUE)
Very close with by call but remember it returns a list of objects. Therefore, consider building a list of data frames to be row binded at very end:
df_list <- by(tp, tp$Date, function(sub) {
tpa <- approx(x=sub$Depth, y=sub$Temperature, xout=sub$Depth2, rule=2)
df <- unique(data.frame(Date = sub$Date,
Depth2 = sub$Depth2,
Temp2 = tpa$y,
stringsAsFactors = FALSE))
return(df)
})
tp2 <- do.call(rbind, unname(df_list))
tp2
# Date Depth2 Temp2
# 1 2009-12-17 25 19.07724
# 2 2009-12-17 26 19.00933
# 5 2009-12-17 27 18.44143
# 7 2009-12-17 28 16.51409
# 9 2009-12-17 29 16.28714
# 11 2009-12-17 30 16.22000
# 12 2009-12-18 25 17.62000
# 21 2009-12-18 26 17.14840
# 4 2009-12-18 27 16.40720
# 6 2009-12-18 28 16.20480
# 8 2009-12-18 29 16.13080
# 10 2009-12-18 30 16.08059
# 13 2009-12-19 25 17.54000
# 22 2009-12-19 26 17.32898
# 41 2009-12-19 27 16.90020
# 61 2009-12-19 28 16.28510
# 81 2009-12-19 29 16.13146
And if you reset row.names, this is exactly identical to your tp1 output:
identical(data.frame(tp1, row.names = NULL),
data.frame(tp2, row.names = NULL))
# [1] TRUE
I'm trying to make a heatmap using ggplot2. What I want to be plotted is in the form of a matrix which is the result of a function.
Here is the data:
Image A B C D E F
1 3 23 45 23 45 90
2 4 34 34 34 34 89
3 34 33 24 89 23 67
4 3 45 234 90 12 78
5 78 89 34 23 12 56
6 56 90 56 67 34 45
Here is the function:
vector_a <- names(master)[2:4]
vector_b <- names(master)[5:6]
heatmap_prep <- function(dataframe, vector_a,vector_b){
dummy <- as.data.frame(matrix(0, nrow=length(vector_a), ncol=length(vector_b)))
for (i in 1:length(vector_a)){
first_value <- dataframe[[ vector_a[i] ]]
# print(first_value)
for(j in 1:length(vector_b)){
second_value <- dataframe[[ vector_b[j] ]]
result <- cor(first_value, second_value, method = "spearman")
dummy [i,j] <- result
}
}
rownames(dummy) <- vector_a
return(as.matrix(dummy))
heatmap_data_matrix1 <- heatmap_prep(master,vector_a, vector_b)
Using the data in heatmap_data_matrix1, I want to create a heatmap using the following code:
library(ggplot2)
if (length(grep("ggplot2", (.packages() ))) == 0){
library(ggplot2)
}
p <- ggplot(data = heatmap_data_matrix1, aes(x = vector_a, y = vector_b)
+ geom_tile(aes(fill = ))
However, this does not work. How should I reformat my data/code so this heatmap can be created? What should I put under "fill="?
Thanks!
Due to many of R functions being vectorized and that, for the most part, you don't need to pre-allocate or define a vector the for loop is unnecessary. You can simply run corr(x,y, method = "spearman") without the complications of the loop.
Regarding your question of what to put in for fill, you'll need to reshape your data to the configuration that ggplot2 uses (long format).
The gather function from tidyr does this, placing the rows/columns of the correlation into separate columns, and then using the r value for fill.
library(tidyverse) # for tidyr, tibble, ggplot2, and magrittr
heatmap_function <- function(df, a, b) {
cor_data <- cor(df[a], df[b], method = "spearman") %>%
as.data.frame(rownames = a) %>%
rownames_to_column("x") %>%
gather(y, fill, -x)
ggplot(cor_data, aes(x = x, y = y, fill = fill)) +
geom_tile()
}
This results in:
heatmap_function(master, c("A","B","C"), c("D","E"))
I have written a function which takes a subset of data based on the value of name column.It Computes the outlier for column "mark" and replaces all the outliers.
However when I try to combine these different subsets, the order of my elements changes. Is there any way by which I can maintain the order of my elements in the column "mark"
My data set is:
name mark
A 100.0
B 0.5
C 100.0
A 50.0
B 90.0
B 1000.0
C 1200.0
C 5000.0
A 210.0
The function which I have written is :
data.frame(do.call("rbind", as.list(by(data, data$name,
function(x){apply(x[, .(mark)],2,
function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))]
<- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})}))))
The result of the above function is the first column below (I've manually added back name for illustratory purposes):
mark NAME
100.000 ----- A
50.000 ----- A
210.000 ----- A
0.500 ----- B
90.000 ----- B
839.625 ----- B
100.000 ----- C
1200.000 ----- C
4875.000 ----- C
In the above result, the order of the values for mark column are changed. Is there any way by which I can maintain the order of the elements ?
Are you sure that code is doing what you think it is?
It looks like you're replacing any value greater than the median (third returned value of quantile) with the median + 1.5*IQR. Maybe that's what you intend, I don't know. The bigger problem is that you're doing that in an apply function, so it's going to re-calculate that median and IQR each iteration, updated with the previous rows already being changed. I'd wager that's not what you intend, but I suppose I've seen stranger.
A better option might be to create an external function to do the work, which takes in all of the data, does the calculation, then outputs all the data. I like dplyr for this simply because it's clean.
Reading your data in (why the "----"?)
scores <- read.table(text="
name mark
A 100.0
B 0.5
C 100.0
A 50.0
B 90.0
B 1000.0
C 1200.0
C 5000.0
A 210.0", header=TRUE)
and creating a function that does something a little more sensible; replaces any value greater than the 75% quantile (referenced by name so you know what it is) or less than the 25% quantile with that limiting value
scale_outliers <- function(data) {
lim <- quantile(data, na.rm = TRUE)
data[data > lim["75%"]] <- lim["75%"]
data[data < lim["25%"]] <- lim["25%"]
return(data)
}
Chaining this processing into dplyr::mutate is neat, and can then be passed on to ggplot. Here's the original data
gg1 <- scores %>% ggplot(aes(x=name, y=mark))
gg1 <- gg1 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg1
And if we alter it with the new function we get the data back without rows changed around
scores %>% mutate(new_mark = scale_outliers(mark))
#> name mark new_mark
#> 1 A 100.0 100
#> 2 B 0.5 90
#> 3 C 100.0 100
#> 4 A 50.0 90
#> 5 B 90.0 90
#> 6 B 1000.0 1000
#> 7 C 1200.0 1000
#> 8 C 5000.0 1000
#> 9 A 210.0 210
and we can plot that,
gg2 <- scores %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark))
gg2 <- gg2 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg2
Best of all, if you now want to do that quantile comparison group-wise (say, by the name column, it's as easy as using dplyr::group_by(name),
gg3 <- scores %>% group_by(name) %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark))
gg3 <- gg3 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg3
A slightly refactored version of Hack-R's answer -- you can add a index to your data.table:
data <- data.table(name = c("A", "B","C", "A","B","B","C","C","A"),mark = c(100,0.5,100,50,90,1000,1200,5000,210))
data[,i:=.I]
Then you perform your calculation but you keep the name and i:
df <- data.frame(do.call("rbind", as.list(
by(data, data$name,
function(x) cbind(i=x$i,
name=x$name,
apply(x[, .(mark)], 2,function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))] <- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})
)))))
And finally you order using the index:
df[order(df$i),]
i name mark
1 1 A 100
4 2 B 0.5
7 3 C 100
2 4 A 50
5 5 B 90
6 6 B 839.625
8 7 C 1200
9 8 C 4875
3 9 A 210