Combine variables into numeric vector and find distance between them - r

I have four numeric variables that I would like to combine into two vectors, and then take the distance between those vectors.
df = data.frame(V1 = 1:10,
V2 = 11:20,
V3 = 21:30,
V4 = 31:40)
I can create the vectors this way:
df2 <- df %>%
mutate(vector1 = mapply(c, V1, V2, SIMPLIFY = F),
vector2 = mapply(c, V3, V4, SIMPLIFY = F))
But I haven't been able to force them to be numeric so I can't calculate the distance between them:
# want to be able to do something like this
df2 %>%
mutate(distance = sqrt(sum((vector1 - vector2) ^ 2)))
I've tried all sorts of combinations of:
distance_df$vector1 <- lapply(distance_df$vector1, as.numeric)
distance_df$vector1 <- as.numeric(as.character(distance_df$vector1))
I must be missing something quite obvious since this doesn't seem that difficult.

might this be an option?
library(tidyverse)
df = data.frame(V1 = 1:10,
V2 = 11:20,
V3 = 21:30,
V4 = 31:40)
df %>%
rowwise() %>%
mutate(distance = sqrt(sum((c(V1,V2) - c(V3,V4)) ^ 2)))

Related

Summing the lengths of lists inside a list in R

I have 2 lists inside a list in R. Each sublist contains a different number of dataframes. The data looks like this:
df1 <- data.frame(x = 1:5, y = letters[1:5])
df2 <- data.frame(x = 1:15, y = letters[1:15])
df3 <- data.frame(x = 1:25, y = letters[1:25])
df4 <- data.frame(x = 1:6, y = letters[1:6])
df5 <- data.frame(x = 1:8, y = letters[1:8])
l1 <- list(df1, df2)
l2 <- list(df3, df4, df5)
mylist <- list(l1, l2)
I want to count the total number of dataframes I have in mylist (answer should be 5, as I have 5 data frames in total).
Using lengths():
sum(lengths(mylist)) # 5
From the official documentation:
[...] a more efficient version of sapply(x, length)
library(purrr)
mylist |> map(length) |> simplify() |> sum()
You can try
lapply(mylist,length) |> unlist() |> sum()
How about this:
sum(sapply(mylist, length))
length(unlist(mylist, recursive = F)) should work.
Another possible solution:
library(tidyverse)
mylist %>% flatten %>% length
#> [1] 5
You can unlist and use length.
length(unlist(mylist, recursive = F))
# [1] 5
Forr lists of arbitrary length, one can use rrapply::rrapply:
length(rrapply(mylist, classes = "data.frame", how = "flatten"))
# 5

Apply function to list of dataframes and columns matching pattern

I have a list of dataframes and I would like to apply a function to specific columns that follow a pattern across all the dataframes in the list.
Here is an example list of dataframes:
k_2 <- data.frame(Site = c(rep("A",3), rep("B",2)), V1 = c(1,2,3,4,5), V2 = c(1,2,3,4,5))
k_3 <- data.frame(Site = c(rep("A",3), rep("B",2)), V1 = c(1,2,3,4,5), V2 = c(1,2,3,4,5), V3 = c(1,2,3,4,5))
k_4 <- data.frame(Site = c(rep("A",3), rep("B",2)), V1 = c(1,2,3,4,5), V2 = c(1,2,3,4,5), V3 = c(1,2,3,4,5), V4 = c(1,2,3,4,5))
my.list <- list(k_2, k_3, k_4)
my.list
I want to apply this
k2_res <- ddply(k_2, "Site", function(x) colSums(x[c("V1", "V2")])/nrow(x))
to all the dataframes in the list. However, for K_3 the calculation will need to be colSums(x[c("V1","V2","V3")]) and k_4 will go up to V4 and so on.
Ideas
I thought that maybe I could use some sort of grep or regrex to automatically select all the columns beginning with V?
Are you looking for something like below?
lapply(
my.list,
function(df) ddply(df, "Site", function(x) colSums(x[grepl("V\\d+", names(x))]) / nrow(x))
)

Remove column after map

Naive question ahead: I would like to remove a columns after a map
Repex:
tibble(a = rep(c("A", "B"), each = 5),
x = runif(10),
y = runif(10),
z = runif(10)) %>%
split(.$a) %>%
map(`[`, c("x", "y", "z"))
selects me the x, y, and z columns of the tibbles.
What if I want to drop the column a instead?
(Same result, but easier for me.)
Using base R
map(~.x[grep('a', names(.x), invert = TRUE)])
#OR
map(function(x) x[grep('a', names(x), invert = TRUE)])
Using dplyr
map(~select(.x, -a))

R Loop code over several lists of dataframes

I have several lists of dataframes and I want to format the date in each single dataframe within all lists of dataframes. Here is an example code:
v1 = c("2000-05-01", "2000-05-02", "2000-05-03", "2000-05-04", "2000-05-05")
v2 = seq(2,20, length = 5)
v3 = seq(-2,7, length = 5)
v4 = seq(-6,3, length = 5)
df1 = data.frame(Date = v1, df1_Tmax = v2, df1_Tmean = v3, df1_Tmin = v4)
dfl1 <- list(df1, df1, df1, df1)
names(dfl1) = c("ABC_1", "DEF_1", "GHI_1", "JKL_1")
v1 = c("2000-05-01", "2000-05-02", "2000-05-03", "2000-05-04", "2000-05-05")
v2 = seq(3,21, length = 5)
v3 = seq(-3,8, length = 5)
v4 = seq(-7,4, length = 5)
df2 = data.frame(Date = v1, df2_Tmax = v2, df2_Tmean = v3, df2_Tmin = v4)
dfl2 <- list(df2, df2, df2, df2)
names(dfl2) = c("ABC_2", "DEF_2", "GHI_2", "JKL_2")
v1 = c("2000-05-01", "2000-05-02", "2000-05-03", "2000-05-04", "2000-05-05")
v2 = seq(4,22, length = 5)
v3 = seq(-4,9, length = 5)
v4 = seq(-8,5, length = 5)
df3 = data.frame(Date = v1, df3_Tmax = v2, df3_Tmean = v3, df3_Tmin = v4)
dfl3 <- list(df3, df3, df3, df3)
names(dfl3) = c("ABC_3", "DEF_3", "GHI_3", "JKL_3")
v1 = c("2000-05-01", "2000-05-02", "2000-05-03", "2000-05-04", "2000-05-05")
v2 = seq(2,20, length = 5)
v3 = seq(-2,8, length = 5)
v4 = seq(-6,3, length = 5)
abc = data.frame(Date = v1, ABC_Tmax = v2, ABC_Tmean = v3, ABC_Tmin = v4)
abclist <-list(abc, abc, abc, abc)
names(abclist) = c("ABC_abc", "DEF_abc", "GHI_abc", "JKL_abc")
I know how to change the date-column manually:
dfl1$ABC_1$Date = as.Date(dfl1$ABC_1$Date,format="%Y-%m-%d")
class(dfl1$ABC_1$Date)
But how can I do that for each single Date-Column in all of my lists of dataframes?
Here is one option using get and assign
nms <- c('dfl1', 'dfl2', 'dfl3', 'abclist')
lapply(nms, function(x) assign(x,lapply(get(x),
function(y) {y$Date1 <- as.Date(y$Date, format="%Y-%m-%d")
return(y)}),
envir = .GlobalEnv))
PS: Be careful with assign since it will change your global environment .GlobalEnv. Many R users will suggest the list solution over assign.
This can be done with lapply:
lapply(dfl1, function(x) {
x$Date <- as.Date(x$Date, format="%Y-%m-%d")
return(x)})
If you want to do this for all of you df-lists you need to store them in a list and then you can use a slightly modified version of the above call:
df_list <- list(dfl1, dfl2, dfl3, abclist)
lapply(df_list, function(x) {
x[[1]]$Date <- as.Date(x[[1]]$Date, format="%Y-%m-%d")
return(x)})
This assumes that the Date-column has always the same name "Date".

Correlations between dataframe and list of dataframes in R

I want to calculate correlations between a dataframe and a list of dataframes. Here is my sample:
library(lubridate)
v1 = seq(ymd('2000-05-01'),ymd('2000-05-10'),by='day')
v2 = seq(2,20, length = 10)
v3 = seq(-2,7, length = 10)
v4 = seq(-6,3, length = 10)
df1 = data.frame(Date = v1, Tmax = v2, Tmean = v3, Tmin = v4)
v1 = seq(ymd('2000-05-01'),ymd('2000-05-10'),by='day')
v2 = seq(3,21, length = 10)
v3 = seq(-3,8, length = 10)
v4 = seq(-7,4, length = 10)
abc = data.frame(Date = v1, ABC_Tmax = v2, ABC_Tmean = v3, ABC_Tmin = v4)
v1 = seq(ymd('2000-05-01'),ymd('2000-05-10'),by='day')
v2 = seq(4,22, length = 10)
v3 = seq(-4,9, length = 10)
v4 = seq(-8,5, length = 10)
def = data.frame(Date = v1, DEF_Tmax = v2, DEF_Tmean = v3, DEF_Tmin = v4)
v1 = seq(ymd('2000-05-01'),ymd('2000-05-10'),by='day')
v2 = seq(2,20, length = 10)
v3 = seq(-2,8, length = 10)
v4 = seq(-6,3, length = 10)
ghi = data.frame(Date = v1, GHI_Tmax = v2, GHI_Tmean = v3, GHI_Tmin = v4)
df2 <-list(abc, def, ghi)
names(df2) = c("ABC", "DEF", "GHI")
I want to have all correlation coefficients between df1 and df2, but only columnswise.
For example:
df1$Tmax and all df2*Tmax columns
df1$Tmean and all df2*Tmean columns
df1$Tmin and all df2*Tmin columns
I know that I can access all Tmax columns like that:
lapply(df2, "[[", 2)
I know how to calculate the correlation between 2 single values:
cor.test(df1$Tmax, df2$ABC$ABC_Tmax, method = "spearman")
But how can I do it for all columns at once? I tried this, which is not working:
cor.test(df1$Tmax, lapply(df2, "[[", 2), method = "spearman")
Any ideas?
You could use lapply in combination with mapply to apply cor.test and extract a specific value from the test. For example, to get p.value and estimate we can do
lapply(2:4, function(i) mapply(function(x, y) {
a <- cor.test(x, y, method = "spearman")
c(setNames(a$p.value, "pvalue"), a$estimate)
}, lapply(df2, "[[", i), df1[i]))

Resources