Divide by last row in mutate in Tidyverse - r

So this is a relatively simple problem, I have a dataset as below
df <- structure(list(term = c("(Intercept)", "overall_quality", "overall_costs",
"wwpf"), estimate = c(0.388607224137536, 0.456477162621961, 0.485612564501229,
NA), std.error = c(0.499812263278414, 0.0987819420575201, 0.108042289289401,
NA), statistic = c(0.777506381273137, 4.62105879995918, 4.49465267438447,
NA), p.value = c(0.440597919486169, 0.0000279867005591494, 0.0000426773877613654,
NA), average = c(NA, 8.09615384615385, 7.86538461538461, 7.90384615384615
), Elasticity = c(NA, 3.69570933584318, 3.81952959386543, NA)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
I am trying to use below
df %>% mutate(Elasticity= average*estimate/average[nrow(df)])
Expected output: https://ibb.co/42ptLXx
basically, divide by last row value & since I am trying to incorporate this in function, I need the method to be dynamic & not hard coded value.
Please help !

We can use n() to return the index of last row for subsetting the value of that column
library(dplyr)
df %>%
mutate(Elasticity= average*estimate/average[n()])
If we need a function (using rlang_0.4.0), we can make use {{..}} for evaluation
f1 <- function(dat, col1, col2) {
dat %>%
mutate(Elasticity = {{col1}} * {{col2}}/{{col1}}[n()])
}
f1(df, average, estimate)
# A tibble: 4 x 7
# term estimate std.error statistic p.value average Elasticity
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 (Intercept) 0.389 0.500 0.778 0.441 NA NA
#2 overall_quality 0.456 0.0988 4.62 0.0000280 8.10 0.468
#3 overall_costs 0.486 0.108 4.49 0.0000427 7.87 0.483
#4 wwpf NA NA NA NA 7.90 NA

Related

Apply function on dataframe by specific group in R

I have a dataframe that looks something like this:
dist id daytime season
3 1.11 Name1 day summer
4 2.22 Name2 night spring
5 3.33 Name1 day winter
6 4.44 Name3 night fall
I want of summary of distby some specific columns in my dataframe.
So far I used a custom function:
summary <- function(x){df %>%
group_by(x) %>%
summarize(min = min(dist),
q1 = quantile(dist, 0.25),
median = median(dist),
mean = mean(dist),
q3 = quantile(dist, 0.75),
max = max(dist))}
And applied it to any specific column I wanted at the moment:
summary_ID <- path.summary(id)
I tried it a few weeks ago and would get something like this>
id min q1 median mean q3 max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Name1 0 17.8 310. 788. 1023. 5832.
2 Name2 0 31.7 284. 570. 744. 9578.
3 Name3 0 17.0 325. 721. 1185. 5293.
4 Name4 0 11.9 197. 530. 865. 3476.
5 Name5 0 24.5 94.9 617. 966. 9567.
When I try it now I get an error:
Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `x` is not found.
What changed and how do I get around the issue?
Here, we may use {{}} if the input is unquoted
path_summary <- function(dat, x){
dat %>%
group_by({{x}}) %>%
summarize(min = min(dist),
q1 = quantile(dist, 0.25),
median = median(dist),
mean = mean(dist),
q3 = quantile(dist, 0.75),
max = max(dist))
}
-testing
> path_summary(df, id)
# A tibble: 3 × 7
id min q1 median mean q3 max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Name1 1.11 1.66 2.22 2.22 2.78 3.33
2 Name2 2.22 2.22 2.22 2.22 2.22 2.22
3 Name3 4.44 4.44 4.44 4.44 4.44 4.44
data
df <- structure(list(dist = c(1.11, 2.22, 3.33, 4.44), id = c("Name1",
"Name2", "Name1", "Name3"), daytime = c("day", "night", "day",
"night"), season = c("summer", "spring", "winter", "fall")),
class = "data.frame", row.names = c("3",
"4", "5", "6"))

Merging output from cor.test in R dataframe

I am trying to perform cor.test in R in a dataframe:
For a toy dataset of X and Y, I used the following:
library(dplyr)
library(broom)
X = c(0.88,1.3,5.6,3.1)
Y = c(0,1,1,1)
ft<-cor.test(X,Y)
tidy(ft) %>%
select(estimate, p.value, conf.low, conf.high) %>%
bind_rows(.id = 'grp')
which gives me the following result:
grp estimate p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl>
1 1 0.571 0.429 -0.864 0.989
Now, a short version of my dataframe is like:
df<-structure(list(X_sample1 = c(0.11, 0.98, 0.88), X_sample2 = c(0.13,
0, 1.3), X_sample3 = c(1.5, 3.5, 5.6), X_sample4 = c(3.2, 2.4,
3.1), Y_sample1 = c(0L, 1L, 0L), Y_sample2 = c(0L, 0L, 1L), Y_sample3 = c(1L,
1L, 1L), Y_sample4 = c(1L, 1L, 1L)), class = "data.frame", row.names = c("Product1",
"Product2", "Product3"))
I want to perform cor.test in each row of the df between X and Y groups. Thus, in the above example df, the groups are:
X = c(0.11,0.13,1.5,3.2)
Y = c(0,0,1,1)
---------------
X = c(0.98,0,3.5,2.4)
Y = c(1,0,1,1)
---------------
X = c(0.88,1.3,5.6,3.1)
Y = c(0,1,1,1)
I want a output like:
grp estimate p.value conf.low conf.high
Product1 0.88 0.12 -0.525 0.997
Product2 0.743 0.257 -0.762 0.994
Product3 0.571 0.429 -0.864 0.989
Thanks for your help!
One option could be:
df %>%
rownames_to_column(var = "grp") %>%
rowwise() %>%
transmute(grp,
tidy(cor.test(c_across(starts_with("X")), c_across(starts_with("Y"))))) %>%
select(grp, estimate, p.value, conf.low, conf.high)
grp estimate p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl>
1 Product1 0.880 0.120 -0.525 0.997
2 Product2 0.743 0.257 -0.762 0.994
3 Product3 0.571 0.429 -0.864 0.989
You can use dplyr and broom:
library(dplyr)
library(broom)
df %>%
rownames_to_column() %>%
pivot_longer(-rowname, names_to = c(".value", "sample"),
names_sep = "_sample") %>%
nest_by(rowname) %>%
summarize(cors1 = tidy(cor.test(data$X, data$Y)))
# A tibble: 3 x 2
# Groups: rowname [3]
rowname cors1$estimate $statistic $p.value $parameter $conf.low $conf.high
<chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
1 Produc~ 0.880 2.62 0.120 2 -0.525 0.997
2 Produc~ 0.743 1.57 0.257 2 -0.762 0.994
3 Produc~ 0.571 0.984 0.429 2 -0.864 0.989

difference between first non-NA and last non-NA in each row

I have a data frame with up to 5 measurements (x) and their corresponding time:
df = structure(list(x1 = c(92.9595722286402, 54.2085219673818,
46.3227062573019,
NA, 65.1501442134141, 49.736451235317), time1 = c(43.2715277777778,
336.625, 483.975694444444, NA, 988.10625, 510.072916666667),
x2 = c(82.8368681534474, 53.7981639701784, 12.9993531230419,
NA, 64.5678816290574, 55.331442940348), time2 = c(47.8166666666667,
732, 506.747222222222, NA, 1455.25486111111, 958.976388888889
), x3 = c(83.5433119686794, 65.723072881366, 19.0147593408309,
NA, 65.1989838202356, 36.7000828457705), time3 = c(86.5888888888889,
1069.02083333333, 510.275, NA, 1644.21527777778, 1154.95694444444
), x4 = c(NA, 66.008102917677, 40.6243513885846, NA, 62.1694420909955,
29.0078249523063), time4 = c(NA, 1379.22986111111, 520.726388888889,
NA, 2057.20833333333, 1179.86805555556), x5 = c(NA, 61.0047472617535,
45.324715258421, NA, 59.862110645527, 45.883161439362), time5 = c(NA,
1825.33055555556, 523.163888888889, NA, 3352.26944444444,
1364.99513888889)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -6L))
"NA" means that the person (row) didn't have a measurement.
I would like to calculate the difference between the last existing measurement and the first one.
So for the first one it would be x3 minus x1 (6.4), for the second it would be -6.8 and so on.
I tried something like this, which didnt work:
df$diff = apply(df %>% select(., contains("x")), 1, function(x) head(x,
na.rm = T) - tail(x, na.rm=T))
Any suggestions? Also, is apply/rowwise the most efficient way, or is there a vectorized function to do that?
A vectorized way would be using max.col where we get "first" and "last" non-NA value using ties.method parameter
#Get column number of first and last col
first_col <- max.col(!is.na(df[x_cols]), ties.method = "first")
last_col <- max.col(!is.na(df[x_cols]), ties.method = "last")
#subset the dataframe to include only `"x"` cols
new_df <- as.data.frame(df[grep("^x", names(df))])
#Subtract last non-NA value with the first one
df$new_calc <- new_df[cbind(1:nrow(df), last_col)] -
new_df[cbind(1:nrow(df), first_col)]
Using apply you could do
x_cols <- grep("^x", names(df))
df$new_calc <- apply(df[x_cols], 1, function(x) {
new_x <- x[!is.na(x)]
if (length(new_x) > 0)
new_x[length(new_x)] - new_x[1L]
else NA
})
We can use tidyverse methods on the tbl_df. Create a row names column (rownames_to_column), gather the 'x' columns to 'long' format while removing the NA elements (na.rm = TRUE), grouped by row name, get the difference of first and last 'val'ues and bind the extracted column with the original dataset 'df'
library(tidyverse)
rownames_to_column(df, 'rn') %>%
select(rn, starts_with('x')) %>%
gather(key, val, -rn, na.rm = TRUE) %>%
group_by(rn) %>%
summarise(Diff = diff(c(first(val), last(val)))) %>%
mutate(rn = as.numeric(rn)) %>%
complete(rn = min(rn):max(rn)) %>%
pull(Diff) %>%
bind_cols(df, new_col = .)
# A tibble: 6 x 11
# x1 time1 x2 time2 x3 time3 x4 time4 x5 time5 new_col
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 93.0 43.3 82.8 47.8 83.5 86.6 NA NA NA NA -9.42
#2 54.2 337. 53.8 732 65.7 1069. 66.0 1379. 61.0 1825. 6.80
#3 46.3 484. 13.0 507. 19.0 510. 40.6 521. 45.3 523. -0.998
#4 NA NA NA NA NA NA NA NA NA NA NA
#5 65.2 988. 64.6 1455. 65.2 1644. 62.2 2057. 59.9 3352. -5.29
#6 49.7 510. 55.3 959. 36.7 1155. 29.0 1180. 45.9 1365. -3.85

Calculate total sum of squares between clusters in R

My objective is to compare which of the two clustering methods I've used cluster_method_1 and cluster_method_2 has the largest between cluster sum of squares in order to identify which one achieves better separation.
I'm basically looking for an efficient way to calculate the distance between each point of cluster 1 and all points of cluster 2,3,4, and so on.
example dataframe:
structure(list(x1 = c(0.01762376, -1.147739752, 1.073605848,
2.000420899, 0.01762376, 0.944438811, 2.000420899, 0.01762376,
-1.147739752, -1.147739752), x2 = c(0.536193126, 0.885609849,
-0.944699546, -2.242627057, -1.809984553, 1.834120637, 0.885609849,
0.96883563, 0.186776403, -0.678508604), x3 = c(0.64707104, -0.603759684,
-0.603759684, -0.603759684, -0.603759684, 0.64707104, -0.603759684,
-0.603759684, -0.603759684, 1.617857394), x4 = c(-0.72712328,
0.72730861, 0.72730861, -0.72712328, -0.72712328, 0.72730861,
0.72730861, -0.72712328, -0.72712328, -0.72712328), cluster_method_1 = structure(c(1L,
3L, 3L, 3L, 2L, 2L, 3L, 2L, 1L, 4L), .Label = c("1", "2", "4",
"6"), class = "factor"), cluster_method_2 = structure(c(5L, 3L,
1L, 3L, 4L, 2L, 1L, 1L, 1L, 6L), .Label = c("1", "2", "3", "4",
"5", "6"), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
x1 x2 x3 x4 cluster_method_1 cluster_method_2
<dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 0.0176 0.536 0.647 -0.727 1 5
2 -1.15 0.886 -0.604 0.727 4 3
3 1.07 -0.945 -0.604 0.727 4 1
4 2.00 -2.24 -0.604 -0.727 4 3
5 0.0176 -1.81 -0.604 -0.727 2 4
6 0.944 1.83 0.647 0.727 2 2
7 2.00 0.886 -0.604 0.727 4 1
8 0.0176 0.969 -0.604 -0.727 2 1
9 -1.15 0.187 -0.604 -0.727 1 1
10 -1.15 -0.679 1.62 -0.727 6 6
The within sum-of-squares for cluster Si can be written as the sum of all pairwise (Euclidean) distances squared, divided by twice the number of points in that cluster (see e.g. the Wikipedia article on k-means clustering)
For convenience we define a function calc_SS that returns the within sum-of-squares for a (numeric) data.frame
calc_SS <- function(df) sum(as.matrix(dist(df)^2)) / (2 * nrow(df))
It's then straightforward to calculate the within (cluster) sum-of-squares for every cluster for every method
library(tidyverse)
df %>%
gather(method, cluster, cluster_method_1, cluster_method_2) %>%
group_by(method, cluster) %>%
nest() %>%
transmute(
method,
cluster,
within_SS = map_dbl(data, ~calc_SS(.x))) %>%
spread(method, within_SS)
## A tibble: 6 x 3
# cluster cluster_method_1 cluster_method_2
# <chr> <dbl> <dbl>
#1 1 1.52 9.99
#2 2 10.3 0
#3 3 NA 10.9
#4 4 15.2 0
#5 5 NA 0
#6 6 0 0
The total within sum-of-squares is then just the sum of the within sum-of-squares for every cluster
df %>%
gather(method, cluster, cluster_method_1, cluster_method_2) %>%
group_by(method, cluster) %>%
nest() %>%
transmute(
method,
cluster,
within_SS = map_dbl(data, ~calc_SS(.x))) %>%
group_by(method) %>%
summarise(total_within_SS = sum(within_SS)) %>%
spread(method, total_within_SS)
## A tibble: 1 x 2
# cluster_method_1 cluster_method_2
# <dbl> <dbl>
#1 27.0 20.9
By the way, we can confirm that calc_SS does indeed return the within sum-of-squares using the iris dataset:
set.seed(2018)
df2 <- iris[, 1:4]
kmeans <- kmeans(as.matrix(df2), 3)
df2$cluster <- kmeans$cluster
df2 %>%
group_by(cluster) %>%
nest() %>%
mutate(within_SS = map_dbl(data, ~calc_SS(.x))) %>%
arrange(cluster)
## A tibble: 3 x 3
# cluster data within_SS
# <int> <list> <dbl>
#1 1 <tibble [38 × 4]> 23.9
#2 2 <tibble [62 × 4]> 39.8
#3 3 <tibble [50 × 4]> 15.2
kmeans$within
#[1] 23.87947 39.82097 15.15100
The total sum of squares, sum_x sum_y ||x-y||² is constant.
The total sum of squares can be computed trivially from variance.
If you now subtract the within-cluster sum of squares where x and y belong to the same cluster, then the between cluster sum of squares remains.
If you do this approach, it takes O(n) time instead of O(n²).
Corollary: the solution with the smallest WCSS has the largest BCSS.
Consider the package clValid. It calculates a large number of indexes for validating clustering. The Dunn index is particularly appropriate for what you are trying to do. The documentation says that the Dunn index is the ratio between the smallest distance between observation not in the same cluster to the largest intra-cluster distance. The documentation for it can be found at https://cran.r-project.org/web/packages/clValid/clValid.pdf.

Replace/Add column values based on another column

I am creating three new columns and trying to fill them based on the values in two other columns. The new columns are "0-5", "5-15" and "15-30". First, I wanted to know whether or not the cells are within these ranges according to the columns upper and lower, so I created rules to fill them in with 'y' (for yes). Now, if there is a y present, I would like to replace that y with the corresponding number in the value column. I am stuck on this part. I am also wondering if there is an easier way to fill in the "0-5", "5-15", and "15-30" columns directly with the number in "value" based on the upper/lower columns without having to put in "y" first.
x y upper lower 0-5 5-15 15-30 value
378828.1 1682697.2 2 12 y y NA 4.04
378828.1 1682697.2 12 37 NA y y 1.00
381625.6 1684852.5 0 63 y y y 1.96
388660.2 1704566.9 5 18 NA y y 2.65
We can accomplish this using the dplyr package as well as the match function:
library(dplyr)
dat %>%
rowwise() %>%
mutate(`0-5` = ifelse(any(match(lower:upper, 0:4)), value, NA),
`5-15` = ifelse(any(match(lower:upper, 5:14)), value, NA),
`15-30` = ifelse(any(match(lower:upper, 15:29)), value, NA))
x y upper lower value `0-5` `5-15` `15-30`
<dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 378828.1 1682697 2 12 4.04 4.04 4.04 NA
2 378828.1 1682697 12 37 1.00 NA 1.00 1.00
3 381625.6 1684853 0 63 1.96 1.96 1.96 1.96
4 388660.2 1704567 5 18 2.65 NA 2.65 2.65
data
dat <- structure(list(x = c(378828.1, 378828.1, 381625.6, 388660.2),
y = c(1682697.2, 1682697.2, 1684852.5, 1704566.9), upper = c(2L,
12L, 0L, 5L), lower = c(12L, 37L, 63L, 18L), value = c(4.04,
1, 1.96, 2.65)), .Names = c("x", "y", "upper", "lower", "value"
), class = "data.frame", row.names = c(NA, -4L))

Resources