Using dplyr to group the new calculations into one data frame - r

I have the following table and I have to obtain a standard deviation of y for each unique value of x.
ID x y
1 1 4
2 2 3
3 3 7
4 1 2
5 2 6
6 3 8
For example, each unique value of x, I have y=4 and y=2, so the standard deviation will be:
x1 <- c(4,2)
sd(x1)
#output is 1.41
x2 <-c(3,6)
sd(x2)
#output is 2.21
x3 <-c(3,6)
sd(x3)
#output is 0.71
Instead of getting each output and put it in a data frame using the long way, is there a way to do it faster using dplyr and the pipe? I tried to use mutate and group_by, but it doesn't seem to work. I would like the result to look the following with count_y (# of y values to each unique x)
x count_y Std_Dev
1 2 1.41
2 2 2.21
3 2 0.71

We don't need mutate (mutate creates or transforms column). Here, the output needed is one row per group which can be done with summarise
library(dplyr)
df1 %>%
group_by(x) %>%
summarise(count_y = n(), Std_Dev = sd(y))
-output
# A tibble: 3 × 3
x count_y Std_Dev
<int> <int> <dbl>
1 1 2 1.41
2 2 2 2.12
3 3 2 0.707
data
df1 <- structure(list(ID = 1:6, x = c(1L, 2L, 3L, 1L, 2L, 3L), y = c(4L,
3L, 7L, 2L, 6L, 8L)), class = "data.frame", row.names = c(NA,
-6L))

Related

How can I calculate the sum of the column wise differences using dplyr

Despite using R and dplyr on a regular basis, I encountered the issue of not being able to calculate the sum of the absolute differences between all columns:
sum_diff=ABS(A-B)+ABS(B-C)+ABS(C-D)...
A
B
C
D
sum_diff
1
2
3
4
3
2
1
3
4
4
1
2
1
1
2
4
1
2
1
5
I know I could iterate using a for loop over all columns, but given the size of my data frame, I prefer a more elegant and fast solution.
Any help?
Thank you
We may remove the first and last columns, get the difference, and use rowSums on the absolute values in base R. This could be very efficient compared to a package solution
df1$sum_diff <- rowSums(abs(df1[-ncol(df1)] - df1[-1]))
-output
> df1
A B C D sum_diff
1 1 2 3 4 3
2 2 1 3 4 4
3 1 2 1 1 2
4 4 1 2 1 5
Or another option is rowDiffs from matrixStats
library(matrixStats)
rowSums(abs(rowDiffs(as.matrix(df1))))
[1] 3 4 2 5
data
df1 <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Daata from akrun (many thanks)!
This is complicated the idea is to generate a list of the combinations, I tried it with combn but then I get all possible combinations. So I created by hand.
With this combinations we then could use purrrs map_dfc and do some data wrangling after that:
library(tidyverse)
combinations <-list(c("A", "B"), c("B", "C"), c("C","D"))
purrr::map_dfc(combinations, ~{df <- tibble(a=data[[.[[1]]]]-data[[.[[2]]]])
names(df) <- paste0(.[[1]],"_v_",.[[2]])
df}) %>%
transmute(sum_diff = rowSums(abs(.))) %>%
bind_cols(data)
sum_diff A B C D
<dbl> <int> <int> <int> <int>
1 3 1 2 3 4
2 4 2 1 3 4
3 2 1 2 1 1
4 5 4 1 2 1
data:
data <- structure(list(A = c(1L, 2L, 1L, 4L), B = c(2L, 1L, 2L, 1L),
C = c(3L, 3L, 1L, 2L), D = c(4L, 4L, 1L, 1L)), row.names = c(NA,
-4L), class = "data.frame")
Here is a dplyrs version of #akrun's elegant aproach that calculates the diff of the dataframe with it's shifted variant:
df %>%
mutate(sum_diff = rowSums(abs(identity(.) %>% select(1:last_col(1))
- identity(.) %>% select(2:last_col()))))
And here we have the rowwise variant, which basicly follows the same idea but this time every row is used as a vector that get's substracted by it's shifted self.
df %>%
rowwise() %>%
mutate(sum_diff = map2_int(c_across(1:last_col(1)),
c_across(2:last_col()),
~ abs(.x - .y)) %>% sum())

With R, how can I separate continuous values from a dataframe with item NA and calculate the average of only variable Y?

X Y
1 1 2
2 2 4
3 NA NA
4 NA NA
5 NA NA
6 NA NA
7 1 4
8 2 6
9 1 8
10 1 10
It should be so: In the first case the average of the values 2 and 4 is 3 In the second case, the average of the values 4,6,8,10 is 7 and so on...
Your data:
df = data.frame(X=c(1,2,NA,NA,NA,NA,1,2,1,1),Y=c(2,4,NA,NA,NA,NA,4,6,8,10))
You can define rows with consecutive rows with no NAs using diff(complete.cases(..)) :
blocks = cumsum(c(0,diff(complete.cases(df)) != 0 ))
block_means = tapply(df$Y,blocks,mean)
0 1 2
3 NA 7
block_means[!is.na(block_means)]
0 2
3 7
Or if you don't need to know the order:
na.omit(as.numeric(tapply(df$Y,blocks,mean)))
[1] 3 7
We can create groups of continuous values using rleid from data.table , within each group calculate the mean of Y values/
library(dplyr)
df %>%
group_by(gr = data.table::rleid(is.na(Y))) %>%
summarise(Y = mean(Y, na.rm = TRUE)) %>%
filter(!is.na(Y)) -> df1
df1
# gr Y
# <int> <dbl>
#1 1 3
#2 3 7
data.table way of doing this would be :
library(data.table)
df1 <- setDT(df)[, .(Y = mean(Y, na.rm = TRUE)), rleid(is.na(Y))][!is.na(Y)]
data
df <- structure(list(X = c(1L, 2L, NA, NA, NA, NA, 1L, 2L, 1L, 1L),
Y = c(2L, 4L, NA, NA, NA, NA, 4L, 6L, 8L, 10L)),
class = "data.frame", row.names = c(NA, -10L))

calculate descriptives for a nested variable

I want to calculate the M, min, and max of a variable. Data were collected at different visits. My data look like this:
id visit V1
1 1 18
1 2 24
2 2 NA
2 3 5
2 4 6
I want it to look like this, where I have columns for the M, SD, min, and max for V1 for each participant.
id visit V1 M MIN MAX
1 1 18 21 18 24
2 2 3 4.67 3 6
In calculating the M, I want to take into account the # of visits (e.g., 18 + 24/2 visits). I tried this as a first step:
df %>%
group_by(id) %>%
mutate(M = mean(V1), MIN = min(V1), MAX = max(V1), na.rm = T)
When I try to handle the NAs by making sure they are not included, the na.rm = T results in a new column entitled "na.rm" with every value being true, which isn't what I want. Any thoughts on making this work?
The dplyr package makes this easy. You can group_by() a variable, and whatever you do after that only applies within the group. In dplyr notation, the %>% is a special operator that feeds the outcome of the function on the left into the first argument of the function on the right.
There are two ways to do it. The first way keeps all of the data, but your summary statistics are repeated in each row.
library(dplyr)
df %>%
group_by(id) %>%
mutate(M = mean(V1), MIN = min(V1), MAX = max(V1)
id visit V1 M MIN MAX
1 1 18 21 18 24
1 2 24 21 18 24
2 2 3 4.67 3 6
2 3 5 4.67 3 6
2 4 6 4.67 3 6
The second way provides only the summary statistics by the group.
library(dplyr)
df %>%
group_by(id) %>%
summarize(M = mean(V1), MIN = min(V1), MAX = max(V1)
id M MIN MAX
1 21 18 24
2 4.67 3 6
You can try this dplyr approach similar to #ThomasIsCoding that produces something similar to what you want:
library(dplyr)
#Data
df <- structure(list(id = c(1L, 1L, 2L, 2L, 2L), visit = c(1L, 2L,
2L, 3L, 4L), V1 = c(18L, 24L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))
The code:
df %>% group_by(id) %>% mutate(M=mean(V1),Min=min(V1),Max=max(V1),SD=sd(V1))
Output:
# A tibble: 5 x 7
# Groups: id [2]
id visit V1 M Min Max SD
<int> <int> <int> <dbl> <int> <int> <dbl>
1 1 1 18 21 18 24 4.24
2 1 2 24 21 18 24 4.24
3 2 2 3 4.67 3 6 1.53
4 2 3 5 4.67 3 6 1.53
5 2 4 6 4.67 3 6 1.53
Maybe you want something like below
transform(df,
M = ave(V1, id, FUN = mean),
MIN = ave(V1, id, FUN = min),
MAX = ave(V1, id, FUN = max)
)
which gives
id visit V1 M MIN MAX
1 1 1 18 21.000000 18 24
2 1 2 24 21.000000 18 24
3 2 2 3 4.666667 3 6
4 2 3 5 4.666667 3 6
5 2 4 6 4.666667 3 6
Data
> dput(df)
structure(list(id = c(1L, 1L, 2L, 2L, 2L), visit = c(1L, 2L,
2L, 3L, 4L), V1 = c(18L, 24L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))

Filter values relative to values in another column using dplyr

I have a column in a dataframe and I would like to filter out any rows that are over or under two standard deviations from the mean.
As an example, I would hope to get two rows out of this (only the rows that fall between the low and high standard deviations:
group value low_sd high_sd
a 4 2 8
a 1 2 8
b 6 4 9
b 12 4 9
I was hoping to use dplyr::between .
clean_df <- df%>%
filter(between(value, low_sd, high_sd))
But it seems between only takes numerical values.
The between is not vectorized for left, right values. Instead, this can be done by using only the comparison (>/<) operators
library(dplyr)
df %>%
filter(value > low_sd, value <= high_sd)
# group value low_sd high_sd
#1 a 4 2 8
#2 b 6 4 9
But if we wrap with Vectorize, it would work as well
df %>%
filter(Vectorize(dplyr::between)(value, low_sd, high_sd))
# group value low_sd high_sd
#1 a 4 2 8
#2 b 6 4 9
data
df <- structure(list(group = c("a", "a", "b", "b"), value = c(4L, 1L,
6L, 12L), low_sd = c(2L, 2L, 4L, 4L), high_sd = c(8L, 8L, 9L,
9L)), class = "data.frame", row.names = c(NA, -4L))
Alternatively, you can use between() from data.table:
df %>%
filter(data.table::between(value, low_sd, high_sd))
group value low_sd high_sd
1 a 4 2 8
2 b 6 4 9
Or if you want to stick just to dplyr:
df %>%
rowwise() %>%
filter(dplyr::between(value, low_sd, high_sd))

Complex data frame transposition in R

I've tried searching for an answer for this but most data.frame/matrix transpoitions aren't as complicated as I am trying to accomplish. Basically I have a data.frame which looks like
F M A
2008_b 1 5 6
2008_r 3 3 6
2008_a 4 1 5
2009_b 1 1 2
2009_r 5 4 9
2009_a 2 2 4
I'm trying to transpose it and rename the column and row names as such:
F_b M_b A_b F_r M_r A_r F_a M_a A_a
2008 1 5 6 3 3 6 4 1 5
2009 1 1 2 5 4 9 2 2 4
Essentially every three rows are being collapsed in to a single row. I assume this can be done with some clever plyr or reshape2 commands but I'm at a total loss how to accomplish it.
You could try
library(dplyr)
library(tidyr)
lvl <- c(outer(colnames(df), unique(gsub(".*_", "", rownames(df))),
FUN=paste, sep="_"))
res <- cbind(Var1=row.names(df), df) %>%
gather(Var2, value, -Var1) %>%
separate(Var1, c('Var11', 'Var12')) %>%
unite(VarN, Var2, Var12) %>%
mutate(VarN=factor(VarN, levels=lvl)) %>%
spread(VarN, value)
row.names(res) <- res[,1]
res1 <- res[,-1]
res1
# F_b M_b A_b F_r M_r A_r F_a M_a A_a
#2008 1 5 6 3 3 6 4 1 5
#2009 1 1 2 5 4 9 2 2 4
data
df <- structure(list(F = c(1L, 3L, 4L, 1L, 5L, 2L), M = c(5L, 3L, 1L,
1L, 4L, 2L), A = c(6L, 6L, 5L, 2L, 9L, 4L)), .Names = c("F",
"M", "A"), class = "data.frame", row.names = c("2008_b", "2008_r",
"2008_a", "2009_b", "2009_r", "2009_a"))

Resources