Difference between two variables in R within the same group - r

I have a dataset with the following structure:
I would like to make the difference between two variables in the same group. Thus, the result I wish to obtain is the following:
Note that the difference must always be equal to or bigger than 0. I would like to solve it using R.

Try group by and diff function.
library(tidyverse)
df <- data.frame(group = rep(LETTERS[1:3], each=2),
value = c(20, 5, 0, 30, 10, 2))
df %>%
group_by(group) %>%
summarise(difference= abs(diff(value)))
# A tibble: 3 × 2
group difference
<chr> <dbl>
1 A 15
2 B 30
3 C 8

Related

R: Using GTsummary of table where ID can have multiple values

I have this examplary dataframe:
df <- tibble(ID = c(1, 1, 2), value = c(0, 1, 3), group = c("group0", "group0", "group1")) %>% group_by(value)
ID value group
<dbl> <dbl> <chr>
1 1 0 group0
2 1 1 group0
3 2 3 group1
That is, an ID always belongs to one group, however, there might be more than one value associated with that ID.
I know want to summarise the occurence of values within the different groups. For that I tried
df %>% gtsummary::tbl_summary(by = "group")
which gives me
However, as you can see in the header, the N numbers do not quite match my requirements. Because I only want to count the number of unique IDs in the group. Therefore, for both groups it should be N = 1.
Is there a way to achieve this with gtsummary?

Finding the exact match in the values in the categorical variables

I wanted to find an exact match in the values between all three columns (rg1,rg2,rg3).Below is my dataframe.
For instance - first row has a combination of (70,71,72) , if this same combination appears in the remaining rows for the rest of the user ids , then, keep only those users and delete rest.
To describe it further - first row has (70,71,72) and say , if row 10 had the same values in B,C,D column, then I just want to display row 1 and row 10.(using R)
I tried doing clustering on this - kmodes. But I'm not getting the expected results.The current code is grouping all the rgs but it's kind of validating only a single Rg that has appeared most frequently in the data frame(above is my dataframe) and ranking them accordingly.
Can someone please guide me on this?Is there any better way to do this?
kmodes <- klaR::kmodes(mapped_df, modes= 5, iter.max = 10, weighted = FALSE)
#Add these clusters to the main dataframe
final <- mapped_df %>%
mutate(cluster = kmodes$cluster)
You can sort across the columns, then look for duplicates.
set.seed(1234)
df <- tibble(Userids = 1:20,
rg_1 = sample(1:20, 20, TRUE),
rg_2 = sample(1:20, 20, TRUE),
rg_3 = sample(1:20, 20, TRUE))
df[4, -1] <- rev(df[15, -1])
# sort across the columns
df_sorted <- t(apply(df[-1], 1, sort))
# return the duplicated rows
df[duplicated(df_sorted) | duplicated(df_sorted, fromLast = TRUE), ]
This will give you a data frame with all the duplicated values. Once you have the sorted data frame, it should be easy enough to find what you need.
Userids rg_1 rg_2 rg_3
<int> <int> <int> <int>
1 4 16 17 6
2 15 6 17 16
I still do not understand what are you precisely looking for. Besides, it is always recomended to include the data frame you are refering.
I could suggest a solution, which implies the use of a threshold value. So, for each row, if some of the differences (between rg1-rg2, rg1-rg3 and rg2-rg3) is higher than the threshold, it will not be consider.
threshold <- 5
index <- mapped_df %>%
tibble(g1_g2 = abs(rg1 - rg2),
g1_g3 = abs(rg1 - rg3),
g2_g3 = abs(rg2 - rg3)) %>%
apply(1, function(x, threshold) all(x <= threshold),
threshold = threshold)
mapped_df[index]
Maybe you're (just) after some filtering?
library(tidyverse)
data <- tibble(Userids = 1:10,
rg1 = c(70,1:8,70),
rg2 = c(71,11:18,71),
rg3 = c(72,21:28,72))
data |>
filter(rg1 == 70,
rg2 == 71,
rg3 == 72)
data |>
filter(rg1 == rg1[row_number()==1],
rg2 == rg2[row_number()==1],
rg3 == rg3[row_number()==1])
Output:
# A tibble: 2 × 4
Userids rg1 rg2 rg3
<int> <dbl> <dbl> <dbl>
1 1 70 71 72
2 10 70 71 72
Or combine them for ease:
data |>
unite(rg, starts_with("rg")) |>
filter(rg == rg[row_number()==1])
Output:
# A tibble: 2 × 2
Userids rg
<int> <chr>
1 1 70_71_72
2 10 70_71_72

Reshape a dataframe from 1 x 4 to 2 x 2?

I am working with the dplyr library and have created a dataframe in a pipe that looks something like this:
a <- c(1, 2, 2)
b <- c(3, 4, 4)
data <- data.frame(a, b)
data %>% summarize_all(c(min, max))
which gives me this dataframe:
a_fn1 b_fn1 a_fn2 b_fn2
1 3 2 4
and I am trying to reshape this dataframe so that the output of the pipe stacks multiple columns on top of each other in several rows that look like this:
A B
----
1 3
2 4
How would I go about this? I do not want to change how the functions are called because the summarize_all function helps me achieve the values I am looking for. I just want to know how to change this dataframe to the shape such that each value in each row is the value of the summarize function for the given column.
First, naming your functions in summarize_all() will make them appear in the result for easier wrangling.
Then, you can use pivot_longer() with the special .value sentinel in names_to to achieve what you want:
library(tidyverse)
a <- c(1, 2, 2)
b <- c(3, 4, 4)
data <- data.frame(a, b)
data %>%
summarize_all(c(min=min, max=max)) %>%
pivot_longer(everything(), names_to=c(".value", "variable"), names_pattern="(.)_(.+)")
#> # A tibble: 2 x 3
#> variable a b
#> <chr> <dbl> <dbl>
#> 1 min 1 3
#> 2 max 2 4
Created on 2021-07-22 by the reprex package (v2.0.0)
Depending on what output you want, you can even switch the order to c("variable", ".value").
Note that summarize_all() is deprecated and that you might want to use the new, more verbous syntax: summarize(across(everything(), c(min=min, max=max))).

How to quickly create multiple summary tables with group_by() / summarise()?

I have a data frame with N vars, M categorical and 2 numeric. I would like to create M data frames, one for each categorical variable.
Eg.,
data %>%
group_by(var1) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
data %>%
group_by(varM) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
etc...
Is there a way to iterate through the categorical variables and generate each of the summary tables? That is, without needing to repeat the above chunks M times.
Alternatively, these summary tables don't have to be individual objects, as long as I can easily reference / pull the summaries for each of the M variables.
Here is a solution (I hope). Creates a list of data frames with the formula you have:
library(tidyverse)
# Create sample data frame
data <- data.frame(var1 = sample(1:2, 5, replace = T),
var2 = sample(1:2, 5, replace = T),
var3 = sample(1:2, 5, replace = T),
varM = sample(1:2, 5, replace = T),
var5 = rnorm(5, 3, 6),
var6 = rnorm(5, 3, 6))
# Vars to be grouped (var1 until varM in this example)
vars_to_be_used <- names(select(data, var1:varM))
# Function to be used
group_fun <- function(x, .df = data) {
.df %>%
group_by_(.x) %>%
summarise(sumVar5 = sum(var5),
meanVar6 = mean(var6))
}
# Loop over vars
results <- map(vars_to_be_used, group_fun)
# Nice list names
names(results) <- vars_to_be_used
print(results)
You didn't supply a sample data.set so I created a small example to show how it works.
data <- data_frame(var1 = rep(letters[1:5], 2),
var2 = rep(LETTERS[11:15], 2),
var3 = 1:10,
var4 = 11:20)
A combination of tidyverse packages can get you where you need to be.
Steps used: First we gather all the columns we want to group by on in a cols column and keep the numeric vars separate. Next we split the data.frame in a list of data.frames so that every column we want to group by on has it's own table with the 2 numeric vars. Now that everything is in a list, we need to use the map functionality from the purrr package. Using map, we spread the data.frame again so the column names are as we expect them to be. Finally using map we use group_by_if to group by on the character column and summarise the rest. All the outcomes are stored in a list where you can access what you need.
Run the code in pieces to see what every step does.
library(dplyr)
library(purrr)
library(tidyr)
outcomes <- data %>%
gather(cols, value, -c(var3, var4)) %>%
split(.$cols) %>%
map(~ spread(.x, cols, value)) %>%
map(~ group_by_if(.x, is.character) %>%
summarise(sumvar3 = sum(var3),
meanvar4 = mean(var4)))
outcomes
$`var1`
# A tibble: 5 x 3
var1 sumvar3 meanvar4
<chr> <int> <dbl>
1 a 7 13.5
2 b 9 14.5
3 c 11 15.5
4 d 13 16.5
5 e 15 17.5
$var2
# A tibble: 5 x 3
var2 sumvar3 meanvar4
<chr> <int> <dbl>
1 K 7 13.5
2 L 9 14.5
3 M 11 15.5
4 N 13 16.5
5 O 15 17.5

using replace_na() with indeterminate number of columns

My data frame looks like this:
df <- tibble(x = c(1, 2, NA),
y = c(1, NA, 3),
z = c(NA, 2, 3))
I want to replace NA with 0 using tidyr::replace_na(). As this function's documentation makes clear, it's straightforward to do this once you know which columns you want to perform the operation on.
df <- df %>% replace_na(list(x = 0, y = 0, z = 0))
But what if you have an indeterminate number of columns? (I say 'indeterminate' because I'm trying to create a function that does this on the fly using dplyr tools.) If I'm not mistaken, the base R equivalent to what I'm trying to achieve using the aforementioned tools is:
df[, 1:ncol(df)][is.na(df[, 1:ncol(df)])] <- 0
But I always struggle to get my head around this code. Thanks in advance for your help.
We can do this by creating a list of 0's based on the number of columns of dataset and set the names with the column names
library(tidyverse)
df %>%
replace_na(set_names(as.list(rep(0, length(.))), names(.)))
# A tibble: 3 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 1 0
#2 2 0 2
#3 0 3 3
Or another option is mutate_all (for selected columns -mutate_at or base don conditions mutate_if) and applyreplace_all
df %>%
mutate_all(replace_na, replace = 0)
With base R, it is more straightforward
df[is.na(df)] <- 0

Resources