Lowest positive and least negative value among various columns in R? - r

I have a dataset looking like this:
df <- data.frame(ID=c(1, 1, 1, 2, 3, 3), timeA=c(-10, NA, NA, -15, -10, -5), timeB=c(5, 100, -10, -10, -15, 5), timeC=c(1, 160, 17, -5, -5, 2))
Question 1:
I want to create a column giving me the lowest positive value of time for each participant or if all values are negative then keep the negative value in and choose the one that is least negative. Then I want to only choose the lowest positive value for each participant (ID), or when all values are negative, choose the value that is least negative.
Question 2: Is there a function looking for the value that is closest to 0?
So that my output would look like this:
df <- data.frame(ID=c(1,2,3), time_new=c(1, -5, 2))

I think your looking for Closest() from the library DescTools.
library(tidyverse)
library(DescTools)
# your data
df <- data.frame(ID=c(1, 1, 1, 2, 3, 3),
timeA=c(-10, NA, NA, -15, -10, -5),
timeB=c(5, 100, -10, -10, -15, 5),
timeC=c(1, 160, 17, -5, -5, 2))
# your results
# I stacked the information for easier searching
df %>% pivot_longer(!ID,values_to = "value") %>%
group_by(ID) %>%
summarise(time_new = Closest(value, 0, na.rm = T)) # closest value to zero

Simply calculate distance to 0 and then filter
For #1
library(tidyverse)
# function filter check and return a TRUE/FALSE with
# follow logic of #1 - priority positive value first
# if no positive take the maximum negative number
filter_function <- function(x) {
result <- rep(0, length(x))
if (all(x < 0, na.rm = TRUE)) {
reference <- max(x, na.rm = TRUE)
} else {
reference <- min(x[x > 0], na.rm = TRUE)
}
result <- result + (x == reference)
result[is.na(result)] <- 0
as.logical(result)
}
# filter as #1 option
df %>% pivot_longer(!ID,values_to = "value") %>%
# calculate the distance to ZERO for each value
mutate(distance_to_zero = 0 + value,
abs_distance_to_zero = abs(distance_to_zero)) %>%
group_by(ID) %>%
filter(filter_function(distance_to_zero))
#> # A tibble: 3 x 5
#> # Groups: ID [3]
#> ID name value distance_to_zero abs_distance_to_zero
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 timeC 1 1 1
#> 2 2 timeC -5 -5 5
#> 3 3 timeC 2 2 2
And this is for #2
# filter as closest to ZERO no matter positive or negative
df %>%
pivot_longer(!ID,values_to = "value") %>%
# calculate the distance to ZERO for each value
mutate(abs_distance_to_zero = abs(0 + value)) %>%
group_by(ID) %>%
# Then filter by the one equal to minimum in each group can return multiple
# records in your actual data
filter(abs_distance_to_zero == min(abs_distance_to_zero, na.rm = TRUE) &
!is.na(abs_distance_to_zero)) %>%
ungroup()
#> # A tibble: 3 x 4
#> ID name value abs_distance_to_zero
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 timeC 1 1
#> 2 2 timeC -5 5
#> 3 3 timeC 2 2

Related

Referencing variable names in loops for dplyr

I know this has been discussed already, but can't find a solution that works for me. I have several binary (0/1) variables named "indic___1" to "indic___8" and one continuous variable "measure".
I would like to compute summary statistics for "measure" across each group, so I created this code:
library(dplyr)
indic___1 <- c(0, 1, 0, 1, 0)
indic___2 <- c(1, 1, 0, 1, 1)
indic___3 <- c(0, 0, 1, 0, 0)
indic___4 <- c(1, 1, 0, 1, 0)
indic___5 <- c(0, 0, 0, 1, 1)
indic___6 <- c(0, 1, 1, 1, 0)
indic___7 <- c(1, 1, 0, 1, 1)
indic___8 <- c(0, 1, 1, 1, 0)
measure <- c(28, 15, 26, 42, 12)
dataset <- data.frame(indic___1, indic___2, indic___3, indic___4, indic___5, indic___6, indic___7, indic___8, measure)
for (i in 1:8) {
variable <- paste0("indic___", i)
print(variable)
dataset %>% group_by(variable) %>% summarise(mean = mean(measure))
}
It returns an error:
Error in `group_by()`:
! Must group by variables found in `.data`.
x Column `variable` is not found.
Putting data into long format makes this generally solvable without a loop. You didn’t specify what you wanted to do with the data inside the loop so I had to guess, but the general form of the solution would look as follows:
results = dataset |>
pivot_longer(starts_with("indic___"), names_pattern = "indic___(.*)") |>
group_by(name, value) |>
summarize(mean = mean(measure), .groups = "drop")
# # A tibble: 16 × 3
# name value mean
# <chr> <dbl> <dbl>
# 1 1 0 22
# 2 1 1 28.5
# 3 2 0 26
# 4 2 1 24.2
# 5 3 0 24.2
# …
If you want to separate the results from the individual names, you can use a combination of nest and pull:
results |>
nest(data = c(value, mean), .by = name) |>
pull(data)
# [[1]]
# # A tibble: 2 × 2
# value mean
# <dbl> <dbl>
# 1 0 22
# 2 1 28.5
#
# [[2]]
# # A tibble: 2 × 2
# value mean
# <dbl> <dbl>
# 1 0 26
# 2 1 24.2
# …
… but at this point I’d ask myself why I am using table manipulation in the first place. The following seems a lot easier:
indices = unname(mget(ls(pattern = "^indic___")))
results = indices |>
lapply(split, x = measure) |>
lapply(vapply, mean, numeric(1L))
# [[1]]
# 0 1
# 22.0 28.5
#
# [[2]]
# 0 1
# 26.00 24.25
# …
Notably, in real code you shouldn’t need the first line since your data should not be in individual, numbered variables in the first place. The proper way to do this is to have the data in a joint list, as is done here. Also, note that I once again explicitly removed the unreadable indic___X names. You can of course retain them (just remove the unname call) but I don’t recommend it.

How to use fill inside the tidyverse complete function to fill all dataframe columns?

I generate the following test dataframe output when running the code below, for original data dataframe and by running the function state_inflow:
> test
Previous_State 1 2 3
1: X0 2 0 0
2: X1 0 0 0
3: X2 0 0 1
library(data.table)
library(dplyr)
library(tidyverse)
data <-
data.frame(
ID = c(1,1,1,2,2,2,3,3,3),
Period_1 = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
Period_2 = c("2020-01","2020-02","2020-03","2020-04","2020-05","2020-06","2020-02","2020-03","2020-04"),
Values = c(5, 10, 15, 0, 2, 4, 3, 6, 9),
State = c("X0","X1","X2","X0","X2","X0", "X2","X1","X3")
)
state_inflow <- function(mydat, target_state, period_col_name, fct) {
dcast(
setDT(mydat)[, Previous_State := factor(shift(State, fill = target_state)), by = ID][
, period_factor := lapply(.SD, factor), .SDcols = period_col_name],
Previous_State ~ period_factor, fct,
value.var = "Values", subset = .(State == target_state), drop = FALSE
)
}
test <- state_inflow(data, "X0", "Period_1", length)
I'm adding a row to the dataframe to include those "state" combinations that never touch the target_state category (see ID 3 in the data dataframe; across periods it never touches the target state of x0 and is therefore excluded from the original test output shown above), and populating all of those columns added for that new row with 0's. I am now doing this as follows:
test %>%
complete(Previous_State = unique(data$State)) %>%
replace(is.na(.), 0)
which gives me correct output of:
# A tibble: 4 x 4
Previous_State `1` `2` `3`
<chr> <int> <int> <int>
1 X0 2 0 0
2 X1 0 0 0
3 X2 0 0 1
4 X3 0 0 0
See how row 4, "X3", was added with all 0's? That's correct output.
I'm trying to learn how to use complete(... ,fill = ...). How would I accomplish what I did above, but by instead using fill = ... inside the complete(...) function?
The fill argument of complete expects a list to set the value for each individual column. By default, this is NA for all columns. You can change this by setting the desired filling value for each column separately:
test %>%
complete(Previous_State = unique(data$State),
fill = list(`1` = 0, `2` = 0, `3` = 0))
# A tibble: 4 x 4
# Previous_State `1` `2` `3`
# <chr> <dbl> <dbl> <dbl>
#1 X0 2 0 0
#2 X1 0 0 0
#3 X2 0 0 1
#4 X3 0 0 0
Since your question is about the tidyverse: A tidy data frame is normalized so that you usually have only one column for each property. This makes the completing much easier to archive the same result:
test %>%
pivot_longer(matches("^[0-9]+$")) %>%
complete(Previous_State = unique(data$State), name,
fill = list(value = 0)) %>%
pivot_wider()

Checking if columns in dataframe are "paired"

I have a very long data frame (~10,000 rows), in which two of the columns look something like this.
A B
1 5.5
1 5.5
2 201
9 18
9 18
2 201
9 18
... ...
Just scrubbing through the data it seems that the two columns are "paired" together, but is there any way of explicitly checking this?
You want to know if value x in column A always means value y in column B?
Let's group by A and count the distinct values in B:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.5, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 1
2 2 1
3 9 1
If we now alter the df to the case that this is not true:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.4, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 2
2 2 1
3 9 1
Observe the increased count for group 1. As you have more than 10000 rows, what remains is to see whether or not there is at least one instance that has n_unique > 1, for instance by filter(n_unique > 1)
If you run this you will see how many unique values of B there are for each value of A
tapply(dat$B, dat$A, function(x) length(unique(x)))
So if the max of this vector is 1 then there are no values of A that have more than one corresponding value of B.

slice lowest positive value in R

I have a dataset looking like this:
df <- data.frame(ID=c(1, 1, 1, 2, 2, 3), days=c(100, 10, -8, -5, 12, 10))
Now I only want to take the lowest positive value of "days" so that my output would look like this:
new <- data.frame(ID=c(1, 2, 3), days=c(10, 12, 10))
I have thought about this:
df%>%
group_by(ID)%>%
slice_min(days)
But of course this will return the lowest number number also if it is negative. What can I do to only get the lowest positive values?
Preferably using dplyr.
Thanks so much!
filtering only positve values for days should do.
df <- data.frame(ID=c(1, 1, 1, 2, 2, 3), days=c(100, 10, -8, -5, 12, 10))
library(dplyr)
df %>%
group_by(ID) %>%
filter(days>0) %>%
slice_min(days)
#> # A tibble: 3 x 2
#> # Groups: ID [3]
#> ID days
#> <dbl> <dbl>
#> 1 1 10
#> 2 2 12
#> 3 3 10
You can use aggregate()
aggregate(days ~ ID, df, function(x){
min(x[x > 0])
})
# ID days
# 1 1 10
# 2 2 12
# 3 3 10

slice(which.max()) with condition

I have the following dataset:
ID, diff
1 -40
1 -21
1 -5
1 1
1 6
1 7
...
ID variable has values 1,2,3,4,5,... while diff is a numeric variable. Now, from the dataset, for each ID I want to extract the row with a diff that is closest to zero AND is negative. So, I want the row with the highest negative value of diff. In the dataset above, for ID 1 I want to extract 3 rows with values (1 -5).
The following code can extract rows where the absolute value is closest to 0:
library(dplyr)
dataset22 = dataset1 %>% group_by(ID) %>% slice(which.min(abs(diff)))
How can I extract the row with a negative number that is closest to zero?
Thanks in advance!
This works:
library(dplyr)
df <- data.frame(ID = c(1, 1, 1, 1, 1, 1),
diff = c(-40, -21, -5, 1, 6, 7))
df %>%
group_by(ID) %>%
filter(diff < 0) %>%
summarise(min_negative_diff = max(diff))
#> # A tibble: 1 x 2
#> ID min_negative_diff
#> <dbl> <dbl>
#> 1 1 -5

Resources