Referencing variable names in loops for dplyr - r

I know this has been discussed already, but can't find a solution that works for me. I have several binary (0/1) variables named "indic___1" to "indic___8" and one continuous variable "measure".
I would like to compute summary statistics for "measure" across each group, so I created this code:
library(dplyr)
indic___1 <- c(0, 1, 0, 1, 0)
indic___2 <- c(1, 1, 0, 1, 1)
indic___3 <- c(0, 0, 1, 0, 0)
indic___4 <- c(1, 1, 0, 1, 0)
indic___5 <- c(0, 0, 0, 1, 1)
indic___6 <- c(0, 1, 1, 1, 0)
indic___7 <- c(1, 1, 0, 1, 1)
indic___8 <- c(0, 1, 1, 1, 0)
measure <- c(28, 15, 26, 42, 12)
dataset <- data.frame(indic___1, indic___2, indic___3, indic___4, indic___5, indic___6, indic___7, indic___8, measure)
for (i in 1:8) {
variable <- paste0("indic___", i)
print(variable)
dataset %>% group_by(variable) %>% summarise(mean = mean(measure))
}
It returns an error:
Error in `group_by()`:
! Must group by variables found in `.data`.
x Column `variable` is not found.

Putting data into long format makes this generally solvable without a loop. You didn’t specify what you wanted to do with the data inside the loop so I had to guess, but the general form of the solution would look as follows:
results = dataset |>
pivot_longer(starts_with("indic___"), names_pattern = "indic___(.*)") |>
group_by(name, value) |>
summarize(mean = mean(measure), .groups = "drop")
# # A tibble: 16 × 3
# name value mean
# <chr> <dbl> <dbl>
# 1 1 0 22
# 2 1 1 28.5
# 3 2 0 26
# 4 2 1 24.2
# 5 3 0 24.2
# …
If you want to separate the results from the individual names, you can use a combination of nest and pull:
results |>
nest(data = c(value, mean), .by = name) |>
pull(data)
# [[1]]
# # A tibble: 2 × 2
# value mean
# <dbl> <dbl>
# 1 0 22
# 2 1 28.5
#
# [[2]]
# # A tibble: 2 × 2
# value mean
# <dbl> <dbl>
# 1 0 26
# 2 1 24.2
# …
… but at this point I’d ask myself why I am using table manipulation in the first place. The following seems a lot easier:
indices = unname(mget(ls(pattern = "^indic___")))
results = indices |>
lapply(split, x = measure) |>
lapply(vapply, mean, numeric(1L))
# [[1]]
# 0 1
# 22.0 28.5
#
# [[2]]
# 0 1
# 26.00 24.25
# …
Notably, in real code you shouldn’t need the first line since your data should not be in individual, numbered variables in the first place. The proper way to do this is to have the data in a joint list, as is done here. Also, note that I once again explicitly removed the unreadable indic___X names. You can of course retain them (just remove the unname call) but I don’t recommend it.

Related

Alternative method to count number of single occurencies across columns of interest

I would like the number of single occurrences of some rows values across different columns. I have applied the following code:
dat = data.frame()
vector <- c(1, 2, 3)
for (i in names(data)){
for (j in vector){
dat[j,i] <- length(which(data[,i] == j))
}
}
print(dat)
That return exactly the output I am looking for. Does this code contain any redundancies? Could you please some more effective alternative way with the iterative method (including for loop) or with dplyr() packages?
Thanks
Here is a short extract of the dataset I am working on.
structure(list(run_set_1 = c(3, 3, 3, 3, 3, 3), run_set_2 = c(1,
1, 1, 1, 1, 1), run_set_3 = c(2, 2, 2, 2, 2, 2)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
You could first match() each column to get the index in vector that
the column values correspond to, if any. Then tabulate() those to get the
counts, including 0s:
lapply(data, match, vector) |>
sapply(tabulate, length(vector))
#> run_set_1 run_set_2 run_set_3
#> [1,] 0 6 0
#> [2,] 0 0 6
#> [3,] 6 0 0
This can be modified to use dplyr-native iteration:
library(dplyr, warn.conflicts = FALSE)
data %>%
summarise(
across(everything(), match, vector) %>%
purrr::map_dfc(tabulate, length(vector))
)
#> # A tibble: 3 × 3
#> run_set_1 run_set_2 run_set_3
#> <int> <int> <int>
#> 1 0 6 0
#> 2 0 0 6
#> 3 6 0 0
EDIT : I added the case for a value that we expect but is missing (4 as example)
Here is the tidyverse version. I think it may be even shorter but I don't know yet.
vector = c(1:4)
library(dplyr)
library(tidyr)
data %>% pivot_longer(cols = everything()) %>%
mutate(value = factor(as.character(value), levels = vector)) %>%
count(name, value, .drop = FALSE) %>%
pivot_wider(names_from = name, values_from = n) %>%
arrange(value) %>% select(-value)
# last line only to remove the value column and fit your example
# # A tibble: 3 × 3
# run_set_1 run_set_2 run_set_3
# <int> <int> <int>
# 1 0 6 0
# 2 0 0 6
# 3 6 0 0
# 4 0 0 0

How can I divide one variable into two variables in R?

I have a variable x which can take five values (0,1,2,3,4). I want to divide the variable into two variables. Variable 1 is supposed to contain the value 0 and variable two is supposed to contain the values 1,2,3 and 4.
I'm sure this is easy but I can't find out what i need to do.
what my data looks like:
|variable x|
|-----------|
|0|
|1|
|0|
|4|
|3|
|0|
|0|
|2|
so i get the table:
0
1
2
3
4
125
34
14
15
15
But I want my data to look like this
variable 1
125
variable 2
78
So variable 1 is supposed to contain how often 0 is in my data
and variable 2 is supposed to contain the sum of how often 1,2,3 and 4 are in my data
You can convert the variable to logical by testing whether x == 0
x <- c(0, 1, 0, 4, 3, 0, 0, 2)
table(x)
#> x
#> 0 1 2 3 4
#> 4 1 1 1 1
table(x == 0)
#> FALSE TRUE
#> 4 4
If you want the exact headings, you can do:
setNames(table(x == 0), c(0, paste(unique(sort(x[x != 0])), collapse = ","))
#> 0 1,2,3,4
#> 4 4
And if you want to change the variable to a factor you could do:
c("zero", "not zero")[1 + (x != 0)]
#> x
#> 1 zero
#> 2 not zero
#> 3 zero
#> 4 not zero
#> 5 not zero
#> 6 zero
#> 7 zero
#> 8 not zero
Created on 2022-04-02 by the reprex package (v2.0.1)
base R
You can use cbind:
x = sample(0:5, 200, replace = T)
table(x)
# x
# 0 1 2 3 4 5
# 29 38 41 35 27 30
cbind(`0` = table(x)[1], `1,2,3,4` = sum(table(x)[2:5]))
# 0 1,2,3,4
# 0 29 141
tidyverse
library(tidyverse)
ta = as.data.frame(t(as.data.frame.array(table(x))))
ta %>%
mutate(!!paste(names(.[-1]), collapse = ",") := sum(c_across(`1`:`5`)), .keep = "unused")
# 0 1,2,3,4,5
# 1 29 171
Beginning with the vector, we can get the frequency from table then put it into a dataframe. Then, we can create a new column with the names collapsed (i.e., 1,2,3,4) and get the row sum for all columns except the first one.
library(tidyverse)
tab <- data.frame(value=c(0, 1, 2, 3, 4),
freq=c(125, 34, 14, 15, 15))
x <- rep(tab$value, tab$freq)
output <- data.frame(rbind(table(x))) %>%
rename_with(~str_remove(., 'X')) %>%
mutate(!!paste0(names(.)[-1], collapse = ",") := rowSums(select(., -1))) %>%
select(1, last_col())
Output
0 1,2,3,4
1 125 78
Then, to create the 2 variables in 2 dataframes, you can split the columns into a list, change the names, then put into the global environment.
list2env(setNames(
split.default(output, seq_along(output)),
c("variable 1", "variable 2")
), envir = .GlobalEnv)
Or you could just subset:
variable1 <- data.frame(`variable 1` = output$`0`, check.names = FALSE)
variable2 <- data.frame(`variable 2` = output$`1,2,3,4`, check.names = FALSE)
Update: deleted first answer:
df[paste(names(df[2:5]), collapse = ",")] <- rowSums(df[2:5])
df[, c(1,6)]
# A tibble: 1 × 2
`0` `1,2,3,4`
<dbl> <dbl>
1 125 78
data:
df <- structure(list(`0` = 125, `1` = 34, `2` = 14, `3` = 15, `4` = 15), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -1L))

Lowest positive and least negative value among various columns in R?

I have a dataset looking like this:
df <- data.frame(ID=c(1, 1, 1, 2, 3, 3), timeA=c(-10, NA, NA, -15, -10, -5), timeB=c(5, 100, -10, -10, -15, 5), timeC=c(1, 160, 17, -5, -5, 2))
Question 1:
I want to create a column giving me the lowest positive value of time for each participant or if all values are negative then keep the negative value in and choose the one that is least negative. Then I want to only choose the lowest positive value for each participant (ID), or when all values are negative, choose the value that is least negative.
Question 2: Is there a function looking for the value that is closest to 0?
So that my output would look like this:
df <- data.frame(ID=c(1,2,3), time_new=c(1, -5, 2))
I think your looking for Closest() from the library DescTools.
library(tidyverse)
library(DescTools)
# your data
df <- data.frame(ID=c(1, 1, 1, 2, 3, 3),
timeA=c(-10, NA, NA, -15, -10, -5),
timeB=c(5, 100, -10, -10, -15, 5),
timeC=c(1, 160, 17, -5, -5, 2))
# your results
# I stacked the information for easier searching
df %>% pivot_longer(!ID,values_to = "value") %>%
group_by(ID) %>%
summarise(time_new = Closest(value, 0, na.rm = T)) # closest value to zero
Simply calculate distance to 0 and then filter
For #1
library(tidyverse)
# function filter check and return a TRUE/FALSE with
# follow logic of #1 - priority positive value first
# if no positive take the maximum negative number
filter_function <- function(x) {
result <- rep(0, length(x))
if (all(x < 0, na.rm = TRUE)) {
reference <- max(x, na.rm = TRUE)
} else {
reference <- min(x[x > 0], na.rm = TRUE)
}
result <- result + (x == reference)
result[is.na(result)] <- 0
as.logical(result)
}
# filter as #1 option
df %>% pivot_longer(!ID,values_to = "value") %>%
# calculate the distance to ZERO for each value
mutate(distance_to_zero = 0 + value,
abs_distance_to_zero = abs(distance_to_zero)) %>%
group_by(ID) %>%
filter(filter_function(distance_to_zero))
#> # A tibble: 3 x 5
#> # Groups: ID [3]
#> ID name value distance_to_zero abs_distance_to_zero
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 timeC 1 1 1
#> 2 2 timeC -5 -5 5
#> 3 3 timeC 2 2 2
And this is for #2
# filter as closest to ZERO no matter positive or negative
df %>%
pivot_longer(!ID,values_to = "value") %>%
# calculate the distance to ZERO for each value
mutate(abs_distance_to_zero = abs(0 + value)) %>%
group_by(ID) %>%
# Then filter by the one equal to minimum in each group can return multiple
# records in your actual data
filter(abs_distance_to_zero == min(abs_distance_to_zero, na.rm = TRUE) &
!is.na(abs_distance_to_zero)) %>%
ungroup()
#> # A tibble: 3 x 4
#> ID name value abs_distance_to_zero
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 timeC 1 1
#> 2 2 timeC -5 5
#> 3 3 timeC 2 2

Subseting when there are n consecutive dummies

I have a data frame and I have created a series of dummy variables and then combined them into i final column. I want to know if I have a case where there is 3 consecutive 1's, i.e., is there a way to subset the data frame that gives me rows 3:5 in the following example?
df <- tibble(
a= c(0, 0, 1, 1, 1, 0, 1, 1)
)
df
# A tibble: 8 x 1
a
<dbl>
1 0
2 0
3 1
4 1
5 1
6 0
7 1
8 1
The package data.table has a nice function called rleid that creates groups based on the diff not being 0. Using that, you can do,
library(tidyverse)
df %>%
group_by(grp = data.table::rleid(df$a)) %>%
filter(n() >= 3 & all(a == 1))

Determining the first occurence of a value by group and its position within the group

I would like to know per group in the column 'Participants' when the value '1' occurs for the first time in the column 'Signal' (by Participants). The count of the value '1' should refer to the group.
Here is an example data frame
> dfInput <- data.frame(Participants=c( 'A','A','A','B','B','B','B','C','C'), Signal=c(0, 1, 1, 0, 0, 0, 1, 1,0))
> dfInput
Participants Signal
1 A 0
2 A 1
3 A 1
4 B 0
5 B 0
6 B 0
7 B 1
8 C 1
9 C 0
And here is the output I am looking for:
> dfOutput <-data.frame(Participants=c( 'A','B','C'), RowNumberofFirst1=c(2, 4, 1))
> dfOutput
Participants RowNumberofFirst1
1 A 2
2 B 4
3 C 1
The problem is somewhat similar to this: Find first occurence of value in group using dplyr mutate
Yet, I could not adapt it accordingly, to create my output df
I think this is what you are looking for
library(dplyr)
dfInput %>%
group_by(Participants) %>%
summarise(RowNumberofFirst1 = which(Signal == 1)[1])
Another base R via aggregate
aggregate(Signal~Participants, dfInput, function(i)which(i == 1)[1])
# Participants Signal
#1 A 2
#2 B 4
#3 C 1
dfInput <- data.frame(Participants=c( 'A','A','A','B','B','B','B','C','C'),
Signal=c(0, 1, 1, 0, 0, 0, 1, 1,0))
library(dplyr)
dfInput %>%
group_by(Participants) %>% # for each Participant
summarise(NumFirst1 = min(row_number()[Signal == 1])) # get the minimum number of row where signal equals 1
# # A tibble: 3 x 2
# Participants NumFirst1
# <fct> <int>
# 1 A 2
# 2 B 4
# 3 C 1
In case you want to return the row (i.e. all column values) that you've identified, you can use this:
set.seed(5)
dfInput <- data.frame(Participants=c( 'A','A','A','B','B','B','B','C','C'),
Signal=c(0, 1, 1, 0, 0, 0, 1, 1,0),
A = sample(c("C","D","F"),9, replace = T),
B = sample(c("N","M","K"),9, replace = T))
library(dplyr)
dfInput %>%
group_by(Participants) %>%
filter(row_number() == min(row_number()[Signal == 1])) %>%
ungroup()
# # A tibble: 3 x 4
# Participants Signal A B
# <fct> <dbl> <fct> <fct>
# 1 A 1 F N
# 2 B 1 D N
# 3 C 1 F M
So, in this case you use filter to return, for each participant, the row that is equal to the minimum row number where Signal is 1.
With tidyverse:
dfInput%>%
group_by(Participants)%>%
mutate(max=cumsum(Signal),
RowNumberofFirst1=row_number())%>%
filter(max==1)%>%
top_n(-1,RowNumberofFirst1)%>%
select(Participants,RowNumberofFirst1)
# A tibble: 3 x 2
# Groups: Participants [3]
Participants RowNumberofFirst1
<fct> <int>
1 A 2
2 B 4
3 C 1
Here is a solution with base R:
dfInput <- data.frame(Participants=c( 'A','A','A','B','B','B','B','C','C'), Signal=c(0, 1, 1, 0, 0, 0, 1, 1,0))
tapply(dfInput$Signal, dfInput$Participants, FUN=function(x) min(which(x==1)))
# > tapply(dfInput$Signal, dfInput$Participants, FUN=function(x) min(which(x==1)))
# A B C
# 2 4 1
If you want a dataframe you can do:
first1 <- tapply(dfInput$Signal, dfInput$Participants, FUN=function(x) min(which(x==1)))
data.frame(Participants=names(first1), f=first1)
Here is a variant with data.table:
library("data.table")
setDT(dfInput)
dfInput[, which(Signal==1)[1], "Participants"]

Resources