Alternative method to count number of single occurencies across columns of interest - r

I would like the number of single occurrences of some rows values across different columns. I have applied the following code:
dat = data.frame()
vector <- c(1, 2, 3)
for (i in names(data)){
for (j in vector){
dat[j,i] <- length(which(data[,i] == j))
}
}
print(dat)
That return exactly the output I am looking for. Does this code contain any redundancies? Could you please some more effective alternative way with the iterative method (including for loop) or with dplyr() packages?
Thanks
Here is a short extract of the dataset I am working on.
structure(list(run_set_1 = c(3, 3, 3, 3, 3, 3), run_set_2 = c(1,
1, 1, 1, 1, 1), run_set_3 = c(2, 2, 2, 2, 2, 2)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))

You could first match() each column to get the index in vector that
the column values correspond to, if any. Then tabulate() those to get the
counts, including 0s:
lapply(data, match, vector) |>
sapply(tabulate, length(vector))
#> run_set_1 run_set_2 run_set_3
#> [1,] 0 6 0
#> [2,] 0 0 6
#> [3,] 6 0 0
This can be modified to use dplyr-native iteration:
library(dplyr, warn.conflicts = FALSE)
data %>%
summarise(
across(everything(), match, vector) %>%
purrr::map_dfc(tabulate, length(vector))
)
#> # A tibble: 3 × 3
#> run_set_1 run_set_2 run_set_3
#> <int> <int> <int>
#> 1 0 6 0
#> 2 0 0 6
#> 3 6 0 0

EDIT : I added the case for a value that we expect but is missing (4 as example)
Here is the tidyverse version. I think it may be even shorter but I don't know yet.
vector = c(1:4)
library(dplyr)
library(tidyr)
data %>% pivot_longer(cols = everything()) %>%
mutate(value = factor(as.character(value), levels = vector)) %>%
count(name, value, .drop = FALSE) %>%
pivot_wider(names_from = name, values_from = n) %>%
arrange(value) %>% select(-value)
# last line only to remove the value column and fit your example
# # A tibble: 3 × 3
# run_set_1 run_set_2 run_set_3
# <int> <int> <int>
# 1 0 6 0
# 2 0 0 6
# 3 6 0 0
# 4 0 0 0

Related

Referencing variable names in loops for dplyr

I know this has been discussed already, but can't find a solution that works for me. I have several binary (0/1) variables named "indic___1" to "indic___8" and one continuous variable "measure".
I would like to compute summary statistics for "measure" across each group, so I created this code:
library(dplyr)
indic___1 <- c(0, 1, 0, 1, 0)
indic___2 <- c(1, 1, 0, 1, 1)
indic___3 <- c(0, 0, 1, 0, 0)
indic___4 <- c(1, 1, 0, 1, 0)
indic___5 <- c(0, 0, 0, 1, 1)
indic___6 <- c(0, 1, 1, 1, 0)
indic___7 <- c(1, 1, 0, 1, 1)
indic___8 <- c(0, 1, 1, 1, 0)
measure <- c(28, 15, 26, 42, 12)
dataset <- data.frame(indic___1, indic___2, indic___3, indic___4, indic___5, indic___6, indic___7, indic___8, measure)
for (i in 1:8) {
variable <- paste0("indic___", i)
print(variable)
dataset %>% group_by(variable) %>% summarise(mean = mean(measure))
}
It returns an error:
Error in `group_by()`:
! Must group by variables found in `.data`.
x Column `variable` is not found.
Putting data into long format makes this generally solvable without a loop. You didn’t specify what you wanted to do with the data inside the loop so I had to guess, but the general form of the solution would look as follows:
results = dataset |>
pivot_longer(starts_with("indic___"), names_pattern = "indic___(.*)") |>
group_by(name, value) |>
summarize(mean = mean(measure), .groups = "drop")
# # A tibble: 16 × 3
# name value mean
# <chr> <dbl> <dbl>
# 1 1 0 22
# 2 1 1 28.5
# 3 2 0 26
# 4 2 1 24.2
# 5 3 0 24.2
# …
If you want to separate the results from the individual names, you can use a combination of nest and pull:
results |>
nest(data = c(value, mean), .by = name) |>
pull(data)
# [[1]]
# # A tibble: 2 × 2
# value mean
# <dbl> <dbl>
# 1 0 22
# 2 1 28.5
#
# [[2]]
# # A tibble: 2 × 2
# value mean
# <dbl> <dbl>
# 1 0 26
# 2 1 24.2
# …
… but at this point I’d ask myself why I am using table manipulation in the first place. The following seems a lot easier:
indices = unname(mget(ls(pattern = "^indic___")))
results = indices |>
lapply(split, x = measure) |>
lapply(vapply, mean, numeric(1L))
# [[1]]
# 0 1
# 22.0 28.5
#
# [[2]]
# 0 1
# 26.00 24.25
# …
Notably, in real code you shouldn’t need the first line since your data should not be in individual, numbered variables in the first place. The proper way to do this is to have the data in a joint list, as is done here. Also, note that I once again explicitly removed the unreadable indic___X names. You can of course retain them (just remove the unname call) but I don’t recommend it.

Calculate and add percent columns to multiple columns inside a function

I have a large survey data-set to summarise, I have calculated row counts across multiple columns grouped by treatment and control conditions.
I need to add columns that calculate the percentage for each group and a 'difference' column
(percentage gp1) - (percentage gp2) but can't work out how to do it. Help would be appreciated.
Below is an example of the data:
library(tidyverse)
library(janitor)
df <- data.frame(mhst = factor(c(0,1,1,0)),
q1 = factor(c(1, 4, 2, 2)),
q2 = factor(c(3, 4, 5, 1)),
q3 = factor(c(1, 4, 2, 5)),
q4 = factor(c(2, 1, 1, 3)),
WT1 = c(0.5, 0.3, 6, 1))
q_set_t1 <- c("q1", "q2", "q3", "q4") #choose cols to calc
wt1 <- c("WT1") # choose weights
make_output <- function(mycol, weight) { ## Make the output table for the chosen column and weight
output <- df %>% group_by_at(c("mhst", mycol)) %>%
summarise_at(weight, sum)
names(output)[2] <- "Q_set" #Set the column names
names(output)[3] <- "weighted_count"
output <- output %>% pivot_wider(names_from = mhst, values_from = weighted_count) #sets MHST nonMHST side by side
output <- output %>% janitor::adorn_totals()
output <- output %>% mutate_all(~as.character(.)) #converts to character for easier manipulation when joining vertically
output <- bind_rows(tibble("var_name"=mycol), #adds a blank row above the output table with the var name
output, #adds the output table
tibble("var_name"="")) #adds blank row at the bottom of each tibble
}
cols_output <- pmap(list(q_set_t1, wt1), ~make_output(..1,..2)) # Generates all the output tables by coercing into a lsit and feeding through the make_output function feeding
q_set_wt <- tibble() # Join them vertically
for(i in 1:length(cols_output)) {
q_set_wt <- bind_rows(q_set_wt, cols_output[[i]])
}
I have tried adding this code:
output <- output %>% mutate(percent = mycol/sum(mycol)*100)
below in the row under the pivot_wider() function but I get an error:
"Caused by error in `sum()`:
! invalid 'type' (character) of argument"
For the difference column I have tried adding
output <- output %>% mutate("0" - "1") (0 and 1 being the names of the mhst levels)
below in the row under the pivot_wider() function but I get error:
! Problem while computing `..1 = "0" - "1"`.
Caused by error in `"0" - "1"`:
! non-numeric argument to binary operator
We may modify the function as
library(purrr)
library(dplyr)
library(tidyr)
make_output <- function(mycol, weight) {
df %>%
group_by(across(c("mhst", all_of(mycol)))) %>%
summarise(across(all_of(weight), sum), .groups = "drop") %>%
rename(Q_set = 2, weighted_count = 3) %>%
group_by(Q_set) %>%
mutate(Perc = weighted_count/sum(weighted_count) * 100) %>%
ungroup %>%
pivot_wider(names_from = mhst,
values_from = weighted_count, values_fill = 0) %>%
mutate(Diff = `0` - `1`)
}
and then apply pmap
> pmap_dfr(list(q_set_t1, wt1), ~make_output(..1, ..2))
# A tibble: 15 × 5
Q_set Perc `0` `1` Diff
<fct> <dbl> <dbl> <dbl> <dbl>
1 1 100 0.5 0 0.5
2 2 14.3 1 0 1
3 2 85.7 0 6 -6
4 4 100 0 0.3 -0.3
5 1 100 1 0 1
6 3 100 0.5 0 0.5
7 4 100 0 0.3 -0.3
8 5 100 0 6 -6
9 1 100 0.5 0 0.5
10 5 100 1 0 1
11 2 100 0 6 -6
12 4 100 0 0.3 -0.3
13 2 100 0.5 0 0.5
14 3 100 1 0 1
15 1 100 0 6.3 -6.3

Interpolation of values from list

I have a dataframe containing the results of a competition. In this example competitors b and c have tied for second place. The actual dataframe is very large and could contain multiple ties.
df <- data.frame(name = letters[1:4],
place = c(1, 2, 2, 4))
I also have point values for the respective places, where first place gets 4 points, 2nd gets 3, 3rd gets 1 and 4th gets 0.
points <- c(4, 3, 1, 0)
names(points) <- 1:4
I can match points to place to get each competitor's score
df %>%
mutate(score = points[place])
name place score
1 a 1 4
2 b 2 3
3 c 2 3
4 d 4 0
What I would like to do though is award points to b and c that are the mean of the point values for 2nd and 3rd, such that each receives 2 points like this:
name place score
1 a 1 4
2 b 2 2
3 c 2 2
4 d 4 0
How can I accomplish this programmatically?
A solution using nested data frames and purrr.
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(name = letters[1:4],
place = c(1, 2, 2, 4))
points <- c(4, 3, 1, 0)
names(points) <- 1:4
# a function to help expand the dataframe based on the number of ties
expand_all <- function(x,n){
x:(x+n-1)
}
df %>%
group_by(place) %>%
tally() %>%
mutate(new_place = purrr::map2(place,n, expand_all)) %>%
unnest(new_place) %>%
mutate(score = points[new_place]) %>%
group_by(place) %>%
summarize(score = mean(score)) %>%
inner_join(df)
Robert Wilson's answer gave me an idea. Rather than mapping over nested dataframes the rank function from base can get to the same result
df %>%
mutate(new_place = rank(place, ties.method = "first")) %>%
mutate(score = points[new_place]) %>%
group_by(place) %>%
summarize(score = mean(score)) %>%
inner_join(df)
place score name
<dbl> <dbl> <chr>
1 1 4 a
2 2 2 b
3 2 2 c
4 4 0 d
This can be accomplished in few lines with an ifelse() statement inside of a mutate():
df %>%
group_by(place) %>%
mutate(n_ties = n()) %>%
ungroup %>%
mutate(score = (points[place] + ifelse(n_ties > 1, 1, 0))/ n_ties)
# A tibble: 4 x 4
name place n_ties score
<chr> <dbl> <int> <dbl>
1 a 1 1 4
2 b 2 2 2
3 c 2 2 2
4 d 4 1 0

map over columns and apply custom function

Missing something small here and struggling to pass columns to function. I just want to map (or lapply) over columns and perform a custom function on each of the columns. Minimal example here:
library(tidyverse)
set.seed(10)
df <- data.frame(id = c(1,1,1,2,3,3,3,3),
r_r1 = sample(c(0,1), 8, replace = T),
r_r2 = sample(c(0,1), 8, replace = T),
r_r3 = sample(c(0,1), 8, replace = T))
df
# id r_r1 r_r2 r_r3
# 1 1 0 0 1
# 2 1 0 0 1
# 3 1 1 0 1
# 4 2 1 1 0
# 5 3 1 0 0
# 6 3 0 0 1
# 7 3 1 1 1
# 8 3 1 0 0
a function just to filter and counts unique ids remaining in the dataset:
cnt_un <- function(var) {
df %>%
filter({{var}} == 1) %>%
group_by({{var}}) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
it works outside of map
cnt_un(r_r1)
# A tibble: 1 x 2
r_r1 n_uniq
<dbl> <int>
1 1 3
I want to apply the function over all r_r columns to get something like:
df2
# y n_uniq
# 1 r_r1 3
# 2 r_r2 2
# 3 r_r3 2
I thought the following would work but doesnt
map(dplyr::select(df, matches("r_r")), ~ cnt_un(.x))
any suggestions? thanks
I'm not sure if there's a direct tidyeval way to do this with something like map. The issue you're running into is that in calling map(df, *whatever_function*), the function is being called on each column of df as a vector, whereas your function expects a bare column name in the tidyeval style. To verify that:
map(df, class)
will return "numeric" for each column.
An alternative is to iterate over column names as strings, and convert those to symbols; this takes just one additional line in the function.
library(dplyr)
library(tidyr)
library(purrr)
cnt_un_name <- function(varname) {
var <- ensym(varname)
df %>%
filter({{var}} == 1) %>%
group_by({{var}}) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
Calling the function is a little awkward because it keeps only the relevant column names (calling on "r_r1" gets columns "r_r1" and "n_uniq", etc). One way is to get the vector of column names you want, name it so you can add an ID column in map_dfr, and drop the extra columns, since they'll be mostly NA.
grep("^r_r\\d+", names(df), value = TRUE) %>%
set_names() %>%
map_dfr(cnt_un_name, .id = "y") %>%
select(y, n_uniq)
#> # A tibble: 3 x 2
#> y n_uniq
#> <chr> <int>
#> 1 r_r1 3
#> 2 r_r2 2
#> 3 r_r3 2
A better way is to call the function, then bind after reshaping.
grep("^r_r\\d+", names(df), value = TRUE) %>%
map(cnt_un_name) %>%
map_dfr(pivot_longer, 1, names_to = "y") %>%
select(y, n_uniq)
# same output as above
Alternatively (and maybe better/more scaleable) would be to do the column renaming inside the function definition.
Here's a base R solution that uses lapply. The tricky bit is that your function isn't actually running on single columns; it's using id, too, so you can't use canned functions that iterate column-wise.
do.call(rbind, lapply(grep("r_r", colnames(df), value = TRUE), function(i) {
X <- subset(df, df[,i] == 1)
row <- data.frame(y = i, n_uniq = length(unique(X$id)), stringsAsFactors = FALSE)
}))
y n_uniq
1 r_r1 2
2 r_r2 3
3 r_r3 2
Here is another solution. I changed the syntax of your function. Now you supply the pattern of the columns you want to select.
cnt_un <- function(var_pattern) {
df %>%
pivot_longer(cols = contains(var_pattern), values_to = "vals", names_to = "y") %>%
filter(vals == 1) %>%
group_by(y) %>%
summarise(n_uniq = n_distinct(id)) %>%
ungroup()
}
cnt_un("r_r")
#> # A tibble: 3 x 2
#> y n_uniq
#> <chr> <int>
#> 1 r_r1 2
#> 2 r_r2 3
#> 3 r_r3 2

Get last row of each group in R [duplicate]

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 4 years ago.
I have some data similar in structure to:
a <- data.frame("ID" = c("A", "A", "B", "B", "C", "C"),
"NUM" = c(1, 2, 4, 3, 6, 9),
"VAL" = c(1, 0, 1, 0, 1, 0))
And I am trying to sort it by ID and NUM then get the last row.
This code works to get the last row and summarize down to a unique ID, however, it doesn't actually get the full last row like I want.
a <- a %>% arrange(ID, NUM) %>%
group_by(ID) %>%
summarise(max(NUM))
I understand why this code doesn't work but am looking for the dplyr way of getting the last row for each unique ID
Expected Results:
ID NUM VAL
<fct <dbl> <dbl>
1 A 2 0
2 B 4 1
3 C 9 0
Note: I will admit that though it is nearly a duplicate of Select first and last row from grouped data, the answers on that thread were not quite what I was looking for.
You might try:
a %>%
group_by(ID) %>%
arrange(NUM) %>%
slice(n())
One dplyr option could be:
a %>%
arrange(ID, NUM) %>%
group_by(ID) %>%
summarise_all(last)
ID NUM VAL
<fct> <dbl> <dbl>
1 A 2. 0.
2 B 4. 1.
3 C 9. 0.
Or since dplyr 1.0.0:
a %>%
arrange(ID, NUM) %>%
group_by(ID) %>%
summarise(across(everything(), last))
Or using slice_max():
a %>%
group_by(ID) %>%
slice_max(order_by = NUM, n = 1)
tail() returns the last 6 items of a subsettable object. When using aggregate(), the parameters to the FUN argument are passed immediately after the function using a comma; here 1 refers to n = 1, which tells tail() to only return the last item.
aggregate(a[, c('NUM', 'VAL')], list(a$ID), tail, 1)
# Group.1 NUM VAL
# 1 A 2 0
# 2 B 3 0
# 3 C 9 0
You can use top_n. (grouping already sorts by ID, and sorting by NUM isn't necessary since there's only 1 value)
library(dplyr)
a %>%
group_by(ID) %>%
top_n(1, NUM)
# # A tibble: 3 x 3
# # Groups: ID [3]
# ID NUM VAL
# <fct> <dbl> <dbl>
# 1 A 2 0
# 2 B 4 1
# 3 C 9 0

Resources