Creating multiple new columns in a dataframe in an iterative way - r

-- Small edit made to test data. Columns are no longer grouped by round, but instead grouped by Team as is the case in the real dataset.
I have tried a variety of methods for creating multiple new columns, while minimising the repetition in the code. My initial and successful method requires excessive copy and pasting, but I would like to minimise this as much as possible. Below is example data for the problem:
df <- tribble(~R1TeamX, ~R2TeamX,~R3TeamX, ~R1TeamY,~R2TeamY, ~R3TeamY,
10, 11, 12, 15, 19, 20,
11, 13, 14, 25, 18, 15)
This example data is for three rounds with the scores for both team X and team Y. I am looking to create additional columns, finding the difference between the scores of Team X and Y. The real dataset has upwards of 30 rounds.
My initial solution used mutate and works as follows:
df <- df %>%
mutate(R1Diff = R1TeamX - R1TeamY,
R2Diff = R2TeamX - R2TeamY,
R3Diff = R3TeamX - R3TeamY)
While this does the job, it is not scaleable. I have attempted to reduce this down to less code using str_c & mutate, but cannot identify the correct looping method to make this work for several lines of code. Below is my attempt at standardising the code so far:
teamx <- str_c("R", 1:3, "TeamX")
teamy <- str_c("R", 1:3, "TeamY")
round_diff <- str_c("R", 1:3, "Diff")
df <- df %>%
mutate(!!round_diff[1] := UQ(parse_quo(teamx[1], global_env())) - UQ(parse_quo(teamy[1], global_env())),
!!round_diff[2] := UQ(parse_quo(teamx[2], global_env())) - UQ(parse_quo(teamy[2], global_env())),
!!round_diff[3] := UQ(parse_quo(teamx[3], global_env())) - UQ(parse_quo(teamy[3], global_env())))
While additional code is required, this standardises my input to some degree reducing some of the leg work, but I know there must be some way to reduce this into a single line. I have explored mutate_at and for loops to no avail. I suspect this problem could also be tackled with purrr::map, but I do have enough ability in this area to identify the correct approach.
Any help would be greatly appreciated.

Whilst it's possible to do this in dplyr and tidyr functions, remember you still have some useful base R options open to you. This method uses lapply and makes the assumption that your columns are alternating between team X and team Y
seq(length(df)/2) %>%
lapply(function(x) df[[x]] - df[[x + 1]]) %>%
as.data.frame() %>%
setNames(paste0("R", seq(length(df)/2), "Diff")) %>%
cbind(df,.)
#> R1TeamX R1TeamY R2TeamX R2TeamY R3TeamX R3TeamY R1Diff R2Diff R3Diff
#> 1 10 11 12 15 19 20 -1 -1 -3
#> 2 11 13 14 25 18 15 -2 -1 -11

try to do it this way
library(tidyverse)
df %>%
mutate(id = row_number()) %>%
pivot_longer(
-id,
names_to = c("set", ".value"),
names_pattern = "(R\\d+Team)(X|Y)"
) %>%
mutate(Diff = X - Y) %>%
pivot_longer(-c(id, set)) %>%
pivot_wider(id, names_from = c(set, name), values_from = value, names_sep = "")

Here's a solution that I believe is robust with respect to the number of rounds, the number of opponents of Team X and the order in which the results are stored.
First, make the data tidy: remove information about Teams and Rounds from column names.
newDF <- df %>%
mutate(id = row_number()) %>%
pivot_longer(
-id,
names_to = c("Round", "Team"),
names_pattern = "R(\\d+)Team(X|Y)",
values_to="Score"
)
Now calculate the differences in scores
newDF %>%
# Calculate difference in scores
mutate(Team=ifelse(Team == "X", Team, "Opponent")) %>%
pivot_wider(values_from=Score, names_from=Team) %>%
mutate(Diff=X - Opponent) %>%
select(-Opponent) %>%
# Bring in identity of oponent
left_join(
newDF %>%
filter(Team != "X") %>%
select(-Score) %>%
rename(Opposition=Team),
by=c("id", "Round")
)
Giving
# A tibble: 6 x 5
id Round X Diff Opposition
<int> <chr> <dbl> <dbl> <chr>
1 1 1 10 -5 Y
2 1 2 11 -8 Y
3 1 3 12 -8 Y
4 2 1 11 -14 Y
5 2 2 13 -5 Y
6 2 3 14 -1 Y
based on OP's revised input data.

Related

Summing across in a dataframe with condition coming from another column

this is not a very good title for the question. I want to sum across certain columns in a data frame for each group, excluding one column for each of my groups. A simple example would be as follows:
df <- tibble(group_name = c("A", "B","C"), mean_A = c(1,2,3), mean_B = c(2,3,4), mean_C=c(3,4,5))
df %>% group_by(group_name) %>% mutate(m1 = sum(across(contains("mean"))))
This creates column m1, which is the sum across mean_a, mean_b, mean_c for each group. What I want to do is exclude mean_a for group a, mean_b for b and mean_c for c. The following does not work though (not surprisingly).
df %>% group_by(group_name) %>% mutate(m1 = sum(across(c(contains("mean") & !contains(group_name)))))
Do you have an idea how I could do this? My original data contains many more groups, so would be hard to do by hand.
Edit: I have tried the following way which solves it in a rudimentary fashion, but something (?grepl maybe) seems to not work great here and I get the wrong result.
df %>% pivot_longer(!group_name) %>% mutate(value2 = case_when(grepl(group_name, name) ~ 0, TRUE ~ value)) %>% group_by(group_name) %>% summarise(m1 = sum(value2))
Edit2: Found out what's wrong with the above, and below works, but still a lot of warnings so I recommend people to follow TarJae's response below
df %>% pivot_longer(!group_name) %>% group_by(group_name) %>% mutate(value2 = case_when(grepl(group_name, name) ~ 0, TRUE ~ value)) %>% group_by(group_name) %>% summarise(m1 = sum(value2))
Here is another option where you can just use group_name directly with the tidyselect helpers:
df %>%
rowwise() %>%
mutate(m1 = rowSums(select(across(starts_with("mean")), -ends_with(group_name)))) %>%
ungroup()
Output
group_name mean_A mean_B mean_C m1
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 2 3 5
2 B 2 3 4 6
3 C 3 4 5 7
How it works
The row-wise output of across is a 1-row tibble containing only the variables that start with "mean".
select unselects the subset of the variables from output by across that end with the value from group_name.
At this point you are left with a 1 x 2 tibble, which is then summed using rowSums.
Here is one way how we could do it:
We create a helper column to match column names
We set value of mean column to zeor if column names matches helper name.
Then we use transmute with select to calculate rowSums
Finally we cbind column m1 to df:
library(dplyr)
df %>%
mutate(helper = paste0("mean_", group_name)) %>%
mutate(across(starts_with("mean"), ~ifelse(cur_column()==helper, 0, .))) %>%
transmute(m1 = select(., contains("mean")) %>%
rowSums()) %>%
cbind(df)
m1 group_name mean_a mean_b mean_c
1 5 a 1 2 3
2 6 b 2 3 4
3 7 c 3 4 5

filter using dplyr to filter where columns and rows match values

I have an example stripped down dataset combined using this:
library(tidyverse)
composite <- inner_join(shared, taxonomy, by="otu") %>%
group_by(group, taxonomy) %>%
filter(count > 3)
The data looks like this:
group otu count taxonomy
<fct> <fct> <dbl> <chr>
1 17VSD otu001 4559 Escherichia-Shigella
2 17VSD otu002 870 Enterobacteriaceae_unclassified
3 17VSD otu020 63 Cupriavidus
4 17VSD otu072 24 Escherichia-Shigella
5 17VSD otu080 16 Escherichia-Shigella
6 17VSD otu205 4 Escherichia-Shigella
7 YG1 otu001 15 Escherichia-Shigella
8 YG1 otu002 15 Enterobacteriaceae_unclassified
9 YG1 otu004 504 Corynebacterium
10 YG1 otu006 500 Cutibacterium
In the group column there are 3 variables.
I'm having a lot of trouble with the syntax to get the next filter sequence. I want each group factor to have only the same factors under the otu column. In the data we can see row 9 - YG1 otu004 504 Corynebacterium. I would like to remove this row because group 17VSD does not have otu004. This will also have to be true for the other group (YG2). of course this will get very complicated with the full data set where there are 20+ group factors and millions of otu factors.
I've tried expanding filter(count > 3 & ...) but that doesn't seem to be the right direction. Otherwise just lots of searching through examples. My problem may also be because I'm not using the correct language to help solve the issue.
As Ben had pointed out I can use n_distinct() to solve the issue. Thank you Ben!
to get what I needed for the dummy set:
composite <- inner_join(shared, taxonomy, by="otu") %>%
group_by(group, taxonomy) %>%
filter(count >3) %>%
group_by(otu) %>%
filter(n_distinct(group) == 3)
I also attempted Pablo Serrati's answer. It works, but its a little but complicated. Thank you Pablo
I don't know if this is the simple way, but... you can reshape your data with pivot_, then use this table to select columns with apply, and make a new reshape.
# Example data
composite <- data.frame(group = rep(c("a", "b", "c"), c(9, 11, 5)),
otu = sample(paste0("otu00", 0:9),
size = 25, replace = T),
count = sample(1:20, size = 25, replace = T))
# Deleting possible duplicated group+otu
composite <- composite[!duplicated(composite[c("group", "otu")]), ]
# reshape data
composite_reshape <- composite |>
arrange(group, otu) |>
pivot_wider(id_cols = "group", names_from = "otu", values_from = "count")
# Select complete otu columns
composite_reduce <- composite_reshape[!apply(composite_reshape , 2, anyNA)]
# New reshape
composite_final <- composite_reduce |>
pivot_longer(cols = -group )
composite_final

Sum duplicated columns in dataframe in R

Hello i have the following dataframe :
colnames(tv_viewing time) <-c("channel_1", "channel_2", "channel_1", "channel_2")
Each row gives a the viewing time for an individual on channel 1 and channel 2, for instance for individual 1 i get :
tv_viewing_time[1,] <- c(1,2,4,5)
What I would like is actually a dataframe that sums up the values of duplicated columns.
I.e. I would get
colnames(tv_viewing time) <-c("channel_1", "channel_2")
Where for instance for individual 1 i would get :
tv_viewing_time[1,] <- c(5,7)
As all two row entries are summed when they correspond to duplicated column names.
I have looked for an answer but all suggested on other threads did not work for my dataframe case.
Note that there are many more duplicated columns, so i am looking for a solution that can be efficiently applied to all my duplicates.
We could use split.default with rowSums
sapply(split.default(tv_viewing_time,
sub("\\.\\d+$", "", names(tv_viewing_time))), rowSums)
-output
# channel_1 channel_2
# 5 7
Or using tidyverse
library(dplyr)
library(tidyr)
library(stringr)
tv_viewing_time %>%
pivot_longer(cols = everything()) %>%
group_by(name = str_remove(name, "\\.\\d+$")) %>%
summarise(value = sum(value)) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 1 x 2
# channel_1 channel_2
# <dbl> <dbl>
#1 5 7
data
tv_viewing_time <- data.frame(channel_1 = 1, channel_2 = 2,
channel_1 = 4, channel_2 = 5)

Generating demographic tables/frequencies of a large number of specific variables in R

I have larger data sets that for a 'first run' require basic frequencies by a group (groups are flagged with 1 or 0 in columns). The issue is that some of the basic frequencies are for a very large number of variables (180 or so) that aren't named with a specific prefix and aren't positioned say in columns 2:100. They could be in columns 2:80, and then say 90:117, etc.
I get the basic gist of doing this would be something like this:
mtcars %>% filter(gear == 4) %>% group_by(am) %>% summarise(n=n()) %>% mutate(perc = n / sum(n)*100)
One issue is that my 3-5 groups are all flagged in separate columns, so I would need to use filter(pop1 == 1), filter(pop2 == 1), etc. But is there a way for the group_by to go through 180 variables with an output for each variable? It's just a simple frequency of each variable's values or missing. So for the mtcars, it would be a frequency of just am, and then just vs. It would be fine for the output to be one long one with two columns.
Like this:
Variable Value n Perc
am 0 4 33.3
am 1 8 66.7
vs 0 2 16.7
vs 1 10 83.3
I recognize this might involve gather or pivot_longer, but could not figure out a way to only transform the 180 variables out of say 200 that I need frequencies on to long.
Edit:
I ended up using this to select for many columns:
positions <- c(4:176,198)
And using select(positions) to circumvent typing in all of the variables.
We can use map from purrr to separately do the grouping and apply the same code as in the OP's post
library(dplyr)
library(purrr)
map_dfr(c('am', 'vs'), ~
mtcars %>%
filter(gear == 4) %>%
group_by(Variable = .x, Value = !!rlang::sym(.x)) %>%
summarise(n = n()) %>%
mutate(perc = n/sum(n) * 100))
# A tibble: 4 x 4
# Groups: Variable [2]
# Variable Value n perc
# <chr> <dbl> <int> <dbl>
#1 am 0 4 33.3
#2 am 1 8 66.7
#3 vs 0 2 16.7
#4 vs 1 10 83.3
Or another option is to convert to 'long' format with pivot_longer
library(tidyr)
mtcars %>%
filter(gear == 4) %>%
select(vs, am) %>%
pivot_longer(everything()) %>%
count(name, value) %>%
mutate(perc = n/sum(n) * 100)

Parsing a Hierarchy in a String Value

I am trying to create an edge list from a single character vector. My list to be processed is over 93k elements long, but as an example I will provide a small excerpt.
The chracter strings are part of the ICD10 code hierarchy and the parent child relationships exist within the string. That means that a single string, "A0101", would have a parent of "A010"
It would look like this:
A00
A000
A001
A009
A01
A010
A0100
A0101
A02
A03
etc.
My vector does not contain any other data except the strings but i basically need to convert
dat <- c("A00", "A000", "A001", "A009", "A01", "A010", "A0100", "A0101", "A02")
into an edge list formatted as follows...
# (A00, A000)
# (A00, A001)
# (A00, A009)
# (A01, A010)
# (A010, A0100)
# (A010, A0101)
I am fairly certain there are more efficient ways to accomplish this but this excerpt of code should download the ICD10 CM data from the icd.data package. Use the children detection system from the icd package and then make extensive use of the tidyverse to return an edgelist. I had to get a bit creative to connect the "top" of the hierarchies since they do not include the chapters and sub chapters of ICD10 data as an individual 2 or 1 digit code.
Basically sub-chapters become 2 digit codes, chapters become 1 digit codes, and then there is a root node to connect everything at the top.
library(icd.data)
icd10 <- icd10cm2016
library(icd)
code_children <- lapply(icd10$code, children)
code_vec <- sapply(code_children, paste, collapse = ",")
code_df <- as.data.frame(code_vec, stringsAsFactors = F)
library(dplyr);library(stringr);library(tidyr)
code_df_new <- code_df %>%
mutate(parent = sapply(strsplit(code_vec,","), "[", 1)) %>%
separate(code_vec,
paste("code", 1:max(str_count(code_df$code_vec, ",")), sep ="."),
",",extra = "merge")
library(reshape2)
edgelist <- melt(code_df_new, id = "parent") %>%
filter(!is.na(value)) %>%
select(parent, child = value) %>%
arrange(parent)
edgelist <- subset(edgelist, edgelist$parent != edgelist$child)
edgelist <- subset(edgelist, nchar(edgelist$child) == nchar(edgelist$parent) + 1)
subchaps <- icd10 %>% select(three_digit, sub_chapter, chapter) %>%
mutate(two_digit = substr(three_digit, 1, 2)) %>%
select(parent = two_digit, child = three_digit) %>%
distinct()
chaps <- icd10 %>% select(three_digit, sub_chapter, chapter) %>%
mutate(
two_digit = substr(three_digit, 1, 2),
one_digit = substr(three_digit, 1, 1)) %>%
select(parent = one_digit, child = two_digit) %>%
distinct()
root <- icd10 %>% select(three_digit) %>%
mutate(parent = "root", child = substr(three_digit, 1, 1)) %>%
select(parent, child) %>%
distinct()
edgelist_final <- edgelist %>%
bind_rows(list(chaps, subchaps, root)) %>%
arrange(parent)
If anybody has any tips or methods to improve the efficiency of this code I am all ears. (eyes?)
On the assumption that the length of the node names in ICD10 fully define the order (with shorter ones being parents), here's an approach that connects each node with it's immediate parent, if available.
While I think the logic is legible here, I'd be curious to see what a more streamlined solution would look like.
# Some longer fake data to prove that it works acceptably
# with 93k rows (took a few seconds). These are just
# numbers of different lengths, converted to characters, but they
# should suffice if the assumption about length = order is correct.
set.seed(42)
fake <- runif(93000, 0, 500) %>%
magrittr::raise_to_power(3) %>%
as.integer() %>%
as.character()
# Step 1 - prep
library(dplyr); library(tidyr)
fake_2 <- fake %>%
as_data_frame() %>%
mutate(row = row_number()) %>%
# Step 2 - widen by level and fill in all parent nodes
mutate(level = str_length(value)) %>%
spread(level, value) %>%
fill(everything()) %>%
# Step 3 - Get two highest non-NA nodes
gather(level, code, -row) %>%
arrange(row, level) %>%
filter(!is.na(code)) %>%
group_by(row) %>%
top_n(2, wt = level) %>%
# Step 4 - Spread once more to get pairs
mutate(pos = row_number()) %>%
ungroup() %>%
select(-level) %>%
spread(pos, code)
Output on OP data
# A tibble: 9 x 3
row `1` `2`
<int> <chr> <chr>
1 1 A00 NA
2 2 A00 A000
3 3 A00 A001
4 4 A00 A009
5 5 A01 A009
6 6 A01 A010
7 7 A010 A0100
8 8 A010 A0101
9 9 A010 A0101
Output on 93k fake data
> head(fake, 10)
[1] "55174190" "50801321" "46771275" "6480673"
[5] "20447474" "879955" "4365410" "11434009"
[9] "5002257" "9200296"
> head(fake_2, 10)
# A tibble: 10 x 3
row `1` `2`
<int> <chr> <chr>
1 1 55174190 NA
2 2 50801321 NA
3 3 46771275 NA
4 4 6480673 46771275
5 5 6480673 20447474
6 6 6480673 20447474
7 7 4365410 20447474
8 8 4365410 11434009
9 9 5002257 11434009
10 10 9200296 11434009

Resources