Sum column with similar name - r

I have a Dataset in R like this (my real dataset has more rows and columns):
AB1
AB3
AB4
XB1
XB3
XB4
12
34
0
5
3
7
I need to sum the column similar like
AB1+XB1 AB3+XB3 AB4+XB4
What is the code I can use?

You could use
library(dplyr)
df %>%
mutate(across(starts_with("AB"),
~.x + df[[gsub("AB", "XB", cur_column())]],
.names = "sum_{.col}"))
This returns
# A tibble: 1 x 9
AB1 AB3 AB4 XB1 XB3 XB4 sum_AB1 sum_AB3 sum_AB4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 34 0 5 3 7 17 37 7
We use across and mutate in this approach.
First we select all columns starting with AB. The desired sums are always ABn + XB2, so we can use this pattern.
Next we replace AB in the name of the current selected column with XB and sum those two columns. These sums are stored in a new column prefixed with sum_.

Assuming it is the first character that changes and the others are used to group
df=read.table(text="
AB1 AB3 AB4 XB1 XB3 XB4
12 34 0 5 3 7
11 35 1 7 2 8",h=T)
sapply(
unique(substr(colnames(df),2,100)),
function(x){
rowSums(df[,grepl(x,colnames(df))])
}
)
B1 B3 B4
[1,] 17 37 7
[2,] 18 37 9

We can try the code below
cbind(
df,
list2DF(lapply(
split.default(df, gsub("\\D+", "", names(df))),
rowSums
))
)
which gives
AB1 AB3 AB4 XB1 XB3 XB4 1 3 4
1 12 34 0 5 3 7 17 37 7

Try this:
library(tidyverse)
tribble(
~AB1, ~AB3, ~AB4, ~XB1, ~XB3, ~XB4,
12, 34, 0, 5, 3, 7
) |>
pivot_longer(everything(), names_pattern = "(\\w\\w)(\\d)", names_to = c("prefix", "suffix")) |>
pivot_wider(names_from = prefix) |>
rowwise() |>
mutate(sum = sum(c_across(- suffix)))
#> # A tibble: 3 × 4
#> # Rowwise:
#> suffix AB XB sum
#> <chr> <dbl> <dbl> <dbl>
#> 1 1 12 5 17
#> 2 3 34 3 37
#> 3 4 0 7 7
Created on 2022-05-11 by the reprex package (v2.0.1)

If you know that structure is consistent (an "A" and "X" pair for everything), then this should work.
cols <- unique(substring(names(df), 2))
df[paste0("A", cols)] + df[paste0("X", cols)]

Using the 2 row DF2 in the Note as input calculate the suffixes (s), unique suffixes (u) and perform the indicated matrix multiplication giving (m). Finally convert that back to a data frame and set the names. No packages are used.
s <- substring(names(DF2), 2)
u <- unique(s)
m <- as.matrix(DF2) %*% outer(s, u, `==`)
sums <- setNames(as.data.frame(m), u); sums
## B1 B3 B4
## 1 17 37 7
## 2 17 37 7
If it is desired to append these as columns to DF2 then:
data.frame(DF2, sum = sums)
## AB1 AB3 AB4 XB1 XB3 XB4 sum.B1 sum.B3 sum.B4
## 1 12 34 0 5 3 7 17 37 7
## 2 12 34 0 5 3 7 17 37 7
Note
DF <- structure(list(AB1 = 12L, AB3 = 34L, AB4 = 0L, XB1 = 5L, XB3 = 3L,
XB4 = 7L), class = "data.frame", row.names = c(NA, -1L))
DF2 <- rbind(DF, DF)
DF2
## AB1 AB3 AB4 XB1 XB3 XB4
## 1 12 34 0 5 3 7
## 2 12 34 0 5 3 7

An option with across2 from dplyover
library(dplyover)
df1 %>%
mutate(across2(starts_with('AB'), starts_with('XB'),
~ .x + .y, .names = "sum_{xcol}"))
AB1 AB3 AB4 XB1 XB3 XB4 sum_AB1 sum_AB3 sum_AB4
1 12 34 0 5 3 7 17 37 7

Related

How to get this outcome in R

I have multiple data frames. Here I have demonstrated 3 data frames with different rows.
dat1<-read.table (text=" D Size1
A1 12
A2 18
A3 16
A4 14
A5 11
A6 0
Value1 25
Score1 30
", header=TRUE)
dat2<-read.table (text=" D Size2
S12 5
S13 9
S14 11
S15 12
S16 12
Value2 40
Score2 45
", header=TRUE)
dat3<-read.table (text=" D Size2
S17 0
S19 1
S22 2
S33 1
Value3 22
Score3 60
", header=TRUE)
I want to get the following outcome:
D Value Score
1 25 30
2 40 45
3 22 60
I need to get a data frame only for value and score
We may have to filter the rows after binding the datasets into a single data and then use pivot_wider to reshape back to wide
library(dplyr)
library(tidyr)
library(stringr)
bind_rows(dat1, dat2, dat3) %>%
filter(str_detect(D, '(Value|Score)\\d+')) %>%
separate(D, into = c("colnm", "D"), sep = "(?<=[a-z](?=\\d))") %>%
group_by(colnm, D) %>%
transmute(Score = coalesce(Size1, Size2)) %>%
ungroup %>%
pivot_wider(names_from = colnm, values_from = Score)
-output
# A tibble: 3 × 3
D Value Score
<chr> <int> <int>
1 1 25 30
2 2 40 45
3 3 22 60
Or an option in base R
do.call(rbind, Map(function(dat, y) data.frame(D = y,
Value = dat[[2]][grepl('Value', dat$D)],
Score = dat[[2]][grepl('Score', dat$D)]), list(dat1, dat2, dat3), 1:3))
D Value Score
1 1 25 30
2 2 40 45
3 3 22 60

Define groups of columns and sum all i-th columns of each groups with dplyr

I have two groups of columns, each with 36 columns, and I want to sum all i-th column of group 1 with i-th column of group2, getting 36 columns. The number of columns in each group is not fix in my code, although each group has the same number of them.
Exemple. What I have:
teste <- tibble(a1=c(1,2,3),a2=c(7,8,9),b1=c(4,5,6),b2=c(10,20,30))
a1 a2 b1 b2
<dbl> <dbl> <dbl> <dbl>
1 1 7 4 10
2 2 8 5 20
3 3 9 6 30
What I want:
resultado <- teste %>%
summarise(
a_b1 = a1+b1,
a_b2 = a2+b2
)
a_b1 a_b2
<dbl> <dbl>
1 5 17
2 7 28
3 9 39
It would be nice to perform this operation with dplyr.
I would thank any help.
You will struggle to find a dplyr solution as simple and elegant as the base R one:
teste[1:2] + teste[3:4]
#> a1 a2
#> 1 5 17
#> 2 7 28
#> 3 9 39
Though I guess in dplyr you get the same result with:
teste %>% select(starts_with("a")) + teste %>% select(starts_with("b"))
teste %>%
summarise(across(starts_with("a")) + across(starts_with("b")))
# A tibble: 3 x 2
a1 a2
<dbl> <dbl>
1 5 17
2 7 28
3 9 39
This might also help in base R:
as.data.frame(do.call(cbind, lapply(split.default(teste, sub("\\D(\\d+)", "\\1", names(teste))), rowSums, na.rm = TRUE)))
1 2
1 5 17
2 7 28
3 9 39
Another dplyr solution. We can use rowwise and c_across together to sum the values per row. Notice that we can add na.rm = TRUE to the sum function in this case.
library(dplyr)
teste2 <- teste %>%
rowwise() %>%
transmute(a_b1 = sum(c_across(ends_with("1")), na.rm = TRUE),
a_b2 = sum(c_across(ends_with("2")), na.rm = TRUE)) %>%
ungroup()
teste2
# # A tibble: 3 x 2
# a_b1 a_b2
# <dbl> <dbl>
# 1 5 17
# 2 7 28
# 3 9 39

Putting row names and column names when converting from list to data frame

I know that this question might be similar to the previous ones, e.g., this and this. However, I found it confusing to add the row names and column names as a result of converting from list to data frame as follows:
Library("FSA", "FSAdata")
data("RuffeSLRH92")
str(RuffeSLRH92)
ruffe2 <- Subset(RuffeSLRH92,!is.na(weight) & !is.na(length))
ruffe2$logL <- log(ruffe2$length)
ruffe2$logW <- log(ruffe2$weight)
data <- Subset(ruffe2,logW >= -0.5)
LWfunction <- function(x) {
fits <- lm(log(weight) ~ log(length), data = x)
a <- hoCoef(fits, 2,3)
b <- confint(fits)
output <- list(a, b)
return(output)
}
output <- by(data[c("weight", "length")], data[c("month", "year")], LWfunction)
df <- data.frame(matrix(unlist(output), nrow=7, byrow=TRUE),stringsAsFactors=FALSE)
df
The idea is to extract coefficient hoCoef and confint from log-transform linear regression of length-weight relationship of fish. And aggregate the result into a readable data frame. From the code above I mange to extract the "raw" result:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 2 3 3.000857 0.03958601 0.02164589 58 9.828047e-01 -11.60960 2.921617 -10.86960
2 2 3 2.880604 0.03154619 -3.78478744 64 3.415156e-04 -10.94504 2.817584 -10.35515
3 2 3 2.859603 0.03171993 -4.42615042 152 1.821503e-05 -10.92607 2.796934 -10.33690
4 2 3 2.865718 0.01889957 -7.10501173 147 4.811825e-11 -10.74430 2.828368 -10.39930
5 2 3 2.893662 0.03124268 -3.40362699 67 1.126571e-03 -11.01110 2.831301 -10.45753
6 2 3 3.022135 0.03257380 0.67954496 114 4.981701e-01 -11.67896 2.957607 -11.08538
7 2 3 2.996446 0.03140263 -0.11316551 64 9.102536e-01 -11.51532 2.933712 -10.94305
X11
1 3.080097
2 2.943625
3 2.922272
4 2.903068
5 2.956022
6 3.086664
7 3.059180
So how can I get the desired output like this:
year month term Ho Value Estimate Std. Error T df p-value 2.5% 97.5%
In LWfunction return a 1-row dataframe with all the required values in it.
library(FSA)
library(FSAdata)
library(dplyr)
library(tidyr)
LWfunction <- function(x) {
fits <- lm(log(weight) ~ log(length), data = x)
a <- hoCoef(fits, 2,3)
b <- confint(fits)
output <- cbind(a, data.frame(intercept_2.5 = b[1, 1],
intercept_97.5 = b[1, 2],
log_length_2.5 = b[2, 1],
log_length_97.5 = b[2, 2]))
return(output)
}
apply it for each year and month :
result <- data %>%
group_by(month, year) %>%
summarise(output = list(LWfunction(cur_data()))) %>%
ungroup %>%
unnest(output)
result
# A tibble: 7 x 13
# month year term `Ho Value` Estimate `Std. Error` T df
# <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 1992 2 3 3.00 0.0396 0.0216 58
#2 5 1992 2 3 2.88 0.0315 -3.78 64
#3 6 1992 2 3 2.86 0.0317 -4.43 152
#4 7 1992 2 3 2.87 0.0189 -7.11 147
#5 8 1992 2 3 2.89 0.0312 -3.40 67
#6 9 1992 2 3 3.02 0.0326 0.680 114
#7 10 1992 2 3 3.00 0.0314 -0.113 64
# … with 5 more variables: `p value` <dbl>, intercept_2.5 <dbl>,
# intercept_97.5 <dbl>, log_length_2.5 <dbl>,
# log_length_97.5 <dbl>

How to use R to create baseball Splits from Game Logs

I'm trying to use R to recreate Baseball Splits as found on MLB.com. The splits are created from Game Logs and provide different cuts of the data. For example, home games vs. away games, day games vs. night games, August vs. September and many more all in one convenient table. I believe the ratios (AVG, OBP SLG) can all be added via mutate once the basic splits have been totaled.
My Question is, what's the best and most efficient way to create these splits and how should the data be shaped. The game log obviously has additional (hidden) column(s) that contain the Split topics. The nature of the problem leads me to believe purrr might be a tool to employ but I can't quite wrap my mind around how to approach this one.
Here is how I believe the data should be shaped and a link to a sample game log. I would appreciate any thoughts, ideas or solutions to this problem.
Links and images of Game Logs and Splits for National outfielder Juan Soto are set forth below.
Game Logs: Juan Soto Game Log
Splits: Juan Soto Game Splits
Splits
I've gone through the dataset, although I'm not sure if the sum values match, and neither the averages relative to the images above.
You're right about mutating for creating the values you suggest.
However, hopefully my approach can help you get what you're after.
library(tidyverse)
library(data.table)
game.splits <- "https://raw.githubusercontent.com/MundyMSDS/GAMELOG/main/SAMPLE_GAME_LOG.csv"
game.splits <- fread(game.splits, fill = TRUE)
game.splits.pivot <- game.splits
game.splits.pivot$Var1 <- ifelse(game.splits.pivot$Var1 %in% "HOME", 1, 0)
game.splits.pivot$Var2 <- ifelse(game.splits.pivot$Var2 %in% "NIGHT", 3, 2)
game.splits.pivot$Var3 <- ifelse(game.splits.pivot$Var3 %in% "SEPTEMBER", 5, 4)
game.splits.pivot <- game.splits.pivot %>% pivot_longer(-c(1:16, 20))
colnames(game.splits.pivot)[19] <- "name_c"
game.splits.pivot <- game.splits.pivot[, -c(17, 18)]
game.splits.pivot <- game.splits.pivot %>% pivot_longer(-c(1:3, 17))
#test
game.splits.pivot_test <- game.splits.pivot[, -c(1, 2, 3)]
game.splits.pivot_test <- aggregate(value ~ name_c + name, game.splits.pivot_test, sum)
game.splits.pivot_test <- game.splits.pivot_test %>% pivot_wider(names_from = name, values_from = value)
lc_name <- tibble(name_c = 0:5, split = c("HOME", "AWAY", "DAY", "NIGHT", "AUGUST", "SEPTEMBER"))
game.splits.pivot_test <- game.splits.pivot_test %>%
inner_join(lc_name, by = "name_c") %>%
arrange(name_c) %>%
select(-name_c)
game.splits.pivot_test <- game.splits.pivot_test[, c(14, 3, 9, 6, 1, 2, 7, 10, 4, 8, 12, 11, 5, 13)]
A look into the dataset:
# A tibble: 6 x 14
split AB R H `2B` `3B` HR RBI BB IBB SO SB CS TB
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 HOME 88 24 32 5 0 9 23 15 5 12 1 2 64
2 AWAY 66 15 22 9 0 4 14 26 7 16 5 0 43
3 DAY 29 21 18 4 0 5 17 12 4 3 4 0 37
4 NIGHT 125 18 36 10 0 8 20 29 8 25 2 2 70
5 AUGUST 90 21 33 6 0 11 25 13 1 13 1 1 72
6 SEPTEMBER 64 18 21 8 0 2 12 28 11 15 5 1 35
This turned out to be more straight-forward than I had thought. The following solution relies upon pivot_longer to shape the data and summarise_if to tally the splits - no rbinds or purrr needed.
library(tidyverse)
game.splits <- "https://raw.githubusercontent.com/MundyMSDS/GAMELOG/main/SAMPLE_GAME_LOG.csv"
game.splits <- read_csv(game.splits)
game.splits %>%
pivot_longer(Var1:Var3, names_to = "split") %>%
group_by(split) %>%
arrange(split) %>%
select(split, value, everything()) %>%
ungroup() %>%
select(split, value, everything()) %>%
select(-Date, -OPP) %>%
mutate(value = str_c(split, "_", value)) %>%
group_by(value) %>%
summarise_if(is.numeric, sum) %>%
mutate(value= str_replace(value, "(Var\\d_)",""))
#> # A tibble: 6 x 14
#> value AB R H TB `2B` `3B` HR RBI BB IBB SO SB
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AWAY 88 24 32 64 5 0 9 23 15 5 12 1
#> 2 HOME 66 15 22 43 9 0 4 14 26 7 16 5
#> 3 DAY 29 21 18 37 4 0 5 17 12 4 3 4
#> 4 NIGHT 125 18 36 70 10 0 8 20 29 8 25 2
#> 5 AUGUST 90 21 33 72 6 0 11 25 13 1 13 1
#> 6 SEPTE~ 64 18 21 35 8 0 2 12 28 11 15 5
Created on 2021-03-03 by the reprex package (v0.3.0)

Pivot_longer to manipulate table

I would like to pivot variables nclaims, npatients, nproviders to show up underneath groups.
I believe I should be using pivot_longer but it doesn't work.
library(tidyr)
ptype <- c(0,1,2,0,1)
groups <- c(rep(1,3), rep(2,2))
nclaims <- c(10,23,32,12,8)
nproviders <- c(2,4,5,1,1)
npatients <- c(8, 20, 29, 9, 6)
dta <- data.frame(ptype=ptype, groups=groups, nclaims=nclaims, nproviders=nproviders, npatients=npatients)
table <- pivot_longer(everything(dta), names_to = "groups", values_to=c("nclaims", "npatients", "nproviders"))
Desired output:
We need to use pivot_longer, then pivot_wider:
dta %>%
pivot_longer(nclaims:npatients) %>%
# values_fill = 0 changes NA values to 0, as in your desired result
pivot_wider(names_from = ptype, values_from = value,
values_fill = 0)
groups name `0` `1` `2`
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 nclaims 10 23 32
2 1 nproviders 2 4 5
3 1 npatients 8 20 29
4 2 nclaims 12 8 0
5 2 nproviders 1 1 0
6 2 npatients 9 6 0
another approach, using reshape2::recast()
library( reshape2 )
recast( dta, groups + variable ~ ptype, id.var = c("ptype", "groups") )
# groups variable 0 1 2
# 1 1 nclaims 10 23 32
# 2 1 nproviders 2 4 5
# 3 1 npatients 8 20 29
# 4 2 nclaims 12 8 NA
# 5 2 nproviders 1 1 NA
# 6 2 npatients 9 6 NA

Resources