Using loops with mutate in R to sum columns with partially matching column names - r

df <- data.frame(x_1_jr=c(1,2,3,4), x_2_jr=c(1,2,3,4), y_1_jr=c(4,3,2,1), y_2_jr=c(4,3,2,1)
x_1_jr x_2_jr y_1_jr y_2_jr
1 1 1 4 4
2 2 2 3 3
3 3 3 2 2
4 4 4 1 1
I am trying to generate new variables that are the sum of x and y with the same column name suffix, i.e.
df <- df %>% mutate(z_1_jr= x_1_jr + y_1_jr)
x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr
1 1 1 4 4 5
2 2 2 3 3 5
3 3 3 2 2 5
4 4 4 1 1 5
I could write this out for each variable combination, but I have a large number of variables(>50 for each x and y group), and would like to use a loop... however, I'm relatively new to R and am not sure where to begin!
Can someone help? Thank you!
EDIT: for additional clarity, the dataset contains other non-numeric variables. There are >700 columns (from a large survey). x_1_jr represents, for example, the number of male individuals ages 1 year, y_1_jr female individuals of 1 year. I am trying to get a total (male plus female of 1 year) for each age group.
A

An option with base R
df[c("z_1_jr", "z_2_jr")] <- sapply(split.default(df,
sub("^[a-z]+_", "", names(df))), rowSums)
df
# x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr z_2_jr
#1 1 1 4 4 5 5
#2 2 2 3 3 5 5
#3 3 3 2 2 5 5
#4 4 4 1 1 5 5

One dplyr and purrr option could be:
df %>%
bind_cols(map_dfc(.x = unique(sub(".*?_", "_", names(df))),
~ df %>%
transmute(!!paste0("z", .x) := rowSums(select(., ends_with(.x))))))
x_1_jr x_2_jr y_1_jr y_2_jr z_1_jr z_2_jr
1 1 1 4 4 5 5
2 2 2 3 3 5 5
3 3 3 2 2 5 5
4 4 4 1 1 5 5

Related

create multiple columns at once, depending on the number of columns of another df, with dplyr

I want to create with dplyr a df with n columns (depending on the number of columns of df data), where the root of the name is the same TIME. and first the column is equal to 1 in all rows, the second equal to 2 and so on. The number of rows is the same as data
data <- data.frame(ID=c(1:6), VALUE.1=c(2,5,7,1,3,5), VALUE.2=c(1,7,2,4,5,4), VALUE.3=c(9,2,6,3,4,4), VALUE.4=c(1,2,3,2,3,8))
And the first column of data as first column. This is what I'd like to have:
ID TIME.1 TIME.2 TIME.3 TIME.4
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 1 2 3 4
5 1 2 3 4
6 1 2 3 4
Now I'm doing:
T1 <- data.frame(ID=unique(data$ID), TIME.1=rep(1, length(unique(data$ID))), TIME.2=rep(2, length(unique(data$ID))), TIME.3=rep(3, length(unique(data$ID))), TIME.4=rep(4, length(unique(data$ID))) )
We can replace the column contents with the suffix in the column name, then rename the columns from VALUE.n to TIME.n.
library(dplyr)
data %>%
mutate(across(starts_with("VALUE"), ~sub("VALUE.", "", cur_column()))) %>%
rename_with(~sub("VALUE", "TIME", .x))
ID TIME.1 TIME.2 TIME.3 TIME.4
1 1 1 2 3 4
2 2 1 2 3 4
3 3 1 2 3 4
4 4 1 2 3 4
5 5 1 2 3 4
6 6 1 2 3 4
Here is a base R approach that may give a similar result. This involves creating a matrix based on your other data.frame data, using its dimensions for column names and determining the number of rows. We subtract 1 from number of columns given the first ID column present.
nc <- ncol(data) - 1
nr <- nrow(data)
as.data.frame(cbind(
ID = data$ID,
matrix(1:nc, ncol = nc, nrow = nr, byrow = T, dimnames = list(NULL, paste0("TIME.", 1:nc)))
))
Output
ID TIME.1 TIME.2 TIME.3 TIME.4
1 1 1 2 3 4
2 2 1 2 3 4
3 3 1 2 3 4
4 4 1 2 3 4
5 5 1 2 3 4
6 6 1 2 3 4

Using "contain" function with two arguments in R

I have a dataset f.ex. like this:
dat1 <- read.table(header=TRUE, text="
Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2 Cont_01_T1 Cont_01_T2
5 1 2 1 5 3 1 1
3 1 3 3 4 2 1 2
2 1 3 1 3 1 2 2
4 2 5 5 3 2 3 3
5 1 4 1 2 2 4 5
")
I'd like to use the select function to gather the variables that contain Trust AND T1.
dat1 <- dat1 %>%
mutate(Trust_T1 = select(., contains("Trust")))
Does anybody know how to use two Arguments there, to have Trust AND T1. If I use:
dat1 <- dat1 %>%
mutate(Trust_T1 = select(., contains("Trust"), contains("T1")))
it gives me the Variables that contain EITHER Trust or T1.
best!
If we need both, then use matches with a regex to specify the column names that starts (^) with 'Trust' and ends ($) as 'T1' (assuming these are only patterns
library(dplyr)
dat1 %>%
select(matches("^Trust_.*T1$"))
The mutate used to create a new column is not clear as there are multiple columns that matches the 'Trust' followed by 'T1'. If the intention is to do some operations on the selected columns, can either be across or c_across with rowwise (not clear from the post)
One solution could be:
library(dplyr)
df %>% select(starts_with('Trust') | contains('_T1'))
#> Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2
#> 1 5 1 2 1 5 3
#> 2 3 1 3 3 4 2
#> 3 2 1 3 1 3 1
#> 4 4 2 5 5 3 2
#> 5 5 1 4 1 2 2
#> Cont_01_T1
#> 1 1
#> 2 1
#> 3 2
#> 4 3
#> 5 4
DATA
df <- read.table(text =
"
Trust_01_T1 Trust_02_T1 Trust_03_T1 Trust_01_T2 Trust_02_T2 Trust_03_T2 Cont_01_T1 Cont_01_T2
5 1 2 1 5 3 1 1
3 1 3 3 4 2 1 2
2 1 3 1 3 1 2 2
4 2 5 5 3 2 3 3
5 1 4 1 2 2 4 5
", header =T)

Can I use Boolean operators with R tidy select functions

is there a way I can use Boolean operators (e.g. | or &) with the tidyselect helper functions to select variables?
The code below illustrates what currently works and what, in my mind, should work but doesn't.
df<-sample(seq(1,4,1), replace=T, size=400)
df<-data.frame(matrix(df, ncol=10))
#make variable names
library(tidyverse)
library(stringr)
vars1<-str_c('q1_', seq(1,5,1))
vars2<-str_c('q9_', seq(1,5,1))
#Assign
names(df)<-c(vars1, vars2)
names(df)
#This works
df %>%
select(starts_with('q1_'), starts_with('q9'))
#This does not work using |
df %>%
select(starts_with('q1_'| 'q9_'))
#This does not work with c()
df %>%
select(starts_with(c('q1_', 'q9_')))
You can use multiple starts_with, e.g.,
df %>% select(starts_with('q1_'), starts_with('q9_'))
You can use | in a regular expression and matches() (in this case, in combination with ^, the regex beginning-of-string)
df %>% select(matches('^q1_|^q9_'))
You can also approach it using purrr:
map(.x = c("q1_", "q9_"), ~ df %>%
select(starts_with(.x))) %>%
bind_cols()
q1_1 q1_2 q1_3 q1_4 q1_5 q9_1 q9_2 q9_3 q9_4 q9_5
1 2 4 3 1 2 2 3 1 1 3
2 1 3 3 4 4 3 2 2 1 3
3 2 2 3 4 3 4 1 3 2 4
4 1 2 4 2 4 3 3 1 3 3
5 3 1 2 3 3 2 2 3 3 3
6 4 2 3 4 1 4 2 4 2 4
7 3 1 4 1 4 2 4 4 1 2
8 2 2 3 2 1 3 3 3 1 4
9 1 4 2 3 4 4 1 1 3 4
10 1 1 2 4 1 1 4 4 1 2

Split dataframe based on one column in r, with a non-fixed width column [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 5 years ago.
I have a problem that is an extension of a well-covered issue here on SE. I.e:
Split a column of a data frame to multiple columns
My data has a column with a string format, comma-separated, but of no fixed length.
data = data.frame(id = c(1,2,3), treatments = c("1,2,3", "2,3", "8,9,1,2,4"))
So I would like to have my dataframe eventually be in the proper tidy/long form of:
id treatments
1 1
1 2
1 3
...
3 1
3 2
3 4
Something like separate or strsplit doesn't seem on it's own to be the solution. Separate fails with warnings that various columns have too many values (NB id 3 has more values than id 1).
Thanks
You can use tidyr::separate_rows:
library(tidyr)
separate_rows(data, treatments)
# id treatments
#1 1 1
#2 1 2
#3 1 3
#4 2 2
#5 2 3
#6 3 8
#7 3 9
#8 3 1
#9 3 2
#10 3 4
Using dplyr and tidyr packages:
data %>%
separate(treatments, paste0("v", 1:5)) %>%
gather(var, treatments, -id) %>%
na.exclude %>%
select(id, treatments) %>%
arrange(id)
id treatments
1 1 1
2 1 2
3 1 3
4 2 2
5 2 3
6 3 8
7 3 9
8 3 1
9 3 2
10 3 4
You can also use unnest:
library(tidyverse)
data %>%
mutate(treatments = stringr::str_split(treatments, ",")) %>%
unnest()
id treatments
1 1 1
2 1 2
3 1 3
4 2 2
5 2 3
6 3 8
7 3 9
8 3 1
9 3 2
10 3 4

calculate each chunk by group using dplyr?

How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?
Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3
As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3
Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3

Resources