Appending many columns - functions of existing columns - to data frame - r

I have a data frame with 200 columns: A_1, ..., A_100, B_1, ..., B_100. The entries of A are integers from 1 to 5 or NA, while the entries of B are -1, 0, 1, NA.
I want to append 100 more columns: C_1, ..., C_100 where C_i = A_i + B_i, except when it would yield 0 or 6, in which case it should stay as is.
What would be the best way to do this in R, in terms of clarity and computational complexity? There has to be a better way than a for loop or something like that, perhaps there are functions for this in some library? I'm going to have to do similar operations a lot so I'd like a streamlined method.

You can try:
library(tidyverse)
# some data
d <- data.frame(A_1=1:10,
A_2=1:10,
A_3=1:10,
B_1=11:20,
B_2=21:30,
B_3=31:40)
d %>%
gather(key, value) %>%
separate(key, into = c("a","b")) %>%
group_by(b, a) %>%
mutate(n=row_number()) %>%
unite(a2,b, n) %>%
spread(a, value) %>%
mutate(Sum=A+B) %>%
separate(a2, into = c("a", "b"), remove = T) %>%
select(-A,-B) %>%
mutate(a=paste0("C_",a)) %>%
spread(a, Sum) %>%
arrange(as.numeric(b)) %>%
left_join(d %>% rownames_to_column(), by=c("b"="rowname"))
# A tibble: 10 x 10
b C_1 C_2 C_3 A_1 A_2 A_3 B_1 B_2 B_3
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 12 22 32 1 1 1 11 21 31
2 2 14 24 34 2 2 2 12 22 32
3 3 16 26 36 3 3 3 13 23 33
4 4 18 28 38 4 4 4 14 24 34
5 5 20 30 40 5 5 5 15 25 35
6 6 22 32 42 6 6 6 16 26 36
7 7 24 34 44 7 7 7 17 27 37
8 8 26 36 46 8 8 8 18 28 38
9 9 28 38 48 9 9 9 19 29 39
10 10 30 40 50 10 10 10 20 30 40
The idea is to use tidyr's gather and spread to get the columns A and B side by side. Then you can calculate the sum and transform it back to the expected data.frame. As long your data.frame has the same number of A and B columns, it is working.

Related

Data manipulation: gather or spread or both?

I am trying to change my data frame so I can look at it with some different plots. Essentially I want to compare different models. This is what I have:
variable = c('A','B','C','A','B','C')
optimal = c(10,20,30,40,80,100)
control = c(15,15,15,15,15,15)
method_1 = c(11,22,28,44,85,95)
method_2 = c(9, 19,31,39,79,102)
df = data.frame(variable, optimal, control, method_1, method_2)
df
and so it looks like this:
variable optimal control method_1 method_2
1 A 10 15 11 9
2 B 20 15 22 19
3 C 30 15 28 31
4 A 40 15 44 39
5 B 80 15 85 79
6 C 100 15 95 102
And I need something that looks like this:
variable A B C
1 optimal 10 20 30
2 optimal 40 80 100
3 control 15 15 15
4 control 15 15 15
5 method_1 11 22 28
6 method_1 44 85 95
7 method_2 9 19 31
8 method_2 39 79 102
I've tried gather and spread and transpose but nothing worked. Any thoughts? Feels that should be a easy fix, but I could not get my head around it. Thanks in advance.
You have to go long first and then wide, i.e.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-1) %>%
pivot_wider(names_from = variable, values_from = value) %>%
unnest()
name A B C
<chr> <dbl> <dbl> <dbl>
1 optimal 10 20 30
2 optimal 40 80 100
3 control 15 15 15
4 control 15 15 15
5 method_1 11 22 28
6 method_1 44 85 95
7 method_2 9 19 31
8 method_2 39 79 102
I think you need both. Also note that gather and spread has been retired and replaced with pivot_longer and pivot_wider instead.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -variable) %>%
group_by(variable) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = variable, values_from = value) %>%
select(-row)
# name A B C
# <chr> <dbl> <dbl> <dbl>
#1 optimal 10 20 30
#2 control 15 15 15
#3 method_1 11 22 28
#4 method_2 9 19 31
#5 optimal 40 80 100
#6 control 15 15 15
#7 method_1 44 85 95
#8 method_2 39 79 102

row wise calculation and update entire row in dplyr

I want to do row wise calculation with dplyr package of R.The result of the calculation is a series. Then I want to replace the entire row with the calculated series. Here is the code:
df <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45)
I want to run isoreg which the result is a series then replace it with what is under w:z columns:
df %>% rowwise %>%
mutate(across(c_across(w:z), ~ isoreg(as.numeric(c_across(w:z)))$yf))
It seems this method is just for replacing one element, not the entire row.
The isoreg is just a sample function, we could use other functions that return a series not a single value as the output.
You don't need to use across as well c_across. For rowwise operations use only c_across. Also c_across expects a single summary value as output so you can't replace all the rows in one go. A hack is to capture all the values in a list and use unnest_wider to get those values as separate columns.
library(dplyr)
df %>%
rowwise() %>%
mutate(output = list(isoreg(c_across(w:z))$yf)) %>%
tidyr::unnest_wider(output)
# id w x y z ...1 ...2 ...3 ...4
# <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#1 1 10 20 30 40 10 20 30 40
#2 2 11 21 31 41 11 21 31 41
#3 3 12 22 32 42 12 22 32 42
#4 4 13 23 33 43 13 23 33 43
#5 5 14 24 34 44 14 24 34 44
#6 6 15 25 35 45 15 25 35 45
Since the output of isoreg is not named unnest_wider gives names as ..1, ..2 etc. You can rename them if needed and remove the columns which you don't need.
Base R option is to use apply :
df[-1] <- t(apply(df[-1], 1, function(x) isoreg(x)$yf))
We could use pmap with unnest_wider
library(dplyr)
library(tidyr)
library(purrr)
df %>%
mutate(new = pmap(select(., w:z), ~ isoreg(c(...))$yf)) %>%
unnest_wider(c(new))
# A tibble: 6 x 9
# id w x y z ...1 ...2 ...3 ...4
# <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
#1 1 10 20 30 40 10 20 30 40
#2 2 11 21 31 41 11 21 31 41
#3 3 12 22 32 42 12 22 32 42
#4 4 13 23 33 43 13 23 33 43
#5 5 14 24 34 44 14 24 34 44
#6 6 15 25 35 45 15 25 35 45

Create multiple new columns in tibble in R based on value of previous row giving prefix to all

I have a tibble as so:
df <- tibble(a = seq(1:10),
b = seq(21,30),
c = seq(31,40))
I want to create a new tibble, where I want to lag some. I want to create new columns called prev+lagged_col_name, eg prev_a.
In my actual data, there are a lot of cols so I don't want to manually write it out. Additonally I only want to do it for some cols. In this eg, I have done it manually but wanted to know if there is a way to use a function to do it.
df_new <- df %>%
mutate(prev_a = lag(a),
prev_b = lag(b),
prev_d = lag(d))
Thanks for your help!
With the current dplyr version you can create new variable names with mutate_at, using a named list will take the name of the list as suffix. If you want it as a prefix as in your example you can use rename_at to correct the variable naming. With your real data, you need to adjust the vars() selection. For your example data matches("[a-c]") did work.
library(dplyr)
df <- tibble(a = seq(1:10),
b = seq(21,30),
c = seq(31,40))
df %>%
mutate_at(vars(matches("[a-c]")), list(prev = ~ lag(.x)))
#> # A tibble: 10 x 6
#> a b c a_prev b_prev c_prev
#> <int> <int> <int> <int> <int> <int>
#> 1 1 21 31 NA NA NA
#> 2 2 22 32 1 21 31
#> 3 3 23 33 2 22 32
#> 4 4 24 34 3 23 33
#> 5 5 25 35 4 24 34
#> 6 6 26 36 5 25 35
#> 7 7 27 37 6 26 36
#> 8 8 28 38 7 27 37
#> 9 9 29 39 8 28 38
#> 10 10 30 40 9 29 39
df %>%
mutate_at(vars(matches("[a-c]")), list(prev = ~ lag(.x))) %>%
rename_at(vars(contains( "_prev") ), list( ~paste("prev", gsub("_prev", "", .), sep = "_")))
#> # A tibble: 10 x 6
#> a b c prev_a prev_b prev_c
#> <int> <int> <int> <int> <int> <int>
#> 1 1 21 31 NA NA NA
#> 2 2 22 32 1 21 31
#> 3 3 23 33 2 22 32
#> 4 4 24 34 3 23 33
#> 5 5 25 35 4 24 34
#> 6 6 26 36 5 25 35
#> 7 7 27 37 6 26 36
#> 8 8 28 38 7 27 37
#> 9 9 29 39 8 28 38
#> 10 10 30 40 9 29 39
Created on 2020-04-29 by the reprex package (v0.3.0)
You could do this this way
df_new <- bind_cols(
df,
df %>% mutate_at(.vars = vars("a","b","c"), function(x) lag(x))
)
Names are a bit nasty but you can rename them check here. Or see #Bas comment to get the names with a suffix.
# A tibble: 10 x 6
a b c a1 b1 c1
<int> <int> <int> <int> <int> <int>
1 1 21 31 NA NA NA
2 2 22 32 1 21 31
3 3 23 33 2 22 32
4 4 24 34 3 23 33
5 5 25 35 4 24 34
6 6 26 36 5 25 35
7 7 27 37 6 26 36
8 8 28 38 7 27 37
9 9 29 39 8 28 38
10 10 30 40 9 29 39
If you have dplyr 1.0 you can use the new accross() function.
See some expamples from the docs, instead of mean you want lag
df %>% mutate_if(is.numeric, mean, na.rm = TRUE)
# ->
df %>% mutate(across(is.numeric, mean, na.rm = TRUE))
df %>% mutate_at(vars(x, starts_with("y")), mean, na.rm = TRUE)
# ->
df %>% mutate(across(c(x, starts_with("y")), mean, na.rm = TRUE))
df %>% mutate_all(mean, na.rm = TRUE)
# ->
df %>% mutate(across(everything(), mean, na.rm = TRUE))

Group_by / summarize by two variables within a function

I would like to write a function that summarize the provided data by some specified criteria, in this case by age
The example data is a table of users' age and their stats.
df <- data.frame('Age'=rep(18:25,2), 'X1'=10:17, 'X2'=28:35,'X4'=22:29)
Next I define the output columns that are relevant for the analysis
output_columns <- c('Age', 'X1', 'X2', 'X3')
This function computes the basic the sum of X1. X2 and X3 grouped by age.
aggr <- function(data, criteria, output_columns){
k <- data %>% .[, colnames(.) %in% output_columns] %>%
group_by_(.dots = criteria) %>%
#summarise_each(funs(count), age) %>%
summarize_if(is.numeric, sum)
return (k)
}
When I call it like this
> e <- aggr(df, "Age", output_columns)
> e
# A tibble: 8 x 3
Age X1 X2
<int> <int> <int>
1 18 20 56
2 19 22 58
3 20 24 60
4 21 26 62
5 22 28 64
6 23 30 66
7 24 32 68
8 25 34 70
I want to have another column called count which shows the number of observations in each age group. Desired output is
> desired
Age X1 X2 count
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2
I have tried different ways to do that, e.g. tally(), summarize_each
etc. They all deliver wrong results.
I believe their should be an easy and simple way to do that.
Any help is appreciated.
Since you're already summing all variables, you can just add a column of all 1s before the summary function
aggr <- function(data, criteria, output_columns){
data %>%
.[, colnames(.) %in% output_columns] %>%
group_by_(.dots = criteria) %>%
mutate(n = 1L) %>%
summarize_if(is.numeric, sum)
}
# A tibble: 8 x 4
Age X1 X2 n
<int> <int> <int> <int>
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2
We could create the 'count' column before summarise_if
aggr<- function(data, criteria, output_columns){
data %>%
select(intersect(names(.), output_columns))%>%
group_by_at(criteria)%>%
group_by(count = n(), add= TRUE) %>%
summarize_if(is.numeric,sum) %>%
select(setdiff(names(.), 'count'), count)
}
aggr(df,"Age",output_columns)
# A tibble: 8 x 4
# Groups: Age [8]
# Age X1 X2 count
# <int> <int> <int> <int>
#1 18 20 56 2
#2 19 22 58 2
#3 20 24 60 2
#4 21 26 62 2
#5 22 28 64 2
#6 23 30 66 2
#7 24 32 68 2
#8 25 34 70 2
In base R you could do
aggr <- function(data, criteria, output_columns){
ds <- data[, colnames(data) %in% output_columns]
d <- aggregate(ds, by=list(criteria), function(x) c(sum(x), length(x)))
"names<-"(do.call(data.frame, d)[, -c(2:3, 5)], c(names(ds), "n"))
}
> with(df, aggr(df, Age, output_columns))
Age X1 X2 n
1 18 20 56 2
2 19 22 58 2
3 20 24 60 2
4 21 26 62 2
5 22 28 64 2
6 23 30 66 2
7 24 32 68 2
8 25 34 70 2

summarise dataset conditioning on variable using dplyr

I want to summarise my dataset grouping the variable age into 5 years age groups, so instead of single age 0 1 2 3 4 5 6... I would have 0 5 10 15 etc. with 80 being my open-ended category. I could do this by categorizing everything by hand creating a new variable, but I am sure there must be a quicker way!
a <- cbind(age=c(rep(seq(0, 90, by=1), 2)), value=rnorm(182))
Any ideas?
like this ?
library(dplyr)
a %>% data.frame %>% group_by(age_group = (sapply(age,min,80) %/% 5)*5) %>%
summarize(avg_val = mean(value))
# A tibble: 17 x 2
age_group avg_val
<dbl> <dbl>
1 0 -0.151470805
2 5 0.553619149
3 10 0.198915973
4 15 -0.436646287
5 20 -0.024193193
6 25 0.102671120
7 30 -0.350059839
8 35 0.010762264
9 40 0.339268917
10 45 -0.056448481
11 50 0.002982158
12 55 0.348232262
13 60 -0.364050091
14 65 0.177551510
15 70 -0.178885909
16 75 0.664215782
17 80 -0.376929230
Example data
set.seed(1)
df <- data.frame(age=runif(1000)*100,
value=runif(1000))
Simply add the max value of your group to seq(0,80,5) for irregular breaks with c(..., max(age))
library(dplyr)
df %>%
mutate(age = cut(age, breaks=c(seq(0,80,5), max(age)))) %>%
group_by(age) %>%
summarise(value=mean(value))
Output
age value
<fctr> <dbl>
1 (0,5] 0.4901119
2 (5,10] 0.5131055
3 (10,15] 0.5022297
4 (15,20] 0.4712481
5 (20,25] 0.5610872
6 (25,30] 0.4207203
7 (30,35] 0.5218318
8 (35,40] 0.4377102
9 (40,45] 0.5007616
10 (45,50] 0.4941768
11 (50,55] 0.5350272
12 (55,60] 0.5226967
13 (60,65] 0.5031688
14 (65,70] 0.4652641
15 (70,75] 0.5667020
16 (75,80] 0.4664898
17 (80,100] 0.4604779

Resources