Dplyr: looping the creation of new columns - r

EDIT: My data (for reproducible research) looks as follows. The dplyr will summarise the values for each win_name category:
inv_name inv_province inv_town nip win_name value start duration year
CustomerA łódzkie TownX 1111111111 CompX 233.50 2015-10-23 24 2017
CustomerA łódzkie TownX 1111111111 CompX 300.5 2015-10-23 24 2017
CustomerA łódzkie TownX 1111111111 CompX 200.5 2015-10-23 24 2017
CustomerB łódzkie TownY 2222222222 CompY 200.5 2015-10-25 12 2017
CustomerB łódzkie TownY 2222222222 CompY 1200.0 2015-10-25 12 2017
CustomerB łódzkie TownY 2222222222 CompY 320.00 2015-10-25 12 2017
The dplyr will summarise the values, then the spread will make the summary spread into several columns for each win_name category with numeric values.
I would like to create new columns with formatted text corresponding to existing columns with numbers. Create as many columns as there are numeric columns with numeric data. The number of these columns can change from analysis to analysis. My code so far looks like:
county_marketshare<-df_monthly_val %>%
select(win_name,value,inv_province) %>%
group_by(win_name,inv_province)%>%
summarise(value=round(sum(value),0))%>%
spread(key="win_name", value=value, fill=0) %>% # teraz muszę stworzyc kolumny sformatowane "finansowo"
mutate(!!as.symbol(paste0(bestSup[1],"_lbl")):= formatC(!!as.symbol(bestSup[1]),digits = 0, big.mark = " ", format = "f",zero.print = ""),
!!as.symbol(paste0(bestSup[2],"_lbl")):= formatC(!!as.symbol(bestSup[2]),digits = 0, big.mark = " ", format = "f",zero.print = ""),
!!as.symbol(paste0(bestSup[3],"_lbl")):= formatC(!!as.symbol(bestSup[3]),digits = 0, big.mark = " ", format = "f",zero.print = "")
)
is there a way to loop the mutate function so that as many columns are created as there are existing numeric columns? The relavant lines with the repetitive code are the last three. Each new formatted text column has the name of existing numeric column with a suffix. !!as.symbol makes it possible to put together a parameter, the name of the source column, with _lbl suffix.

you could for example use mutate_at with a function and a conditional such as
dat %>%
mutate_at(.vars = c('num_col1','num_col2'),
.funs = function(x) if(is.numeric(x)) as.character(x))
This will replace the specified numeric columns with character columns. You can tweak the function to your needs, i.e. specifying how the columns should look like. We could help you a bit more with a better data example.
You can also filter only the numeric columns and then use mutate_all:
dat %>%Filter(is.numeric,.) %>% mutate_all(funs(as.character))
# Filter() is not dplyr, but base R, caveat capital 'F' !
# You can also use dat %>%.[sapply(.,is.numeric)], with the same result
# or dplyr::select_if
...:)
P.S. Always worth to cite the reference. Have a look at this gorgeous question:
Selecting only numeric columns from a data frame

Please consult tidyverse documentation.
# mutate_if() is particularly useful for transforming variables from
# one type to another
iris %>% as_tibble() %>% mutate_if(is.factor, as.character)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.10 3.50 1.40 0.200 setosa
#> 2 4.90 3.00 1.40 0.200 setosa
#> 3 4.70 3.20 1.30 0.200 setosa
#> 4 4.60 3.10 1.50 0.200 setosa
#> 5 5.00 3.60 1.40 0.200 setosa
#> 6 5.40 3.90 1.70 0.400 setosa
#> 7 4.60 3.40 1.40 0.300 setosa
#> 8 5.00 3.40 1.50 0.200 setosa
#> 9 4.40 2.90 1.40 0.200 setosa
#> 10 4.90 3.10 1.50 0.100 setosa
#> # ... with 140 more rows

Unexpectedly I found a hint at http://stackoverflow.com/a/47971650/3480717
I did not realise that in the syntax
mtcars %>% mutate_at(columnstolog, funs(log = log(.)))
adding a name part "log="in funs will append it to the names of new colums.... in the effect the following in my case is enough:
mutate_if(is.numeric, funs(lbl = formatC(.,digits = 0, big.mark = " ", format = "f",zero.print = "")))
This will generate new columns, as many as there are original numeric columns, and these new columns will have the name sufficed with "_lbl". No need for loops or advanced syntax. Big thanks to Thebo and Nettle

Related

Looking for function or formula to create table with means and standard deviations for many groups and many variables using tidyverse

I need to prepare a table that includes the means and standards deviations for each level of several demographic variables and for many variables.
Consider the following data:
df <- tibble(place=c("London","Paris","London","Rome","Rome","Madrid","Madrid"),gender=c("m","f","f","f","m","m","f"), education = c(1,1,2,3,5,5,3), var1 = c(2.2,3.1,4.5,1,5,1.4,2.3),var2 = c(4.2,2.1,2.5,4,5,4.4,1.3),var3 = c(0.2,0.1,3.5,3,5,2.4,4.3))
I would like to get a dataframe that contains the grouping variables (place, gender, education) and their levels (e.g., London, Paris, etc.) in the first column and their means and standard deviations for each variable starting with var (var1, var2, var3) in additional columns.
I know how to do this for one group and several variables at a time. However, since I need to repeat this dozens of times I am looking for a way to automate this process. It would be great to have a function to which I simply need to pass (a) the names of the grouping variables (e.g., gender, education) and (b) the variables from which to get the M / SD (e.g. var1, var2).
The solution I look for should look like this (the stats are not correct in the example below):
my_results <- tibble(grouping_vars = c("place_London","place_Paris","place_Rome","place_Madrid","gender_m","gender_f","last_element"),mean_var1=c(1.3,2.5,4.5,1.7,2.5,3.6,4.0),sd_var1=c(0.01,0.41,0.21,0.12,0.02,0.38,0.28),mean_var2=c(4.3,4.5,4.0,1.2,2.5,1.6,2.3),sd_var2=c(0.21,0.1,0.1,0.32,0.22,0.18,0.08),mean_var3=c(2.3,2.5,2.0,3.2,3.5,0.6,5),sd_var3=c(0.51,0.15,0.51,0.52,0.52,0.15,0.48))
grouping_vars mean_var1 sd_var1 mean_var2 sd_var2 mean_var3 sd_var3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 place_London 1.3 0.01 4.3 0.21 2.3 0.51
2 place_Paris 2.5 0.41 4.5 0.1 2.5 0.15
3 place_Rome 4.5 0.21 4 0.1 2 0.51
4 place_Madrid 1.7 0.12 1.2 0.32 3.2 0.52
5 gender_m 2.5 0.02 2.5 0.22 3.5 0.52
6 gender_f 3.6 0.38 1.6 0.18 0.6 0.15
7 last_element 4 0.28 2.3 0.08 5 0.48
Since I typically work with tidyverse, I would particularly appreciate solutions that use these packages (probably dplyr or purrr?).
EDIT:
I thought there would be an elegant way to do this using map(). Maybe there is but I haven't found it yet. For the mean time, I figured out a way that simply restructures the data into an appropriate long format and then computes the statistics.
df %>%
# all grouping vars need to be of the same type, here "factor" is most appropriate
mutate_at(grouping_vars, list(factor)) %>%
# pivot longer, so that each row is a unique combination of grouping variable and grouping level
pivot_longer(
cols = one_of(grouping_vars),
names_to = "group_var",
values_to = "group_level"
) %>%
# merge grouping variable and group level into a single column
unite(var_level,group_var,group_level, sep="_") %>%
# group by group level
group_by(var_level) %>%
# compute means and sd for each test variable
summarise_at(test_vars, list(~mean(., na.rm = TRUE), ~sd(., na.rm = TRUE)))
The result seems fine, e.g., the mean of var1 of the two people who live in London (2.2 + 4.5) is 3.35.
# A tibble: 10 x 7
var_level var1_mean var2_mean var3_mean var1_sd var2_sd var3_sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 education_1 2.65 3.15 0.15 0.636 1.48 0.0707
2 education_2 4.5 2.5 3.5 NA NA NA
3 education_3 1.65 2.65 3.65 0.919 1.91 0.919
4 education_5 3.2 4.7 3.7 2.55 0.424 1.84
5 gender_f 2.72 2.48 2.72 1.47 1.13 1.83
6 gender_m 2.87 4.53 2.53 1.89 0.416 2.40
7 place_London 3.35 3.35 1.85 1.63 1.20 2.33
8 place_Madrid 1.85 2.85 3.35 0.636 2.19 1.34
9 place_Paris 3.1 2.1 0.1 NA NA NA
10 place_Rome 3 4.5 4 2.83 0.707 1.41
Any thoughts on possible risks of this approach or how this could be improved?
One option is the describeBy function from psych:
library(psych)
describeBy(df,group = c("gender","education"), mat= TRUE)
Then subset what you want from there.
Another, surprisingly simple option with dplyr:
library(dplyr)
group.vars <- c("gender","education")
measure.vars <- c("var1","var2")
df %>%
group_by_at(group.vars) %>%
summarize_at(measure.vars,
list(mean =~ mean(.),sd =~ sd(.)))
# A tibble: 5 x 6
# Groups: gender [2]
gender education var1_mean var2_mean var1_sd var2_sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 f 1 3.1 2.1 NA NA
2 f 2 4.5 2.5 NA NA
3 f 3 1.65 2.65 0.919 1.91
4 m 1 2.2 4.2 NA NA
5 m 5 3.2 4.7 2.55 0.424
You can continue adding additional function to that list. For every element, the name will be appended to the variable and the result will be come the column values. Recall that ~ is shorthand for function(x).

Can someone find out how this dplyr table doesn't work?

I have to make a table in which the mean and median from the logsales per month are set out from the "txhousing" dataset. The exercise i got is the following: “The table below show the means and medians of the log of sales per month, sorted by mean”
Insert a new r chunk and type the code in it to display that table
Use na.omit to get rid of cases with missing values
Use the dplyr command mutate to make the variable logsales
Use the dplyr command group_by to group by month
Use the dplyr command summarise to display the table
Use the dplyr command arrange to sorth by mean
Connect the commands with the pipe operator %>%
I've tried to mix up the code multiple times but I can't find out why it keeps on giving me NA's in my table.
library(tidyverse)
summary(txhousing)
na.omit(txhousing)
txhousing<- as.data.frame(txhousing)
logsales <- log(txhousing$sales)
group_by(txhousing, txhousing$month)
txhousing<- txhousing %>% mutate(logsales= log(txhousing$sales))
txhousing %>% group_by(txhousing$month) %>% summarise(mean(logsales), median(logsales)) %>% arrange(mean)
I expect to get a table with the mean and median of the logsales per month, but what i get is only NA in the column from the mean en the median and the arrange gives the following error:
Error: cannot arrange column of class 'function' at position 1`
There are NA values in the columns, so you need to tell mean and median to ignore them. And also name the columns in summarise to use arrange on column named as mean.
txhousing %>%
group_by(txhousing$month) %>%
summarise(mean = mean(logsales, na.rm = T),
med= median(logsales, na.rm = T)) %>%
arrange(mean) %>%
rename(month = `txhousing$month`)
This creates following tibble
# A tibble: 12 x 3
month mean med
<int> <dbl> <dbl>
1 1 4.95 4.74
2 2 5.13 4.93
3 11 5.19 4.96
4 12 5.24 5.02
5 10 5.29 5.08
6 9 5.32 5.09
7 3 5.38 5.15
8 4 5.42 5.21
9 5 5.52 5.29
10 7 5.53 5.30
11 8 5.53 5.33
12 6 5.56 5.34

R: How to aggregate with NA values

To give a small working example, suppose I have the following data frame:
library(dplyr)
country <- rep(c("A", "B", "C"), each = 6)
year <- rep(c(1,2,3), each = 2, times = 3)
categ <- rep(c(0,1), times = 9)
pop <- rep(c(NA, runif(n=8)), each=2)
money <- runif(18)+100
df <- data.frame(Country = country,
Year = year,
Category = categ,
Population = pop,
Money = money)
Now the data I'm actually working with has many more repetitions, namely for every country, year, and category, there are many repeated rows corresponding to various sources of money, and I want to sum these all together. However, for now it's enough just to have one row for each country, year, and category, and just trivially apply the sum() function on each row. This will still exhibit the behavior I'm trying to get rid of.
Notice that for country A in year 1, the population listed is NA. Therefore when I run
aggregate(Money ~ Country+Year+Category+Population, df, sum)
the resulting data frame has dropped the rows corresponding to country A and year 1. I'm only using the ...+Population... bit of code because I want the output data frame to retain this column.
I'm wondering how to make the aggregate() function not drop things that have NAs in the columns by which the grouping occurs--it'd be nice if, for instance, the NAs themselves could be treated as values to group by.
My attempts: I tried turning the Population column into factors but that didn't change the behavior. I read something on the na.action argument but neither na.action=NULL nor na.action=na.skip changed the behavior. I thought about trying to turn all the NAs to 0s, and I can't think of what that would hurt but it feels like a hack that might bite me later on--not sure. But if I try to do it, I'm not sure how I would. When I wrote a function with the is.na() function in it, it didn't apply the if (is.na(x)) test in a vectorized way and gave the error that it would just use the first element of the vector. I thought about perhaps using lapply() on the column and coercing it back to a vector and sticking that in the column, but that also sounds kind of hacky and needlessly round-about.
The solution here seemed to be about keeping the NA values out of the data frame in the first place, which I can't do: Aggregate raster in R with NA values
As you have already mentioned dplyr before your data, you can use dplyr::summarise function. The summarise function supports grouping on NA values.
library(dplyr)
df %>% group_by(Country,Year,Category,Population) %>%
summarise(Money = sum(Money))
# # A tibble: 18 x 5
# # Groups: Country, Year, Category [?]
# Country Year Category Population Money
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A 1.00 0 NA 101
# 2 A 1.00 1.00 NA 100
# 3 A 2.00 0 0.482 101
# 4 A 2.00 1.00 0.482 101
# 5 A 3.00 0 0.600 101
# 6 A 3.00 1.00 0.600 101
# 7 B 1.00 0 0.494 101
# 8 B 1.00 1.00 0.494 101
# 9 B 2.00 0 0.186 100
# 10 B 2.00 1.00 0.186 100
# 11 B 3.00 0 0.827 101
# 12 B 3.00 1.00 0.827 101
# 13 C 1.00 0 0.668 100
# 14 C 1.00 1.00 0.668 101
# 15 C 2.00 0 0.794 100
# 16 C 2.00 1.00 0.794 100
# 17 C 3.00 0 0.108 100
# 18 C 3.00 1.00 0.108 100
Note: The OP's sample data doesn't have multiple rows for same groups. Hence, number of summarized rows will be same as actual rows.

How to divide dataset into balanced sets based on multiple variables

I have a large dataset I need to divide into multiple balanced sets.
The set looks something like the following:
> data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
> colnames(data)<-c("A","B","C","D","E","F","G","H")
The sets, each containing for example 20 rows, will need to be balanced across multiple variables so that each subset ends up having a similar mean of B, C, D that's included in their subgroup compared to all the other subsets.
Is there a way to do that with R? Any advice would be much appreciated. Thank you in advance!
library(tidyverse)
# Reproducible data
set.seed(2)
data<-matrix(runif(4000, min=0, max=10), nrow=500, ncol=8 )
colnames(data)<-c("A","B","C","D","E","F","G","H")
data=as.data.frame(data)
Updated Answer
It's probably not possible to get similar means across sets within each column if you want to keep observations from a given row together. With 8 columns (as in your sample data), you'd need 25 20-row sets where each column A set has the same mean, each column B set has the same mean, etc. That's a lot of constraints. Probably there are, however, algorithms that could find the set membership assignment schedule that minimizes the difference in set means.
However, if you can separately take 20 observations from each column without regard to which row it came from, then here's one option:
# Group into sets with same means
same_means = data %>%
gather(key, value) %>%
arrange(value) %>%
group_by(key) %>%
mutate(set = c(rep(1:25, 10), rep(25:1, 10)))
# Check means by set for each column
same_means %>%
group_by(key, set) %>%
summarise(mean=mean(value)) %>%
spread(key, mean) %>% as.data.frame
set A B C D E F G H
1 1 4.940018 5.018584 5.117592 4.931069 5.016401 5.171896 4.886093 5.047926
2 2 4.946496 5.018578 5.124084 4.936461 5.017041 5.172817 4.887383 5.048850
3 3 4.947443 5.021511 5.125649 4.929010 5.015181 5.173983 4.880492 5.044192
4 4 4.948340 5.014958 5.126480 4.922940 5.007478 5.175898 4.878876 5.042789
5 5 4.943010 5.018506 5.123188 4.924283 5.019847 5.174981 4.869466 5.046532
6 6 4.942808 5.019945 5.123633 4.924036 5.019279 5.186053 4.870271 5.044757
7 7 4.945312 5.022991 5.120904 4.919835 5.019173 5.187910 4.869666 5.041317
8 8 4.947457 5.024992 5.125821 4.915033 5.016782 5.187996 4.867533 5.043262
9 9 4.936680 5.020040 5.128815 4.917770 5.022527 5.180950 4.864416 5.043587
10 10 4.943435 5.022840 5.122607 4.921102 5.018274 5.183719 4.872688 5.036263
11 11 4.942015 5.024077 5.121594 4.921965 5.015766 5.185075 4.880304 5.045362
12 12 4.944416 5.024906 5.119663 4.925396 5.023136 5.183449 4.887840 5.044733
13 13 4.946751 5.020960 5.127302 4.923513 5.014100 5.186527 4.889140 5.048425
14 14 4.949517 5.011549 5.127794 4.925720 5.006624 5.188227 4.882128 5.055608
15 15 4.943008 5.013135 5.130486 4.930377 5.002825 5.194421 4.884593 5.051968
16 16 4.939554 5.021875 5.129392 4.930384 5.005527 5.197746 4.883358 5.052474
17 17 4.935909 5.019139 5.131258 4.922536 5.003273 5.204442 4.884018 5.059162
18 18 4.935830 5.022633 5.129389 4.927106 5.008391 5.210277 4.877859 5.054829
19 19 4.936171 5.025452 5.127276 4.927904 5.007995 5.206972 4.873620 5.054192
20 20 4.942925 5.018719 5.127394 4.929643 5.005699 5.202787 4.869454 5.055665
21 21 4.941351 5.014454 5.125727 4.932884 5.008633 5.205170 4.870352 5.047728
22 22 4.933846 5.019311 5.130156 4.923804 5.012874 5.213346 4.874263 5.056290
23 23 4.928815 5.021575 5.139077 4.923665 5.017180 5.211699 4.876333 5.056836
24 24 4.928739 5.024419 5.140386 4.925559 5.012995 5.214019 4.880025 5.055182
25 25 4.929357 5.025198 5.134391 4.930061 5.008571 5.217005 4.885442 5.062630
Original Answer
# Randomly group data into 20-row groups
set.seed(104)
data = data %>%
mutate(set = sample(rep(1:(500/20), each=20)))
head(data)
A B C D E F G H set
1 1.848823 6.920055 3.2283369 6.633721 6.794640 2.0288792 1.984295 2.09812642 10
2 7.023740 5.599569 0.4468325 5.198884 6.572196 0.9269249 9.700118 4.58840437 20
3 5.733263 3.426912 7.3168797 3.317611 8.301268 1.4466065 5.280740 0.09172101 19
4 1.680519 2.344975 4.9242313 6.163171 4.651894 2.2253335 1.175535 2.51299726 25
5 9.438393 4.296028 2.3563249 5.814513 1.717668 0.8130327 9.430833 0.68269106 19
6 9.434750 7.367007 1.2603451 5.952936 3.337172 5.2892300 5.139007 6.52763327 5
# Mean by set for each column
data %>% group_by(set) %>%
summarise_all(mean)
set A B C D E F G H
1 1 5.240236 6.143941 4.638874 5.367626 4.982008 4.200123 5.521844 5.083868
2 2 5.520983 5.257147 5.209941 4.504766 4.231175 3.642897 5.578811 6.439491
3 3 5.943011 3.556500 5.366094 4.583440 4.932206 4.725007 5.579103 5.420547
4 4 4.729387 4.755320 5.582982 4.763171 5.217154 5.224971 4.972047 3.892672
5 5 4.824812 4.527623 5.055745 4.556010 4.816255 4.426381 3.520427 6.398151
6 6 4.957994 7.517130 6.727288 4.757732 4.575019 6.220071 5.219651 5.130648
7 7 5.344701 4.650095 5.736826 5.161822 5.208502 5.645190 4.266679 4.243660
8 8 4.003065 4.578335 5.797876 4.968013 5.130712 6.192811 4.282839 5.669198
9 9 4.766465 4.395451 5.485031 4.577186 5.366829 5.653012 4.550389 4.367806
10 10 4.695404 5.295599 5.123817 5.358232 5.439788 5.643931 5.127332 5.089670
# ... with 15 more rows
If the total number of rows in the data frame is not divisible by the number of rows you want in each set, then you can do the following when you create the sets:
data = data %>%
mutate(set = sample(rep(1:ceiling(500/20), each=20))[1:n()])
In this case, the set sizes will vary a bit with the number of data rows is not divisible by the desired number of rows in each set.
The following approach could be worth trying for someone in a similar position.
It is based on the numerical balancing in groupdata2's fold() function, which allows creating groups with balanced means for a single column. By standardizing each of the columns and numerically balancing their rowwise sum, we might increase the chance of getting balanced means in the individual columns.
I compared this approach to creating groups randomly a few times and selecting the split with the least variance in means. It seems to be a bit better, but I'm not too convinced that this will hold in all contexts.
# Attach dplyr and groupdata2
library(dplyr)
library(groupdata2)
set.seed(1)
# Create the dataset
data <- matrix(runif(4000, min = 0, max = 10), nrow = 500, ncol = 8)
colnames(data) <- c("A", "B", "C", "D", "E", "F", "G", "H")
data <- dplyr::as_tibble(data)
# Standardize all columns and calculate row sums
data_std <- data %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create groups (new column called ".folds")
# We numerically balance the "total" column
data_std <- data_std %>%
groupdata2::fold(k = 25, num_col = "total") # k = 500/20=25
# Transfer the groups to the original (non-standardized) data frame
data$group <- data_std$.folds
# Check the means
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean)
> # A tibble: 25 x 9
> group A B C D E F G H
> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 1 4.48 5.05 4.80 5.65 5.04 4.60 5.12 4.85
> 2 2 5.57 5.17 3.21 5.46 4.46 5.89 5.06 4.79
> 3 3 4.33 6.02 4.57 6.18 4.76 3.79 5.94 3.71
> 4 4 4.51 4.62 4.62 5.27 4.65 5.41 5.26 5.23
> 5 5 4.55 5.10 4.19 5.41 5.28 5.39 5.57 4.23
> 6 6 4.82 4.74 6.10 4.34 4.82 5.08 4.89 4.81
> 7 7 5.88 4.49 4.13 3.91 5.62 4.75 5.46 5.26
> 8 8 4.11 5.50 5.61 4.23 5.30 4.60 4.96 5.35
> 9 9 4.30 3.74 6.45 5.60 3.56 4.92 5.57 5.32
> 10 10 5.26 5.50 4.35 5.29 4.53 4.75 4.49 5.45
> # … with 15 more rows
# Check the standard deviations of the means
# Could be used to compare methods
data %>%
dplyr::group_by(group) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd))
> # A tibble: 1 x 8
> A B C D E F G H
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
> 1 0.496 0.546 0.764 0.669 0.591 0.611 0.690 0.475
It might be best to compare the means and mean variances (or standard deviations as above) of different methods on the standardized data though. In that case, one could calculate the sum of the variances and minimize it.
data_std %>%
dplyr::select(-total) %>%
dplyr::group_by(.folds) %>%
dplyr::summarise_all(.funs = mean) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
> 1.643989
Comparing multiple balanced splits
The fold() function allows creating multiple unique grouping factors (splits) at once. So here, I will perform the numerically balanced split 20 times and find the grouping with the lowest sum of the standard deviations of the means. I'll further convert it to a function.
create_multi_balanced_groups <- function(data, cols, k, num_tries){
# Extract the variables of interest
# We assume these are numeric but we could add a check
data_to_balance <- data[, cols]
# Standardize all columns
# And calculate rowwise sums
data_std <- data_to_balance %>%
dplyr::mutate_all(.funs = function(x){(x-mean(x))/sd(x)}) %>%
dplyr::mutate(total = rowSums(across(where(is.numeric))))
# Create `num_tries` unique numerically balanced splits
data_std <- data_std %>%
groupdata2::fold(
k = k,
num_fold_cols = num_tries,
num_col = "total"
)
# The new fold column names ".folds_1", ".folds_2", etc.
fold_col_names <- paste0(".folds_", seq_len(num_tries))
# Remove total column
data_std <- data_std %>%
dplyr::select(-total)
# Calculate score for each split
# This could probably be done more efficiently without a for loop
variance_scores <- c()
for (fcol in fold_col_names){
score <- data_std %>%
dplyr::group_by(!!as.name(fcol)) %>%
dplyr::summarise(across(where(is.numeric), mean)) %>%
dplyr::summarise(across(where(is.numeric), sd)) %>%
sum()
variance_scores <- append(variance_scores, score)
}
# Get the fold column with the lowest score
lowest_fcol_index <- which.min(variance_scores)
best_fcol <- fold_col_names[[lowest_fcol_index]]
# Add the best fold column / grouping factor to the original data
data[["group"]] <- data_std[[best_fcol]]
# Return the original data and the score of the best fold column
list(data, min(variance_scores))
}
# Run with 20 splits
set.seed(1)
data_grouped_and_score <- create_multi_balanced_groups(
data = data,
cols = c("A", "B", "C", "D", "E", "F", "G", "H"),
k = 25,
num_tries = 20
)
# Check data
data_grouped_and_score[[1]]
> # A tibble: 500 x 9
> A B C D E F G H group
> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
> 1 5.86 6.54 0.500 2.88 5.70 9.67 2.29 3.01 2
> 2 0.0895 4.69 5.71 0.343 8.95 7.73 5.76 9.58 1
> 3 2.94 1.78 2.06 6.66 9.54 0.600 4.26 0.771 16
> 4 2.77 1.52 0.723 8.11 8.95 1.37 6.32 6.24 7
> 5 8.14 2.49 0.467 8.51 0.889 6.28 4.47 8.63 13
> 6 2.60 8.23 9.17 5.14 2.85 8.54 8.94 0.619 23
> 7 7.24 0.260 6.64 8.35 8.59 0.0862 1.73 8.10 5
> 8 9.06 1.11 6.01 5.35 2.01 9.37 7.47 1.01 1
> 9 9.49 5.48 3.64 1.94 3.24 2.49 3.63 5.52 7
> 10 0.731 0.230 5.29 8.43 5.40 8.50 3.46 1.23 10
> # … with 490 more rows
# Check score
data_grouped_and_score[[2]]
> 1.552656
By commenting out the num_col = "total" line, we can run this without the numerical balancing. For me, this gave a score of 1.615257.
Disclaimer: I am the author of the groupdata2 package. The fold() function can also balance a categorical column (cat_col) and keep all data points with the same ID in the same fold (id_col) (e.g. to avoid leakage in cross-validation). There's a very similar partition() function as well.

R - dplyr lag function

I am trying to calculate the absolute difference between lagged values over several columns. The first row of the resulting data set is NA, which is correct because there is no previous value to calculate the lag. What I don't understand is why the lag isn't calculated for the last value. Note that the last value in the example below (temp) is the lag between the 2nd to last and the 3rd to last values, the lag value between the last and 2nd to last value is missing.
library(tidyverse)
library(purrr)
dim(mtcars) # 32 rows
temp <- map_df(mtcars, ~ abs(diff(lag(.x))))
names(temp) <- paste(names(temp), '.abs.diff.lag', sep= '')
dim(temp) # 31 rows
It would be an awesome bonus if someone could show me how to pipe the renaming step, I played around with paste and enquo. The real dataset is too long to do a gather/newcolumnname/spread approach.
Thanks in advance!
EDIT: libraries need to run the script added
I think the lag call in your existing code is unnecessary as diff calculates the lagged difference automatically (although perhaps I don't understand properly what you are trying to do). You can also use rename_all to add a suffix to all the variable names.
library(purrr)
library(dplyr)
mtcars %>%
map_df(~ abs(diff(.x))) %>%
rename_all(funs(paste0(., ".abs.diff.lag")))
#> # A tibble: 31 x 11
#> mpg.abs.diff.lag cyl.abs.diff.lag disp.abs.diff.lag hp.abs.diff.lag
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0 0 0.0 0
#> 2 1.8 2 52.0 17
#> 3 1.4 2 150.0 17
#> 4 2.7 2 102.0 65
#> 5 0.6 2 135.0 70
#> 6 3.8 2 135.0 140
#> 7 10.1 4 213.3 183
#> 8 1.6 0 5.9 33
#> 9 3.6 2 26.8 28
#> 10 1.4 0 0.0 0
#> # ... with 21 more rows, and 7 more variables: drat.abs.diff.lag <dbl>,
#> # wt.abs.diff.lag <dbl>, qsec.abs.diff.lag <dbl>, vs.abs.diff.lag <dbl>,
#> # am.abs.diff.lag <dbl>, gear.abs.diff.lag <dbl>,
#> # carb.abs.diff.lag <dbl>
Maybe something like this:
dataCars <- mtcars%>%mutate(diffMPG = abs(mpg - lag(mpg)),
diffHP = abs(hp - lag(hp)))
And then do this for all the columns you are interested in
I was not able to reproduce your issues regarding the lag function. When I am executing your sample code, I retrieve a data frame consisting of 31 row, exactly as you mentioned, but the first row is not NA, it is already the subtraction of the 1st and 2nd row.
Regarding your bonus question, the answer is provided here:
temp <- map_df(mtcars, ~ abs(diff(lag(.x)))) %>% setNames(paste0(names(.), '.abs.diff.lag'))
This should result in the desired column naming.

Resources