Creating Groups based on Column Position - r

Good afternoon!
I think this is pretty straight forward question, but I think I am missing a couple of steps. Would like to create groups based on column position.
Am working with a dataframe / tibble; 33 rows long, and 66 columns wide. However, every sequence of 6 columns, should really be separated into its own sub-dataframe / tibble.
The sequence of the number columns is arbitrary to the dataframe. Below is an attempt with mtcars, where I am trying to group every 2 columns into its own sub-dataframe.
mtcars %>%
tibble() %>%
group_by(across(seq(1,2, length.out = 11))) %>%
nest()
However, that method generates errors. Something similar applies when working just within nest() as well.
Using mtcars, would like to create groups using a sequence for every 3 columns, or some other number.
Would ultimately like the mtcars dataframe to be...
Columns 1:3 to be group 1,
Columns 4:6 to be group 2,
Columns 7:9 to be group 3, etc... while retaining the information for the rows in each column.
Also considered something with pivot_longer...
mtcars %>%
tibble() %>%
pivot_longer(cols = seq(1,3, by = 1))
...but that did not generate defined groups, or continue the sequencing along all columns of the dataframe.
Hope one of you can help me with this! Would make certain tasks for work much easier.
PS - A plus if you can keep the workflow to tidyverse centric code :)

You could try this. It splits the dataframe into a list of dataframes based on the number of columns you want (3 in your example):
library(tidyverse)
list_of_dataframes <- mtcars %>%
tibble() %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
group_by(row) %>%
mutate(group = ceiling(row_number()/ 3)) %>%
ungroup() %>%
group_split(group) %>%
map(
~select(.x, row, name, value) %>%
pivot_wider()
)
EDIT
Here, based on comments from the question asker, we will avoid pivoting the data. Instead, we map the groups across the dataframe.
list_of_dataframes <- map(seq(1, ncol(mtcars), by = 3),
~mtcars %>%
as_tibble() %>%
select(all_of(.x:min(c(.x+2, ncol(mtcars))))))
We can then wrap this in a function to make it a little easier to use and change group sizes and dataframes:
group_split_cols <- function(.data, ncols_per_group){
map(seq(1, ncol(.data), by = ncols_per_group),
~.data %>%
as_tibble() %>%
select(all_of(.x:min(c(.x+ncols_per_group-1, ncol(.data))))))
}
list_of_dataframes <- group_split_cols(.data = mtcars, ncols_per_group = 3)

Related

Extract (or isolate) 'group-wise constant' columns from a data frame, *using dplyr/tidyverse*

How can I extract (or isolate_ group-wise constant columns from a data frame, using dplyr/tidyverse?
This is an update of Dowle/Hadley's decades-old question here. The earlier poster's example...
Using a contrived example from iris (to generate a dataset with columns that are constant by group for this example )
irisX <- iris %>% mutate(
numspec = as.numeric(Species),
numspec2 = numspec*2
)
Now I want to generate a dataset that keeps the columns Species, numspec, and numspec2 only (and keeps only one row for each).
And I don't want to have to tell it which columns these are (constant by group) -- I want it to find these for me.
So what I want is
Species, numspec, numspec2
setosa, 1, 2
versicolor, 2, 4
virginica, 3, 6
Unlike in the older linked question I want to do something using the tidyverse so I can understand it better and the code looks cleaner.
I tried something like
single_iris <- irisX %>%
group_by(Species) %>%
select_if(function(.) n_distinct(.) == 1)
But the latter select_if ignores the groupings.
If we want to use select, do it outside the grouping
library(dplyr)
irisX %>%
select(where(~ n_distinct(.) == n_distinct(irisX$Species))) %>%
distinct()
You could do:
iris %>%
group_by(Species)%>%
summarise(numspec = as.numeric(first(Species)),
numspec2 = numspec*2)

2 Numeric Values In A Dataframe Field In R

I have a dataset in R with a little under 100 columns.
Some of the columns have numeric values such as 87+3 as oppose to 90.
I have been able to update each column with the following piece of code:
library(dplyr)
new_dataframe = dataframe %>%
rowwise() %>%
mutate(new_value = eval(parse(text = value)))
However, I would like to be able to update a list of 60 columns in a more efficient way than simply repeating this line for each column.
Can someone help me find a more efficient way?
We can use mutate_at
library(dplyr)
dataframe %>%
rowwise() %>%
mutate_at(1:60, list(new_value = ~eval(parse(text = .))))

Standardize data columns in R in subgrups

I'm struggling with standardization of data columns in R in subgroups.
I created the data frame:
df<-data.frame(
salesPerson=sample(c('Alan','Bob','Cindy'),20 ,replace=TRUE)
, quater=sample(c('Q1','Q2','Q3'),20 ,replace=TRUE)
,salesValue=runif(20, 5.0, 7.5)
)
I would like to add additional column to the data frame with scaled values of Sales.
To scale all column I can use code:
df$salesValueScaled<-scale(df$salesValue)
The problem is that I would like to scale sales separably for each combination of columns salesPerson and quater. Sth like:
df$salesValueScaled<-scale(df$salesValue, by =c(df$salesPerson,df$quater))
I have been searching for this solution on this forum but I couldn't find a solution to this problem.
Thank you in advance for help.
You can use dplyr for this:
library(dplyr)
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
To work around rows that return NAs, you can either keep the original values as they are or filter them out before scaling:
Keeping the original values (by keeping scaling only instances where NROW is greater than 1):
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = ifelse(NROW(salesValue) > 1, scale(salesValue), salesValue)) %>%
ungroup
Filtering them out (as suggested by #steveb):
new_df <- df %>% group_by(salesPerson, quater) %>%
filter(n() > 1) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
I hope this helps.

Run function on all pairs of objects in column of data frame

Suppose I have a data frame with factor "subject", and continuous variables "a" and "b". For each level of subject, I create a distance matrix from a and b:
data %>%
group_by(subject) %>%
select(a,b) %>%
do(dmat = as.matrix(dist(.)))
This returns an n-by-2 data frame, with subject and dmat as columns. What I would like to do matrix norms of each pairwise subtraction. Something along the lines of:
norm(data$dmat[[1]]-data$dmat[[2]])
norm(data$dmat[[1]]-data$dmat[[3]])
# etc etc
Ideally, I'd get out an n^2-by-3 data frame, with the first two columns indicating the two subject levels that are being compared, and the third column containing this norm calculation.
Apologies for not providing a sample dataset. I'm hoping the answer is simple enough, but if one is required I will try to write some code to generate one.
You can use mapply for this.
data %>%
group_by(subject) %>%
select(a,b) %>%
do(dmat = as.matrix(dist(.))) %>%
ungroup %>%
do(data.frame(s1 = rep(.$subject, each=nrow(.)),
s2 = rep(.$subject, times=nrow(.)),
dist = mapply(rep(.$dmat, each=nrow(.)),
rep(.$dmat, times=nrow(.)),
FUN=function(x, y) norm(x-y))))
I would probably find the matrix representation of this result easier to understand:
data %>%
group_by(subject) %>%
select(a,b) %>%
do(dmat = as.matrix(dist(.))) %>%
ungroup %>%
do(data.frame(matrix(mapply(rep(.$dmat, each=nrow(.)),
rep(.$dmat, times=nrow(.)),
FUN=function(x, y) norm(x-y)) , nrow=nrow(.))))

How to extract one specific group in dplyr

Given a grouped tbl, can I extract one/few groups?
Such function can be useful when prototyping code, e.g.:
mtcars %>%
group_by(cyl) %>%
select_first_n_groups(2) %>%
do({'complicated expression'})
Surely, one can do an explicit filter before grouping, but that can be cumbersome.
With a bit of dplyr along with some nesting/unnesting (supported by tidyr package), you could establish a small helper to get the first (or any) group
first = function(x) x %>% nest %>% ungroup %>% slice(1) %>% unnest(data)
mtcars %>% group_by(cyl) %>% first()
By adjusting the slicing you could also extract the nth or any range of groups by index, but typically the first or the last is what most users want.
The name is inspired by functional APIs which all call it first (see stdlibs of i.e. kotlin, python, scala, java, spark).
Edit: Faster Version
A more scalable version (>50x faster on large datasets) that avoids nesting would be
first_group = function(x) x %>%
select(group_cols()) %>%
distinct %>%
ungroup %>%
slice(1) %>%
{ semi_join(x, .)}
A another positive side-effect of this improved version is that it fails if not grouping is present in x.
Try this where groups is a vector of group numbers. Here 1:2 means the first two groups:
select_groups <- function(data, groups, ...)
data[sort(unlist(attr(data, "indices")[ groups ])) + 1, ]
mtcars %>% group_by(cyl) %>% select_groups(1:2)
The selected rows appear in the original order. If you prefer that the rows appear in the order that the groups are specified (e.g. in the above eaxmple the rows of the first group followed by the rows of the second group) then remove the sort.

Resources