Suppose I have a data frame with factor "subject", and continuous variables "a" and "b". For each level of subject, I create a distance matrix from a and b:
data %>%
group_by(subject) %>%
select(a,b) %>%
do(dmat = as.matrix(dist(.)))
This returns an n-by-2 data frame, with subject and dmat as columns. What I would like to do matrix norms of each pairwise subtraction. Something along the lines of:
norm(data$dmat[[1]]-data$dmat[[2]])
norm(data$dmat[[1]]-data$dmat[[3]])
# etc etc
Ideally, I'd get out an n^2-by-3 data frame, with the first two columns indicating the two subject levels that are being compared, and the third column containing this norm calculation.
Apologies for not providing a sample dataset. I'm hoping the answer is simple enough, but if one is required I will try to write some code to generate one.
You can use mapply for this.
data %>%
group_by(subject) %>%
select(a,b) %>%
do(dmat = as.matrix(dist(.))) %>%
ungroup %>%
do(data.frame(s1 = rep(.$subject, each=nrow(.)),
s2 = rep(.$subject, times=nrow(.)),
dist = mapply(rep(.$dmat, each=nrow(.)),
rep(.$dmat, times=nrow(.)),
FUN=function(x, y) norm(x-y))))
I would probably find the matrix representation of this result easier to understand:
data %>%
group_by(subject) %>%
select(a,b) %>%
do(dmat = as.matrix(dist(.))) %>%
ungroup %>%
do(data.frame(matrix(mapply(rep(.$dmat, each=nrow(.)),
rep(.$dmat, times=nrow(.)),
FUN=function(x, y) norm(x-y)) , nrow=nrow(.))))
Related
I am having trouble doing cross-validation for a hierarchical dataset. There is a level 2 factor ("ID") that needs to be equally represented in each subset. For this dataset, there are 157 rows and 28 IDs. I want to divide my data up into five subsets, each containing 31 rows, where each of the 28 IDs is represented (a stand can be repeated within a subset).
I have gotten as far as:
library(dplyr)
df %>%
group_by(ID) %>%
and have no clue where to take it from there. Any help is appreciated!
Here's what I'd do: assign one row from each ID randomly to each of the 5 subsets, and then distribute the leftovers fully randomly. Without sample data this is untested, but it should at least get you on the right track.
df %>%
group_by(ID) %>%
mutate(
random_rank = rank(runif(n())),
strata = ifelse(random_rank <= 5, random_rank, sample(1:5, size = n(), replace = TRUE))
) %>%
select(-random_rank) %>%
ungroup()
This should create a strata column as described above. If you want to split the data into a list of data frames for each strata, ... %>% group_by(strata) %>% group_split().
Good afternoon!
I think this is pretty straight forward question, but I think I am missing a couple of steps. Would like to create groups based on column position.
Am working with a dataframe / tibble; 33 rows long, and 66 columns wide. However, every sequence of 6 columns, should really be separated into its own sub-dataframe / tibble.
The sequence of the number columns is arbitrary to the dataframe. Below is an attempt with mtcars, where I am trying to group every 2 columns into its own sub-dataframe.
mtcars %>%
tibble() %>%
group_by(across(seq(1,2, length.out = 11))) %>%
nest()
However, that method generates errors. Something similar applies when working just within nest() as well.
Using mtcars, would like to create groups using a sequence for every 3 columns, or some other number.
Would ultimately like the mtcars dataframe to be...
Columns 1:3 to be group 1,
Columns 4:6 to be group 2,
Columns 7:9 to be group 3, etc... while retaining the information for the rows in each column.
Also considered something with pivot_longer...
mtcars %>%
tibble() %>%
pivot_longer(cols = seq(1,3, by = 1))
...but that did not generate defined groups, or continue the sequencing along all columns of the dataframe.
Hope one of you can help me with this! Would make certain tasks for work much easier.
PS - A plus if you can keep the workflow to tidyverse centric code :)
You could try this. It splits the dataframe into a list of dataframes based on the number of columns you want (3 in your example):
library(tidyverse)
list_of_dataframes <- mtcars %>%
tibble() %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
group_by(row) %>%
mutate(group = ceiling(row_number()/ 3)) %>%
ungroup() %>%
group_split(group) %>%
map(
~select(.x, row, name, value) %>%
pivot_wider()
)
EDIT
Here, based on comments from the question asker, we will avoid pivoting the data. Instead, we map the groups across the dataframe.
list_of_dataframes <- map(seq(1, ncol(mtcars), by = 3),
~mtcars %>%
as_tibble() %>%
select(all_of(.x:min(c(.x+2, ncol(mtcars))))))
We can then wrap this in a function to make it a little easier to use and change group sizes and dataframes:
group_split_cols <- function(.data, ncols_per_group){
map(seq(1, ncol(.data), by = ncols_per_group),
~.data %>%
as_tibble() %>%
select(all_of(.x:min(c(.x+ncols_per_group-1, ncol(.data))))))
}
list_of_dataframes <- group_split_cols(.data = mtcars, ncols_per_group = 3)
I am trying to create summary statistics without losing column values. For example using the iris dataset, I want to group_by the species and find the summary statistics, such as the sd and mean.
Once I have done this and I want to add this back to the original dataset. How can I can do this, I can only do the first step.
library("tidyverse")
data <- (iris)
data<-data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
this looks like this
I want to then add the result of mean and sd to the original iris data, this is so that I can get the z score for each individual row if it belongs to that species.
For further explanation; essentially create groups by the species and then find z score of each individual plant based on their species.
Though there already is an accepted answer, here is a way of computing the Z scores for all numeric variables.
library(dplyr)
library(stringr)
iris %>%
group_by(Species) %>%
mutate(across(where(is.numeric), scale)) %>%
rename_with(~str_c(., "_Z"), where(is.numeric)) %>%
ungroup() %>%
left_join(iris, ., by = "Species") %>%
relocate(Species, .after = last_col())
You can use something like
library("tidyverse")
data <- (iris)
df <- data %>%
group_by(Species) %>%
summarise(mean.iris=mean(Sepal.Length), sd.iris=sd(Sepal.Length))
data %>% left_join(df, by = "Species") %>%
mutate(Z = (Sepal.Length-mean.iris)/sd.iris)
I have been following this code here:
calculating diversity by groups
However, when I implement the code it returns only the values. I have tried cbind to put it into another dataframe, however, I am afraid that the rows do not match. Is there a way to run that code, which places it in the same dataframe so the rows match with where they were taken from..
following the code given in the answer linked what you need is to use right_join to link the diversity vector with the method_one data.frame :
diversity(aggregate(. ~ LOC_ID + week + year + POSTCODE , method_one, sum)[, 5:25], MARGIN=1, index="simpson") %>%
tibble(diversity=., ID=1:length(.)) %>%
right_join(method_one %>% mutate(ID=1:nrow(.)))
Explanation :
method_one %>% mutate(ID=1:nrow(.)) adds an ID column.
tibble(diversity=., ID=1:length(.)) turns the result of the diversity call into a tibble with an ID column.
right_join(x, y, by=NULL, ...) : return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. if by==NULL it will use columns with matching names in this case ID.
Here is an option with dplyr
library(dplyr)
library(vegan)
method_one %>%
group_by(LOC_ID, week, year, POSTCODE) %>%
summarise(across(everything(), sum, na.rm = TRUE)) %>%
ungroup %>%
select(5:25) %>%
diversity %>%
tibble(diversity = .) %>%
bind_cols(method_one, .)
I have a large data frame with on every rows enough data to calculate a correlation using specific columns of this data frame and add a new column containing the correlations calculated.
Here is a summary of what I would like to do (this one using dplyr):
example_data %>%
mutate(pearsoncor = cor(x = X001_F5_000_A:X030_F5_480_C, y = X031_H5_000_A:X060_H5_480_C))
Obviously it is not working this way as I get only NA's in the pearsoncor column, does anyone has a suggestion? Is there an easy way to do this?
Best,
Example data frame
With tidyr, you can gather separately all x- and y-variables, you'd like to compare. You get a tibble containing the correlation coefficients and their p-values for every combination you provided.
library(dplyr)
library(tidyr)
example_data %>%
gather(x_var, x_val, X001_F5_000_A:X030_F5_480_C) %>%
gather(y_var, y_val, X031_H5_000_A:X060_H5_480_C) %>%
group_by(x_var, y_var) %>%
summarise(cor_coef = cor.test(x_val, y_val)$estimate,
p_val = cor.test(x_val, y_val)$p.value)
edit, update some years later:
library(tidyr)
library(purrr)
library(broom)
library(dplyr)
longley %>%
pivot_longer(GNP.deflator:Armed.Forces, names_to="x_var", values_to="x_val") %>%
pivot_longer(Population:Employed, names_to="y_var", values_to="y_val") %>%
nest(data=c(x_val, y_val)) %>%
mutate(cor_test = map(data, ~cor.test(.x$x_val, .x$y_val)),
tidied = map(cor_test, tidy)) %>%
unnest(tidied)
Here is a solution using the reshape2 package to melt() the data frame into long form so that each value has its own row. The original wide-form data has 60 values per row for each of the 6 genes, while the melted long-form data frame has 360 rows, one for each value. Then we can easily use summarize() from dplyr to calculate the correlations without loops.
library(reshape2)
library(dplyr)
names1 <- names(example_data)[4:33]
names2 <- names(example_data)[34:63]
example_data_longform <- melt(example_data, id.vars = c('Gene','clusterFR','clusterHR'))
example_data_longform %>%
group_by(Gene, clusterFR, clusterHR) %>%
summarize(pearsoncor = cor(x = value[variable %in% names1],
y = value[variable %in% names2]))
You could also generate more detailed results, as in Eudald's answer, using do():
detailed_r <- example_data_longform %>%
group_by(Gene, clusterFR, clusterHR) %>%
do(cor = cor.test(x = .$value[.$variable %in% names1],
y = .$value[.$variable %in% names2]))
This outputs a tibble with the cor column being a list with the results of cor.test() for each gene. We can use lapply() to extract output from the list.
lapply(detailed_r$cor, function(x) c(x$estimate, x$p.value))
I had the same problem a few days back, and I know loops are not optimal in R but that's the only thing I could think of:
df$r = rep(0,nrow(df))
df$cor_p = rep(0,nrow(df))
for (i in 1:nrow(df)){
ct = cor.test(as.numeric(df[i,cols_A]),as.numeric(df[i,cols_B]))
df$r[i] = ct$estimate
df$cor_p[i] = ct$p.value
}