Format multilevel group_by in R - r

In R, when I run this group_by code, I obtain this result.
df <- tibble(y=c('a','a','a', 'b','b','b','b','b'), z=c(1,1,1,1,1,1,2,2))
df %>% group_by(z,y) %>% summarise(n())
z y n()
1 a 3
1 b 3
2 b 2
Is there a way to make it look like this?
z y n()
1 a 3
b 3
2 b 2
My goal is to have the formatting look the way it does in Pandas, where the multilevel index isn't repeated each time ( see below ).

Here's one possibility:
df <- tibble(y=c('a','a','a', 'b','b','b','b','b','a','b'), z=c(1,1,1,1,1,1,2,2,3,3))
df2 <-
df %>%
group_by(z,y) %>%
summarise(n = n()) %>%
group_by(z) %>%
mutate(z2 = if_else(row_number() == 1, as.character(z), " "), y, n) %>%
ungroup() %>%
transmute(z = z2, y, n)
df2 %>%
knitr::kable()
I'm having trouble thinking of ways to do this that don't involve grouping by the z column and finding the first row. Unfortunately that means you need to add a couple steps, because a grouping variable can't be modified in the mutate call.

Related

Summing across in a dataframe with condition coming from another column

this is not a very good title for the question. I want to sum across certain columns in a data frame for each group, excluding one column for each of my groups. A simple example would be as follows:
df <- tibble(group_name = c("A", "B","C"), mean_A = c(1,2,3), mean_B = c(2,3,4), mean_C=c(3,4,5))
df %>% group_by(group_name) %>% mutate(m1 = sum(across(contains("mean"))))
This creates column m1, which is the sum across mean_a, mean_b, mean_c for each group. What I want to do is exclude mean_a for group a, mean_b for b and mean_c for c. The following does not work though (not surprisingly).
df %>% group_by(group_name) %>% mutate(m1 = sum(across(c(contains("mean") & !contains(group_name)))))
Do you have an idea how I could do this? My original data contains many more groups, so would be hard to do by hand.
Edit: I have tried the following way which solves it in a rudimentary fashion, but something (?grepl maybe) seems to not work great here and I get the wrong result.
df %>% pivot_longer(!group_name) %>% mutate(value2 = case_when(grepl(group_name, name) ~ 0, TRUE ~ value)) %>% group_by(group_name) %>% summarise(m1 = sum(value2))
Edit2: Found out what's wrong with the above, and below works, but still a lot of warnings so I recommend people to follow TarJae's response below
df %>% pivot_longer(!group_name) %>% group_by(group_name) %>% mutate(value2 = case_when(grepl(group_name, name) ~ 0, TRUE ~ value)) %>% group_by(group_name) %>% summarise(m1 = sum(value2))
Here is another option where you can just use group_name directly with the tidyselect helpers:
df %>%
rowwise() %>%
mutate(m1 = rowSums(select(across(starts_with("mean")), -ends_with(group_name)))) %>%
ungroup()
Output
group_name mean_A mean_B mean_C m1
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 1 2 3 5
2 B 2 3 4 6
3 C 3 4 5 7
How it works
The row-wise output of across is a 1-row tibble containing only the variables that start with "mean".
select unselects the subset of the variables from output by across that end with the value from group_name.
At this point you are left with a 1 x 2 tibble, which is then summed using rowSums.
Here is one way how we could do it:
We create a helper column to match column names
We set value of mean column to zeor if column names matches helper name.
Then we use transmute with select to calculate rowSums
Finally we cbind column m1 to df:
library(dplyr)
df %>%
mutate(helper = paste0("mean_", group_name)) %>%
mutate(across(starts_with("mean"), ~ifelse(cur_column()==helper, 0, .))) %>%
transmute(m1 = select(., contains("mean")) %>%
rowSums()) %>%
cbind(df)
m1 group_name mean_a mean_b mean_c
1 5 a 1 2 3
2 6 b 2 3 4
3 7 c 3 4 5

R: unique column values, combine rows of second column

From a data frame I need a list of all unique values of one column. For possible later check we need to keep information from a second column, though for simplicity combined.
Sample data
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df
id source
1 1 x
2 3 y
3 1 z
The desired outcome is
df2
id source
1 1 x,z
2 3 y
It should be pretty easy, still I cannot find the proper function / grammar?
E.g. something like
df %>%
+ group_by(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
or
df %>%
+ distinct(id) %>%
+ summarise(vlist = paste0(source, collapse = ","))
What am I missing? Thanks for any advice!
You can use aggregate from stats to combine per group.
aggregate(source ~ id, df, paste, collapse = ",")
# id source
#1 1 x,z
#2 3 y
Using your code here is a solution:
library(dplyr)
df <- data.frame(id=c(1,3,1),source =c("x","y","z"))
df %>%
group_by(id) %>%
summarise(vlist = paste0(source, collapse = ",")) %>%
distinct(id, .keep_all = TRUE)
# A tibble: 2 x 2
id vlist
<dbl> <chr>
1 1 x,z
2 3 y
Your second approach doesn't work because you call distinct before you aggregate the data. Also, you need to use .keep_all = TRUE to also keep the other column.
Your first approach was missing the distinct.
aggregate(source ~ id, df, toString)

R run T-test/anova for each row with 2 groups with 3 samples

My dataset looks something like this:
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"))
df <- matrix(rnorm(12*4), ncol = 12)
colnames(df) <- c("AC-1", "AC-2", "AC-3", "AM-1", "AM-2", "AM-3", "SC-1", "SC-2", "SC-3", "SM-1", "SM-2", "SM-3")
df <- data.frame(compound = c("alanine ", "arginine", "asparagine", "aspartate"), df)
df
compound AC.1 AC.2 AC.3 AM.1 AM.2 AM.3 SC.1 SC.2 SC.3 SM.1
1 alanine 1.18362683 -2.03779314 -0.7217692 -1.7569264 -0.8381042 0.06866567 0.2327702 -1.1558879 1.2077454 0.437707310
2 arginine -0.19610110 0.05361113 0.6478384 -0.1768597 0.5905398 -0.67945600 -0.2221109 1.4032349 0.2387620 0.598236199
3 asparagine 0.02540509 0.47880021 -0.1395198 0.8394257 1.9046667 0.31175358 -0.5626059 0.3596091 -1.0963363 -1.004673116
4 aspartate -1.36397906 0.91380826 2.0630076 -0.6817453 -0.2713498 -2.01074098 1.4619707 -0.7257269 0.2851122 -0.007027878
I want to perform a t-test for each row (compound) on the columns [2:4] as one, and [5:7] as one, and store all the p-values. Basically see if there is a difference between the AC group and AM group for each compound.
I am aware there is another topic with this however I couldn't find a viable solution for my problem.
PS. my real dataset has about 35000 rows (maybe it needs a different solution than only 4 rows)
After selecting the columns of interest, use pmap to apply the t.test on each row by selecting the first 3 and next 3 observations as input to t.test and bind the extracted 'p value' as another column in the original data
library(tidyverse)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{t.test(.[1:3], .[4:6])$p.value}) %>%
bind_cols(df, pval_AC_AM = .)
Or after selecting the columns, do a gather to convert to 'long' format, spread, apply the t.test in summarise and join with the original data
df %>%
select(compound, AC.1:AM.3) %>%
gather(key, val, -compound) %>%
separate(key, into = c('key1', 'key2')) %>%
spread(key1, val) %>%
group_by(compound) %>%
summarise(pval_AC_AM = t.test(AC, AM)$p.value) %>%
right_join(df)
Update
If there are cases where there is only a unique value, then t.test shows error. One option is to run the t.test and get NA for those cases. This can be done with possibly
posttest <- possibly(function(x, y) t.test(x, y)$p.value, otherwise = NA)
df %>%
select(AC.1:AM.3) %>%
pmap_dbl(~ c(...) %>%
{posttest(.[1:3], .[4:6])}) %>%
bind_cols(df, pval_AC_AM = .)
posttest(rep(3,5), rep(1, 5))
#[1] NA
If you can use an external library:
library(matrixTests)
row_t_welch(df[,2:4], df[,5:7])$pvalue
[1] 0.67667626 0.39501003 0.26678161 0.01237438

add x lagged value to a tbl [duplicate]

This question already has answers here:
Adding multiple lag variables using dplyr and for loops
(2 answers)
Closed 4 years ago.
I have a tibble like this:
df <- tibble(value = rnorm(500))
how can I add x (e.g. x = 10) lagged values to this tbl (ideally in a dplyr pipe)? I want to add these lagged variables as new columns.
I can do it for a single lag:
lag_df <- df %>%
mutate(value_lag = lag(value, n = 1)) %>% # first lag
filter(!is.na(value_lag)) # remove NA
doing it manually for 3 lags would look like this:
lag_df <- df %>%
mutate(value_lag1 = lag(value, n = 1)) %>% # first lag
mutate(value_lag2 = lag(value, n = 2)) %>% # second lag
mutate(value_lag3 = lag(value, n = 3)) %>% # third lag
filter(!is.na(value_lag1)) # remove NA
filter(!is.na(value_lag2)) # remove NA
filter(!is.na(value_lag3)) # remove NA
Not a complete dplyr solution but one way is to create a column for each lagged value and cbind it to the original daatframe and remove the rows with NA values with na.omit()
library(dplyr)
cbind(df, sapply(1:10, function(x) lag(df$value, n = x))) %>%
na.omit()
An ugly attempt to keep it completely in tidyverse with my broken skills
library(tidyverse)
tibble(n=1:10) %>% mutate(output = map2(list(df),n ,function(x,y){
x %>% mutate(value = lag(value,y))
})) %>% spread(n,output) %>% unnest() %>% na.omit()
The base R method is much cleaner than this but there should definitely be a better way to do it than this.
And a bit shorter version
map2(list(df), 1:10, function(x, y) {
x %>% mutate(value = lag(value,y))
}) %>%
bind_cols() %>% na.omit()

Ordering rows when they are not numeric

using df below, I made a table with frequencies for each unit according to each combination of group/year.
After obtaining absolute and relative frequencies, I have pasted the values into just one column Frequency
Is there a way that I can after changing the table to have the units on the rows, have them ordered in descending order based on n of the Total group in 2016? I want my final output to not have rows with n and prop, only Frequency
df <- data.frame(cbind(sample(c('Controle','Tratado'),
10, replace = T),
sample(c(2012,2016), 10, T),
c('A','B','A','B','C','D','D','A','F','A')))
colnames(df) <- c('Group', 'Year', 'Unit')
table <- df %>%
group_by(Year, Group) %>%
count(Unit) %>%
mutate(prop = prop.table(n)) %>%
bind_rows(df %>%
mutate(Group ="Total") %>%
group_by(Year, Group) %>%
count(Unit)) %>%
mutate(prop = prop.table(n))
is.num <- sapply(table, is.numeric)
table[is.num] <- lapply(table[is.num], round, 4)
table <- table %>%
mutate(Frequency = paste0(n,' (', 100*prop,'%)'))
table <- table %>%
gather(type, measurement, -Year, -Group, -Unit) %>%
unite(year_group, Year:Group, sep = ":") %>%
spread(year_group, measurement)
Here is what I am expecting to generate:
Unit type 2012:Total 2012:Tratado 2016:Controle 2016:Total 2016:Tratado
1 A Frequency 2 (66.67%) 2 (66.67%) - 2 (28.57%) 2 (100%)
2 D Frequency - - 2 (40%) 2 (28.57%) -
3 B Frequency 1 (33.33%) 1 (33.33%) 1 (20%) 1 (14.29%) -
4 C Frequency - - 1 (20%) 1 (14.29%) -
5 F Frequency - - 1 (20%) 1 (14.29%) -
Notice that the results are ordered according to column 2016:Total
Just found out a way myself, probably not the best one.
After running the code on the question, I have done the following:
table <- subset.data.frame(table, type == 'Frequency')
table <- table %>%
mutate(value = substr(Total_2016, 1, nchar(Total_2016) - 7 )) %>%
mutate(value = as.numeric(value)) %>%
arrange(desc(value))

Resources