Grouping by specifying unique keys - r

In dplyr, I'm looking for way/s to group by unique keys(for the problem at hand, by unique row numbers). Given a dataframe such as below:
df <- data.frame(A = rep(1:5, each = 2), B = rnorm(10, 3, 3), C= runif(10, 1.5, 4.5))
#> A B C
#> 1 1 -4.6399372 1.622857
#> 2 1 0.9933197 4.256062
#> 3 2 4.1381981 3.522439
#> 4 2 4.6943698 4.260124
#> 5 3 5.7183797 3.877568
#> 6 3 -3.6183500 2.236473
#> 7 4 -2.5711393 4.373780
#> 8 4 5.9092908 2.125349
#> 9 5 6.1531930 4.472758
#> 10 5 -1.9750869 1.516432
I would like to get a row of mean of three rows(df[4:6, ]) which replaces those specified in the index with single row. Thus the result would produce only 8 rows in total after grouping and collapsing. Normally, I would work the way out in following manner:
df %>%
group_by(rownumber = c(1:3, rep(4, each=3), 7:10)) %>%
summarise_all(.funs = mean)
But, I find the code overtly explicit, in that each slice of index has to be provided.
There must be more efficient/succinct ways to achieve the same feat. Thanks to anyone to offer insights. And also, although tidyverse community seems to dodge the row naming convention, for now, I'd like to have a proper row numbering here.

One option would be to replace those elements with a specific value so that we can avoid the rep and the later concatenation step
df %>%
group_by(grp = replace(row_number(), 4:6, 4)) %>%
summarise_all(mean)

Related

In dplyr::mutate, refer to a value conditionally, based on the value of another column

Apologies for the unclear title. Although not effective, I couldn't think of a better way to describe this problem.
Here is a sample dataset I am working with
test = data.frame(
Value = c(1:5, 5:1),
Index = c(1:5, 1:5),
GroupNum = c(rep.int(1, 5), rep.int(2, 5))
)
I want to create a new column (called "Value_Standardized") whose values are calculated by grouping the data by GroupNum and then dividing each Value observation by the Value observation of the group when the Index is 1.
Here's what I've come up with so far.
test2 = test %>%
group_by(GroupNum) %>%
mutate(Value_Standardized = Value / special_function(Value))
The special_function would represent a way to get value when Index == 1.
That is also precisely the problem - I cannot figure out a way to get the denominator to be the value when index == 1 in that group. Unfortunately, the value when the index is 1 is not necessarily the max or the min of the group.
Thanks in advance.
Edit: Emphasis added for clarity.
There is a super simple tidyverse way of doing this with the method cur_data() it pulls the tibble for the current subset (group) of data and acts on it
test2 <- test %>%
group_by(GroupNum) %>%
mutate(output=Value/cur_data()$Value[1])
The cur_data() grabs the tibble, then you extract the Values column as you would normally using $Value and because the denominator is always the first row in this group, you just specify that index with [1]
Nice and neat, there are a whole bunch of cur_... methods you can use, check them out here:
Not sure if this is what you meant, nor if it's the best way to do this but...
Instead of using a group_by I used a nested pipe, filtering and then left_joining the table to itself.
test = data.frame(
Value = c(1:5, 5:1),
Index = c(1:5, 1:5),
GroupNum = c(rep.int(1, 5), rep.int(2, 5))
)
test %>%
left_join(test %>%
filter(Index == 1) %>%
select(Value,GroupNum),
by = "GroupNum",
suffix = c('','_Index_1')) %>%
mutate(Value = Value/Value_Index_1)
output:
Value Index GroupNum Value_Index_1
1 1.0 1 1 1
2 2.0 2 1 1
3 3.0 3 1 1
4 4.0 4 1 1
5 5.0 5 1 1
6 1.0 1 2 5
7 0.8 2 2 5
8 0.6 3 2 5
9 0.4 4 2 5
10 0.2 5 2 5
A quick base R solution:
test = data.frame(
Value = c(1:5, 5:1),
Index = c(1:5, 1:5),
GroupNum = c(rep.int(1, 5), rep.int(2, 5)),
Value_Standardized = NA
)
groups <- levels(factor(test$GroupNum))
for(currentGroup in groups) {
test$Value_Standardized[test$GroupNum == currentGroup] <- test$Value[test$GroupNum == currentGroup] / test$Value[test$GroupNum == currentGroup & test$Index == 1]
}
This only works under the assumption that each group will have only one observation with a "1" index though. It's easy to run into trouble...

tidyverse rename_with giving error when trying to provide new names based on existing column values

Assuming the following data set:
df <- data.frame(...1 = c(1, 2, 3),
...2 = c(1, 2, 3),
n_column = c(1, 1, 2))
I now want to rename all vars that start with "...". My real data sets could have different numbers of "..." vars. The information about how many such vars I have is in the n_column column, more precisely, it is the maximum of that column.
So I tried:
df %>%
rename_with(.cols = starts_with("..."),
.fn = paste0("new_name", 1:max(n_column)))
which gives an error:
# Error in paste0("new_name", 1:max(n_column)) :
# object 'n_column' not found
So I guess the problem is that the paste0 function does look for the column I provide within the current data set. Not sure, though, how I could do so. Any ideas?
I know I could bypass the whole thing by just creating an external scalar that contains the max. of n_column, but ideally I'd like to do everything in one pipeline.
You don't need information from n_column, .cols will pass only those columns that satisfy the condition (starts_with("...")).
library(dplyr)
df %>% rename_with(~paste0("new_name", seq_along(.)), starts_with("..."))
# new_name1 new_name2 n_column
#1 1 1 1
#2 2 2 1
#3 3 3 2
This is safer than using max(n_column) as well, for example if the data from n_column gets corrupted or the number of columns with ... change this will still work.
A way to refer to column values in rename_with would be to use anonymous function so that you can use .$n_column.
df %>%
rename_with(function(x) paste0("new_name", 1:max(.$n_column)),
starts_with("..."))
I am assuming this is part of longer chain so you don't want to use max(df$n_column).
We can use str_c
library(dplyr)
library(stringr)
df %>%
rename_with(~str_c("new_name", seq_along(.)), starts_with("..."))
Or using base R
i1 <- startsWith(names(df), "...")
names(df)[i1] <- sub("...", "new_name", names(df)[i1], fixed = TRUE)
df
new_name1 new_name2 n_column
1 1 1 1
2 2 2 1
3 3 3 2
A completly other approach would be
df %>% janitor::clean_names()
x1 x2 n_column
1 1 1 1
2 2 2 1
3 3 3 2

How to remove rows that do not have more than 3 values in r?

this is my first time asking a question and hopefully I can get your help!
I need to remove rows that have values for only one or two genes using R
basically I need to get rid of 50S, ABCC8, and ACAT1 because these have a n<3.
My desired output is
thank you very much!
If this is in a data.frame, you can use dplyr package to do some manipulation. We can group the data by the Genes and count how many instances are there. Then we simply set the filter criteria to remove the records.
require(dplyr)
df <- data.frame(
Genes=c('50S' ,'abcb1' ,'abcb1' ,'abcb1' ,'ABCC8' ,'ABL' ,'ABL' ,'ABL' ,'ABL' ,'ACAT1' ,'ACAT1' ),
Values=c(-0.627323448, -0.226358414, 0.347305901 ,0.371632631 ,0.099485307 ,0.078512979 ,-0.426643782, -1.060270668, -2.059157991, 0.608899174 ,-0.048795611)
)
#group, filter and join back to get subset the data
df %>% group_by(Genes)
%>% summarize(count=n())
%>% filter(count>=3)
%>% inner_join(df)
%>% select(Genes,Values)
As per #Lamia's comments, it is possible to simplify it to just:
df %>% group_by(Genes) %>% filter(n()>=3)
# generating data
x <- c(NA, NA, NA, NA, 2, 3) # has n < 3!
y <- c(1, 2, 3, 4, 5, 6)
z <- c(1 ,2, 3, NA, 5, 6)
df <- data.frame(x,y,z)
colsToKeep <- c() # making empty vector I will fill with column numbers
for (i in 1:ncol(df)) { # for every column
if (sum(!is.na(df[,i]))>=3) { # if that column has greater than 3 valid values (i.e., ones that are not na...
colsToKeep <- c(colsToKeep, i) # then save that column number into this vector
}
}
df[,colsToKeep] # then use that vector to call the columns you want
Note that R treats FALSE as 0 and TRUE as 1, so that is how the sum() function works here.
Another possible solution by using table:
gene <- c("A","A","A","B","B","C","C","C","C","D")
value <- c(seq(1,10,1))
df<-data.frame(gene,value)
df
gene value
1 A 1
2 A 2
3 A 3
6 C 6
7 C 7
8 C 8
9 C 9
su<-data.frame(table(df$gene))
df_keep <-df[which(df$gene %in% su[which(su$Freq>2),1]),]
df_keep
gene value
1 A 1
2 A 2
3 A 3
6 C 6
7 C 7
8 C 8
9 C 9

ACF by group in R

I would like to calculate the acf of a time series grouped by a grouping variable. Specifically, I have a data frame contaning a single time series (variable a) and a grouping variable (e. g. weekday, variable b). Here is an example:
data <- data.frame(a=rnorm(1:150), b=rep(rep(1:3, each=5), 10))
Now, I would like to calculate the acf for the different values of the grouping variable. For example, for lag 2 and group 1 I would like to get the correlation between t and t-2 calculated only over time points t with b=1 (the value of b for t-2 does not matter). I know that the function acf can easily calculate the acf but I don't find a way to include the grouping variable.
I could manually calculate the desired correlation but as I have a large data set and a lot of lags and values for the grouping variables, I would hope that there is a more elegant and faster way. Here is the manual calculation for the example above (lag 2, b=1):
sel <- which(data$b==1)
cor(data$a[sel[sel > 2]], data$a[sel[sel>2] - 2])
If the time series object is a tsibble, the following works for me. Assuming the data frame is called df and the variable you are interested in is called var. You can specify max lag additionally
df %>% group_by(Region) %>% ACF(var, lag_max = 18) %>% autoplot()
I'm not sure I understand exactly what information you are looking for but if you just want the acf values for multiple groups this should accomplish that. Some people have mentioned creating a tidy solution and this uses dplyr, tidyr, and purrr to do grouped calculations.
library(dplyr)
library(tidyr)
library(purrr)
sample_data <- dplyr::data_frame(group = sample(c("a", "b", "c"), size = 100, replace = T), value = sample.int(30, size = 100, replace = T))
head(sample_data)
#> # A tibble: 6 × 2
#> group value
#> <chr> <int>
#> 1 c 28
#> 2 c 9
#> 3 c 13
#> 4 c 11
#> 5 a 9
#> 6 c 9
grouped_acf_values <- sample_data %>%
tidyr::nest(-group) %>%
dplyr::mutate(acf_results = purrr::map(data, ~ acf(.x$value, plot = F)),
acf_values = purrr::map(acf_results, ~ drop(.x$acf))) %>%
tidyr::unnest(acf_values) %>%
dplyr::group_by(group) %>%
dplyr::mutate(lag = seq(0, n() - 1))
head(grouped_acf_values)
#> Source: local data frame [6 x 3]
#> Groups: group [1]
#>
#> group acf_values lag
#> <chr> <dbl> <int>
#> 1 c 1.00000000 0
#> 2 c -0.20192774 1
#> 3 c 0.07191805 2
#> 4 c -0.18440489 3
#> 5 c -0.31817935 4
#> 6 c 0.06368096 5
You can have a look at split to seperate your data.frame in buckets and then lapply to apply your function to each group. Something like:
groups_data <- split(data, data$b)
groups_acf <- lapply(groups_data, acf,...)
Then you have to extract the required information from the output list for instance with `sapply(groups,acf, FUN=function(acfobject){acfobject$value})
For groups computations, I would also definitiely go with new ways "à la" Hadley Wickham with %>% operator and group_by ; studing that is on my todo list...

Sorting a column in descending order in R excluding the first row

I have a dataframe with 5 columns and a very large dataset. I want to sort by column 3. How do you sort everything after the first row? (When calling this function I want to end it with nrows)
Example output:
Original:
4
7
9
6
8
New:
4
9
8
7
6
Thanks!
If I'm correctly understanding what you want to do, this approach should work:
z <- data.frame(x1 = seq(10), x2 = rep(c(2,3), 5), x3 = seq(14, 23))
zsub <- z[2:nrow(z),]
zsub <- zsub[order(-zsub[,3]),]
znew <- rbind(z[1,], zsub)
Basically, snip off the rows you want to sort, sort them in descending order on column 3, then reattach the first row.
And here's a piped version using dplyr, so you don't clutter the workspace with extra objects:
library(dplyr)
z <- z %>%
slice(2:nrow(z)) %>%
arrange(-x3) %>%
rbind(slice(z, 1), .)
You might try this single line of code to modify the third column in your data frame df as described:
df[,3] <- c(df[1,3],sort(df[-1,3]))
df$x[-1] <- df$x[-1][order(df$x[-1], decreasing=T)]
# x
# 1 4
# 2 9
# 3 8
# 4 7
# 5 6

Resources