Grouping dataframe in 12 groups with same column values - r

I have a large dataset with about 15 columns and more than 3 million rows.
Because the dataset is so big, I would like to use multidplyron it .
Because of the data, it would be impossible to just split my data frame to 12 parts. Lets say that there are columns col1 and col2 which each have several different values but they repeat (in each column separately).
How can I make 12 (or n) similar sized groups which each of them contain rows that have the same value in both col1 and col2?
Example: Lets say one of the possible values in col1 foo and in col2 is bar. Then they would be grouped, all rows with this values would be in one group.
So that the question makes sense, there are always more than 12 unique combinations of col1 and col2.
I would try to do something with for and while loops if this was python but as this is R, there probably is another way.

Try this:
# As you provided no example data, I created some data repeating three times.
# I used dplyr within tidyverse. Then grouped by the columns and sliced
# the data by chance for n=2.
library(tidyverse)
df <- data.frame(a=rep(LETTERS,3), b=rep(letters,3))
# the data:
df %>%
arrange(a,b) %>%
group_by(a,b) %>%
mutate(n=1:n())
# A tibble: 78 x 3
# Groups: a, b [26]
a b n
<fctr> <fctr> <int>
1 A a 1
2 A a 2
3 A a 3
4 B b 1
5 B b 2
6 B b 3
7 C c 1
8 C c 2
9 C c 3
10 D d 1
# ... with 68 more rows
Slicing down the data by chance on two rows per group.
set.seed(123)
df %>%
arrange(a,b) %>%
group_by(a,b) %>%
mutate(n=1:n()) %>%
sample_n(2)
# A tibble: 52 x 3
# Groups: a, b [26]
a b n
<fctr> <fctr> <int>
1 A a 1
2 A a 2
3 B b 2
4 B b 3
5 C c 3
6 C c 1
7 D d 2
8 D d 3
9 E e 2
10 E e 1
# ... with 42 more rows

# Create sample data
library(dplyr)
df <- data.frame(a=rep(LETTERS,3), b=rep(letters,3),
nobs=sample(1:100, 26*3,replace=T), stringsAsFactors=F)
# Get all unique combinations of col1 and col2
combos <- df %>%
group_by(a,b) %>%
summarize(n=sum(nobs)) %>%
as.data.frame(.)
top12 <- combos %>%
arrange(desc(n)) %>%
top_n(12,n)
top12
l <- list()
for(i in 1:11){
l[[i]] <- combos[combos$a==top12[i,"a"] & combos$b==top12[i,"b"],]
}
l[[12]] <- combos %>%
anti_join(top12,by=c("a","b"))
l
# This produces a list 'l' that contains twelve data frames -- the top 11 most-commonly occuring pairs of col1 and col2, and all the rest of the data in the 12th list element.

Related

How to filter rows according to the bigger value in another column?

I have a data frame like below
d1<-c('a','b','c','d','e','f','g','h','i','j','k','l')
d2<-c(1,5,1,2,13,2,32,2,1,2,4,5)
df1<-data.frame(d1,d2)
Which looks like the data table in this picture
My goal is to filter the rows based on which value of d2 in every 3 rows is biggest. So it would look like this:
Thank you!
We may use rollmax from zoo to filter the rows
library(dplyr)
library(zoo)
df1 %>%
filter(d2 == na.locf0(rollmax(d2, k = 3, fill = NA)))
d1 d2
1 b 5
2 e 13
3 g 32
4 l 5
You can create a grouping variable that puts observations into groups of 3. I have first created a sequence from 1 to the total number of rows, incremented by 3. And then repeated each number of this sequence 3 times and subset the result to get a vector the same length of the data, incase the number of observations is not perfectly divisible by 3. Then simply filter rows based by the largest number of each group in d2 column.
library(dplyr)
df1 %>%
mutate(group = rep(seq(1, n(), by = 3), each = 3)[1:n()]) %>%
group_by(group) %>%
filter(d2 == max(d2))
# A tibble: 4 x 3
# Groups: group [4]
# d1 d2 group
# <chr> <dbl> <dbl>
# 1 b 5 1
# 2 e 13 4
# 3 g 32 7
# 4 l 5 10
Yet another solution:
library(tidyverse)
d1<-c('a','b','c','d','e','f','g','h','i','j','k','l')
d2<-c(1,5,1,2,13,2,32,2,1,2,4,5)
df1<-data.frame(d1,d2)
df1 %>%
mutate(id = rep(1:(n()/3), each=3)) %>%
group_by(id) %>%
slice_max(d2) %>%
ungroup %>% select(-id)
#> # A tibble: 4 × 2
#> d1 d2
#> <chr> <dbl>
#> 1 b 5
#> 2 e 13
#> 3 g 32
#> 4 l 5

How to Pass column name in group by from a variable

Want to extract max values of a column of each group of data frame.
I have column name in a variable which i want to pass in group by condition but it is failing.
I have below data frame:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')
Column values in Variables below:
columnselected <- c("Value")
groupbycol <- c("Gene")
My Code is :
df %>% group_by(groupbycol) %>% top_n(1, columnselected)
This code is giving error.
Gene Value
A 12
B 6
C 1
D 4
You need to convert column names to symbol using sym and then evaluate them using !!
library(dplyr)
df %>% group_by(!!sym(groupbycol)) %>% top_n(1, !!sym(columnselected))
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
We can use group_by_at and without using an additional package
library(dplyr)
df %>%
group_by_at(groupbycol) %>%
top_n(1, !! as.name(columnselected))
# A tibble: 4 x 2
# Groups: Gene [4]
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
NOTE: There would be many dupes for this post :=)

R Show duplicates in dataframe

I am trying to "highlight" duplicates in my dataframe. I found various tutorials on dropping duplicates or creating a new dataset containing only duplicates. But since I expect something went wrong in earlier stages of my datawork, I would (for now) just like to see which observations appear to be duplicates in order to understand what went wrong. I would like R to create column c
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
c <- c(2,1,2,1,2,2,1)
df <-data.frame(a,b,c)
a <- c("C","A","A","B","A","C","C")
b <- c(1,1,2,1,2,1,2)
df <-data.frame(a,b)
library(dplyr)
df %>%
group_by(a,b) %>% # for each combination of a and b
mutate(c = n()) %>% # count times they appear
ungroup()
# # A tibble: 7 x 3
# a b c
# <fct> <dbl> <int>
# 1 C 1 2
# 2 A 1 1
# 3 A 2 2
# 4 B 1 1
# 5 A 2 2
# 6 C 1 2
# 7 C 2 1

Implementation of LIFO in converting comma separated row values into separate row values

I have a data frame where individual row values are comma separated, I wanted to separate the comma values into individual row values. So using this SO post is was able to achieve in converting comma separated string values in individual rows, but, It replaces first value if the string as the first values if the row and I wanted to have the inverse of this, i.e First string value is the last row value.
# create data
library(tidyverse)
d <- data_frame(
col1 = c("1,2,3")
)
Dataframe
# # A tibble: 3 x 2
# col1
# <chr>
# 1 1,2,3
# tidy data
separate_rows(d, col1, convert = TRUE)
Current Output
# # A tibble: 6 x 2
# col1
# <int>
# 1
# 2
# 3
Desired Output
# tidy data
separate_rows(d, col1, convert = TRUE)
# # A tibble: 6 x 2
# col1
# <int>
# 3
# 2
# 1
Split the column on commas, reverse the vector, construct a data frame.
Sample data:
> d = data.frame(col1=c("23,34,99","9,3,2"),stringsAsFactors=FALSE)
> d
col1
1 23,34,99
2 9,3,2
Do:
> data.frame(col1=do.call(c,lapply(strsplit(d$col1,","),rev)))
col1
1 99
2 34
3 23
4 2
5 3
6 9
We can invert the dataframe and select the indices in reverse order using slice
library(tidyverse)
separate_rows(d, col1, convert = TRUE) %>%
slice(n():1)
# col1
# <int>
#1 3
#2 2
#3 1
For multiple rows, taking #Spacedman's example
d = data.frame(col1=c("23,34,99","9,3,2"),stringsAsFactors=FALSE)
The above solution would give
separate_rows(d, col1, convert = TRUE) %>%
slice(n():1)
# A tibble: 6 x 1
col1
<int>
#1 2
#2 3
#3 9
#4 99
#5 34
#6 23
However, in case if OP needs to reverse the string for each row seperately we can create a group column with row_number and then reverse the string for each row separately as suggested by #Sotos
d %>%
mutate(group = row_number()) %>%
separate_rows(col1, convert = TRUE) %>%
group_by(group) %>%
slice(n():1) %>%
ungroup() %>%
select(-group)
# A tibble: 6 x 1
# col1
# <int>
#1 99
#2 34
#3 23
#4 2
#5 3
#6 9
You can use stri_reverse from stringi package and reverse the string prior to separating, i.e.
library(tidyverse)
d %>%
mutate(col1 = stringi::stri_reverse(col1)) %>%
separate_rows(col1)
which gives
A tibble: 3 x 1
col1
<chr>
1 3
2 2
3 1

Create a variable capturing the most frequent occurence by group

Define:
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
s.t.
> df1
id v1
1 1 a
2 1 b
3 1 b
4 2 c
5 2 c
6 2 c
I want to create a third variable freq that contains the most frequent observation in v1 by id s.t.
> df2
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c
You can do this using ddply and a custom function to pick out the most frequent value:
myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}
ddply(df1,.(id),.fun=myFun)
Note that which.max will return the first occurrence of the maximum value, in the case of ties. See ??which.is.max in the nnet package for an option that breaks ties randomly.
Another way consists of using tidyverse functions:
grouping first, using group_by(), and counting the occurrence of the second variable using tally()
arranging by the number of occurrences with arrange()
summarizing and picking out the first row with summarize() and first()
Therefore:
df1 %>%
group_by(id, v1) %>%
tally() %>%
arrange(id, desc(n)) %>%
summarize(freq = first(v1))
This will give you just the mapping (which I find cleaner):
# A tibble: 2 x 2
id freq
<dbl> <fctr>
1 1 b
2 2 c
You can then left_join your original data frame with that table.
mode <- function(x) names(table(x))[ which.max(table(x)) ]
df1$freq <- ave(df1$v1, df1$id, FUN=mode)
> df1
id v1 freq
1 1 a b
2 1 b b
3 1 b b
4 2 c c
5 2 c c
6 2 c c

Resources