Selecting all rows that match a criteria selected randomly within dplyr - r

I am trying to select all rows in a repeated measures dataset that belong to a randomly selected group of people. I am trying to do it entirely in the tidyverse (for my own edification) but find myself having to fall back on base R functions. Here is how I do it with a combination of base R and dplyr commands.
set.seed(145)
df <- data.frame(id = rep(letters[1:10], each = 4),
score = rnorm(40))
ids <- sample(unique(df$id), 3)
smallDF <- df %>% dplyr::filter(id %in% ids)
smallDF
# id score
# 1 a 0.6869129
# 2 a 1.0663631
# 3 a 0.5367006
# 4 a 1.9060287
# 5 c 1.1677516
# 6 c 0.7926794
# 7 c -1.2135038
# 8 c -1.0056141
# 9 d 0.2085696
# 10 d 0.4461776
# 11 d -0.6208060
# 12 d 0.4413429
I can sample randomly from the id identifier using dplyr...
df %>% distinct(id) %>% sample_n(3)
# id
# 1 e
# 2 c
# 3 b
...but the fact that the output is a dataframe/tibble is making it difficult for me to get to that next step where I then filter the original df by the randomly selected id identifiers.
Can anyone help?

You can do a left_join to original df to get all the rows of randomly selected id's
library(dplyr)
set.seed(123)
df %>% distinct(id) %>% sample_n(3) %>% left_join(df)
#Joining, by = "id"
# id score
#1 b 1.063
#2 b 1.370
#3 b 0.528
#4 b 0.403
#5 f 0.343
#6 f -1.286
#7 f -0.534
#8 f 0.597
#9 c 1.168
#10 c 0.793
#11 c -1.214
#12 c -1.006

df %>% filter(id %in% sample(levels(id),3))

Related

How can I remove rows with the same value in 2 ore more rows in R

I have a dataframe in the following format with ID's and A/B's. The dataframe is very long, over 3000 ID's.
id
type
1
A
2
B
3
A
4
A
5
B
6
A
7
B
8
A
9
B
10
A
11
A
12
A
13
B
...
...
I need to remove all rows (A+B), where more than one A is behind another one or more. So I dont want to remove the duplicates. If there are a duplicate (2 or more A's), i want to remove all A's and the B until the next A.
id
type
1
A
2
B
6
A
7
B
8
A
9
B
...
...
Do I need a loop for this problem? I hope for any help,thank you!
This might be what you want:
First, define a function that notes the indices of what you want to remove:
row_sequence <- function(value) {
inds <- which(value == lead(value))
sort(unique(c(inds, inds + 1, inds +2)))
}
Apply the function to your dataframe by first extracting the rows that you want to remove into df1 and second anti_joining df1 with df to obtain the final dataframe:
library(dplyr)
df1 <- df %>% slice(row_sequence(type))
df2 <- df %>%
anti_join(., df1)
Result:
df2
id type
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data:
df <- data.frame(
id = 1:13,
type = c("A","B","A","A","B","A","B","A","B","A","A","A","B")
)
I imagined there is only one B after a series of duplicated A values, however if that is not the case just let me know to modify my codes:
library(dplyr)
library(tidyr)
library(data.table)
df %>%
mutate(rles = data.table::rleid(type)) %>%
group_by(rles) %>%
mutate(rles = ifelse(length(rles) > 1, NA, rles)) %>%
ungroup() %>%
mutate(rles = ifelse(!is.na(rles) & is.na(lag(rles)) & type == "B", NA, rles)) %>%
drop_na() %>%
select(-rles)
# A tibble: 6 x 2
id type
<int> <chr>
1 1 A
2 2 B
3 6 A
4 7 B
5 8 A
6 9 B
Data
df <- read.table(header = TRUE, text = "
id type
1 A
2 B
3 A
4 A
5 B
6 A
7 B
8 A
9 B
10 A
11 A
12 A
13 B")

Turning a data frame and a list into long format with dplyr

Here is a puzzle.
Assume you have a data frame and a list. The list has as many elements as the df has rows:
dd <- data.frame(ID=1:3, Name=LETTERS[1:3])
dl <- map(4:6, rnorm) %>% set_names(letters[1:3])
Is there a simple way (preferably with dplyr / tidyverse) to make a long format, such that the elements of the list are joined with the corresponding rows of the data frame? Here is what I have in mind illustrated with not-so-elegant way:
rows <- map(1:length(dl), ~ rep(., length(dl[[.]]))) %>% unlist()
dd <- dd[rows,]
dd$value <- unlist(dl)
As you can see, for each vector in dl, we replicated the corresponding row as many times as necessary to accommodate each value.
In base R, you can get your result with stack followed by merge:
res <- merge(stack(dl), dd, by.x="ind", by.y="Name")
head(res)
# ind values ID
#1 A -0.79616693 1
#2 A 0.37720953 1
#3 A 1.30273712 1
#4 A 0.19483859 1
#5 B 0.18770716 2
#6 B -0.02226917 2
NB: I supposed the names for dl were supposed to be in uppercases but if they are indeed lowercase, the following line needs to be pass instead:
res <- merge(stack(setNames(dl, toupper(names(dl)))), dd, by.x="ind", by.y="Name")
Since a dplyr solution has already been provided, another option is to subset dl for each Name value in dd using data.table grouping
library(data.table)
setDT(dd)
dd[, .(values = dl[[tolower(Name)]]), by = .(ID, Name)]
# ID Name values
# 1: 1 A -1.09633600
# 2: 1 A -1.26238190
# 3: 1 A 1.15220845
# 4: 1 A -1.45741071
# 5: 2 B -0.49318131
# 6: 2 B 0.59912670
# 7: 2 B -0.73117632
# 8: 2 B -1.09646143
# 9: 2 B -0.79409753
# 10: 3 C -0.08205888
# 11: 3 C 0.21503398
# 12: 3 C -1.17541571
# 13: 3 C -0.10020616
# 14: 3 C -1.01152362
# 15: 3 C -1.03693337
We can create a list column and unnest
library(tidyverse)
dd %>%
mutate(value = dl) %>%
unnest
# ID Name value
#1 1 A 1.57984385
#2 1 A 0.66831102
#3 1 A -0.45472145
#4 1 A 2.33807619
#5 2 B 1.56716709
#6 2 B 0.74982763
#7 2 B 0.07025534
#8 2 B 1.31174561
#9 2 B 0.57901536
#10 3 C -1.36629653
#11 3 C -0.66437155
#12 3 C 2.12506187
#13 3 C 1.20220402
#14 3 C 0.10687018
#15 3 C 0.15973401
Note that if the criteria is based on the compactness of code, if we remove the %>%
unnest(mutate(dd, value = dl))
Or another option is uncount and mutate
dd %>%
uncount(lengths(dl)) %>%
mutate(value = flatten_dbl(unname(dl)))
If it needs a join based on the names of the 'dl'
enframe(dl, name = 'Name') %>%
mutate(Name = toupper(Name)) %>%
left_join(dd) %>%
unnest
In base R, we can replicate the rows of 'dd' with lengths of 'dl' and transform to create the 'value' as unlisted 'dl'
transform(dd[rep(seq_len(nrow(dd)), lengths(dl)),], value = unlist(dl))

How to Pass column name in group by from a variable

Want to extract max values of a column of each group of data frame.
I have column name in a variable which i want to pass in group by condition but it is failing.
I have below data frame:
df <- read.table(header = TRUE, text = 'Gene Value
A 12
A 10
B 3
B 5
B 6
C 1
D 3
D 4')
Column values in Variables below:
columnselected <- c("Value")
groupbycol <- c("Gene")
My Code is :
df %>% group_by(groupbycol) %>% top_n(1, columnselected)
This code is giving error.
Gene Value
A 12
B 6
C 1
D 4
You need to convert column names to symbol using sym and then evaluate them using !!
library(dplyr)
df %>% group_by(!!sym(groupbycol)) %>% top_n(1, !!sym(columnselected))
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
We can use group_by_at and without using an additional package
library(dplyr)
df %>%
group_by_at(groupbycol) %>%
top_n(1, !! as.name(columnselected))
# A tibble: 4 x 2
# Groups: Gene [4]
# Gene Value
# <fct> <int>
#1 A 12
#2 B 6
#3 C 1
#4 D 4
NOTE: There would be many dupes for this post :=)

Grouping dataframe in 12 groups with same column values

I have a large dataset with about 15 columns and more than 3 million rows.
Because the dataset is so big, I would like to use multidplyron it .
Because of the data, it would be impossible to just split my data frame to 12 parts. Lets say that there are columns col1 and col2 which each have several different values but they repeat (in each column separately).
How can I make 12 (or n) similar sized groups which each of them contain rows that have the same value in both col1 and col2?
Example: Lets say one of the possible values in col1 foo and in col2 is bar. Then they would be grouped, all rows with this values would be in one group.
So that the question makes sense, there are always more than 12 unique combinations of col1 and col2.
I would try to do something with for and while loops if this was python but as this is R, there probably is another way.
Try this:
# As you provided no example data, I created some data repeating three times.
# I used dplyr within tidyverse. Then grouped by the columns and sliced
# the data by chance for n=2.
library(tidyverse)
df <- data.frame(a=rep(LETTERS,3), b=rep(letters,3))
# the data:
df %>%
arrange(a,b) %>%
group_by(a,b) %>%
mutate(n=1:n())
# A tibble: 78 x 3
# Groups: a, b [26]
a b n
<fctr> <fctr> <int>
1 A a 1
2 A a 2
3 A a 3
4 B b 1
5 B b 2
6 B b 3
7 C c 1
8 C c 2
9 C c 3
10 D d 1
# ... with 68 more rows
Slicing down the data by chance on two rows per group.
set.seed(123)
df %>%
arrange(a,b) %>%
group_by(a,b) %>%
mutate(n=1:n()) %>%
sample_n(2)
# A tibble: 52 x 3
# Groups: a, b [26]
a b n
<fctr> <fctr> <int>
1 A a 1
2 A a 2
3 B b 2
4 B b 3
5 C c 3
6 C c 1
7 D d 2
8 D d 3
9 E e 2
10 E e 1
# ... with 42 more rows
# Create sample data
library(dplyr)
df <- data.frame(a=rep(LETTERS,3), b=rep(letters,3),
nobs=sample(1:100, 26*3,replace=T), stringsAsFactors=F)
# Get all unique combinations of col1 and col2
combos <- df %>%
group_by(a,b) %>%
summarize(n=sum(nobs)) %>%
as.data.frame(.)
top12 <- combos %>%
arrange(desc(n)) %>%
top_n(12,n)
top12
l <- list()
for(i in 1:11){
l[[i]] <- combos[combos$a==top12[i,"a"] & combos$b==top12[i,"b"],]
}
l[[12]] <- combos %>%
anti_join(top12,by=c("a","b"))
l
# This produces a list 'l' that contains twelve data frames -- the top 11 most-commonly occuring pairs of col1 and col2, and all the rest of the data in the 12th list element.

Repeating rows of data.frame in dplyr [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I have a trouble with repeating rows of my real data using dplyr. There is already another post in here repeat-rows-of-a-data-frame but no solution for dplyr.
Here I just wonder how could be the solution for dplyr
but failed with error:
Error: wrong result size (16), expected 4 or 1
library(dplyr)
df <- data.frame(column = letters[1:4])
df_rep <- df%>%
mutate(column=rep(column,each=4))
Expected output
>df_rep
column
#a
#a
#a
#a
#b
#b
#b
#b
#*
#*
#*
Using the uncount function will solve this problem as well. The column count indicates how often a row should be repeated.
library(tidyverse)
df <- tibble(letters = letters[1:4])
df
# A tibble: 4 x 1
letters
<chr>
1 a
2 b
3 c
4 d
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
uncount(count)
# A tibble: 11 x 1
letters
<chr>
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 d
11 d
I was looking for a similar (but slightly different) solution. Posting here in case it's useful to anyone else.
In my case, I needed a more general solution that allows each letter to be repeated an arbitrary number of times. Here's what I came up with:
library(tidyverse)
df <- data.frame(letters = letters[1:4])
df
> df
letters
1 a
2 b
3 c
4 d
Let's say I want 2 A's, 3 B's, 2 C's and 4 D's:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count))
# A tibble: 11 x 2
# Groups: letters [4]
letters count
<fctr> <int>
1 a 1
2 a 2
3 b 1
4 b 2
5 b 3
6 c 1
7 c 2
8 d 1
9 d 2
10 d 3
11 d 4
If you don't want to keep the count column:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count)) %>%
select(letters)
# A tibble: 11 x 1
# Groups: letters [4]
letters
<fctr>
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 d
11 d
If you want the count to reflect the number of times each letter is repeated:
df %>%
mutate(count = c(2, 3, 2, 4)) %>%
group_by(letters) %>%
expand(count = seq(1:count)) %>%
mutate(count = max(count))
# A tibble: 11 x 2
# Groups: letters [4]
letters count
<fctr> <dbl>
1 a 2
2 a 2
3 b 3
4 b 3
5 b 3
6 c 2
7 c 2
8 d 4
9 d 4
10 d 4
11 d 4
This is rife with peril if the data.frame has other columns (there, I said it!), but the do block will allow you to generate a derived data.frame within a dplyr pipe (though, ceci n'est pas un pipe):
library(dplyr)
df <- data.frame(column = letters[1:4], stringsAsFactors = FALSE)
df %>%
do( data.frame(column = rep(.$column, each = 4), stringsAsFactors = FALSE) )
# column
# 1 a
# 2 a
# 3 a
# 4 a
# 5 b
# 6 b
# 7 b
# 8 b
# 9 c
# 10 c
# 11 c
# 12 c
# 13 d
# 14 d
# 15 d
# 16 d
As #Frank suggested, a much better alternative could be
df %>% slice(rep(1:n(), each=4))
I did a quick benchmark to show that uncount() is a lot faster than expand()
# for the pipe
library(magrittr)
# create some test data
df_test <-
tibble::tibble(
letter = letters,
row_count = sample(1:10, size = 26, replace = TRUE)
)
# benchmark
bench <- microbenchmark::microbenchmark(
expand = df_test %>%
dplyr::group_by(letter) %>%
tidyr::expand(row_count = seq(1:row_count)),
uncount = df_test %>%
tidyr::uncount(row_count)
)
# plot the benchmark
ggplot2::autoplot(bench)

Resources