Given a dataset
key <- rep(c('a', 'b', 'c'), 10)
value <- sample(30)
df <- data.frame(key, value)
I would like a different number of samples for each group in keys, a simple code using dplyr that obviously do not work for this task is
ns <- c('a'= 1, 'b'= 2, 'c' = 3)
df %>%
mutate(n_s = ns[key]) %>%
group_by(key) %>%
sample_n(n_s)
There is some solution that can look as simple as that ?
You can use mapply and with split(df, df$key) and ns as arguments, but note that the names of ns are not use. It's the order of the groups that counts, and if the number of groups doesn't match the length of ns, ns will be recycled.
set.seed(129)
mapply(sample_n, split(df, df$key), ns, SIMPLIFY = FALSE) %>%
rbind_all
# key value
# (fctr) (int)
#1 a 29
#2 b 14
#3 b 22
#4 c 10
#5 c 24
#6 c 3
You can look at the stratified function from my "splitstackshape" package:
library(splitstackshape)
ns <- c('a'= 1, 'b'= 2, 'c' = 3)
stratified(df, "key", size = ns)
# key value
# 1: a 7
# 2: b 10
# 3: b 13
# 4: c 4
# 5: c 20
# 6: c 9
Related
I'm still learning R and was wondering if I there was an elegant way of manipulating the below df to achieve df2.
I'm not sure if it's a loop that is supposed to be used for this, but basically I want to extract the first Non NA "X_No" Value if the "X_No" value is NA in the first row. This would perhaps be best described through an example from df to the desired df2.
A_ID <- c('A','B','I','N')
A_No <- c(11,NA,15,NA)
B_ID <- c('B','C','D','J')
B_No <- c(NA,NA,12,NA)
C_ID <- c('E','F','G','P')
C_No <- c(NA,13,14,20)
D_ID <- c('J','K','L','M')
D_No <- c(NA,NA,NA,40)
E_ID <- c('W','X','Y','Z')
E_No <- c(50,32,48,40)
df <- data.frame(A_ID,A_No,B_ID,B_No,C_ID,C_No,D_ID,D_No,E_ID,E_No)
ID <- c('A','D','F','M','W')
No <- c(11,12,13,40,50)
df2 <- data.frame(ID,No)
I'm hoping for an elegant solution to this as there are over a 1000 columns similar to the example provided.
I've looked all over the web for a similar example however to no avail that would reproduce the expected result.
Your help is very much appreciated.
Thankyou
I don't know if I'd call it "elegant", but here is a potential solution:
library(tidyverse)
A_ID <- c('A','B','I','N')
A_No <- c(11,NA,15,NA)
B_ID <- c('B','C','D','J')
B_No <- c(NA,NA,12,NA)
C_ID <- c('E','F','G','P')
C_No <- c(NA,13,14,20)
D_ID <- c('J','K','L','M')
D_No <- c(NA,NA,NA,40)
E_ID <- c('W','X','Y','Z')
E_No <- c(50,32,48,40)
df <- data.frame(A_ID,A_No,B_ID,B_No,C_ID,C_No,D_ID,D_No,E_ID,E_No)
ID <- c('A','D','F','M','W')
No <- c(11,12,13,40,50)
df2 <- data.frame(ID,No)
output <- df %>%
pivot_longer(everything(),
names_sep = "_",
names_to = c("Col", ".value")) %>%
drop_na() %>%
group_by(Col) %>%
slice_head(n = 1) %>%
ungroup() %>%
select(-Col)
df2
#> ID No
#> 1 A 11
#> 2 D 12
#> 3 F 13
#> 4 M 40
#> 5 W 50
output
#> # A tibble: 5 × 2
#> ID No
#> <chr> <dbl>
#> 1 A 11
#> 2 D 12
#> 3 F 13
#> 4 M 40
#> 5 W 50
all_equal(df2, output)
#> [1] TRUE
Created on 2023-02-08 with reprex v2.0.2
Using base R with max.col (assuming the columns are alternating with ID, No)
ind <- max.col(!is.na(t(df[c(FALSE, TRUE)])), "first")
m1 <- cbind(seq_along(ind), ind)
data.frame(ID = t(df[c(TRUE, FALSE)])[m1], No = t(df[c(FALSE, TRUE)])[m1])
ID No
1 A 11
2 D 12
3 F 13
4 M 40
5 W 50
Here is a data.table solution that should scale well to a (very) large dataset.
functionally
split the data.frame to a list of chunks of columns, based on their
names. So all columns startting with A_ go to
the first element, all colums startting with B_ to the second
Then, put these list elements on top of each other, using
data.table::rbindlist. Ignure the column-namaes (this only works if
A_ has the same number of columns as B_ has the same number of cols
as n_)
Now get the first non-NA value of each value in the first column
code
library(data.table)
# split based on what comes after the underscore
L <- split.default(df, f = gsub("(.*)_.*", "\\1", names(df)))
# bind together again
DT <- rbindlist(L, use.names = FALSE)
# extract the first value of the non-NA
DT[!is.na(A_No), .(No = A_No[1]), keyby = .(ID = A_ID)]
# ID No
# 1: A 11
# 2: D 12
# 3: F 13
# 4: G 14
# 5: I 15
# 6: M 40
# 7: P 20
# 8: W 50
# 9: X 32
#10: Y 48
#11: Z 40
I have a dataframe like this:
set.seed(123)
df <- data.frame(A = sample(LETTERS[1:5], 50, replace = TRUE),
B = sample(LETTERS[1:5], 50, replace = TRUE))
I want to filter the dataframe on two parameters: (i) the target rows that match a certain criterion and (ii) a certain number of rows that precede the target rows. Specifically, I want to filter rows where A == "A" & B == "A" as well as the five rows preceding the target row. I can do this with a two-step operation: first by defining a function, and second by using the function as input for slice:
Sequ <- function(col1, col2) {
# get row indices of target row with function `which`
inds <- which(col1 == "A" & col2 == "A")
# sort row indices of the rows before target row AND target row itself
sort(unique(c(inds-5, inds-4, inds-3,inds-2, inds-1, inds)))
}
library(dplyr)
df %>%
slice(Sequ(col1 = A, col2 = B))
A B
1 D C
2 D B
3 C B
4 C D
5 B B
6 A A
7 E B
8 E D
9 D C
10 D D
11 A A
12 C C
13 D E
14 B E
15 B E
16 B A
17 A A
18 C D
19 C B
20 B D
21 A B
22 A A
But surely there must be a more efficient replacement for this part: sort(unique(c(inds-5, inds-4, inds-3,inds-2, inds-1, inds))). In case I want to filter not just the preceding 5 but, say, 10 or 100 rows this way of defining each index individually becomes quickly impractical. How can this part be coded more economically?
1) Define bothA which takes a matrix and returns TRUE if any row is all A's. Then use rollapply to apply it as a moving window.
library(zoo)
bothA <- function(x) any(rowSums(rbind(x) == "A") == 2)
ok <- rollapply(df, 6, bothA, align = "left", partial = TRUE, by.column = FALSE)
df[ok, ]
2) or in a pipe
df %>%
filter(rollapply(., 6, bothA, align = "left", partial = TRUE, by.column = FALSE))
3) This also works:
ok <- rollapply(rowSums(df == "A") == 2, 6, any, align = "left", partial = TRUE)
df[ok, ]
Here is a dplyr solution that can be directly used in a pipe, with no need for filter.
Sequ <- function(x, col1, col2, value = "A"){
x %>%
mutate(grp = lag(cumsum({{col1}} == value & {{col2}} == value), default = 0)) %>%
group_by(grp) %>%
slice_tail(n = 5) %>%
ungroup() %>%
select(-grp)
}
df %>% Sequ(A, B)
## A tibble: 23 x 2
# A B
# <chr> <chr>
# 1 B D
# 2 C C
# 3 E A
# 4 D B
# 5 A A
# 6 C D
# 7 E E
# 8 C E
# 9 C C
#10 A A
## … with 13 more rows
One dplyr and purrr solution could be:
df %>%
filter(!row_number() %in% unlist(map(which(A == "A" & B == "A"), ~ (.x-5):.x)))
I have a dataframe like below:
df = data.frame(a = runif(10,0,10),
b = runif(10,1,10),
c = runif(10,0,12))
How can I find the n largest values from this dataframe?
We can easily find top n from a vector. Is there any good way to find the top n from a dataframe?
Thanks a lot.
Maybe you can check for stack
N=2
sort(stack(df)$values, decreasing=TRUE)[1:N]
[1] 10.884644 9.912067
You can use tidyr::gather() and dplyr::top_n().
First gather every column in one column using gather(key, value), and filter top n elements using top_n(). For example, top-5.
library(tidyverse) # dplyr and tidyr
set.seed(10)
mydf <-
data.frame(a = runif(10,0,10),
b = runif(10,1,10),
c = runif(10,0,12))
In gather(), freely specify the name of key and value.
You should name wt of top_n() as value you have given.
mydf %>%
gather(key = "key", value = "value") %>%
top_n(5, wt = value) %>%
arrange(desc(value)) # sort by value
#> key value
#> 1 c 10.38
#> 2 c 10.06
#> 3 c 9.30
#> 4 c 9.25
#> 5 b 8.53
You can get the output of top_n values with corresponding column names.
However, if you just want only values, you can use unlist().
unlist(mydf) %>% # optionally, use.names = FALSE
sort(decreasing = TRUE) %>%
.[1:5]
#> c1 c7 c3 c9 b10
#> 10.38 10.06 9.30 9.25 8.53
unlist and convert it into a vector, sort them and find top values. So for top 2 values we can do
tail(sort(unlist(df, use.names = FALSE)), 2)
#[1] 9.581705 9.591726
If it's a matrix you'll not require unlist
tail(sort(as.matrix(df)), 2)
data
set.seed(1233)
df = data.frame(a = runif(10,0,10),
b = runif(10,1,10),
c = runif(10,0,12))
I suspect you're looking for slice_max().
Given, for example, the data below:
> df = data.frame(a = runif(5,0,10),
+ b = runif(5,1,10),
+ c = runif(5,-1,9))
> df
a b c
1 1.953615 6.663370 6.95084517
2 1.564794 2.376268 1.46826979
3 5.052276 3.609657 0.84467786
4 3.800541 5.506710 5.64018236
5 9.823815 9.158154 -0.03483406
We can get the three topmost rows (defined by the parameter n) sorted by the column a...
> slice_max(df, n=3, order_by=a)
a b c
1 9.823815 9.158154 -0.03483406
2 5.052276 3.609657 0.84467786
3 3.800541 5.506710 5.64018236
...column b...
> slice_max(df, n=3, order_by=b)
a b c
1 9.823815 9.158154 -0.03483406
2 1.953615 6.663370 6.95084517
3 3.800541 5.506710 5.64018236
...or column c:
> slice_max(df, n=3, order_by=c)
a b c
1 1.953615 6.663370 6.950845
2 3.800541 5.506710 5.640182
3 1.564794 2.376268 1.468270
I have a two-column data frame. The first column is a timestamp and the second column is some value. For example:
library(tidyverse)
set.seed(123)
data_df <- tibble(t = 1:15,
value = sample(letters, 15))
I have a another data frame that specifies the range of timestamps that need to be updated and their corresponding values. For example:
criteria_df <- tibble(start = c(1, 3, 7),
end = c(2, 5, 10),
value = c('a', 'b', 'c')
)
This means that I need to mutate the value column in data_df so that its value from t=1 to t=2 is 'a', from t=3 to t=5 is 'b' and from t=7 to t=10 is 'c'.
What is the recommended way to do this in R?
The only way I could think of is to loop each row in criteria_df and mutate the value column in data_df after filtering the t column, like so:
library(iterators)
library(foreach)
foreach(row = row_iter, .combine = c) %do% {
seg_start = row$start
seg_end = row$end
new_value = row$value
data_df %<>%
mutate(value = if_else(between(t, seg_start, seg_end),
new_value,
value))
NULL
}
We can do a two-step base R solution, where we first find the values which lies in the range of criteria_df start and end and then replace the data_df value from it's equivalent criteria_df's value if it matches or keep it as it is.
inds <- sapply(data_df$t, function(x) criteria_df$value[x >= criteria_df$start
& x <= criteria_df$end])
data_df$value <- unlist(ifelse(lengths(inds) > 0, inds, data_df$value))
data_df
# t value
# <int> <chr>
# 1 1 a
# 2 2 a
# 3 3 b
# 4 4 b
# 5 5 b
# 6 6 a
# 7 7 c
# 8 8 c
# 9 9 c
#10 10 c
#11 11 p
#12 12 g
#13 13 r
#14 14 s
#15 15 b
I am trying to select the maximum value in a dataframe's third column based on the combinations of the values in the first two columns.
My problem is similar to this one but I can't find a way to implement what I need.
EDIT: Sample data changed to make the column names more obvious.
Here is some sample data:
library(tidyr)
set.seed(1234)
df <- data.frame(group1 = letters[1:4], group2 = letters[1:4])
df <- df %>% expand(group1, group2)
df <- subset(df, subset = group1!=group2)
df$score <- runif(n = 12,min = 0,max = 1)
df
# A tibble: 12 × 3
group1 group2 score
<fctr> <fctr> <dbl>
1 a b 0.113703411
2 a c 0.622299405
3 a d 0.609274733
4 b a 0.623379442
5 b c 0.860915384
6 b d 0.640310605
7 c a 0.009495756
8 c b 0.232550506
9 c d 0.666083758
10 d a 0.514251141
11 d b 0.693591292
12 d c 0.544974836
In this example rows 1 and 4 are 'duplicates'. I would like to select row 4 as the value in the score column is larger than in row 1. Ultimately I would like a dataframe to be returned with the group1 and group2 columns and the maximum value in the score column. So in this example, I expect there to be 6 rows returned.
How can I do this in R?
I'd prefer dealing with this problem in two steps:
library(dplyr)
# Create function for computing group IDs from data frame of groups (per column)
get_group_id <- function(groups) {
apply(groups, 1, function(row) {
paste0(sort(row), collapse = "_")
})
}
group_id <- get_group_id(select(df, -score))
# Perform the computation
df %>%
mutate(groupId = group_id) %>%
group_by(groupId) %>%
slice(which.max(score)) %>%
ungroup() %>%
select(-groupId)