I want to extract the index based of the minimum number for every Group
Group <- c("A","A","A","A","A","B","B","C","C","C","C")
Number <- c(12,45,15,65,54,21,23,12,3,5,6,11,34,656,754)
data.frame(Group,Number)
Group Number
1 A 12
2 A 45
3 A 15
4 A 65
5 A 54
6 B 21
7 B 23
8 C 12
9 C 3
10 C 5
11 C 6
The result should be a vector that contain the indices:
Answer
vector <- (1,6,9)
Create a sequence column, grouped by 'Group', summarise by returning the corresponding row number based on the index of min value of 'Number' (which.min) and pull the column as a vector
library(dplyr)
df1 %>%
mutate(rn = row_number()) %>%
group_by(Group) %>%
summarise(n = rn[which.min(Number)]) %>%
pull(n)
#[1] 1 6 9
data
df1 <- structure(list(Group = c("A", "A", "A", "A", "A", "B", "B", "C",
"C", "C", "C"), Number = c(12L, 45L, 15L, 65L, 54L, 21L, 23L,
12L, 3L, 5L, 6L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11"))
Does this work for you?
library(dplyr)
df %>%
mutate(row_n = row_number()) %>%
group_by(Group) %>%
slice_min(Number)
# A tibble: 3 x 3
# Groups: Group [3]
Group Number row_n
<chr> <dbl> <int>
1 A 12 1
2 B 12 7
3 C 3 8
The row numbers are in column row_n. If you want outputted only the row numbers, add %>% ungroup() %>% select(-c(1:2)) like so:
df %>%
mutate(row_n = row_number()) %>%
group_by(Group) %>%
slice_min(Number) %>%
ungroup() %>%
select(-c(1:2))
# A tibble: 3 x 1
row_n
<int>
1 1
2 7
3 8
Data:
Group <- c("A","A","A","A","A","B","B","C","C","C","C")
Number <- c(12,45,65,54,21,23,12,3,5,6,34)
df <- data.frame(Group,Number)
This function returns the index i of the smallest value in v
FUN = function(v, i) i[which.min(v)]
Here are the values by group
v = split(df$Number, df$Group)
and the index into the original data.frame by group
i = split(seq_along(df$Number), df$Group)
Apply our function to each group
mapply(FUN, v, i)
In one go:
FUN = function(v, i) i[which.min(v)]
v = split(df$Number, df$Group)
i = split(seq_along(df$Number), df$Group)
mapply(FUN, v, i)
Related
I have a dataset with ids and associated values:
df <- data.frame(id = c("1", "2", "3"), value = c("12", "20", "16"))
I have a lookup table that matches the id to another reference label ref:
lookup <- data.frame(id = c("1", "1", "1", "2", "2", "3", "3", "3", "3"), ref = c("a", "b", "c", "a", "d", "d", "e", "f", "a"))
Note that id to ref is a many-to-many match: the same id can be associated with multiple ref, and the same ref can be associated with multiple id.
I'm trying to split the value associated with the df$id column equally into the associated ref columns. The output dataset would look like:
output <- data.frame(ref = "a", "b", "c", "d", "e", f", value = "18", "4", "4", "14", "4", "4")
ref
value
a
18
b
4
c
4
d
14
e
4
f
4
I tried splitting this into four steps:
calling pivot_wider on lookup, turning rows with the same id value into columns (e.g., a, b, c.)
merging the two datasets based on id
dividing each df$value equally into a, b, c, etc. columns that are not empty
transposing the dataset and summing across the id columns.
I can't figure out how to make step (3) work, though, and I suspect there's a much easier approach.
A variation of #thelatemail's answer with base pipes.
merge(df, lookup) |> type.convert(as.is=TRUE) |>
transform(value=ave(value, id, FUN=\(x) x/length(x))) |>
with(aggregate(list(value=value), list(ref=ref), sum))
# ref value
# 1 a 18
# 2 b 4
# 3 c 4
# 4 d 14
# 5 e 4
# 6 f 4
Here's a potential logic. Merge value from df into lookup by id, divide value by number of matching rows, then group by ref and sum. Then take your pick of how you want to do it.
Base R
tmp <- merge(lookup, df, by="id", all.x=TRUE)
tmp$value <- ave(as.numeric(tmp$value), tmp$id, FUN=\(x) x/length(x) )
aggregate(value ~ ref, tmp, sum)
dplyr
library(dplyr)
lookup %>%
left_join(df, by="id") %>%
group_by(id) %>%
mutate(value = as.numeric(value) / n() ) %>%
group_by(ref) %>%
summarise(value = sum(value))
data.table
library(data.table)
setDT(df)
setDT(lookup)
lookup[df, on="id", value := as.numeric(value)/.N, by=.EACHI][
, .(value = sum(value)), by=ref]
# ref value
#1: a 18
#2: b 4
#3: c 4
#4: d 14
#5: e 4
#6: f 4
This may work
lookup %>%
left_join(lookup %>%
group_by(id) %>%
summarise(n = n()) %>%
left_join(dummy, by = "id") %>%
mutate(value = as.numeric(value)) %>%
mutate(repl = value/n) %>%
select(id, repl) ,
by = "id"
) %>% select(ref, repl) %>%
group_by(ref) %>% summarise(value = sum(repl))
ref value
<chr> <dbl>
1 a 18
2 b 4
3 c 4
4 d 14
5 e 4
6 f 4
Its a bit tricky to explain, Ill try my best, query below. I have a df as below. I need to filter rows by group based on maximum pop in country column but which has not already occurred in the above groups. (As per output (image), the reason why A didnt feature in group2 because it had already occured in Group 1)
In short, I need to get unique values in country column at the same time get maximum value in pop (on a group level). I hope picture can convey what I could not. (Tidyverse solution preferred)
[![Expected output][2]][2]
df<- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), pop = c(200L, 100L, 50L, 200L, 150L, 120L, 200L, 150L,
100L)), class = "data.frame", row.names = c(NA, -9L))
I think this will do. Explanation of syntax
split the data into list for each group
leave first group (as it will be used as .init in next step but after filtering for the max of pop value.
use purrr::reduce here which will reduce the list of tibbles to a single tibble
iterations used in reduce
.init used as filtered first group
thereafter countries in previous groups removed through anti_join
this data filtered for max pop again
added the previously filtered countries by bind_rows()
Thus, in the end we will have desired tibble.
df %>% group_split(Group) %>% .[-1] %>%
reduce(.init =df %>% group_split(Group) %>% .[[1]] %>%
filter(pop == max(pop)),
~ .y %>%
anti_join(.x, by = c("country" = "country")) %>%
filter(pop == max(pop)) %>%
bind_rows(.x) %>% arrange(Group))
# A tibble: 3 x 3
Group country pop
<int> <chr> <int>
1 1 A 200
2 2 E 150
3 3 G 100
You can create a helper function that writes the maximum pop from each group in a vector and use it to filter the dataframe.
library(tidyverse)
max_values <- c()
helper <- function(dat, ...){
dat <- dat[!(dat %in% max_values)] # exclude maximum values from previous groups
max_value <- max(dat) # get current max. value
max_values <<- c(max_values, max_value) # append
return(max_value)
}
df %>%
group_by(Group) %>%
filter(pop == helper(pop))
which gives you:
# A tibble: 3 x 3
# Groups: Group [3]
Group country pop
<int> <chr> <int>
1 1 A 200
2 2 E 150
3 3 H 120
Data used:
> df
Group country pop
1 1 A 200
2 1 B 100
3 1 C 50
4 2 A 200
5 2 E 150
6 2 F 120
7 3 A 200
8 3 E 150
9 3 G 100
10 3 H 120
Here is another possibility, but
Overly Simplified in that it does not take into account
the possibility of a group having a higher population in a Group where
it does not win.
library(dplyr)
df<- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), pop = c(200L, 100L, 50L, 200L, 150L, 120L, 200L, 150L,
100L)), class = "data.frame", row.names = c(NA, -9L))
df %>%
group_by(country) %>%
summarize(popmax = max(pop)) %>%
inner_join(df, by = c("popmax" = 'pop')) %>%
rename(country = country.y) %>%
select(-country.x) %>%
group_by(country) %>%
arrange(Group) %>%
slice(1) %>%
ungroup() %>%
group_by(Group) %>%
arrange(country) %>%
slice(1) %>%
select(Group, country, popmax) %>%
rename(pop = popmax)
My answer fails (while other answers don't) with this data set:
df <- tribble(
~Group, ~ country, ~pop,
1 , 'A', 200,
1 , 'B', 100,
1 , 'C', 50,
1 , 'G', 150,
2 , 'A', 200,
2 , 'E', 150,
2 , 'F', 120,
3 , 'A', 200,
3 , 'E', 150,
3 , 'G', 100
)
Update
#Crestor who is claiming that my answer is not correct.
My answer is correct because my code gives the desired output as requested by OP.
Your objection that my code does not work on another scenario may be correct, but in this setting it is irrelevant, as my answer was only intended to solve the task at hand.
Here is the answer to your raised scenario with this dataset:
df1 <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"),
pop = c(200L, 100L, 250L, 220L, 150L, 120L, 200L, 150L, 100L
)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
))
expected output by Crestor:
# A tibble: 3 x 3
Group country pop
<int> <chr> <int>
1 1 C 250
2 2 A 220
3 3 E 150
My code for your scenario #crestor
library(dplyr)
df1 %>%
group_by(country) %>%
arrange(Group) %>%
filter(pop == max(pop)) %>%
group_by(Group) %>%
filter(pop == max(pop))
Output:
Group country pop
<int> <chr> <int>
1 1 C 250
2 2 A 220
3 3 E 150
Original answer to the question by OP
To keep it simple: First arrange to bring your dataset in position. Then group_by and keep first row in each group with slice. Then group_by Group and filter the max pop
library(dplyr)
df %>%
arrange(country, pop) %>%
group_by(country) %>%
slice(1) %>%
group_by(Group) %>%
filter(pop==max(pop))
Output:
Group country pop
<int> <chr> <int>
1 1 A 200
2 2 E 150
3 3 G 100
I have a dataset of events, grouped by let like so:
set.seed(3)
events <- data.frame(
let = rep(LETTERS[1:2], each=3),
age = c(0,sample(1:20, size=2),
0,sample(1:20, size=2)),
value = sample(1:100, size=6))
let age value
1 A 0 61
2 A 4 60
3 A 16 13
4 B 0 29
5 B 8 56
6 B 7 99
How can I cast the data frame so that age is multiple columns grouped into weeks? So for each column, take the value of the largest age that is less than or equal to 0, 7, 14, 21 days.
events.cast <- data.frame(
let = LETTERS[1:2],
T0_value = c(61,29),
T1_value = c(60,99),
T2_value = c(60,56),
T3_value = c(13,56))
let T0_value T1_value T2_value T3_value
1 A 61 60 60 13
2 B 29 99 56 56
One option is to cut the 'age' into buckets, get the max row by that group and 'let', then reshape into 'wide' format
library(dplyr)
library(tidyr)
library(stringr)
events %>%
group_by(grp = cut(age, breaks = c(-Inf,0, 7, 14, 21),
labels = str_c("T", 0:3, "_value")), let) %>%
slice(which.max(value)) %>%
ungroup %>%
select(-age) %>%
group_by(let) %>%
complete(grp = unique(.$grp)) %>%
fill(value) %>%
pivot_wider(names_from = grp, values_from = value)
# A tibble: 2 x 5
# Groups: let [2]
# let T0_value T1_value T2_value T3_value
# <chr> <int> <int> <int> <int>
#1 A 61 60 60 13
#2 B 29 99 56 56
data
events <- structure(list(let = c("A", "A", "A", "B", "B", "B"), age = c(0L,
4L, 16L, 0L, 8L, 7L), value = c(61L, 60L, 13L, 29L, 56L, 99L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
I have two df.
df1
col1
1 a
2 b
3 c
4 c
df2
setID col1
1 1 a
2 1 b
3 1 b
4 1 a
5 2 w
6 2 v
7 2 c
8 2 b
9 3 a
10 3 a
11 3 b
12 3 a
13 4 a
14 4 b
15 4 c
16 4 a
I'm using the following code to match them.
scorematch <- function ()
{
require("dplyr")
#to make sure every element is preceded by the one before that element
combm <- rev(sapply(rev(seq_along(df1$col1)), function(i) paste0(df1$col1[i-1], df1$col1[i])));
tempdf <- df2
#group the history by their ID
tempdf <- group_by(tempdf, setID)
#collapse strings in history
tempdf <- summarise(tempdf, ss = paste(col1, collapse = ""))
tempdf <- rowwise(tempdf)
#add score based on how it matches compared to path
tempdf <- mutate(tempdf, score = sum(sapply(combm, function(x) sum(grepl(x, ss)))))
tempdf <- ungroup(tempdf)
#filter so that only IDs with scores more than 0 are available
tempdf <- filter(tempdf, score != 0)
tempdf <- pull(tempdf, setID)
#filter original history to reflect new history
tempdf2 <- filter(df2, setID %in% tempdf)
tempdf2
}
This code works great. But I want to take this further. I want to apply a sliding window function to get the df1 values I want to match against df2. So far I'm using this function as my sliding window.
slidingwindow <- function(data, window, step)
{
#data is dataframe with colname
total <- length(data)
#spots are start of each window
spots <- seq(from=1, to=(total-step), by=step)
result <- vector(length = length(spots))
for(i in 1:length(spots)){
...
}
return(result)
}
The scorematch function will be nested inside slidingwindow function. I'm unsure how to proceed from there though. Ideally df1 will be split into windows. Starting from the first window it will be matched against df2 using the scorematch function to get a filtered out df2. Then I want the second window of df1 to match against the newly filtered df2 and so on. The loop should end when df2 has been filtered down so that it contains only 1 distinct setID value. The final output can either be the whole filtered df2 or just the remaining setID.
Ideal output would be either
setID col1
1 4 a
2 4 b
3 4 c
4 4 a
or
[1] "4"
Here is a solution without using a for-loop. I use stringr because of its nice consistent syntax, purrr for map (although lapply would be sufficient in this case) and dplyr to group_by setID and collapse the strings for each group.
library(dplyr)
library(purrr)
library(stringr)
First I collapse the string for each group. This makes it easier to use pattern-matching with str_detect-later:
df2_collapse <- df2 %>%
group_by(setID) %>%
summarise(string = str_c(col1, collapse = ""))
df2_collapse
# A tibble: 4 x 2
# setID string
# <int> <chr>
# 1 1 abba
# 2 2 wvcb
# 3 3 aaba
# 4 4 abca
The "look-up" string is collapse as well and then the substrings (i.e. slding windows) are extract with str_sub. Here I work along the length of the string str_length and extract all possible groups following each letter in the string.
string <- str_c(df1$col1, collapse = "")
string
# [1] "abcc"
substrings <-
unlist(map(1:str_length(string), ~ str_sub(string, start = .x, end = .x:str_length(string))))
Store the substrings in a tibble with their length as score.
substrings
# [1] "a" "ab" "abc" "abcc" "b" "bc" "bcc" "c" "cc" "c"
substrings <- tibble(substring = substrings,
score = str_length(substrings))
substrings
# A tibble: 10 x 2
# substring score
# <chr> <int>
# 1 a 1
# 2 ab 2
# 3 abc 3
# 4 abcc 4
# 5 b 1
# 6 bc 2
# 7 bcc 3
# 8 c 1
# 9 cc 2
# 10 c 1
For each setID with extract the maximum score it matches in the substring-data and the filter out the row with the maximum score of all setIDs.
df2_collapse %>%
mutate(score = map_dbl(string,
~ max(substrings$score[str_detect(.x, substrings$substring)]))) %>%
filter(score == max(score))
# A tibble: 1 x 3
# setID string score
# <int> <chr> <dbl>
# 1 4 abca 3
Data
df1 <- structure(list(col1 = c("a", "b", "c", "c")),
class = "data.frame", row.names = c("1", "2", "3", "4"))
df2 <-
structure(list(setID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L),
col1 = c("a", "b", "b", "a", "w", "v", "c", "b", "a", "a", "b", "a", "a", "b", "c", "a")),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16"))
For example, suppose that you had a function that applied some DPLYR functions, but you couldn't expect datasets passed to this function to have the same column names.
For a simplified example of what I mean, say you had a data frame, arizona.trees:
arizona.trees
group arizona.redwoods arizona.oaks
A 23 11
A 24 12
B 9 8
B 10 7
C 88 22
and another very similar data frame, california.trees:
california.trees
group california.redwoods california.oaks
A 25 50
A 11 33
B 90 5
B 77 3
C 90 35
And you wanted to implement a function that returns the mean for the given groups (A, B, ... Z) for a given type of tree that would work for both of these data frames.
foo <- function(dataset, group1, group2, tree.type) {
column.name <- colnames(dataset[2])
result <- filter(dataset, group %in% c(group1, group2) %>%
select(group, contains(tree.type)) %>%
group_by(group) %>%
summarize("mean" = mean(column.name))
return(result)
}
A desired output for a call of foo(california.trees, A, B, redwoods) would be:
result
mean
A 18
B 83.5
For some reason, doing something like the implementation of foo() just doesn't seem to work. This is likely due to some error with the data frame indexing - the function seems to think I am attempting to get the mean of the column.name string, rather than retrieving the column and passing the column to mean(). I'm not sure how to avoid this. There's the issue of the implicit passing of the modified dataframe that can't be directly referenced with the pipe operator that may be causing the issue.
Why is this? Is there some alternative implementation that would work?
We can use the quosure based solution from the devel version of dplyr (soon to be released 0.6.0)
foo <- function(dataset, group1, group2, tree.type){
group1 <- quo_name(enquo(group1))
group2 <- quo_name(enquo(group2))
colN <- rlang::parse_quosure(names(dataset)[2])
tree.type <- quo_name(enquo(tree.type))
dataset %>%
filter(group %in% c(group1, group2)) %>%
select(group, contains(tree.type)) %>%
group_by(group) %>%
summarise(mean = mean(UQ(colN)))
}
foo(california.trees, A, B, redwoods)
# A tibble: 2 × 2
# group mean
# <chr> <dbl>
#1 A 18.0
#2 B 83.5
foo(arizona.trees, A, B, redwoods)
# A tibble: 2 × 2
# group mean
# <chr> <dbl>
#1 A 23.5
#2 B 9.5
The enquotakes the input arguments and converts it to quosure, with quo_name, it is converted to string for using with %in%, the second column name is converted to quosure from string using parse_quosure and then it is unquoted (UQ or !!) for evaluation within summarise
NOTE: This is based on the OP's function about selecting the second column
The above solution was based on selecting the column based on position (as per the OP's code) and it may not work for other columns. So, we can match the 'tree.type' and get the 'mean' of the columns based on that
foo1 <- function(dataset, group1, group2, tree.type){
group1 <- quo_name(enquo(group1))
group2 <- quo_name(enquo(group2))
tree.type <- quo_name(enquo(tree.type))
dataset %>%
filter(group %in% c(group1, group2)) %>%
select(group, contains(tree.type)) %>%
group_by(group) %>%
summarise_at(vars(contains(tree.type)), funs(mean = mean(.)))
}
The function can be tested for different columns in the two datasets
foo1(arizona.trees, A, B, oaks)
# A tibble: 2 × 2
# group mean
# <chr> <dbl>
#1 A 11.5
#2 B 7.5
foo1(arizona.trees, A, B, redwood)
# A tibble: 2 × 2
# group mean
# <chr> <dbl>
#1 A 23.5
#2 B 9.5
foo1(california.trees, A, B, redwood)
# A tibble: 2 × 2
# group mean
# <chr> <dbl>
#1 A 18.0
#2 B 83.5
foo1(california.trees, A, B, oaks)
# A tibble: 2 × 2
# group mean
# <chr> <dbl>
#1 A 41.5
#2 B 4.0
data
arizona.trees <- structure(list(group = c("A", "A", "B", "B", "C"),
arizona.redwoods = c(23L,
24L, 9L, 10L, 88L), arizona.oaks = c(11L, 12L, 8L, 7L, 22L)),
.Names = c("group",
"arizona.redwoods", "arizona.oaks"), class = "data.frame",
row.names = c(NA, -5L))
california.trees <- structure(list(group = c("A", "A", "B", "B", "C"),
california.redwoods = c(25L,
11L, 90L, 77L, 90L), california.oaks = c(50L, 33L, 5L, 3L, 35L
)), .Names = c("group", "california.redwoods", "california.oaks"
), class = "data.frame", row.names = c(NA, -5L))