I have a data frame with IDs and string values, of which some I prefer over others:
library(dplyr)
d1<-data.frame(id=c("a", "a", "b", "b"),
value=c("good", "better", "good", "good"))
I wand to handle that equivalent to the following example with numbers:
d2<-data.frame(id=c("a", "a", "b", "b"),
value=c(1, 2, 1, 1))
d2 %>% group_by(id) %>%
summarize(max(value))
So if an ID has multiple values, I will always get the highest number for each ID:
# A tibble: 2 x 2
id `max(value)`
<fct> <dbl>
1 a 2
2 b 1
Equivalent, if an ID has multiple strings, I always want to extract the preferred string for the d1 dataframe: If we have "good", use that row, if another row has "better" use that row instead, thus eliminating duplicated IDs.
The example is arbitrary, could also be >>if we have "yes" and "unknown" then take "yes", else take "unknown"<<
So is there an "extract best string" function for the dplyr::summarize() function?
The result should look like this:
id | value
----------
"a"| "better"
"b"| "good"
you can try a factor approach.
First you need an ordered vector of your strings like:
my_levels <- c("better", "good")
Then you change the levels accordingly, transform to numeric, summarize and transform back.
d1 %>%
mutate(value_num = factor(value, levels = my_levels) %>% as.numeric) %>%
group_by(id) %>%
summarize(res = min(value_num)) %>%
mutate(res_fac = factor(res, labels = my_levels))
# A tibble: 2 x 3
id res res_fac
<chr> <dbl> <fct>
1 a 1 better
2 b 2 good
Similar to #roman s answer, but using the data.table package you could do the following to filter the "better" rows:
require(data.table)
setDT(d1)
# convert value to factor
d1[ , value := factor(value, levels = c('better', 'good'))]
# return first ordered value by each id group
d1[ , .SD[order(value)][1], id]
Related
Below is my example, let me explain what I am trying to do, although it is not working as I want it to.
I need to find all instances where there are 2+ unique values in column z on the same date for the same person. However, I need to find where a specific list of values in column z are present.
library(tidyverse)
x <- c("Person A","Person A","Person A","Person A","Person A","Person A")
y <- c("2022-01-01","2022-01-01","2022-01-20","2022-02-01","2022-02-01","2022-02-01")
z <- c("A","D","A","A","C","B")
df <- data.frame(x,y,z)
df
df %>%
group_by(x,y) %>%
mutate(unique_z = n_distinct(z)) %>%
# ungroup() %>%
filter(unique_z > 1,
z %in% c("C","B"))
Below is an image of what I want the output to be, but I cannot figure it out.
Row 1 and 2 should be removed because even though it is 2 unique values of z on the same date for the same person, it does not include "C" or "B".
Row 3 is removed because it is only one unique value for that person and date.
Rows 4, 5, and 6 all should stay because that person, date combination has three unique values of z. Also, "C" and/or "B" occur in these rows. For some reason, row 4 is being removed every time. I to see the other values of z on that person, date combination. I thought grouping and filtering would do this, but it does not seem to the way I am doing it.
You need to use any to check for the presence of c("B", "C") within each group and not at each row; see below:
library(dplyr)
df %>%
group_by(x,y) %>%
mutate(unique_z = n_distinct(z)) %>%
filter(unique_z > 1,
any(z %in% c("B","C")))
## any(z %in% c("C")) & any(z %in% c("B")))
## use this one instead if you want B and C present at the same time ...
## ... and two B's or two C's are not desired
# # A tibble: 3 x 4
# # Groups: x, y [1]
# x y z unique_z
# <fct> <fct> <fct> <int>
# 1 Person A 2022-02-01 A 3
# 2 Person A 2022-02-01 C 3
# 3 Person A 2022-02-01 B 3
I have a df like this:
df <- data.frame(
id = c("A", "A", "B", NA, "A", "B", "B", "B"),
speech = c("hi", "how are you [Larry]?", "[uh]", "(0.123)", "I'm fine [you 'n Mary] how's it [goin]?", "[erm]", "(0.4)", "well")
)
I want to filter out those rows (1) where speech is made up entirely of an expression wrapped in square brackets [...] from string start to string end AND (2) those rows by the same ID which follow the row where [...] makes up the whole speech. I know how to filter out the rows with [...]:
df %>%
group_by(grp = rleid(id)) %>%
filter(grepl("^\\[.*?\\]$", speech))
but I don't know how to also filter out the same-ID rows that follow the [...] row. The desired output is this:
df
id speech
1 B [uh]
2 B [erm]
3 B (0.4)
4 B well
Create the grouping index with rleid asin the OP's code, then filter out groups that doesn't have a [ in the first element of 'speech', ungroup
library(dplyr)
library(data.table)
library(stringr)
df %>%
group_by(grp = rleid(id)) %>%
filter(str_detect(first(speech), "^\\[")) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 4 x 2
# id speech
# <chr> <chr>
#1 B [uh]
#2 B [erm]
#3 B (0.4)
#4 B well
EDIT: Based on #ChrisRuehlemann's comments
First time posting here! Been struggling with this for about two days but I have a dataframe that looks like this:
code.1 <- factor(c(rep("x",3), rep("y",2), rep("z",3)))
type.1 <- factor(c(rep("small", 2), rep("medium", 2), rep("large", 4)))
df <- cbind.data.frame(type.1, code.1)
df
And am trying to get it to return this:
code.2 <- factor(c("x", "y", "z"))
type.2 <- factor(c("multiple", "multiple", "large"))
df2 <- cbind.data.frame(type.2, code.2)
df2
I've tried all manner of If/Else and apply functions grouping by "code" to return these results but am stuck. Any help appreciated!
You can do that with dplyr: you group by code.1, then all you have to do is to summarize type.1 with an if/else: if there is only a single value, you return it, else you return "multiple".
The code is slightly more complicated because of practical considerations (need to convert to character, need to have a vectorized TRUE condition that always returns a single value even when FALSE):
df %>%
group_by(code.1) %>%
summarize(type.2 = if_else(n_distinct(type.1) == 1,
as.character(first(type.1)),
"multiple"),
type.2 = as.factor(type.2))
# A tibble: 3 x 2
# code.1 type.2
# <fct> <fct>
# 1 x multiple
# 2 y multiple
# 3 z large
EDIT: here is a different formulation of the same approach without converting to character, might be better suited for large problems, and might give a different view of the same question:
# default value when multiple
iffalse <- as.factor("multiple")
df %>%
group_by(code.1) %>%
mutate(type.1 = factor(type.1, levels = c(levels(type.1), levels(iffalse)))) %>% # add possible level to type.1
summarize(type.2 = if_else(n_distinct(type.1) == 1,
first(type.1),
iffalse))
I have the following dataframe:
Each client's cap could be upgraded at some point in time defined by column Date. I would like to aggregate on ID and show on what date the cap has been upgraded. Sometimes this could happen twice. The output should look like this:
Thank you in advance !
library(tidyverse)
df <- tibble(
ID = c(1,1,1,2,2,2,2,3,3),
Cap = c("S", "S", "M", "S", "M", "L", "L", "S", "L"),
Date = paste("01", c(1:2, 4, 3:6, 2:3), "2000") %>% lubridate::dmy()
)
df2 <- df %>%
group_by(ID) %>% # looking at each ID separately
mutate(prev = lag(Cap), # what is the row - 1 value
change = !(Cap == prev)) %>% # is the row - 1 value different than the current row value
filter(change) %>% # filtering where there are changes
select(ID, "From" = prev, "To" = Cap, Date) # renaming columns and selecting the relevant ones
You can use the lag command here to create a column with the previous rows value of Cap included. Then you simply filter out first entries and rows which are the same.
out <- dat %>%
## calculate lag within unique subjects
group_by(ID) %>%
mutate(
## copy previous row value to new column
from=lag(Cap),
to=Cap
) %>%
ungroup() %>%
## ignore first entry for each subject
drop_na(from) %>%
## ignore all rows where Cap didn't change
filter(from != to) %>%
## reorder columns
select(ID, from, to, Date)
This gives us output matching your expected format
> out
# A tibble: 4 x 4
ID from to Date
<dbl> <fct> <fct> <dbl>
1 1 S M 4
2 2 S M 4
3 2 M L 5
4 3 S L 3
I would like to find the minimum value of a variable (time) that several other variables are equal to 1 (or any other value). Basically my application is finding the first year that x ==1, for several x. I know how to find this for one x but would like to avoid generating multiple reduced data frames of minima, then merging these together. Is there an efficient way to do this? Here is my example data and solution for one variable.
d <- data.frame(cat = c(rep("A",10), rep("B",10)),
time = c(1:10),
var1 = c(0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1),
var2 = c(0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1))
ddply(d[d$var1==1,], .(cat), summarise,
start= min(time))
How about this using dplyr
d %>%
group_by(cat) %>%
summarise_at(vars(contains("var")), funs(time[which(. == 1)[1]]))
Which gives
# A tibble: 2 x 3
# cat var1 var2
# <fct> <int> <int>
# 1 A 4 5
# 2 B 7 8
We can use base R to get the minimum 'time' among all the columns of 'var' grouped by 'cat'
sapply(split(d[-1], d$cat), function(x)
x$time[min(which(x[-1] ==1, arr.ind = TRUE)[, 1])])
#A B
#4 7
Is this something you are expecting?
library(dplyr)
df <- d %>%
group_by(cat, var1, var2) %>%
summarise(start = min(time)) %>%
filter()
I have left a blank filter argument that you can use to specify any filter condition you want (say var1 == 1 or cat == "A")