I have the following dataframe:
Each client's cap could be upgraded at some point in time defined by column Date. I would like to aggregate on ID and show on what date the cap has been upgraded. Sometimes this could happen twice. The output should look like this:
Thank you in advance !
library(tidyverse)
df <- tibble(
ID = c(1,1,1,2,2,2,2,3,3),
Cap = c("S", "S", "M", "S", "M", "L", "L", "S", "L"),
Date = paste("01", c(1:2, 4, 3:6, 2:3), "2000") %>% lubridate::dmy()
)
df2 <- df %>%
group_by(ID) %>% # looking at each ID separately
mutate(prev = lag(Cap), # what is the row - 1 value
change = !(Cap == prev)) %>% # is the row - 1 value different than the current row value
filter(change) %>% # filtering where there are changes
select(ID, "From" = prev, "To" = Cap, Date) # renaming columns and selecting the relevant ones
You can use the lag command here to create a column with the previous rows value of Cap included. Then you simply filter out first entries and rows which are the same.
out <- dat %>%
## calculate lag within unique subjects
group_by(ID) %>%
mutate(
## copy previous row value to new column
from=lag(Cap),
to=Cap
) %>%
ungroup() %>%
## ignore first entry for each subject
drop_na(from) %>%
## ignore all rows where Cap didn't change
filter(from != to) %>%
## reorder columns
select(ID, from, to, Date)
This gives us output matching your expected format
> out
# A tibble: 4 x 4
ID from to Date
<dbl> <fct> <fct> <dbl>
1 1 S M 4
2 2 S M 4
3 2 M L 5
4 3 S L 3
Related
I have a data set that I want to pivot to long format depending on if the variable name contains any of the strings: list_a <- c("a", "b", "c") and list_b <- c("usd", "eur", "gbp"). The data set only contains values in one row. I want the values in list_b to become column names and the values in list_a to become row names in the resulting dataset. Please see the reproducable example data set below.
I currently solve this issue by applying the following R code (once for each value in list_b) resulting in three data frames called "df_usd", "df_eur" and "df_gbp" which I then merge based on the column "name". This is however a bit cumbersome and I would very much appreciate if you could help me with finding a more elegant solution since the variables in list_b change from month to month (list_a stays the same each month) and updating the existing code manually is both time consuming and opens up for manual error.
# Current solution for df_usd:
df_usd <- df %>%
select(date, contains("usd")) %>%
pivot_longer(cols = contains(c("a_", "b_", "c_")),
names_to = "name", values_to = "usd") %>% mutate(name = case_when(
str_detect(name, "a_") ~ "a",
str_detect(name, "b_") ~ "b",
str_detect(name, "c_") ~ "c")) %>%
select(-date)
A screenshot of the starting point in Excel
A screenshot of the result I want to acheive in Excel
# Example data to copy and paste into R for easy reproduction of problem:
df <- data.frame (date = c("2020-12-31"),
a_usd = c(1000),
b_usd = c(2000),
c_usd = c(3000),
a_eur = c(100),
b_eur =c(200),
c_eur = c(300),
a_gbp = c(10),
b_gbp = c(20),
c_gbp = c(30))
It would be to specify names_sep with names_to in pivot_longer
library(dplyr)
df %>%
pivot_longer(cols = -date, names_to = c("grp", ".value"), names_sep = "_")
-output
# A tibble: 3 x 5
# date grp usd eur gbp
# <chr> <chr> <dbl> <dbl> <dbl>
#1 2020-12-31 a 1000 100 10
#2 2020-12-31 b 2000 200 20
#3 2020-12-31 c 3000 300 30
A base R option using reshape
reshape(
setNames(df, gsub("(\\w+)_(\\w+)", "\\2.\\1", names(df))),
direction = "long",
varying = -1
)
gives
date time usd eur gbp id
1.a 2020-12-31 a 1000 100 10 1
1.b 2020-12-31 b 2000 200 20 1
1.c 2020-12-31 c 3000 300 30 1
I have a data frame with IDs and string values, of which some I prefer over others:
library(dplyr)
d1<-data.frame(id=c("a", "a", "b", "b"),
value=c("good", "better", "good", "good"))
I wand to handle that equivalent to the following example with numbers:
d2<-data.frame(id=c("a", "a", "b", "b"),
value=c(1, 2, 1, 1))
d2 %>% group_by(id) %>%
summarize(max(value))
So if an ID has multiple values, I will always get the highest number for each ID:
# A tibble: 2 x 2
id `max(value)`
<fct> <dbl>
1 a 2
2 b 1
Equivalent, if an ID has multiple strings, I always want to extract the preferred string for the d1 dataframe: If we have "good", use that row, if another row has "better" use that row instead, thus eliminating duplicated IDs.
The example is arbitrary, could also be >>if we have "yes" and "unknown" then take "yes", else take "unknown"<<
So is there an "extract best string" function for the dplyr::summarize() function?
The result should look like this:
id | value
----------
"a"| "better"
"b"| "good"
you can try a factor approach.
First you need an ordered vector of your strings like:
my_levels <- c("better", "good")
Then you change the levels accordingly, transform to numeric, summarize and transform back.
d1 %>%
mutate(value_num = factor(value, levels = my_levels) %>% as.numeric) %>%
group_by(id) %>%
summarize(res = min(value_num)) %>%
mutate(res_fac = factor(res, labels = my_levels))
# A tibble: 2 x 3
id res res_fac
<chr> <dbl> <fct>
1 a 1 better
2 b 2 good
Similar to #roman s answer, but using the data.table package you could do the following to filter the "better" rows:
require(data.table)
setDT(d1)
# convert value to factor
d1[ , value := factor(value, levels = c('better', 'good'))]
# return first ordered value by each id group
d1[ , .SD[order(value)][1], id]
I have a df like this:
df <- data.frame(
id = c("A", "A", "B", NA, "A", "B", "B", "B"),
speech = c("hi", "how are you [Larry]?", "[uh]", "(0.123)", "I'm fine [you 'n Mary] how's it [goin]?", "[erm]", "(0.4)", "well")
)
I want to filter out those rows (1) where speech is made up entirely of an expression wrapped in square brackets [...] from string start to string end AND (2) those rows by the same ID which follow the row where [...] makes up the whole speech. I know how to filter out the rows with [...]:
df %>%
group_by(grp = rleid(id)) %>%
filter(grepl("^\\[.*?\\]$", speech))
but I don't know how to also filter out the same-ID rows that follow the [...] row. The desired output is this:
df
id speech
1 B [uh]
2 B [erm]
3 B (0.4)
4 B well
Create the grouping index with rleid asin the OP's code, then filter out groups that doesn't have a [ in the first element of 'speech', ungroup
library(dplyr)
library(data.table)
library(stringr)
df %>%
group_by(grp = rleid(id)) %>%
filter(str_detect(first(speech), "^\\[")) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 4 x 2
# id speech
# <chr> <chr>
#1 B [uh]
#2 B [erm]
#3 B (0.4)
#4 B well
EDIT: Based on #ChrisRuehlemann's comments
This is pretty tricky. Let's say I have, for example, a first dataset df:
sample id name
1 ID200,ID300,ID299 first
2 ID2,ID123 second
3 ID90 third
And a second dataset df_1:
ids condition
ID200 y
ID300 n
ID299 n
ID2 y
ID123 y
ID90 n
I have to filter, from the first dataset, all the rows in which all ID values satisfy a condition on the second table, like y.
So the filtering in this example should give:
sample id name
2 ID2,ID123 second
I was thinking to use something like:
new_df = df %>%
filter(grepl('ID', id), df_1$condition == 'y')
But obviously I need something different, can you give me some clues?
Edit: As I said in the comment, what happens if I have df's id column populated with other text, like this?
sample id name
1 ID = ID200,ID300,ID299,abcd first
2 ID = ID2,ID123, dfg second
3 ID = ID90, text third
Perhaps a bit inelegant, but this would give you the final condition status of each sample.
library(tidyverse)
df <- tibble(sample = c(1, 2, 3),
id = c("ID200,ID300,ID299", "ID2,ID123", "ID90"),
name = c("first", "second", "third"))
df_1 <- tibble(ids = c("ID200", "ID300", "ID299", "ID2", "ID123", "ID90"),
condition = c("y", "n", "n", "y", "y", "n"))
df2 <- df %>%
mutate(ids = str_split(id, ",")) %>%
unnest() %>%
inner_join(df_1, by = "ids") %>%
group_by(sample) %>%
summarise(condition = min(condition))
You could then join that to the to the original data frame for filtering.
filtered <- inner_join(df, df2, by = "sample") %>%
filter(condition == "y")
I'd start by tidying df so as id contains one observation per row:
library(tidyr)
library(dplyr)
df %>%
separate_rows(id)
sample id name
1 1 ID200 first
2 1 ID300 first
3 1 ID299 first
4 2 ID2 second
5 2 ID123 second
6 3 ID90 third
The same operation, followed by a join with df_1:
df %>%
separate_rows(id) %>%
left_join(df_1, by = c("id" = "ids"))
sample id name condition
1 1 ID200 first y
2 1 ID300 first n
3 1 ID299 first n
4 2 ID2 second y
5 2 ID123 second y
6 3 ID90 third n
Now you can group on sample and filter for cases where the only condition is "y":
new_df <- df %>%
separate_rows(id) %>%
left_join(df_1, by = c("id" = "ids")) %>%
group_by(sample) %>%
filter(condition == "y",
n_distinct(condition) == 1) %>%
ungroup()
Result:
sample id name condition
<int> <chr> <chr> <chr>
1 2 ID2 second y
2 2 ID123 second y
If you really want to transform back to the original format with comma-separated ids in a column:
library(purrr)
new_df %>%
nest(id) %>%
mutate(newid = map_chr(data, ~paste(.$id, collapse = ","))) %>%
select(sample, id = newid, name)
sample id name
<int> <chr> <chr>
1 2 ID2,ID123 second
I have a dataset of unique matches like this. Each row is a match with result.
date <- c('2017/12/01','2017/11/01','2017/10/01','2017/09/01','2017/08/01','2017/07/01','2017/06/01')
team1 <- c('A','B','B','C','D','A','B')
team1_score <- c(1,0,4,3,5,6,7)
team2 <- c('B','A','A','B','C','C','A')
team2_score <- c(0,1,5,4,6,9,10)
matches <- data.frame(date, team1, team1_score, team2, team2_score)
I want to create 2 new columns, forms for team 1 and team 2. The result of the match can be determined by which team have a larger score or a draw. The result would look something like below. So the form would be the result of team1 in the last 2 matches. For example, for the first 3 rows, form of team 1 and 2 respectively are. There will be times where there are not enough 2 matches of a particular team, so a result of NULL is sufficient. I want to know the form of team1 and team2 going into a match.
Form1: W-W, L-W, W-L
Form2: L-L, W-L, L-W
In the actual data set, there are a lot more than just 4 unique teams. I have been thinking but can't think of a good way to create these 2 variables.
Here is my solution:
library(tidyverse)
date <- as.Date(c('2017/12/01','2017/11/01','2017/10/01','2017/09/01','2017/08/01','2017/07/01','2017/06/01', '2017/05/30'))
team1 <- c('A','B','B','C','D','A','B','A')
team1_score <- c(1,0,4,3,5,6,7,0)
team2 <- c('B','A','A','B','C','C','A','D')
team2_score <- c(0,1,5,4,6,9,10,0)
matches <- data.frame(date, team1, team1_score, team2, team2_score)
## 1. Create a unique identifier for each match. It assumes that teams can only play each other once a day.
matches$UID <- paste(matches$date, matches$team1, matches$team2, sep = "-")
## 2. Create a Score Difference Varaible reflecting team1's score
matches <- matches %>% mutate(score_dif_team1 = team1_score - team2_score)
## 3. Create a Result (WDL) reflecting team1's results
matches <- matches %>% mutate(results_team1 = if_else(score_dif_team1 < 0, true = "L", false = if_else(score_dif_team1 > 0, true = "W", false = "D")))
## 4. Cosmetic step: Reorder variables for easier comparison across variables
matches <- matches %>% select(UID, date:results_team1)
## 5. Reshape the table into a long format based on the teams. Each observation will now reflect the results of 1 team within a match. Each game will have two observations.
matches <- matches %>% gather(key = old_team_var, value = team, team1, team2)
## 6. Stablishes a common results variable for each observation. It essentially inverts the results_team1 varaible for teams2, and keeps results_team1 identical for teams1
matches <- matches %>%
mutate(results = if_else(old_team_var == "team2",
true = if_else(results_team1 == "W",
true = "L",
false = if_else(results_team1 == "L",
true = "W",
false = "D")),
false = results_team1))
## Final step: Filter the matches table by the dates you are interested into, and then reshapes the table to show a data frame of DLW in long format.
Results_table <- matches %>% filter(date <= as.Date("2017-12-01")) %>% group_by(team, results) %>% summarise(cases = n()) %>% spread(key = results, value = cases, fill = 0)
## Results:
# A tibble: 4 x 4
# Groups: team [4]
team D L W
* <chr> <dbl> <dbl> <dbl>
1 A 1 1 4
2 B 0 4 1
3 C 0 1 2
4 D 1 1 0