Track changes for a given field given some ID and date

Track changes for a given field given some ID and date - r

I have the following dataframe:
Each client's cap could be upgraded at some point in time defined by column Date. I would like to aggregate on ID and show on what date the cap has been upgraded. Sometimes this could happen twice. The output should look like this:
Thank you in advance !

library(tidyverse)
df <- tibble(
ID = c(1,1,1,2,2,2,2,3,3),
Cap = c("S", "S", "M", "S", "M", "L", "L", "S", "L"),
Date = paste("01", c(1:2, 4, 3:6, 2:3), "2000") %>% lubridate::dmy()
)
df2 <- df %>%
group_by(ID) %>% # looking at each ID separately
mutate(prev = lag(Cap), # what is the row - 1 value
change = !(Cap == prev)) %>% # is the row - 1 value different than the current row value
filter(change) %>% # filtering where there are changes
select(ID, "From" = prev, "To" = Cap, Date) # renaming columns and selecting the relevant ones

You can use the lag command here to create a column with the previous rows value of Cap included. Then you simply filter out first entries and rows which are the same.
out <- dat %>%
## calculate lag within unique subjects
group_by(ID) %>%
mutate(
## copy previous row value to new column
from=lag(Cap),
to=Cap
) %>%
ungroup() %>%
## ignore first entry for each subject
drop_na(from) %>%
## ignore all rows where Cap didn't change
filter(from != to) %>%
## reorder columns
select(ID, from, to, Date)
This gives us output matching your expected format
> out
# A tibble: 4 x 4
ID from to Date
<dbl> <fct> <fct> <dbl>
1 1 S M 4
2 2 S M 4
3 2 M L 5
4 3 S L 3

Related

R: Pivot numeric data from columns to rows based on string in variable name

I have a data set that I want to pivot to long format depending on if the variable name contains any of the strings: list_a <- c("a", "b", "c") and list_b <- c("usd", "eur", "gbp"). The data set only contains values in one row. I want the values in list_b to become column names and the values in list_a to become row names in the resulting dataset. Please see the reproducable example data set below.
I currently solve this issue by applying the following R code (once for each value in list_b) resulting in three data frames called "df_usd", "df_eur" and "df_gbp" which I then merge based on the column "name". This is however a bit cumbersome and I would very much appreciate if you could help me with finding a more elegant solution since the variables in list_b change from month to month (list_a stays the same each month) and updating the existing code manually is both time consuming and opens up for manual error.
# Current solution for df_usd:
df_usd <- df %>%
select(date, contains("usd")) %>%
pivot_longer(cols = contains(c("a_", "b_", "c_")),
names_to = "name", values_to = "usd") %>% mutate(name = case_when(
str_detect(name, "a_") ~ "a",
str_detect(name, "b_") ~ "b",
str_detect(name, "c_") ~ "c")) %>%
select(-date)
A screenshot of the starting point in Excel
A screenshot of the result I want to acheive in Excel
# Example data to copy and paste into R for easy reproduction of problem:
df <- data.frame (date = c("2020-12-31"),
a_usd = c(1000),
b_usd = c(2000),
c_usd = c(3000),
a_eur = c(100),
b_eur =c(200),
c_eur = c(300),
a_gbp = c(10),
b_gbp = c(20),
c_gbp = c(30))

It would be to specify names_sep with names_to in pivot_longer
library(dplyr)
df %>%
pivot_longer(cols = -date, names_to = c("grp", ".value"), names_sep = "_")
-output
# A tibble: 3 x 5
# date grp usd eur gbp
# <chr> <chr> <dbl> <dbl> <dbl>
#1 2020-12-31 a 1000 100 10
#2 2020-12-31 b 2000 200 20
#3 2020-12-31 c 3000 300 30

A base R option using reshape
reshape(
setNames(df, gsub("(\\w+)_(\\w+)", "\\2.\\1", names(df))),
direction = "long",
varying = -1
)
gives
date time usd eur gbp id
1.a 2020-12-31 a 1000 100 10 1
1.b 2020-12-31 b 2000 200 20 1
1.c 2020-12-31 c 3000 300 30 1

dplyr summarize by preferred string value

I have a data frame with IDs and string values, of which some I prefer over others:
library(dplyr)
d1<-data.frame(id=c("a", "a", "b", "b"),
value=c("good", "better", "good", "good"))
I wand to handle that equivalent to the following example with numbers:
d2<-data.frame(id=c("a", "a", "b", "b"),
value=c(1, 2, 1, 1))
d2 %>% group_by(id) %>%
summarize(max(value))
So if an ID has multiple values, I will always get the highest number for each ID:
# A tibble: 2 x 2
id `max(value)`
<fct> <dbl>
1 a 2
2 b 1
Equivalent, if an ID has multiple strings, I always want to extract the preferred string for the d1 dataframe: If we have "good", use that row, if another row has "better" use that row instead, thus eliminating duplicated IDs.
The example is arbitrary, could also be >>if we have "yes" and "unknown" then take "yes", else take "unknown"<<
So is there an "extract best string" function for the dplyr::summarize() function?
The result should look like this:
id | value
----------
"a"| "better"
"b"| "good"

you can try a factor approach.
First you need an ordered vector of your strings like:
my_levels <- c("better", "good")
Then you change the levels accordingly, transform to numeric, summarize and transform back.
d1 %>%
mutate(value_num = factor(value, levels = my_levels) %>% as.numeric) %>%
group_by(id) %>%
summarize(res = min(value_num)) %>%
mutate(res_fac = factor(res, labels = my_levels))
# A tibble: 2 x 3
id res res_fac
<chr> <dbl> <fct>
1 a 1 better
2 b 2 good

Similar to #roman s answer, but using the data.table package you could do the following to filter the "better" rows:
require(data.table)
setDT(d1)
# convert value to factor
d1[ , value := factor(value, levels = c('better', 'good'))]
# return first ordered value by each id group
d1[ , .SD[order(value)][1], id]

Filter rows based on regex pattern and ID

I have a df like this:
df <- data.frame(
id = c("A", "A", "B", NA, "A", "B", "B", "B"),
speech = c("hi", "how are you [Larry]?", "[uh]", "(0.123)", "I'm fine [you 'n Mary] how's it [goin]?", "[erm]", "(0.4)", "well")
)
I want to filter out those rows (1) where speech is made up entirely of an expression wrapped in square brackets [...] from string start to string end AND (2) those rows by the same ID which follow the row where [...] makes up the whole speech. I know how to filter out the rows with [...]:
df %>%
group_by(grp = rleid(id)) %>%
filter(grepl("^\\[.*?\\]$", speech))
but I don't know how to also filter out the same-ID rows that follow the [...] row. The desired output is this:
df
id speech
1 B [uh]
2 B [erm]
3 B (0.4)
4 B well

Create the grouping index with rleid asin the OP's code, then filter out groups that doesn't have a [ in the first element of 'speech', ungroup
library(dplyr)
library(data.table)
library(stringr)
df %>%
group_by(grp = rleid(id)) %>%
filter(str_detect(first(speech), "^\\[")) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 4 x 2
# id speech
# <chr> <chr>
#1 B [uh]
#2 B [erm]
#3 B (0.4)
#4 B well
EDIT: Based on #ChrisRuehlemann's comments

Filter rows using column values as condition for another dataset

This is pretty tricky. Let's say I have, for example, a first dataset df:
sample id name
1 ID200,ID300,ID299 first
2 ID2,ID123 second
3 ID90 third
And a second dataset df_1:
ids condition
ID200 y
ID300 n
ID299 n
ID2 y
ID123 y
ID90 n
I have to filter, from the first dataset, all the rows in which all ID values satisfy a condition on the second table, like y.
So the filtering in this example should give:
sample id name
2 ID2,ID123 second
I was thinking to use something like:
new_df = df %>%
filter(grepl('ID', id), df_1$condition == 'y')
But obviously I need something different, can you give me some clues?
Edit: As I said in the comment, what happens if I have df's id column populated with other text, like this?
sample id name
1 ID = ID200,ID300,ID299,abcd first
2 ID = ID2,ID123, dfg second
3 ID = ID90, text third

Perhaps a bit inelegant, but this would give you the final condition status of each sample.
library(tidyverse)
df <- tibble(sample = c(1, 2, 3),
id = c("ID200,ID300,ID299", "ID2,ID123", "ID90"),
name = c("first", "second", "third"))
df_1 <- tibble(ids = c("ID200", "ID300", "ID299", "ID2", "ID123", "ID90"),
condition = c("y", "n", "n", "y", "y", "n"))
df2 <- df %>%
mutate(ids = str_split(id, ",")) %>%
unnest() %>%
inner_join(df_1, by = "ids") %>%
group_by(sample) %>%
summarise(condition = min(condition))
You could then join that to the to the original data frame for filtering.
filtered <- inner_join(df, df2, by = "sample") %>%
filter(condition == "y")

I'd start by tidying df so as id contains one observation per row:
library(tidyr)
library(dplyr)
df %>%
separate_rows(id)
sample id name
1 1 ID200 first
2 1 ID300 first
3 1 ID299 first
4 2 ID2 second
5 2 ID123 second
6 3 ID90 third
The same operation, followed by a join with df_1:
df %>%
separate_rows(id) %>%
left_join(df_1, by = c("id" = "ids"))
sample id name condition
1 1 ID200 first y
2 1 ID300 first n
3 1 ID299 first n
4 2 ID2 second y
5 2 ID123 second y
6 3 ID90 third n
Now you can group on sample and filter for cases where the only condition is "y":
new_df <- df %>%
separate_rows(id) %>%
left_join(df_1, by = c("id" = "ids")) %>%
group_by(sample) %>%
filter(condition == "y",
n_distinct(condition) == 1) %>%
ungroup()
Result:
sample id name condition
<int> <chr> <chr> <chr>
1 2 ID2 second y
2 2 ID123 second y
If you really want to transform back to the original format with comma-separated ids in a column:
library(purrr)
new_df %>%
nest(id) %>%
mutate(newid = map_chr(data, ~paste(.$id, collapse = ","))) %>%
select(sample, id = newid, name)
sample id name
<int> <chr> <chr>
1 2 ID2,ID123 second

Matching rows within the same dataset in R

I have a dataset of unique matches like this. Each row is a match with result.
date <- c('2017/12/01','2017/11/01','2017/10/01','2017/09/01','2017/08/01','2017/07/01','2017/06/01')
team1 <- c('A','B','B','C','D','A','B')
team1_score <- c(1,0,4,3,5,6,7)
team2 <- c('B','A','A','B','C','C','A')
team2_score <- c(0,1,5,4,6,9,10)
matches <- data.frame(date, team1, team1_score, team2, team2_score)
I want to create 2 new columns, forms for team 1 and team 2. The result of the match can be determined by which team have a larger score or a draw. The result would look something like below. So the form would be the result of team1 in the last 2 matches. For example, for the first 3 rows, form of team 1 and 2 respectively are. There will be times where there are not enough 2 matches of a particular team, so a result of NULL is sufficient. I want to know the form of team1 and team2 going into a match.
Form1: W-W, L-W, W-L
Form2: L-L, W-L, L-W
In the actual data set, there are a lot more than just 4 unique teams. I have been thinking but can't think of a good way to create these 2 variables.

Here is my solution:
library(tidyverse)
date <- as.Date(c('2017/12/01','2017/11/01','2017/10/01','2017/09/01','2017/08/01','2017/07/01','2017/06/01', '2017/05/30'))
team1 <- c('A','B','B','C','D','A','B','A')
team1_score <- c(1,0,4,3,5,6,7,0)
team2 <- c('B','A','A','B','C','C','A','D')
team2_score <- c(0,1,5,4,6,9,10,0)
matches <- data.frame(date, team1, team1_score, team2, team2_score)
## 1. Create a unique identifier for each match. It assumes that teams can only play each other once a day.
matches$UID <- paste(matches$date, matches$team1, matches$team2, sep = "-")
## 2. Create a Score Difference Varaible reflecting team1's score
matches <- matches %>% mutate(score_dif_team1 = team1_score - team2_score)
## 3. Create a Result (WDL) reflecting team1's results
matches <- matches %>% mutate(results_team1 = if_else(score_dif_team1 < 0, true = "L", false = if_else(score_dif_team1 > 0, true = "W", false = "D")))
## 4. Cosmetic step: Reorder variables for easier comparison across variables
matches <- matches %>% select(UID, date:results_team1)
## 5. Reshape the table into a long format based on the teams. Each observation will now reflect the results of 1 team within a match. Each game will have two observations.
matches <- matches %>% gather(key = old_team_var, value = team, team1, team2)
## 6. Stablishes a common results variable for each observation. It essentially inverts the results_team1 varaible for teams2, and keeps results_team1 identical for teams1
matches <- matches %>%
mutate(results = if_else(old_team_var == "team2",
true = if_else(results_team1 == "W",
true = "L",
false = if_else(results_team1 == "L",
true = "W",
false = "D")),
false = results_team1))
## Final step: Filter the matches table by the dates you are interested into, and then reshapes the table to show a data frame of DLW in long format.
Results_table <- matches %>% filter(date <= as.Date("2017-12-01")) %>% group_by(team, results) %>% summarise(cases = n()) %>% spread(key = results, value = cases, fill = 0)
## Results:
# A tibble: 4 x 4
# Groups: team [4]
team D L W
* <chr> <dbl> <dbl> <dbl>
1 A 1 1 4
2 B 0 4 1
3 C 0 1 2
4 D 1 1 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Track changes for a given field given some ID and date - r

I have the following dataframe: Each client's cap could be upgraded at some point in time defined by column Date. I would like to aggregate on ID and show on what date the cap has been upgraded. Sometimes this could happen twice. The output should look like this: Thank you in advance !

Related

R: Pivot numeric data from columns to rows based on string in variable name

dplyr summarize by preferred string value

Filter rows based on regex pattern and ID

Filter rows using column values as condition for another dataset

Matching rows within the same dataset in R

Categories

Resources