How to rank order dates in R [duplicate]

How to rank order dates in R [duplicate] - r

This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 1 year ago.
I have a dataframe for multiple products and different date ranges. I want to assign unique value to each date so that even if the starting dates are different for various products, I can group by the dates.
df
acc product date
a1 p1 d1
a1 p1 d2
a1 p1 d3
a1 p1 d4
a1 p2 d1
a1 p2 d2
a1 p2 d3
a1 p3 d3
a1 p3 d4
I want to arrange the dates so that there is a unique identifier each for d1, d2, d3 etc.
I used the following code to try this:
df <- df %>% group_by(acc, product) %>% mutate(t = row_number())
Output
df
acc product date t EXPECTED
a1 p1 d1 1 1
a1 p1 d2 2 2
a1 p1 d3 3 3
a1 p1 d4 4 4
a1 p2 d1 1 1
a1 p2 d2 2 2
a1 p2 d3 3 3
a1 p3 d3 1 3
a1 p3 d4 2 4
Any suggestions for this?

use dplyr::dense_rank()
df %>% mutate(new = dense_rank(date))
acc product date new
1 a1 p1 d1 1
2 a1 p1 d2 2
3 a1 p1 d3 3
4 a1 p1 d4 4
5 a1 p2 d1 1
6 a1 p2 d2 2
7 a1 p2 d3 3
8 a1 p3 d3 3
9 a1 p3 d4 4
If however, you want to restart ranks for each acc use group_by before the mutate statement.
dput used
df <- structure(list(acc = c("a1", "a1", "a1", "a1", "a1", "a1", "a1",
"a1", "a1"), product = c("p1", "p1", "p1", "p1", "p2", "p2",
"p2", "p3", "p3"), date = c("d1", "d2", "d3", "d4", "d1", "d2",
"d3", "d3", "d4")), class = "data.frame", row.names = c(NA, -9L
))

Related

How to use filter across and str_detect together to filter conditional on mutlitple columns

I have this dataframe:
df <- structure(list(col1 = c("Z2", "A2", "B2", "C2", "A2", "E2", "F2",
"G2"), col2 = c("Z2", "Z2", "A2", "B2", "C2", "D2", "A2", "F2"
), col3 = c("A2", "B2", "C2", "D2", "E2", "F2", "G2", "Z2")), class = "data.frame", row.names = c(NA, -8L))
> df
col1 col2 col3
1 Z2 Z2 A2
2 A2 Z2 B2
3 B2 A2 C2
4 C2 B2 D2
5 A2 C2 E2
6 E2 D2 F2
7 F2 A2 G2
8 G2 F2 Z2
I would like to use explicitly filter, across and str_detect in a tidyverse setting to filter all rows that start with an A over col1:col3.
Expected result:
col1 col2 col3
1 Z2 Z2 A2
2 A2 Z2 B2
3 B2 A2 C2
4 A2 C2 E2
5 F2 A2 G2
I have tried:
library(dplyr)
library(stringr)
df %>%
filter(across(c(col1, col2, col3), ~str_detect(., "^A")))
This gives:
[1] col1 col2 col3
<0 Zeilen> (oder row.names mit Länge 0)
I want to learn why this code is not working using filter, across and str_detect!

We can use if_any as across will look for & condition i.e. all columns should meet the condition for a particular row to get filtered
library(dplyr)
library(stringr)
df %>%
filter(if_any(everything(), ~str_detect(., "^A")))
-output
col1 col2 col3
1 Z2 Z2 A2
2 A2 Z2 B2
3 B2 A2 C2
4 A2 C2 E2
5 F2 A2 G2
According to ?across
if_any() and if_all() apply the same predicate function to a selection of columns and combine the results into a single logical vector: if_any() is TRUE when the predicate is TRUE for any of the selected columns, if_all() is TRUE when the predicate is TRUE for all selected columns.
across() supersedes the family of "scoped variants" like summarise_at(), summarise_if(), and summarise_all().
The if_any/if_all are not part of the scoped variants

Create a data frame where corresponding values are t interval back

I have a huge data frame with following syntax (the four variables are just for example, there are many more variables):
Date. Ticker. Revenue. Price.
a1 b1 c1 d1
a2 b1 c2 d2
a3 b1 c3 d3
a4 b1 c4 d4
a5 b1 c5 d5
a1 b2 c6 d6
a2 b2 c7 d7
a3 b2 c8 d8
a4 b2 c9 d9
a5 b2 c10 d10
...
The ticker b1 and b2 are in order in the example, but in the real df it might be mixed up.
What I want is to create a new data frame with prices that goes to t intervals back. For example, if I need 3 years back, the result will be:
Date. Ticker. Revenue. Price.
a1 b1 c1
a2 b1 c2
a3 b1 c3
a4 b1 c4 d1
a5 b1 c5 d2
a1 b2 c6
a2 b2 c7
a3 b2 c8
a4 b2 c9 d6
a5 b2 c10 d10
...

We can use lag in dplyr to go back t intervals.
library(dplyr)
t <- 3
df %>% group_by(Ticker) %>% mutate(Price= lag(Price, t))
# Date Ticker Revenue Price
# <chr> <chr> <chr> <chr>
# 1 a1 b1 c1 NA
# 2 a2 b1 c2 NA
# 3 a3 b1 c3 NA
# 4 a4 b1 c4 d1
# 5 a5 b1 c5 d2
# 6 a1 b2 c6 NA
# 7 a2 b2 c7 NA
# 8 a3 b2 c8 NA
# 9 a4 b2 c9 d6
#10 a5 b2 c10 d7
Or shift in data.table :
library(data.table)
setDT(df)[, Price := shift(Price, t), Ticker]
data
df <- structure(list(Date = c("a1", "a2", "a3", "a4", "a5", "a1", "a2",
"a3", "a4", "a5"), Ticker = c("b1", "b1", "b1", "b1", "b1", "b2",
"b2", "b2", "b2", "b2"), Revenue = c("c1", "c2", "c3", "c4",
"c5", "c6", "c7", "c8", "c9", "c10"), Price = c("d1", "d2", "d3",
"d4", "d5", "d6", "d7", "d8", "d9", "d10")),
class = "data.frame", row.names = c(NA, -10L))

We can use data.table methods
library(data.table)
setDT(df)[, Price. := shift(Price., 3, fill = ""), Ticker.]
or with dplyr
library(dplyr)
df %>%
group_by(Ticker.) %>%
mutate(Price = lag(Price., 3, default = ""))
-output
# A tibble: 10 x 5
# Groups: Ticker. [2]
# Date. Ticker. Revenue. Price. Price
# <chr> <chr> <chr> <chr> <chr>
# 1 a1 b1 c1 d1 ""
# 2 a2 b1 c2 d2 ""
# 3 a3 b1 c3 d3 ""
# 4 a4 b1 c4 d4 "d1"
# 5 a5 b1 c5 d5 "d2"
# 6 a1 b2 c6 d6 ""
# 7 a2 b2 c7 d7 ""
# 8 a3 b2 c8 d8 ""
# 9 a4 b2 c9 d9 "d6"
#10 a5 b2 c10 d10 "d7"
Or using base R with ave
df$Price <- with(df, ave(Price., Ticker., FUN =
function(x) c(rep('', 3), head(x, -3))))
data
df <- structure(list(Date. = c("a1", "a2", "a3", "a4", "a5", "a1",
"a2", "a3", "a4", "a5"), Ticker. = c("b1", "b1", "b1", "b1",
"b1", "b2", "b2", "b2", "b2", "b2"), Revenue. = c("c1", "c2",
"c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10"), Price. = c("d1",
"d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10")), class = "data.frame",
row.names = c(NA,
-10L))

R: Merge Data While Retaining Values for One Dataset in Duplicates

I have two data sets, data1 and data2:
data1 <- data.frame(ID = 1:6,
A = c("a1", "a2", NA, "a4", "a5", NA),
B = c("b1", "b2", "b3", NA, "b5", NA),
stringsAsFactors = FALSE)
data1
ID A B
1 a1 b1
2 a2 b2
3 NA b3
4 a4 NA
5 a5 b5
6 NA NA
and
data2 <- data.frame(ID = 1:6,
A = c(NA, "a2", "a3", NA, "a5", "a6"),
B = c(NA, "b2.wrong", NA, "b4", "b5", "b6"),
stringsAsFactors = FALSE)
data2
ID A B
1 NA NA
2 a2 b2.wrong
3 a3 NA
4 NA b4
5 a5 b5
6 a6 b6
I would like to merge them by ID so that the resultant merged dataset, data.merged, populates fields form both datasets, but chooses values from data1 whenever there are possible values from both datasets.
I.e., I would like the final dataset, data.merge, to be:
ID A B
1 a1 b1
2 a2 b2
3 a3 b3
4 a4 b4
5 a5 b5
6 a6 b6
I have looked around, finding similar but not exact answers.

You can join the data and use coalesce to select the first non-NA value.
library(dplyr)
data1 %>%
inner_join(data2, by = 'ID') %>%
mutate(A = coalesce(A.x, A.y),
B = coalesce(B.x, B.y)) %>%
select(names(data1))
# ID A B
#1 1 a1 b1
#2 2 a2 b2
#3 3 a3 b3
#4 4 a4 b4
#5 5 a5 b5
#6 6 a6 b6
Or in base R comparing values with NA :
transform(merge(data1, data2, by = 'ID'),
A = ifelse(is.na(A.x), A.y, A.x),
B = ifelse(is.na(B.x), B.y, B.x))[names(data1)]

how to create new variables from one variable using two rules

I would appreciate any help to create new variables from one variable.
Specifically, I need help to simultaneously create one row per each ID and various columns of E, where each of the new columns of E, (that is, E1, E2, E3) contains the values of E for each row of ID. I tried doing this which melt followed by spread but I am getting the error:
Error: Duplicate identifiers for rows (4, 7, 9), (1, 3, 6), (2, 5, 8)
Additionally, I tried the solutions discussed here and here but these did not work for my case because I need to be able to create row identifiers for rows (4, 1, 2), (7, 3, 5), and (9, 6, 8). That is, E for rows (4, 1, 2) should be named E1, E for rows (7, 3, 5) should be named E2, E for rows (9, 6, 8) should be named E3, and so on.
#data
dT<-structure(list(A = c("a1", "a2", "a1", "a1", "a2", "a1", "a1",
"a2", "a1"), B = c("b2", "b2", "b2", "b1", "b2", "b2", "b1",
"b2", "b1"), ID = c("3", "4", "3", "1", "4", "3", "1", "4", "1"
), E = c(0.621142094943352, 0.742109450696123, 0.39439152996948,
0.40694392882818, 0.779607277916503, 0.550579323666347, 0.352622183880119,
0.690660491345867, 0.23378944873769)), class = c("data.table",
"data.frame"), row.names = c(NA, -9L))
#my attempt
A B ID E
1: a1 b2 3 0.6211421
2: a2 b2 4 0.7421095
3: a1 b2 3 0.3943915
4: a1 b1 1 0.4069439
5: a2 b2 4 0.7796073
6: a1 b2 3 0.5505793
7: a1 b1 1 0.3526222
8: a2 b2 4 0.6906605
9: a1 b1 1 0.2337894
aTempDF <- melt(dT, id.vars = c("A", "B", "ID")) )
A B ID variable value
1: a1 b2 3 E 0.6211421
2: a2 b2 4 E 0.7421095
3: a1 b2 3 E 0.3943915
4: a1 b1 1 E 0.4069439
5: a2 b2 4 E 0.7796073
6: a1 b2 3 E 0.5505793
7: a1 b1 1 E 0.3526222
8: a2 b2 4 E 0.6906605
9: a1 b1 1 E 0.2337894
aTempDF%>%spread(variable, value)
Error: Duplicate identifiers for rows (4, 7, 9), (1, 3, 6), (2, 5, 8)
#expected output
A B ID E1 E2 E3
1: a1 b2 3 0.6211421 0.3943915 0.5505793
2: a2 b2 4 0.7421095 0.7796073 0.6906605
3: a1 b1 1 0.4069439 0.3526222 0.2337894
Thanks in advance for any help.

You can use dcast from data.table
library(data.table)
dcast(dT, A + B + ID ~ paste0("E", rowid(ID)))
# A B ID E1 E2 E3
#1 a1 b1 1 0.4069439 0.3526222 0.2337894
#2 a1 b2 3 0.6211421 0.3943915 0.5505793
#3 a2 b2 4 0.7421095 0.7796073 0.6906605
You need to create the correct 'time variable' first which is what rowid(ID) does.

For those looking for a tidyverse solution:
library(tidyverse)
dT <- structure(
list(
A = c("a1", "a2", "a1", "a1", "a2", "a1", "a1", "a2", "a1"),
B = c("b2", "b2", "b2", "b1", "b2", "b2", "b1", "b2", "b1"),
ID = c("3", "4", "3", "1", "4", "3", "1", "4", "1"),
E = c(0.621142094943352, 0.742109450696123, 0.39439152996948, 0.40694392882818,
0.550579323666347, 0.352622183880119, 0.690660491345867, 0.23378944873769,
0.779607277916503)),
class = c("data.table",
"data.frame"),
row.names = c(NA, -9L))
dT %>%
as_tibble() %>% # since dataset is a data.table object
group_by(A, B, ID) %>%
# Just so columns are "E1", "E2", etc.
mutate(rn = glue::glue("E{row_number()}")) %>%
ungroup() %>%
spread(rn, E) %>%
# not necessary, just making output in the same order as your expected output
arrange(desc(B))
# A tibble: 3 x 6
# A B ID E1 E2 E3
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 a1 b2 3 0.621 0.394 0.551
#2 a2 b2 4 0.742 0.780 0.691
#3 a1 b1 1 0.407 0.353 0.234
As mentioned in the accepted answer, you need a "key" variable to spread on first. This is created using row_number() and glue where glue just gives you the proper E1, E2, etc. variable names.
The group_by piece just makes sure that the row numbers are with respect to A, B and ID.
EDIT for tidyr >= 1.0.0
The (not-so) new pivot_ functions supercede gather and spread and eliminate the need to glue the new variable names together in a mutate.
dT %>%
as_tibble() %>% # since dataset is a data.table object
group_by(A, B, ID) %>%
# no longer need to glue (or paste) the names together but still need a row number
mutate(rn = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = rn, values_from = E, names_glue = "E{.name}") %>% # names_glue argument allows for easy transforming of the new variable names
# not necessary, just making output in the same order as your expected output
arrange(desc(B))
# A tibble: 3 x 6
# A B ID E1 E2 E3
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 a1 b2 3 0.621 0.394 0.551
#2 a2 b2 4 0.742 0.780 0.691
#3 a1 b1 1 0.407 0.353 0.234

Merging two data.tables with common ID but different Columns [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 4 years ago.
I try to combine two data.tables in R based on a common ID but varying columns and I also want to drop duplicate ID rows. My approach would be:
dt1 dt2
ID X1 Y1 Z1 ID X2 Y2 Z2
1 a1 a2 a3 1 A1 A2 A3
2 b1 b2 b3 2 B1 NA B3
3 c1 c2 NA 3 C1 C2 C3
4 d1 d2 d3 5 E1 E2 E3
6 f1 f2 f3 6 F1 F2 F3
Using rbind(dt1, dt2, fill = TRUE) gives me:
dt_merged
ID X1 Y1 Z1 X2 Y2 Z2
1 a1 a2 a3 NA NA NA
1 NA NA NA A1 A2 A3
2 b1 b2 b3 NA NA NA
2 NA NA NA B1 NA B3
3 c1 c2 NA NA NA NA
3 NA NA NA C1 C2 C3
4 d1 d2 d3 NA NA NA
5 NA NA NA E1 E2 E3
6 f1 f2 f3 NA NA NA
6 NA NA NA F1 F2 F3
My problem is now that I don´t know how to merge the duplicate row IDs and fill in the NAs with the corresponding data from the duplicate ID rows. My desired output data.table would be:
ID X1 Y1 Z1 X2 Y2 Z2
1 a1 a2 a3 A1 A2 A3
2 b1 b2 b3 B1 NA B3
3 c1 c2 NA C1 C2 C3
4 d1 d2 d3 NA NA NA
5 NA NA NA E1 E2 E3
6 f1 f2 f3 F1 F2 F3
I hope my stated description is good enough to give you an overview of my problem. Any kind of help would be higly appreciated by me and excuse me for my foolish question but data.table wrangling gives me sometimes a very hard time.

Simply do a full join. It is very simple with the dplyr package.
(or the data.table package)
library(dplyr)
dt1 <- data.frame("ID" = c(1,2,3,4,6),
"X1" = c("a1", "b1", "c1", "d1", "f1"),
"Y1" = c("a2", "b2", "c2", "d2", "f2"),
"Z1" = c("a3", "b3", NA, "d3", "f3")
)
dt2 <- data.frame("ID" = c(1,2,3,5,6),
"X2" = c("A1", "B1", "C1", "E1", "F1"),
"Y2" = c("A2", NA, "C2", "E2", "F2"),
"Z2" = c("A3", "B3", "C3", "E3", "F3")
)
dt3 <- full_join(x = dt1, y = dt2, by = "ID") %>%
arrange(ID)
dt4 <- merge(dt1, dt2, by = "ID", all = TRUE)
dt3
dt4
Updated:
If you ever need to join more tables (as per OP's comment), just chain them:
dt5 <- data.frame("ID" = c(1,3,4,5,7),
"X3" = c("A1", "C1", "D1", "E1","G1"),
"Y3" = c(NA, "C2", "D2", "E2", "G2"),
"Z3" = c("A3","C3", "D3", "E3", NA)
)
dt6 <- full_join(x = dt1, y = dt2, by = "ID") %>%
full_join( x = ., y = dt5, by = "ID") %>%
arrange(ID)
dt6

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to rank order dates in R [duplicate] - r

Related

How to use filter across and str_detect together to filter conditional on mutlitple columns

Create a data frame where corresponding values are t interval back

R: Merge Data While Retaining Values for One Dataset in Duplicates

how to create new variables from one variable using two rules

Merging two data.tables with common ID but different Columns [duplicate]

Categories

Resources