Related
I have this dataframe:
df <- structure(list(col1 = c("Z2", "A2", "B2", "C2", "A2", "E2", "F2",
"G2"), col2 = c("Z2", "Z2", "A2", "B2", "C2", "D2", "A2", "F2"
), col3 = c("A2", "B2", "C2", "D2", "E2", "F2", "G2", "Z2")), class = "data.frame", row.names = c(NA, -8L))
> df
col1 col2 col3
1 Z2 Z2 A2
2 A2 Z2 B2
3 B2 A2 C2
4 C2 B2 D2
5 A2 C2 E2
6 E2 D2 F2
7 F2 A2 G2
8 G2 F2 Z2
I would like to use explicitly filter, across and str_detect in a tidyverse setting to filter all rows that start with an A over col1:col3.
Expected result:
col1 col2 col3
1 Z2 Z2 A2
2 A2 Z2 B2
3 B2 A2 C2
4 A2 C2 E2
5 F2 A2 G2
I have tried:
library(dplyr)
library(stringr)
df %>%
filter(across(c(col1, col2, col3), ~str_detect(., "^A")))
This gives:
[1] col1 col2 col3
<0 Zeilen> (oder row.names mit Länge 0)
I want to learn why this code is not working using filter, across and str_detect!
We can use if_any as across will look for & condition i.e. all columns should meet the condition for a particular row to get filtered
library(dplyr)
library(stringr)
df %>%
filter(if_any(everything(), ~str_detect(., "^A")))
-output
col1 col2 col3
1 Z2 Z2 A2
2 A2 Z2 B2
3 B2 A2 C2
4 A2 C2 E2
5 F2 A2 G2
According to ?across
if_any() and if_all() apply the same predicate function to a selection of columns and combine the results into a single logical vector: if_any() is TRUE when the predicate is TRUE for any of the selected columns, if_all() is TRUE when the predicate is TRUE for all selected columns.
across() supersedes the family of "scoped variants" like summarise_at(), summarise_if(), and summarise_all().
The if_any/if_all are not part of the scoped variants
I have a huge data frame with following syntax (the four variables are just for example, there are many more variables):
Date. Ticker. Revenue. Price.
a1 b1 c1 d1
a2 b1 c2 d2
a3 b1 c3 d3
a4 b1 c4 d4
a5 b1 c5 d5
a1 b2 c6 d6
a2 b2 c7 d7
a3 b2 c8 d8
a4 b2 c9 d9
a5 b2 c10 d10
...
The ticker b1 and b2 are in order in the example, but in the real df it might be mixed up.
What I want is to create a new data frame with prices that goes to t intervals back. For example, if I need 3 years back, the result will be:
Date. Ticker. Revenue. Price.
a1 b1 c1
a2 b1 c2
a3 b1 c3
a4 b1 c4 d1
a5 b1 c5 d2
a1 b2 c6
a2 b2 c7
a3 b2 c8
a4 b2 c9 d6
a5 b2 c10 d10
...
We can use lag in dplyr to go back t intervals.
library(dplyr)
t <- 3
df %>% group_by(Ticker) %>% mutate(Price= lag(Price, t))
# Date Ticker Revenue Price
# <chr> <chr> <chr> <chr>
# 1 a1 b1 c1 NA
# 2 a2 b1 c2 NA
# 3 a3 b1 c3 NA
# 4 a4 b1 c4 d1
# 5 a5 b1 c5 d2
# 6 a1 b2 c6 NA
# 7 a2 b2 c7 NA
# 8 a3 b2 c8 NA
# 9 a4 b2 c9 d6
#10 a5 b2 c10 d7
Or shift in data.table :
library(data.table)
setDT(df)[, Price := shift(Price, t), Ticker]
data
df <- structure(list(Date = c("a1", "a2", "a3", "a4", "a5", "a1", "a2",
"a3", "a4", "a5"), Ticker = c("b1", "b1", "b1", "b1", "b1", "b2",
"b2", "b2", "b2", "b2"), Revenue = c("c1", "c2", "c3", "c4",
"c5", "c6", "c7", "c8", "c9", "c10"), Price = c("d1", "d2", "d3",
"d4", "d5", "d6", "d7", "d8", "d9", "d10")),
class = "data.frame", row.names = c(NA, -10L))
We can use data.table methods
library(data.table)
setDT(df)[, Price. := shift(Price., 3, fill = ""), Ticker.]
or with dplyr
library(dplyr)
df %>%
group_by(Ticker.) %>%
mutate(Price = lag(Price., 3, default = ""))
-output
# A tibble: 10 x 5
# Groups: Ticker. [2]
# Date. Ticker. Revenue. Price. Price
# <chr> <chr> <chr> <chr> <chr>
# 1 a1 b1 c1 d1 ""
# 2 a2 b1 c2 d2 ""
# 3 a3 b1 c3 d3 ""
# 4 a4 b1 c4 d4 "d1"
# 5 a5 b1 c5 d5 "d2"
# 6 a1 b2 c6 d6 ""
# 7 a2 b2 c7 d7 ""
# 8 a3 b2 c8 d8 ""
# 9 a4 b2 c9 d9 "d6"
#10 a5 b2 c10 d10 "d7"
Or using base R with ave
df$Price <- with(df, ave(Price., Ticker., FUN =
function(x) c(rep('', 3), head(x, -3))))
data
df <- structure(list(Date. = c("a1", "a2", "a3", "a4", "a5", "a1",
"a2", "a3", "a4", "a5"), Ticker. = c("b1", "b1", "b1", "b1",
"b1", "b2", "b2", "b2", "b2", "b2"), Revenue. = c("c1", "c2",
"c3", "c4", "c5", "c6", "c7", "c8", "c9", "c10"), Price. = c("d1",
"d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10")), class = "data.frame",
row.names = c(NA,
-10L))
I have two data sets, data1 and data2:
data1 <- data.frame(ID = 1:6,
A = c("a1", "a2", NA, "a4", "a5", NA),
B = c("b1", "b2", "b3", NA, "b5", NA),
stringsAsFactors = FALSE)
data1
ID A B
1 a1 b1
2 a2 b2
3 NA b3
4 a4 NA
5 a5 b5
6 NA NA
and
data2 <- data.frame(ID = 1:6,
A = c(NA, "a2", "a3", NA, "a5", "a6"),
B = c(NA, "b2.wrong", NA, "b4", "b5", "b6"),
stringsAsFactors = FALSE)
data2
ID A B
1 NA NA
2 a2 b2.wrong
3 a3 NA
4 NA b4
5 a5 b5
6 a6 b6
I would like to merge them by ID so that the resultant merged dataset, data.merged, populates fields form both datasets, but chooses values from data1 whenever there are possible values from both datasets.
I.e., I would like the final dataset, data.merge, to be:
ID A B
1 a1 b1
2 a2 b2
3 a3 b3
4 a4 b4
5 a5 b5
6 a6 b6
I have looked around, finding similar but not exact answers.
You can join the data and use coalesce to select the first non-NA value.
library(dplyr)
data1 %>%
inner_join(data2, by = 'ID') %>%
mutate(A = coalesce(A.x, A.y),
B = coalesce(B.x, B.y)) %>%
select(names(data1))
# ID A B
#1 1 a1 b1
#2 2 a2 b2
#3 3 a3 b3
#4 4 a4 b4
#5 5 a5 b5
#6 6 a6 b6
Or in base R comparing values with NA :
transform(merge(data1, data2, by = 'ID'),
A = ifelse(is.na(A.x), A.y, A.x),
B = ifelse(is.na(B.x), B.y, B.x))[names(data1)]
I would appreciate any help to create new variables from one variable.
Specifically, I need help to simultaneously create one row per each ID and various columns of E, where each of the new columns of E, (that is, E1, E2, E3) contains the values of E for each row of ID. I tried doing this which melt followed by spread but I am getting the error:
Error: Duplicate identifiers for rows (4, 7, 9), (1, 3, 6), (2, 5, 8)
Additionally, I tried the solutions discussed here and here but these did not work for my case because I need to be able to create row identifiers for rows (4, 1, 2), (7, 3, 5), and (9, 6, 8). That is, E for rows (4, 1, 2) should be named E1, E for rows (7, 3, 5) should be named E2, E for rows (9, 6, 8) should be named E3, and so on.
#data
dT<-structure(list(A = c("a1", "a2", "a1", "a1", "a2", "a1", "a1",
"a2", "a1"), B = c("b2", "b2", "b2", "b1", "b2", "b2", "b1",
"b2", "b1"), ID = c("3", "4", "3", "1", "4", "3", "1", "4", "1"
), E = c(0.621142094943352, 0.742109450696123, 0.39439152996948,
0.40694392882818, 0.779607277916503, 0.550579323666347, 0.352622183880119,
0.690660491345867, 0.23378944873769)), class = c("data.table",
"data.frame"), row.names = c(NA, -9L))
#my attempt
A B ID E
1: a1 b2 3 0.6211421
2: a2 b2 4 0.7421095
3: a1 b2 3 0.3943915
4: a1 b1 1 0.4069439
5: a2 b2 4 0.7796073
6: a1 b2 3 0.5505793
7: a1 b1 1 0.3526222
8: a2 b2 4 0.6906605
9: a1 b1 1 0.2337894
aTempDF <- melt(dT, id.vars = c("A", "B", "ID")) )
A B ID variable value
1: a1 b2 3 E 0.6211421
2: a2 b2 4 E 0.7421095
3: a1 b2 3 E 0.3943915
4: a1 b1 1 E 0.4069439
5: a2 b2 4 E 0.7796073
6: a1 b2 3 E 0.5505793
7: a1 b1 1 E 0.3526222
8: a2 b2 4 E 0.6906605
9: a1 b1 1 E 0.2337894
aTempDF%>%spread(variable, value)
Error: Duplicate identifiers for rows (4, 7, 9), (1, 3, 6), (2, 5, 8)
#expected output
A B ID E1 E2 E3
1: a1 b2 3 0.6211421 0.3943915 0.5505793
2: a2 b2 4 0.7421095 0.7796073 0.6906605
3: a1 b1 1 0.4069439 0.3526222 0.2337894
Thanks in advance for any help.
You can use dcast from data.table
library(data.table)
dcast(dT, A + B + ID ~ paste0("E", rowid(ID)))
# A B ID E1 E2 E3
#1 a1 b1 1 0.4069439 0.3526222 0.2337894
#2 a1 b2 3 0.6211421 0.3943915 0.5505793
#3 a2 b2 4 0.7421095 0.7796073 0.6906605
You need to create the correct 'time variable' first which is what rowid(ID) does.
For those looking for a tidyverse solution:
library(tidyverse)
dT <- structure(
list(
A = c("a1", "a2", "a1", "a1", "a2", "a1", "a1", "a2", "a1"),
B = c("b2", "b2", "b2", "b1", "b2", "b2", "b1", "b2", "b1"),
ID = c("3", "4", "3", "1", "4", "3", "1", "4", "1"),
E = c(0.621142094943352, 0.742109450696123, 0.39439152996948, 0.40694392882818,
0.550579323666347, 0.352622183880119, 0.690660491345867, 0.23378944873769,
0.779607277916503)),
class = c("data.table",
"data.frame"),
row.names = c(NA, -9L))
dT %>%
as_tibble() %>% # since dataset is a data.table object
group_by(A, B, ID) %>%
# Just so columns are "E1", "E2", etc.
mutate(rn = glue::glue("E{row_number()}")) %>%
ungroup() %>%
spread(rn, E) %>%
# not necessary, just making output in the same order as your expected output
arrange(desc(B))
# A tibble: 3 x 6
# A B ID E1 E2 E3
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 a1 b2 3 0.621 0.394 0.551
#2 a2 b2 4 0.742 0.780 0.691
#3 a1 b1 1 0.407 0.353 0.234
As mentioned in the accepted answer, you need a "key" variable to spread on first. This is created using row_number() and glue where glue just gives you the proper E1, E2, etc. variable names.
The group_by piece just makes sure that the row numbers are with respect to A, B and ID.
EDIT for tidyr >= 1.0.0
The (not-so) new pivot_ functions supercede gather and spread and eliminate the need to glue the new variable names together in a mutate.
dT %>%
as_tibble() %>% # since dataset is a data.table object
group_by(A, B, ID) %>%
# no longer need to glue (or paste) the names together but still need a row number
mutate(rn = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = rn, values_from = E, names_glue = "E{.name}") %>% # names_glue argument allows for easy transforming of the new variable names
# not necessary, just making output in the same order as your expected output
arrange(desc(B))
# A tibble: 3 x 6
# A B ID E1 E2 E3
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
#1 a1 b2 3 0.621 0.394 0.551
#2 a2 b2 4 0.742 0.780 0.691
#3 a1 b1 1 0.407 0.353 0.234
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 4 years ago.
I try to combine two data.tables in R based on a common ID but varying columns and I also want to drop duplicate ID rows. My approach would be:
dt1 dt2
ID X1 Y1 Z1 ID X2 Y2 Z2
1 a1 a2 a3 1 A1 A2 A3
2 b1 b2 b3 2 B1 NA B3
3 c1 c2 NA 3 C1 C2 C3
4 d1 d2 d3 5 E1 E2 E3
6 f1 f2 f3 6 F1 F2 F3
Using rbind(dt1, dt2, fill = TRUE) gives me:
dt_merged
ID X1 Y1 Z1 X2 Y2 Z2
1 a1 a2 a3 NA NA NA
1 NA NA NA A1 A2 A3
2 b1 b2 b3 NA NA NA
2 NA NA NA B1 NA B3
3 c1 c2 NA NA NA NA
3 NA NA NA C1 C2 C3
4 d1 d2 d3 NA NA NA
5 NA NA NA E1 E2 E3
6 f1 f2 f3 NA NA NA
6 NA NA NA F1 F2 F3
My problem is now that I don´t know how to merge the duplicate row IDs and fill in the NAs with the corresponding data from the duplicate ID rows. My desired output data.table would be:
ID X1 Y1 Z1 X2 Y2 Z2
1 a1 a2 a3 A1 A2 A3
2 b1 b2 b3 B1 NA B3
3 c1 c2 NA C1 C2 C3
4 d1 d2 d3 NA NA NA
5 NA NA NA E1 E2 E3
6 f1 f2 f3 F1 F2 F3
I hope my stated description is good enough to give you an overview of my problem. Any kind of help would be higly appreciated by me and excuse me for my foolish question but data.table wrangling gives me sometimes a very hard time.
Simply do a full join. It is very simple with the dplyr package.
(or the data.table package)
library(dplyr)
dt1 <- data.frame("ID" = c(1,2,3,4,6),
"X1" = c("a1", "b1", "c1", "d1", "f1"),
"Y1" = c("a2", "b2", "c2", "d2", "f2"),
"Z1" = c("a3", "b3", NA, "d3", "f3")
)
dt2 <- data.frame("ID" = c(1,2,3,5,6),
"X2" = c("A1", "B1", "C1", "E1", "F1"),
"Y2" = c("A2", NA, "C2", "E2", "F2"),
"Z2" = c("A3", "B3", "C3", "E3", "F3")
)
dt3 <- full_join(x = dt1, y = dt2, by = "ID") %>%
arrange(ID)
dt4 <- merge(dt1, dt2, by = "ID", all = TRUE)
dt3
dt4
Updated:
If you ever need to join more tables (as per OP's comment), just chain them:
dt5 <- data.frame("ID" = c(1,3,4,5,7),
"X3" = c("A1", "C1", "D1", "E1","G1"),
"Y3" = c(NA, "C2", "D2", "E2", "G2"),
"Z3" = c("A3","C3", "D3", "E3", NA)
)
dt6 <- full_join(x = dt1, y = dt2, by = "ID") %>%
full_join( x = ., y = dt5, by = "ID") %>%
arrange(ID)
dt6