I have a data frame with a column containing code numbers and another with dates. I am trying to use dplyr and intersect to find the common elements among days.
Sample data:
df <- data.frame(A=c(2289,490,3940,1745,855,3954,2289,555,3940,667,855,3954,2289,490,12,1745,3000,3954,2289,490,3940,28,855,3954),B=as.Date(c("2019-08-01","2019-08-01","2019-08-01","2019-08-01","2019-08-01","2019-08-01","2019-08-02","2019-08-02","2019-08-02","2019-08-02","2019-08-02","2019-08-02","2019-08-03","2019-08-03","2019-08-03","2019-08-03","2019-08-03","2019-08-03","2019-08-04","2019-08-04","2019-08-04","2019-08-04","2019-08-04","2019-08-04")))
I am trying something like this:
df %>% group_by(B) %>% intersect(A)
The expected output are the codes that are common in each single day. For instance 2289 is the expecte value but 28 is not.
I wonder whether I can use intersect in this case.
Appreciate any help
Regards
Here's one way -
df %>%
# filter(!duplicated(.)) %>% # add this if there can be duplicates
count(A) %>%
filter(n == n_distinct(df$B))
# A tibble: 2 x 2
A n
<dbl> <int>
1 2289 4
2 3954 4
A base R solution if you prefer intersect although I guess above method would be faster if number of groups is high -
Reduce(intersect, split(df$A, df$B))
[1] 2289 3954
As a side note - you can do in base R:
sort(unique(df$A))[rowMeans(table(df)) == 1]
#2289 3954
You can also try:
df %>% group_by(A) %>% summarize(if_all = length(intersect(B, unique(df$B))) == length(unique(df$B)))
which uses intersect.
Related
Hi I have not seen a similar solution to this problem I am having. I am trying to make a regrex pattern to extract the characters following the word major within { } and place them in a major column. However, the major repeats in row 2 and I need to extract and combine all characters within both { } following major. Ideally I would do this for minor and incidental attributes as well. Not sure what I am getting wrong here. Thanks!
test <- data.frame(lith=c("major{basalt} minor{andesite} incidental{dacite rhyolite}",
"major {andesite flows} major {dacite flows}",
"major{andesite} minor{dacite}",
"major{basaltic andesitebasalt}"))
test %>%
mutate(major = str_extract_all(test$lith, "[major].*[{](\\D[a-z]*)[}]") %>%
map_chr(toString))
What I am looking for:
major minor incidental
1 basalt andesite dacite ryolite
2 andesite flows, decite flows <NA> <NA>
3 basaltic andesitebasalt <NA> <NA>
First, (almost) never use test$ within a dplyr pipe starting with test %>%. At best it's just a little inefficient; if there are any intermediate steps that re-order, alter, or filter the data, then the results will be either (a) an error, preferred; or (b) silently just wrong. The reason: let's say you do
test %>%
filter(grepl("[wy]", lith)) %>%
mutate(major = str_extract_all(test$lith, ...))
In this case, the filter reduced the data from 4 rows to just 2 rows. However, since you're using test$lith, that's taken from the contents of test before the pipe started, so here test$lith is length-4 where we need it to be length-2.
Alternatively (and preferred),
test %>%
filter(grepl("[wy]", lith)) %>%
mutate(major = str_extract_all(lith, ...))
Here, the str_extract_all(lith, ...) sees only two values, not the original four.
On to the regularly-scheduled answer ...
I'll add a row number rn column as an original row reference (id of sources). This is both functional (for things to work internally) and in case you need to tie it back to the original data somehow. I'm inferring that you group the values together as strings instead of list-columns, though it's easy enough to switch to the latter if desired.
library(dplyr)
library(stringr) # str_extract_all
library(tidyr) # unnest, pivot_wider
test %>%
mutate(
rn = row_number(),
tmp = str_extract_all(lith, "\\b([[:alpha:]]+) ?\\{([^}]+)\\}"),
tmp = lapply(tmp, function(z) strcapture("^([^{}]*) ?\\{(.*)\\}", z, list(categ="", val="")))
) %>%
unnest(tmp) %>%
mutate(across(c(categ, val), trimws)) %>%
group_by(rn, categ) %>%
summarize(val = paste(val, collapse = ", ")) %>%
pivot_wider(rn, names_from = "categ", values_from = "val") %>%
ungroup()
# # A tibble: 4 x 4
# rn incidental major minor
# <int> <chr> <chr> <chr>
# 1 1 dacite rhyolite basalt andesite
# 2 2 NA andesite flows, dacite flows NA
# 3 3 NA andesite dacite
# 4 4 NA basaltic andesitebasalt NA
Sample data frame
Guest <- c("ann","ann","beth","beth","bill","bill","bob","bob","bob","fred","fred","ginger","ginger")
State <- c("TX","IA","IA","MA","AL","TX","TX","AL","MA","MA","IA","TX","AL")
df <- data.frame(Guest,State)
Desired output
I have tried about a dozen different ideas but not getting close. Closest was setting up a crosstab but didn't know how to get counts from that. Long/wide got me nowhere. etc. Too new still to think out of the box I guess.
Try this approach. You can arrange your values and then use group_by() and summarise() to reach a structure similar to those expected:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
arrange(Guest,State) %>%
group_by(Guest) %>%
summarise(Chain=paste0(State,collapse = '-')) %>%
group_by(Chain,.drop = T) %>%
summarise(N=n())
Output:
# A tibble: 4 x 2
Chain N
<chr> <int>
1 AL-MA-TX 1
2 AL-TX 2
3 IA-MA 2
4 IA-TX 1
We can use base R with aggregate and table
table(aggregate(State~ Guest, df[do.call(order, df),], paste, collapse='-')$State)
-output
# AL-MA-TX AL-TX IA-MA IA-TX
# 1 2 2 1
Im fairly new to R and struggling to find a solution for the following problem:
I have a tibble consisting of 3 columns. First column describes ids of stocks (e.g. ID1,ID2..), the second the Date of observation and third the corresponding return. (ID | Date | Return )
For tidying my dataset I need to delete all zero returns starting from end of sample period until i reach the first non zero return.
The following picture further visualises my issue.
DatasetExample
In case of the example Dataset depicted above, I need to delete the yellow coloured elements.
Hence, one needs to first group by ID and second iterate over the table from bottom to top until reaching a non zero return.
I already found a way by converting the tibble into a matrix and then looping over each element but this apporach is rather naive and does not perform well on large datasets (+2 mio. observations), which is exactly my case.
Is there any more effcient way to achieve this aim? Solutions using dplyr would be highly appreciated.
Thanks in advance.
Here is a dplyr solution. I believe it's a bit complicated, but it works.
library(dplyr)
df1 %>%
mutate(Date = as.Date(Date, format = "%d.%m.%Y")) %>%
group_by(ID) %>%
arrange(desc(Date), .by_group = TRUE) %>%
mutate(flag = min(which(Return == 0)),
flag = cumsum(Return != 0 & flag <= row_number())) %>%
filter(flag > 0) %>%
select(-flag) %>%
arrange(Date, .by_group = TRUE)
## A tibble: 7 x 3
## Groups: ID [2]
# ID Date Return
# <int> <date> <dbl>
#1 1 2020-09-20 0.377
#2 1 2020-09-21 0
#3 1 2020-09-22 -1.10
#4 2 2020-09-20 0.721
#5 2 2020-09-21 0
#6 2 2020-09-22 0
#7 2 2020-09-23 1.76
Test data creation code
set.seed(2020)
df1 <- data.frame(ID = rep(1:2, each = 5), Date = Sys.Date() - 5:1, Return = rnorm(10))
df1$Date <- format(df1$Date, "%d.%m.%Y")
df1$Return[sample(1:5, 2)] <- 0
df1$Return[sample(6:10, 2)] <- 0
df1$Return[10] <- 0
There might be a more elegant way but this could work:
split_data <- split(data,data$ID)
split_tidy_data <- lapply(split_data,function(x) x[1:which.max(x[,"Return"]!=0),])
tidy_data <- do.call(rbind,split_tidy_data)
Note: This only works if there is at least 1 "Return" which is not equal 0
I have this input:
t <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
And want to have the rowwise nth-lowest element of the dataframe ordered by the rowwise values, so that the output is something like this (example for nth_element = 2):
[1] 2 3 5 4
I tried a function like this:
apply(t, 1, nth, n=1, order_by = .)
But this does not work. Two questions:
What should I type in the order_by gument to make this function work?
Which is the best way to summarise rows with an own summary function if I don't want to mention the column names in the rowwise summary function?
Sidenote:
I don't want to mention the column names specifically, I want the function to use all rows in the dataset.
I tried the rownth function from the Rfast package but it only provides one result. Does anybody know what I do wrong?
We can use apply and sort to do this.
d <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
nth_lowest <- 2
apply(d, 1, FUN = function(x) sort(x)[nth_lowest])
# [1] 2 3 5 4
Note that I am calling the data d instead of t. t is already a reserved name in R (matrix transpose function).
Not as elegant as #bouncyball's answer, but using dplyr (and tidyr), one possibility is to do:
library(dplyr)
library(tidyr)
t %>% mutate(Row = row_number()) %>%
pivot_longer(-Row, names_to = "Col", values_to = "Val") %>%
group_by(Row) %>%
arrange(Val) %>%
slice(2) %>%
select(Val)
Adding missing grouping variables: `Row`
# A tibble: 4 x 2
# Groups: Row [4]
Row Val
<int> <dbl>
1 1 2
2 2 3
3 3 5
4 4 4
Using Rfast you could reduce run time for big matrices and for matrices only.
d <- data.frame(x=c(1,2,8,4), y=c(2,3,4,5), k=c(3,4,5,1))
d<- Rfast::data.frame.to_matrix(d)
nth_lowests <- rep(2,ncol(d))
Rfast::rownth(d,nth_lowests)
# [1] 2 3 5 4
You could also use the parallel version of Rfast::rownth
This feels like a common enough task that I assume there's an established function/method for accomplishing it. I'm imagining a function like dplyr::filter_after() but there doesn't seem to be one.
Here's the method I'm using as a starting point:
#Setup:
library(dplyr)
threshold <- 3
test.df <- data.frame("num"=c(1:5,1:5),"let"=letters[1:10])
#Drop every row that follows the first 3, including that row:
out.df <- test.df %>%
mutate(pastThreshold = cumsum(num>=threshold)) %>%
filter(pastThreshold==0) %>%
dplyr::select(-pastThreshold)
This produces the desired output:
> out.df
num let
1 1 a
2 2 b
Is there another solution that's less verbose?
dplyr provides the window functions cumany and cumall, that filter all rows after/before a condition becomes false for the first time. Documentation.
test.df %>%
filter(cumall(num<threshold)) #all rows until condition violated for first time
# num let
# 1 1 a
# 2 2 b
You can do:
test.df %>%
slice(1:which.max(num == threshold)-1)
num let
1 1 a
2 2 b
We can use the same in filter without the need for creating extra column and later removing it
library(dplyr)
test.df %>%
filter(cumsum(num>=threshold) == 0)
# num let
#1 1 a
#2 2 b
Or another option is match with slice
test.df %>%
slice(seq_len(match(threshold-1, num)))
Or another option is rleid
library(data.table)
test.df %>%
filter(rleid(num >= threshold) == 1)