Subselect in dplyr - r

I am looking for the dplyr equivalent of the following SQL:
SELECT x
FROM ABT1
WHERE x IN (SELECT z FROM ABT2 WHERE q = ABT1.q)
I need this to be able to add a new column to a data frame based on values in an other data frame. I might be doing this the wrong way (hope you can tell me), but the idea I have is along the lines of:
ABT1 <- ABT1 %>% mutate(x = ifelse(ABT2 %>% filter(x = ABT1.x) %>% count() > 0, 0, 1))
The code above does not work as I don't know how to finish it. ABT1 and ABT2 are both data frames.
Does anyone know how I can solve this?

With dplyr, we can do
library(dplyr)
inner_join(ABT1, select(ABT2, q, z), by = 'q') %>%
filter(x %in% z) %>%
select(x) %>%
distinct()
# x
#1 4
#2 3
-testing with 'sqldf'
library(sqldf)
sqldf('SELECT x
FROM ABT1
WHERE x IN (SELECT z FROM ABT2 WHERE q = ABT1.q)')
# x
#1 4
#2 3
data
ABT1 <- data.frame(q = rep(letters[1:3], each = 2), x = c(1, 3, 5, 2, 4, 3))
ABT2 <- data.frame(q = rep(letters[2:4], each = 3),
z = c(4, 9, 12, 3, 1, 4, 10, 6, 5))

Related

combine lists nested in a tibble

I want to count the number of unique values in a table that contains several from...to pairs like below:
tmp <- tribble(
~group, ~from, ~to,
1, 1, 10,
1, 5, 8,
1, 15, 20,
2, 1, 10,
2, 5, 10,
2, 15, 18
)
I tried to nest all values in a list for each row (works), but combining these nested lists into one vector and counting the uniques doesn't work as expected.
tmp %>%
group_by(group) %>%
rowwise() %>%
mutate(nrs = list(c(from:to))) %>%
summarise(n_uni = length(unique(unlist(list(nrs)))))
The desired output looks like this:
tibble(group = c(1, 2),
n_uni = c(length(unique(unlist(list(tmp$nrs[tmp$group == 1])))),
length(unique(unlist(list(tmp$nrs[tmp$group == 2]))))))
# # A tibble: 2 × 2
# group n_uni
# <dbl> <int>
#1 1 16
#2 2 14
Any help would be much appreciated!
tmp %>%
rowwise() %>%
mutate(nrs = list(from:to)) %>%
group_by(group) %>%
summarise(n_uni = n_distinct(unlist(nrs)))
The issue with OP's approach is that rowwise is the equivalent of a new grouping, that drops the initial group_by step. Thus we've got to group_by after creating nrs in the rowwise step.

Specifying several separate filter conditions and run code for each of the conditions

In my analysis I want to cut and slice my data in several different ways (i.e. want to apply different filters), but otherwise run the same follow-up code (here count) and then ideally glue this together. So what I would do is:
df <- data.frame(x = c(1, 1, 2, 2, 3, 3),
y = c(1, 2, 3, 3, 2, 1),
z = c(1, 1, 0, 1, 0, 1))
r1 <- df %>%
filter(x < 3 & y == 3) %>%
count(z)
r2 <- df %>%
filter(x != 2 & y == 1) %>%
count(z)
full_join(r1, r2, by = "z")
which gives:
z n.x n.y
1 0 1 NA
2 1 1 2
Now, I have a lot of different filter variants, so I'm wondering if there is an easier (tidy) way to achieve this. My idea was that I somehow provide my filter conditions as an input and then apply these filter conditions to my data frame and the bind together the results. I guess this might be able with some purrr magic, but no idea how. Anyone?
Depends in what format do you have your filters, for example you could use eval + parse:
filters <- c(
"x < 3 & y == 3",
"x != 2 & y == 1"
)
filters %>%
map(~filter(df, eval(parse(text = .x))) %>% count(z)) %>%
reduce(full_join, by = "z")
I'm an idiot. I just realized that I could simply do:
df |>
group_by(z) |>
summarize(x = sum(x < 3 & y == 3),
y = sum(x != 2 & y == 1))

Sum of a list in R data frame

I have a column of type "list" in my data frame, I want to create a column with the sum.
I guess there is no visual difference, but my column consists of list(1,2,3)s and not c(1,2,3)s :
tibble(
MY_DATA = list(
list(2, 7, 8),
list(3, 10, 11),
list(4, 2, 8)
),
NOT_MY_DATA = list(
c(2, 7, 8),
c(3, 10, 11),
c(4, 2, 8)
)
)
Unfortunately when I try
mutate(NEW_COL = MY_LIST_COL_D %>% unlist() %>% sum())
the result is that every cell in the new column contains the sum of the entire source column (so a value in the millions)
I tried reduce and it did work, but was slow, I was looking for a better solution.
You could use the purrr::map_dbl, which should return a vector of type double:
library(tibble)
library(dplyr)
library(purrr)
df = tibble(
MY_LIST_COL_D = list(
c(2, 7, 8),
c(3, 10, 11),
c(4, 2, 8)
)
)
df %>%
mutate(NEW_COL= map_dbl(MY_LIST_COL_D, sum), .keep = 'unused')
# NEW_COL
<dbl>
# 1 17
# 2 24
# 3 14
Is this what you were looking for? If you don't want to remove the list column just disregard the .keep argument.
Update
With the underlying structure being lists, you can still apply the same logic, but one way to solve the issue is to unlist:
df = tibble(
MY_LIST_COL_D = list(
list(2, 7, 8),
list(3, 10, 11),
list(4, 2, 8)
)
)
df %>%
mutate(NEW_COL = map_dbl(MY_LIST_COL_D, ~ sum(unlist(.x))), .keep = 'unused')
# NEW_COL
# <dbl>
# 1 17
# 2 24
# 3 14
You can use rowwise in dplyr
library(dplyr)
df %>% rowwise() %>% mutate(NEW_COL = sum(MY_LIST_COL_D))
rowwise will also make your attempt work :
df %>% rowwise() %>% mutate(NEW_COL = MY_LIST_COL_D %>% unlist() %>% sum())
Can also use sapply in base R :
df$NEW_COL <- sapply(df$MY_LIST_COL_D, sum)

Top_n return both max and min value - R

is it possible for the top_n() command to return both max and min value at the same time?
Using the example from the reference page https://dplyr.tidyverse.org/reference/top_n.html
I tried the following
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
df %>% top_n(c(1,-1)) ## returns an error
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
df %>% top_n(1) %>% top_n(-1) ## returns only max value
Thanks
Not really involving top_n(), but you can try:
df %>%
arrange(x) %>%
slice(c(1, n()))
x
1 1
2 10
Or:
df %>%
slice(which(x == max(x) | x == min(x))) %>%
distinct()
Or (provided by #Gregor):
df %>%
slice(c(which.min(x), which.max(x)))
Or using filter():
df %>%
filter(x %in% range(x) & !duplicated(x))
Idea similar to #Jakub's answer with purrr::map_dfr
library(tidyverse) # dplyr and purrrr for map_dfr
df %>%
map_dfr(c(1, -1), top_n, wt = x, x = .)
# x
# 1 10
# 2 1
# 3 1
# 4 1
Here is an option with top_n where we pass a logical vector based that returns TRUE for min/max using range and then get the distinct rows as there are ties for range i.e duplicate elements are present
library(dplyr)
df %>%
top_n(x %in% range(x), 1) %>%
distinct
# x
#1 10
#2 1
I like #tmfmnk's answer. If you want to use top_n function, you can do this:
df <- data.frame(x = c(10, 4, 1, 6, 3, 1, 1))
bind_rows(
df %>% top_n(1),
df %>% top_n(-1)
)
# this solution addresses the specification in comments
df %>%
group_by(y) %>%
summarise(min = min(x),
max = max(x),
average = mean(x))

How to use dplyr to make modifications in a dataframe equivalent to the use of a 'which'?

If I have a data frame, say
df <- data.frame(x = c(1, 2, 3), y = c(2, 4, 7), z = c(3, 6, 10))
then I can modify entries with the which function:
w <- which(df[,"y"] == 7)
df[w,c("y", "z")] <- data.frame(6, 9)
One way I see to do this with the package dplyr is the following:
df <- df %>%
mutate(W = (y==7),
y = ifelse(W, 6, y),
z = ifelse(W, 9, z)) %>%
select(-W)
But I find it a bit unelegant, and I am not so sure it would replace all kinds of which uses. Ideally I would imagine something like:
df <- df %>%
keep(y == 7) %>%
mutate(y = 6) %>%
unkeep()
where keep would provisionally select rows where modifications are to be made, and unkeep would unselect them to recover the full data frame.

Resources