Is it possible to COUNT the CONDITION and including NULL as well as the other values as 0? - sqlite

I have a problem. It includes the condition that to COUNT the rows where its status = 1 (GROUP BY name).
However, the result should include the rows WHERE those are not = 1 and NULL. And they are counted as 0.
I have tried cte, CASE WHEN, WHERE status = 1 or status IS NULL. It does include null as 0, but there are name containing 1 and 0 or only containing 0.
If I use WHERE status IS NULL OR status=1, the name with status 0 is not counted.
If I use CASE WHEN status IS NULL THEN 0
WHEN status IS 0 THEN 0
WHEN status = 1 THEN COUNT(DISTINCT name)
Then the name containing 1 AND 0 will be counted as 0.
TABLE:
INSERT INTO
students
(name, student_id, exercise_id, status)
VALUES
(Uolevi, 1, 1, 0),
(Uolevi, 1, 1, 0),
(Uolevi, 1, 1, 1),
(Uolevi, 1, 2, 0),
(Uolevi, 1, 2, 0),
(Uolevi, 1, 2, 1),
(Maija , 2, 1, 1),
(Maija , 2, 2, 1),
(Maija , 2, 2, 1),
(Maija , 2, 2, 1),
(Maija , 2, 3, 0),
(Juuso , 3, 1, 0),
(Juuso , 3, 2, 0),
(Juuso , 3, 3, 0),
(Miiko , NULL, NULL, NULL);

EDITED since you specified your expected result.
You can add a FILTER clause to the aggregate function COUNT() to only count the rows matched by the filter:
SELECT name,
COUNT(DISTINCT exercise_id) FILTER (WHERE status=1) as exercise_count
FROM students
GROUP BY student_id
ORDER BY exercise_count DESC
Here is a fiddle with the values you provided
My query doesn't take into account the table transmissions you used in your answer because you didn't specify it's structure or what it does contain, but you can adapt the answer to your real needs.

you can try like this.
SELECT s.name, COUNT(DISTINCT exercise_id)
FROM students s
LEFT JOIN transmissions t
ON s.id=t.student_id
AND state = 1
group by student_id

Related

R Find Distance Between Two values By Group

HAVE = data.frame(INSTRUCTOR = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3),
STUDENT = c(1, 2, 2, 2, 1, 3, 1, 1, 1, 1, 2, 1),
SCORE = c(10, 1, 0, 0, 7, 3, 5, 2, 2, 4, 10, 2),
TIME = c(1,1,2,3,2,1,1,2,3,1,1,2))
WANT = data.frame(INSTRUCTOR = c(1, 2, 3),
SCORE.DIF = c(-9, NA, 6))
For each INSTRUCTOR, I wish to find the SCORE of the first and second STUDENT, and subtract their scores. The STUDENT code varies so I wish not to use '==1' vs '==2'
I try:
HAVE[, .SD[1:2], by = 'INSTRUCTOR']
but do not know how to subtract vertically and obtain 'WANT' data frame from 'HAVE'
library(data.table)
setDT(HAVE)
unique(HAVE, by = c("INSTRUCTOR", "STUDENT")
)[, .(SCORE.DIF = diff(SCORE[1:2])), by = INSTRUCTOR]
# INSTRUCTOR SCORE.DIF
# <num> <num>
# 1: 1 -9
# 2: 2 NA
# 3: 3 6
To use your new TIME variable, we can do
HAVE[, .SD[which.min(TIME),], by = .(INSTRUCTOR, STUDENT)
][, .(SCORE.DIF = diff(SCORE[1:2])), by = INSTRUCTOR]
# INSTRUCTOR SCORE.DIF
# <num> <num>
# 1: 1 -9
# 2: 2 NA
# 3: 3 6
One might be tempted to replace SCORE[1:2] with head(SCORE,2), but that won't work: head(SCORE,2) will return length-1 if the input is length-2, as it is with instructor 2 (who only has one student albeit multiple times). When you run diff on length-1 (e.g., diff(1)), it returns a 0-length vector, which in the above data.table code reduces to zero rows for instructor 2. However, when there is only one student, SCORE[1:2] resolves to c(SCORE[1], NA), for which the diff is length-1 (as needed) and NA (as needed).

R Manipulating List of Lists With Conditions / Joining Data

I have the following data showing 5 possible kids to invite to a party and what neighborhoods they live in.
I have a list of solutions as well (binary indicators of whether the kid is invited or not; e.g., the first solution invites Kelly, Gina, and Patty.
data <- data.frame(c("Kelly", "Andrew", "Josh", "Gina", "Patty"), c(1, 1, 0, 1, 0), c(0, 1, 1, 1, 0))
names(data) <- c("Kid", "Neighborhood A", "Neighborhood B")
solutions <- list(c(1, 0, 0, 1, 1), c(0, 0, 0, 1, 1), c(0, 1, 0, 1, 1), c(1, 0, 1, 0, 1), c(0, 1, 0, 0, 1))
I'm looking for a way to now filter the solutions in the following ways:
a) Only keep solutions where there are at least 3 kids from both neighborhood A and neighborhood B (one kid can count as one for both if they're part of both)
b) Only keep solutions that have at least 3 kids selected (i.e., sum >= 3)
I think I need to somehow join data to the solutions in solutions, but I'm a bit lost on how to manipulate everything since the solutions are stuck in lists. Basically looking for a way to add entries to every solution in the list indicating a) how many kids the solution has, b) how many kids from neighborhood A, and c) how many kids from neighborhood B. From there I'd have to somehow filter the lists to only keep the solutions that satisfy >= 3?
Thank you in advance!
I wrote a little function to check each solution and return TRUE or FALSE based on your requirements. Passing your solutions to this using sapply() will give you a logical vector, with which you can subset solutions to retain only those that met the requirements.
check_solution <- function(solution, data) {
data <- data[as.logical(solution),]
sum(data[["Neighborhood A"]]) >= 3 && sum(data[["Neighborhood B"]]) >= 3
}
### No need for function to test whether `sum(solution) >= 3`, since
### this will *always* be true if either neighborhood sums is >= 3.
tests <- sapply(solutions, check_solution, data = data)
# FALSE FALSE FALSE FALSE FALSE
solutions[tests]
# list()
### none of the `solutions` provided actually meet criteria
Edit: OP asked in the comments how to test against all neighborhoods in the data, and return TRUE if a specified number of neighborhoods have enough kids. Below is a solution using dplyr.
library(dplyr)
data <- data.frame(
c("Kelly", "Andrew", "Josh", "Gina", "Patty"),
c(1, 1, 0, 1, 0),
c(0, 1, 1, 1, 0),
c(1, 1, 1, 0, 1),
c(0, 1, 1, 1, 1)
)
names(data) <- c("Kid", "Neighborhood A", "Neighborhood B", "Neighborhood C",
"Neighborhood D")
solutions <- list(c(1, 0, 0, 1, 1), c(0, 0, 0, 1, 1), c(0, 1, 0, 1, 1),
c(1, 0, 1, 0, 1), c(0, 1, 0, 0, 1))
check_solution <- function(solution,
data,
min_kids = 3,
min_neighborhoods = NULL) {
neighborhood_tests <- data %>%
filter(as.logical(solution)) %>%
summarize(across(starts_with("Neighborhood"), ~ sum(.x) >= min_kids)) %>%
as.logical()
# require all neighborhoods by default
if (is.null(min_neighborhoods)) min_neighborhoods <- length(neighborhood_tests)
sum(neighborhood_tests) >= min_neighborhoods
}
tests1 <- sapply(solutions, check_solution, data = data)
solutions[tests1]
# list()
tests2 <- sapply(
solutions,
check_solution,
data = data,
min_kids = 2,
min_neighborhoods = 3
)
solutions[tests2]
# [[1]]
# [1] 1 0 0 1 1
#
# [[2]]
# [1] 0 1 0 1 1

match every row whose `region_ID=0` with the rows whose `region_ID=1` and calculate a certain distance

I have a dataset that looks like the following:
structure(list(X = c(36, 37, 38, 39, 40, 41, 1, 2, 3, 4, 5, 6
), Y = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), region_ID = c(0,
0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -12L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x7fb8fc819ae0>)
I want to match every row whose region_ID=0 with the rows whose region_ID=1 and calculate
dist_to_r1=sqrt((X - i.X)^2 + (Y - i.Y)^2))
where i. prefix refers to the latter rows. I want to do this using data table syntax.
I have been trying to do this with left joins, but couldn't make it work.
You want a full join, such that each of the six rows in region 0 are joined to the six rows in region 1?.
In that case, you can simply set allow.cartesian = T:
data[, id:=1][region_ID==0][data[region_ID==1], on ="id", allow.cartesian=T][, dist_to_r1:=sqrt((X-i.X)^2 + (Y-i.Y)^2)][]
Edit: OP clarified that only the minimum distance to a point in region 0 is required. In this case, we can do something like this:
data[,id:=1]
region0 = data[region_ID==0]
# function that gets the minimum distance between two regions
get_min_dist <- function(region_a, region_b) {
region_a[region_b, on="id", allow.cartesian=T][,min(sqrt((X-i.X)^2 + (Y-i.Y)^2))]
}
# apply the function above to every region
data[,
(min_dist_to_zero = get_min_dist(
region_a = region0,
region_b = data[region_ID==.BY]
)),
by=region_ID]
Output:
region_ID min_dist_to_zero
1: 0 0
2: 1 30

Changing rank across time in R as opposed to by group

I am having an issue in R where I want to add a rank (or Index) column, though as opposed to the rankings changing every time the combination changes. I want it to change every time the previous combination changes. I will illustrate what I mean in the code below.
df <- data.frame(id = c(1, 1, 1, 1, 1),
time = c(1, 2, 3, 4, 5),
group = c(1, 2, 2, 1, 3),
rank1 = c(1, 2, 2, 1, 3),
rank2 = c(1, 2, 2, 3, 4))
In the example I am ranking by group. rank1 is consistent with what I have been able do so far, which is at time 4 the rank is 1 because there was a previous instance of that group. I want something similar to rank2, because it accounts for there being a gap between the group == 1 instances, and assigns a different rank accordingly (i.e. at time 4 in rank2 is 3 as opposed to 1).

r - how to subtract first date entry from last date entry in grouped data and control output format

This question is very similar to a question asked in another thread which can be found here. I'm trying to achieve something similar: within groups (events) subtract the first date from the last date. I'm using the dplyr package and code provided in the answers of this thread. Subtracting the first date from the last date works, however it does not provide satisfactory results; the resulting time difference is displayed in numbers, and there seems to be no distinction between different time units (e.g., minutes and hours) --> subtractions in first 2 events are correct, however in the 3rd one it is not i.e. should be minutes. How can I manipulate the output by dplyr so that the resulting subtractions are actually a correct reflection of the time difference? Below you will find a sample of my data (1 group only) and the code that I used:
df<- structure(list(time = structure(c(1428082860, 1428083340, 1428084840,
1428086820, 1428086940, 1428087120, 1428087240, 1428087360, 1428087480,
1428087720, 1428088800, 1428089160, 1428089580, 1428089700, 1428090120,
1428090240, 1428090480, 1428090660, 1428090780, 1428090960, 1428091080,
1428091200, 1428091500, 1428091620, 1428096060, 1428096420, 1428096540,
1428096600, 1428097560, 1428097860, 1428100440, 1428100560, 1428100680,
1428100740, 1428100860, 1428101040, 1428101160, 1428101400, 1428101520,
1428101760, 1428101940, 1428102240, 1428102840, 1428103080, 1428103620,
1428103980, 1428104100, 1428104160, 1428104340, 1428104520, 1428104700,
1428108540, 1428108840, 1428108960, 1428110340, 1428110460, 1428110640
), class = c("POSIXct", "POSIXt"), tzone = ""), event = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3)), .Names = c("time",
"event"), class = "data.frame", row.names = c(NA, 57L))
df1 <- df %>%
group_by(event) %>%
summarize(first(time),last(time),difference = last(time)-first(time))
We can use difftime and specify the unit to get all the difference in the same unit.
df %>%
group_by(event) %>%
summarise(First = first(time),
Last = last(time) ,
difference= difftime(last(time), first(time), unit='hour'))

Resources