I have a dataframe data like this
data
id time var1
1 a 3 0
2 a 2 2
3 a 1 3
4 b 3 2
5 b 4 6
I want to get the second largest time row of each id like this:
data2
id time var1
1 a 2 2
2 b 3 2
I try use sqldf
sqldf("select * from data order by time desc limit 2,1 group by id")
but I got an Error:
Error in sqliteSendQuery(con, statement, bind.data) :
error in statement: near "group": syntax error
I also try:
select max(time),* from data where time not in(select max(time) from data group by id) group by id
but I only got a result, I can't get the right answer.
Thanks !
Try taking the maximum among the rows with values less than the maximum for that id:
sqldf("select id, max(time) time, var1
from data a
where time < (select max(b.time)
from data b
where b.id = a.id)
group by id")
Related
I have to retrieve IDs for employees who have completed the minimum number of jobs. There are multiple employees who have completed 1 job. My current sqldf query retrieves only 1 row of data, while there are multiple employee IDs who have completed just 1 job. Why does it stop at the first minimum value? And how do I fetch all rows with the minimum value in a column? Here is a data sample:
ID TaskCOunt
1 74
2 53
3 10
4 5
5 1
6 1
7 1
The code I have used:
sqldf("select id, min(taskcount) as Jobscompleted
from (select id,count(id) as taskcount
from MyData
where id is not null
group by id order by id)")
Output is
ID leastcount
5 1
While what I want is all the rows with minimum jobs completed.
ID Jobscompleted
5 1
6 1
7 1
min(...) always returns one row in SQL as do all SQL aggregate functions. Try this instead:
sqldf("select ID, TaskCount TasksCompleted from MyData
where TaskCount = (select min(TaskCount) from MyData)")
giving:
ID TasksCompleted
1 5 1
2 6 1
3 7 1
Note: The input in reproducible form is:
Lines <- "
ID TaskCount
1 74
2 53
3 10
4 5
5 1
6 1
7 1"
MyData <- read.table(text = Lines, header = TRUE)
As an alternative to sqldf, you could use data.table:
library(data.table)
dt <- data.table(ID=1:7, TaskCount=c(74, 53, 10, 5, 1, 1, 1))
dt[TaskCount==min(TaskCount)]
## ID TaskCount
## 1: 5 1
## 2: 6 1
## 3: 7 1
I have a dataset with id and speed.
id <- c(1,1,1,1,2,2,2,2,3,3,3)
speed <- c(40,30,50,40,45,50,30,55,50,50,60)
i <- cbind(id, speed)
limit <- 35
Say, if 'speed' crosses 'limit' will count it as 1. And you will count again only if speed comes below and crosses the 'limit'.
I want data to be like.
id | Speed Viol.
----------
1 | 2
---------
2 | 2
---------
3 | 1
---------
here id (count).
id1 (1) 40 (2) 50,40
id2 (1) 45,50 (2) 55
id3 (1) 50,50,60
How to do it not using if().
Here's a method tapply as suggested in the comments and the original vectors.
tapply(speed, id, FUN=function(x) sum(c(x[1] > limit, diff(x > limit)) > 0))
1 2 3
2 2 1
tapply applies a function to each group, here, by ID. The function checks if the first element of an ID is over 35, and then concatenates this to the output of diff, whose argument is checking if subsequent observations are greater than 35. Thus diff checks if an ID returns to above 35 after dropping below that level. Negative values in the resulting vector are converted to FALSE (0) with > 0 and these results are summed.
tapply returns a named vector, which can be fairly nice to work with. However, if you want a data.frame, then you could use aggregate instead as suggested by d.b:
aggregate(speed, list(id=id), FUN=function(x) sum(c(x[1] > limit, diff(x > limit)) > 0))
id x
1 1 2
2 2 2
3 3 1
Here's a dplyr solution. I group by id then check if speed is above the limit in each row, but wasn't in the previous entry. (I get the previous row using lag). If this is the case, it produces TRUE. Or, if it's the first row for the id (i.e., row_number()==1) and it's above the limit, this gives a TRUE, too. Then, I sum all the TRUE values for each id using summarise.
id <- c(1,1,1,1,2,2,2,2,3,3,3)
speed <- c(40,30,50,40,45,50,30,55,50,50,60)
i <- data.frame(id, speed)
limit <- 35
library(dplyr)
i %>%
group_by(id) %>%
mutate(viol=(speed>limit&lag(speed)<limit)|(row_number()==1&speed>limit)) %>%
summarise(sum(viol))
# A tibble: 3 x 2
id `sum(viol)`
<dbl> <int>
1 1 2
2 2 2
3 3 1
Here is another option with data.table,
library(data.table)
setDT(i)[, id1 := rleid(speed > limit), by = id][
speed > limit, .(violations = uniqueN(id1)), by = id][]
which gives,
id violations
1: 1 2
2: 2 2
3: 3 1
aggregate(speed~id, data.frame(i), function(x) sum(rle(x>limit)$values))
# id speed
#1 1 2
#2 2 2
#3 3 1
The main idea is that x > limit will check for instances when the speed limit is violated and rle(x) will group those instances into consecutive violations or consecutive non-violations. Then all you need to do is to count the groups of consecutive violations (when rle(x>limit)$values is TRUE).
I have a column in Access table and it has different values, i just want to count number of records against each different value e.g
Column A Count
4 3
4 3
4 3
3 2
3 2
1 1
Can anyone help me how to do this?
You do it using a query:
SELECT [Column A], Count([Column A]) AS CountOfColumnA
FROM tbl
GROUP BY [Column A];
How to count the total number of transaction by id and by date ?
Sample data :
f<- data.frame(
id=c("A","A","A","A","C","C","D","D","E"),
start_date=c("6/3/2012","7/3/2012","7/3/2012","8/3/2012","5/3/2012","6/3/2012","6/3/2012","6/3/2012","5 /3/2012")
)
Excepted Output:
id | count
A | 3
C | 2
D | 1
E | 1
Logic :
As A is 6 MARCH , 7 MARCH AND 8 MARCH SO COUNT 3
C is 5 MARCH , 6 MARCH SO COUNT 2
so on...
I Tried with the following code , and I think it only count the number of the ID occurred in the data.
library(lubridate)
f$date <- mdy(f$Date)
f1 <- s[order(f$id, f$Date), ]
How can I implement this code to get my desire outcome?
[Note: The actual data is in huge volume, so optimization need to be consider.]
Thanks in advance.
I'm getting a different answer:
with(f, tapply(start_date, id, length))
A C D E
4 2 2 1
You can try. f[!duplicated(f), ] removes duplicates from f and then aggregate does the aggregation using length function i.e. gives count of start_date for each id
aggregate(start_date ~ id, f[!duplicated(f), ], length)
## id start_date
## 1 A 3
## 2 C 2
## 3 D 1
## 4 E 1
Not sure what format you want the results in, but
rowSums(with(f, table(id, start_date)>0))
will return a named vector with the count of distinct days for each ID.
I have a problem which is a bit beyond me (I'm really awfully glad I'm a Beta) involving duplicates (so GROUP BY, HAVING, COUNT), compounded by keeping the solution within the standard functions that came with SQLite. I am using the sqlite3 module from Python.
Example table workers, Columns:
* ID: integer, auto-incrementing
* ColA: integer
* ColB: varchar(20)
* UserType: varchar(20)
* LoadMe: Boolean
(Yes, SQLite's datatypes are nominal)
My data table, Workers, at start looks like:
ID ColA ColB UserType LoadMe
1 1 a Alpha 0
2 1 b Beta 0
3 2 a Alpha 0
4 2 a Beta 0
5 2 b Delta 0
6 2 b Alpha 0
7 1 a Delta 0
8 1 b Epsilon 0
9 1 c Gamma 0
10 4 b Delta 0
11 5 a Alpha 0
12 5 a Beta 0
13 5 b Gamma 0
14 5 a Alpha 0
I would like to enable, for Loading onto trucks at a new factory, all workers who have unique combinations between ColA and ColB. For those duplicates (twins, triplets, etc., perhaps via Bokanovsky's Process) where unique combinations of ColA and ColB have more than one worker, I would like to select only one from each set of duplicates. To make the problem harder, I would like to additionally be able to make the selection one from each set of duplicates on the basis of UserType in some form of ORDER BY. I may wish to select the first "duplicate" with a UserType of "Alpha," to work on a frightfully clever problem, or ORDER BY UserType DESC, that I may issue an order for black tunics for the lowest of the workers.
You can see that IDs 9, 10, and 13 have unique combinations of ColA and ColB and are most easily identified. The 1-a, 1-b, 2-a, 2-b, and 5-a combinations, however, have duplicates within them.
My current process, as it stands so far:
0) Everyone comes with a unique ID number. This is done at birth.
1) SET all Workers to LoadMe = 1.
UPDATE Workers
SET LoadMe = 1
2) Find my duplicates based on their similarity in two columns (GROUP BY ColA, ColB):
SELECT Wk1.*
FROM Workers AS Wk1
INNER JOIN (
SELECT ColA, ColB
FROM Workers
GROUP BY ColA, ColB
HAVING COUNT(*) > 1
) AS Wk2
ON Wk1.ColA = Wk2.ColA
AND Wk1.ColB = Wk2.ColB
ORDER BY ColA, ColB
3) SET all of my duplicates to LoadMe = 0.
UPDATE Workers
SET LoadMe = 0
WHERE ID IN (
SELECT Wk1.ID
FROM Workers AS Wk1
INNER JOIN (
SELECT ColA, ColB
FROM Workers
GROUP BY ColA, ColB
HAVING COUNT(*) > 1
) AS Wk2
ON Wk1.ColA = Wk2.ColA
AND Wk1.ColB = Wk2.ColB
)
4) For each set of duplicates in my GROUP BY, ORDERed BY UserType, SELECT only one, the first in the list, to have LoadMe SET to 1.
This table would look like:
ID ColA ColB UserType LoadMe
1 1 a Alpha 1
2 1 b Beta 1
3 2 a Alpha 1
4 2 a Beta 0
5 2 b Delta 0
6 2 b Alpha 1
7 1 a Delta 0
8 1 b Epsilon 0
9 1 c Gamma 1
10 4 b Delta 1
11 5 a Alpha 1
12 5 a Beta 0
13 5 b Gamma 1
14 5 a Alpha 0
ORDERed BY ColA, ColB, UserType, then ID, and broken out by the GROUP BY columns, (and finally spaced for clarity) that same data might look like:
ID ColA ColB UserType LoadMe
1 1 a Alpha 1
7 1 a Delta 0
2 1 b Beta 1
8 1 b Epsilon 0
9 1 c Gamma 1
3 2 a Alpha 1
4 2 a Beta 0
6 2 b Alpha 1
5 2 b Delta 0
10 4 b Delta 1
11 5 a Alpha 1
14 5 a Alpha 0
12 5 a Beta 0
13 5 b Gamma 1
I am confounded on the last step and feel like an Epsilon-minus semi-moron. I had previously been pulling the duplicates out of the database into program space and working within Python, but this situation arises not infrequently and I would like to more permanently solve this.
I like to break a problem like this up a bit. The first step is to identify the unique ColA,ColB pairs:
SELECT ColA,ColB FROM Workers GROUP BY ColA,ColB
Now for each of these pairs you want to find the highest priority record. A join won't work because you'll end up with multiple records for each unique pair but a subquery will work:
SELECT ColA,ColB,
(SELECT id FROM Workers w1
WHERE w1.ColA=w2.ColA AND w1.ColB=w2.ColB
ORDER BY UserType LIMIT 1) AS id
FROM Workers w2 GROUP BY ColA,ColB;
You can change the ORDER BY clause in the subquery to control the priority. LIMIT 1 ensures that there is only one record for each subquery (otherwise sqlite will return the last record that matches the WHERE clause, although I'm not sure that that's guaranteed).
The result of this query is a list of records to be loaded with ColA, ColB, id. I would probably work directly from that and get rid of LoadMe but if you want to keep it you could do this:
BEGIN TRANSACTION;
UPDATE Workers SET LoadMe=0;
UPDATE Workers SET LoadMe=1
WHERE id IN (SELECT
(SELECT id FROM Workers w1
WHERE w1.ColA=w2.ColA AND w1.ColB=w2.ColB
ORDER BY UserType LIMIT 1) AS id
FROM Workers w2 GROUP BY ColA,ColB);
COMMIT;
That clears the LoadMe flag and then sets it to 1 for each of the records returned by our last query. The transaction guarantees that this all takes place or fails as one step and never leaves your LoadMe fields in an inconsistent state.