Calculate transaction per ID by date - r

How to count the total number of transaction by id and by date ?
Sample data :
f<- data.frame(
id=c("A","A","A","A","C","C","D","D","E"),
start_date=c("6/3/2012","7/3/2012","7/3/2012","8/3/2012","5/3/2012","6/3/2012","6/3/2012","6/3/2012","5 /3/2012")
)
Excepted Output:
id | count
A | 3
C | 2
D | 1
E | 1
Logic :
As A is 6 MARCH , 7 MARCH AND 8 MARCH SO COUNT 3
C is 5 MARCH , 6 MARCH SO COUNT 2
so on...
I Tried with the following code , and I think it only count the number of the ID occurred in the data.
library(lubridate)
f$date <- mdy(f$Date)
f1 <- s[order(f$id, f$Date), ]
How can I implement this code to get my desire outcome?
[Note: The actual data is in huge volume, so optimization need to be consider.]
Thanks in advance.

I'm getting a different answer:
with(f, tapply(start_date, id, length))
A C D E
4 2 2 1

You can try. f[!duplicated(f), ] removes duplicates from f and then aggregate does the aggregation using length function i.e. gives count of start_date for each id
aggregate(start_date ~ id, f[!duplicated(f), ], length)
## id start_date
## 1 A 3
## 2 C 2
## 3 D 1
## 4 E 1

Not sure what format you want the results in, but
rowSums(with(f, table(id, start_date)>0))
will return a named vector with the count of distinct days for each ID.

Related

R Check for levels within a group and duplicate row if not present

I have a problem with in Shiny which I will show in a simple example:
I have the following data:
Group<-c("A","A","B","C","C","D")
Value<-c(1,2,6,7,3,9)
df<-data.frame(Group, Value)
Group Value
A 1
A 2
B 6
C 7
C 3
D 9
Then I add a row to see how many reps a group has:
df$num <- ave(df$Value, df$Group, FUN = seq_along)
Group Value num
A 1 1
A 2 2
B 6 1
C 7 1
C 3 2
D 9 1
Now, I would like it, to check if the group contains a 2nd rep, and if not, duplicate the 1st row of the group (containing num=1) and setting num to 2.
So I would like to end up with:
Group Value num
A 1 1
A 2 2
B 6 1
B 6 2 #this row is added
C 7 1
C 3 2
D 9 1
D 9 2 #this row is added
I have tried to search for solution, but I mainly ended up with subject like a condition that is based on a certain value, rather than conditions within a group.
Could someone help me? I would appreciate it a lot!
Can this code do the trick ?
res <- lapply(unique(df$Group), function(x){
a <- df[df$Group == x, ]
if(nrow(a) == 1) {
a <- a[rep(row.names(a), 2), ]
a$num <- c(1:2)
}
a
})
do.call(rbind, res)

Counting the instances of a variable that exceeds a threshold

I have a dataset with id and speed.
id <- c(1,1,1,1,2,2,2,2,3,3,3)
speed <- c(40,30,50,40,45,50,30,55,50,50,60)
i <- cbind(id, speed)
limit <- 35
Say, if 'speed' crosses 'limit' will count it as 1. And you will count again only if speed comes below and crosses the 'limit'.
I want data to be like.
id | Speed Viol.
----------
1 | 2
---------
2 | 2
---------
3 | 1
---------
here id (count).
id1 (1) 40 (2) 50,40
id2 (1) 45,50 (2) 55
id3 (1) 50,50,60
How to do it not using if().
Here's a method tapply as suggested in the comments and the original vectors.
tapply(speed, id, FUN=function(x) sum(c(x[1] > limit, diff(x > limit)) > 0))
1 2 3
2 2 1
tapply applies a function to each group, here, by ID. The function checks if the first element of an ID is over 35, and then concatenates this to the output of diff, whose argument is checking if subsequent observations are greater than 35. Thus diff checks if an ID returns to above 35 after dropping below that level. Negative values in the resulting vector are converted to FALSE (0) with > 0 and these results are summed.
tapply returns a named vector, which can be fairly nice to work with. However, if you want a data.frame, then you could use aggregate instead as suggested by d.b:
aggregate(speed, list(id=id), FUN=function(x) sum(c(x[1] > limit, diff(x > limit)) > 0))
id x
1 1 2
2 2 2
3 3 1
Here's a dplyr solution. I group by id then check if speed is above the limit in each row, but wasn't in the previous entry. (I get the previous row using lag). If this is the case, it produces TRUE. Or, if it's the first row for the id (i.e., row_number()==1) and it's above the limit, this gives a TRUE, too. Then, I sum all the TRUE values for each id using summarise.
id <- c(1,1,1,1,2,2,2,2,3,3,3)
speed <- c(40,30,50,40,45,50,30,55,50,50,60)
i <- data.frame(id, speed)
limit <- 35
library(dplyr)
i %>%
group_by(id) %>%
mutate(viol=(speed>limit&lag(speed)<limit)|(row_number()==1&speed>limit)) %>%
summarise(sum(viol))
# A tibble: 3 x 2
id `sum(viol)`
<dbl> <int>
1 1 2
2 2 2
3 3 1
Here is another option with data.table,
library(data.table)
setDT(i)[, id1 := rleid(speed > limit), by = id][
speed > limit, .(violations = uniqueN(id1)), by = id][]
which gives,
id violations
1: 1 2
2: 2 2
3: 3 1
aggregate(speed~id, data.frame(i), function(x) sum(rle(x>limit)$values))
# id speed
#1 1 2
#2 2 2
#3 3 1
The main idea is that x > limit will check for instances when the speed limit is violated and rle(x) will group those instances into consecutive violations or consecutive non-violations. Then all you need to do is to count the groups of consecutive violations (when rle(x>limit)$values is TRUE).

Count of unique values across all columns in a data frame

We have a data frame as below :
raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
I need a result data frame in the following format :
result<-data.frame(v1=c("A","B","C","D"), v2=c(3,2,2,3))
Used the following code to get the count across one particular column :
count_raw<-sqldf("SELECT DISTINCT(v1) AS V1, COUNT(v1) AS count FROM raw GROUP BY v1")
This would return count of unique values across an individual column.
Any help would be highly appreciated.
Use this
table(unlist(raw))
Output
A B C D
3 2 2 3
For data frame type output wrap this with as.data.frame.table
as.data.frame.table(table(unlist(raw)))
Output
Var1 Freq
1 A 3
2 B 2
3 C 2
4 D 3
If you want a total count,
sapply(unique(raw[!is.na(raw)]), function(i) length(which(raw == i)))
#A B C D
#3 2 2 3
We can use apply with MARGIN = 1
cbind(raw[1], v2=apply(raw, 1, function(x) length(unique(x[!is.na(x)]))))
If it is for each column
sapply(raw, function(x) length(unique(x[!is.na(x)])))
Or if we need the count based on all the columns, convert to matrix and use the table
table(as.matrix(raw))
# A B C D
# 3 2 2 3
If you have only character values in your dataframe as you've provided, you can unlist it and use unique or to count the freq, use count
> library(plyr)
> raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
> unique(unlist(raw))
[1] A B C D <NA>
Levels: A B C D
> count(unlist(raw))
x freq
1 A 3
2 B 2
3 C 2
4 D 3
5 <NA> 6

Order multiple rows in a data frame

I have a data frame like this
ID EPOCH
B 2
B 3
A 1
A 2
A 3
C 0
and what I would like to do is to order it by the ID first appearance date (i.e. the minimum value of EPOCH for each ID) so that I get
ID EPOCH
C 0
A 1
A 2
A 3
B 2
B 3
I managed only to order the data frame according to Epoch and than ID
df[order(df$EPOCH,df$ID),]
but than it is no more clustered by ID, i.e.
C 0
A 1
A 2
B 2
A 3
B 3
Many thanks
First add a column with the minimum EPOCH for each ID to the data.frame:
data <- read.table(textConnection("ID EPOCH
B 2
B 3
A 1
A 2
A 3
C 0"), header=TRUE)
a <- aggregate(data$EPOCH, data["ID"], min)
names(a)[2] <- "min_EPOCH"
data <- merge(data, a)
Then sort on that new column:
o <- order(data$min_EPOCH, data$ID, data$EPOCH)
data[o, ]

Multirow deletion: delete row depending on other row

I'm stuck with a quite complex problem. I have a data frame with three rows: id, info and rownum. The data looks like this:
id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8
What I want to do now is to delete all other rows of one id if one of the rows contains the info a. This would mean for example that row 2 and 3 should be removed as row 1's coloumn info contains the value a. Please note that the info values are not ordered (id 3/row 5 & 6) and cannot be ordered due to other data limitations.
I solved the case using a for loop:
# select all id containing an "a"-value
a_val <- data$id[grep("a", data$info)]
# check for every id containing an "a"-value
for(i in a_val) {
temp_data <- data[which(data$id == i),]
# only go on if the given id contains more than one row
if (nrow(temp_data) > 1) {
for (ii in nrow(temp_data)) {
if (temp_data$info[ii] != "a") {
temp <- temp_data$row[ii]
if (!exists("delete_rows")) {
delete_rows <- temp
} else {
delete_rows <- c(delete_rows, temp)
}
}
}
}
}
My solution works quite well. Nevertheless, it is very, very, very slow as the original data contains more than 700k rows and more that 150k rows with an "a"-value.
I could use a foreach loop with 4 cores to speed it up, but maybe someone could give me a hint for a better solution.
Best regards,
Arne
[UPDATE]
The outcome should be:
id info row
1 a 1
2 a 4
3 a 6
4 b 7
4 c 8
Here is one possible solution.
First find ids where info contains "a":
ids <- with(data, unique(id[info == "a"]))
Subset the data:
subset(data, (id %in% ids & info == "a") | !id %in% ids)
Output:
id info row
1 1 a 1
4 2 a 4
6 3 a 6
7 4 b 7
8 4 c 8
An alternative solution (maybe harder to decipher):
subset(data, info == "a" | !rep.int(tapply(info, id, function(x) any(x == "a")),
table(id)))
Note. #BenBarnes found out that this solution only works if the data frame is ordered according to id.
You might want to investigate the data.table package:
EDIT: If the row variable is not a sequential numbering of each row in your data (as I assumed it was), you could create such a variable to obtain the original row order:
library(data.table)
# Create data.table of your data
dt <- as.data.table(data)
# Create index to maintain row order
dt[, idx := seq_len(nrow(dt))]
# Set a key on id and info
setkeyv(dt, c("id", "info"))
# Determine unique ids
uid <- dt[, unique(id)]
# subset your data to select rows with "a"
dt2 <- dt[J(uid, "a"), nomatch = 0]
# identify rows of dataset where the id doesn't have an "a"
dt3 <- dt[J(dt2[, setdiff(uid, id)])]
# rbind those two data.tables together
(dt4 <- rbind(dt2, dt3))
# id info row idx
# 1: 1 a 1 1
# 2: 2 a 4 4
# 3: 3 a 6 6
# 4: 4 b 7 7
# 5: 4 c 8 8
# And if you need the original ordering of rows,
dt5 <- dt4[order(idx)]
Note that setting a key for the data.table will order the rows according to the key columns. The last step (creating dt5) sets the row order back to the original.
Here is a way using ddply:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
library("plyr")
ddply(df,.(id),subset,rep(!'a'%in%info,length(info))|info=='a')
Returns:
id info row
1 1 a 1
2 2 a 4
3 3 a 6
4 4 b 7
5 4 c 8
if df is this (RE Sacha above) use match which just finds the index of the first occurrence:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
# the first info row matching 'a' and all other rows that are not 'a'
with(df, df[c(match('a',info), which(info != 'a')),])
id info row
1 1 a 1
2 1 b 2
3 1 c 3
5 3 b 5
7 4 b 7
8 4 c 8
try to take a look at subset, it's quite easy to use and it will solve your problem.
you just need to specify the value of the column that you want to subset based on, alternatively you can choose more columns.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html
http://www.statmethods.net/management/subset.html

Resources