Counting Records Based on Unique Date - r

I have a frame with a column of dates (some dates with multiple records) and a numeric column. I want a frame that lists one date per record, the sum of the numbers for each date, and the number of occurrences of records for each date.
Starting frame:
SomeDate SomeNum
10/1/2013 2
10/1/2013 3
10/2/2013 5
10/3/2013 4
10/3/2013 1
10/3/2013 1
I can get the sum of SomeNum per unique date with the following:
newDF<-unique(within(df, {
SumOfSomeNums <- ave(SomeNum, SomeDate, FUN = sum)
}))
But I can't figure out how to get a count of the number of times each unique SomeDate occurs.
I want:
SomeDate SumOfSomeNums CountOfSomeDate
10/1/2013 5 2
10/2/2013 5 1
10/3/2013 6 3
What would get me the CountOfSomeDate data?
Thx

Continuing with your approach, use length as your aggregation function:
unique(within(mydf, {
SumOfSomeNums <- ave(SomeNum, SomeDate, FUN = sum)
CountOfSomeDate <- ave(SomeDate, SomeDate, FUN = length)
rm(SomeNum)
}))
# SomeDate CountOfSomeDate SumOfSomeNums
# 1 10/1/2013 2 5
# 3 10/2/2013 1 5
# 4 10/3/2013 3 6
However, there are many alternative ways to get here.
Here's an aggregate approach:
do.call(data.frame, aggregate(SomeNum ~ SomeDate, mydf, function(x) c(sum(x), length(x))))
# SomeDate SomeNum.1 SomeNum.2
# 1 10/1/2013 5 2
# 2 10/2/2013 5 1
# 3 10/3/2013 6 3
And a data.table approach:
library(data.table)
DT <- data.table(mydf)
DT[, list(Count = length(SomeNum), Sum = sum(SomeNum)), by = SomeDate]
# SomeDate Count Sum
# 1: 10/1/2013 2 5
# 2: 10/2/2013 1 5
# 3: 10/3/2013 3 6

Related

Merge the rows with least abundance

I would like to merge rows lower than specific value, like:
ID A B C
Apple 1 1 1
Banana 2 2 2
Cherry 3 3 3
Dates 4 4 4
For Apple, the total amount in A, B and C is 3, which is 10% (3/30*100%=10%) in total.
I would like to merge the rows with amount lower than 20% in total into a "Others" row, like:
ID A B C
Cherry 3 3 3
Dates 4 4 4
Others 3 3 3
May I know how to draw the function and how to achieve this?
Any suggestion or idea is appreciated
I'd do it like this:
## Your original data
df <- read.table(text="ID A B C
Apple 1 1 1
Banana 2 2 2
Cherry 3 3 3
Dates 4 4 4" ,stringsAsFactors = FALSE)
names(df) <- df[1,] ## adding column names
df <- df[-1,] ## removing the header row
df[,-1] <- lapply(df[,-1], as.numeric) ## converting to numeric
rownames(df) <- df[,1] ## adding rownames
df <- df[,-1] ## removing the header column
df$tots <- apply(df, 1, sum)
df$proportion <- df$tots/sum(df$tots)
df <- rbind(df[which(df$proportion >= 0.2), ],
Others=apply(df[which(df$proportion < 0.2), ], 2, sum))
df <- subset(df, select = -c(tots, proportion))
The result:
>df
>Banana 2 2 2
>Cherry 3 3 3
>Dates 4 4 4
>Others 1 1 1
One option would be to create a logical index by dividing the rowSums of numeric columns with the total sum to check if it is less than or equal to 0.2, then assign the 'ID' based on the index to "Others" (assuming that the "ID" column is character class) and aggregate the columns by 'ID' to get the sum
i1 <- rowSums(df1[-1])/sum(as.matrix(df1[-1])) <= 0.2
df1$ID[i1] <- "Others"
aggregate(.~ ID, df1, sum)
# ID A B C
#1 Cherry 3 3 3
#2 Dates 4 4 4
#3 Others 3 3 3

Obtaining descriptive statistics of observations with years of complete data in R

I have the following panel dataset
id year Value
1 1 50
2 1 55
2 2 40
3 1 48
3 2 54
3 3 24
4 2 24
4 3 57
4 4 30
I would like to obtain descriptive statistics of the number of years in which observations have information available, for example: the number of individuals with only one year of information is 1, the number of individuals with only two years of information is one, while the number of individuals with three years of available information is 2.
In base R using table and it's faster cousin tabulate:
table(tabulate(dat$id))
1 2 3
1 1 2
or
table(table(dat$id))
Convert to a data.frame:
data.frame(table(tabulate(dat$id)))
Var1 Freq
1 1 1
2 2 1
3 3 2
lapply(split(df$id, ave(df$year, df$id, FUN = length)), function(x) length(unique(x)))
#$`1`
#[1] 1
#$`2`
#[1] 1
#$`3`
#[1] 2
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', get the length of unique number of 'year', grouped by that column, get the number of rows (.N)
library(data.table)
setDT(df1)[, uniqueN(year), .(id)][, .N, V1]
# V1 N
#1: 1 1
#2: 2 1
#3: 3 2

Are multiple dates in one data frame within multiple date ranges in another data frame using R

I am trying to determine if multiple dates in one data frame are within multiple date ranges from another data frame. The dates and date ranges should be compared within each ID. I'd then like to update the data from the first data frame with information from the second data frame. Both data frames can potentially have 0 to multiple records for each ID. For example, df1 might look like this:
UID1 ID Date
1 1 05/12/10
2 1 07/25/11
3 1 07/31/12
4 2 11/04/03
5 2 10/06/04
6 3 10/07/08
7 3 06/16/12
While df2 might look like this (note ID=2 has no records in df2):
UID2 ID StartDate EndDate
1 1 07/22/09 09/13/09
2 1 03/19/10 11/29/10
3 1 05/09/11 09/04/11
4 3 05/18/12 08/15/12
5 3 01/15/13 04/21/13
I would like to end up with a new df1 that looks like this:
UID1 ID Date UID2 InRange DaysSinceStart
1 1 05/12/10 2 TRUE 54
2 1 07/25/11 3 TRUE 77
3 1 07/31/12 NA FALSE NA
4 2 11/04/03 NA FALSE NA
5 2 10/06/04 NA FALSE NA
6 3 10/07/08 NA FALSE NA
7 3 06/16/12 4 TRUE 29
Suggestions?
suggestion to use data.table. explanation inline.
data:
dt1 <- fread("
UID1 ID Date
1 1 05/12/10
2 1 07/25/11
3 1 07/31/12
4 2 11/04/03
5 2 10/06/04
6 3 10/07/08
7 3 06/16/12
")[, Date:=as.Date(Date, "%m/%d/%y")]
cols <- c("StartDate", "EndDate")
dt2 <- fread("
UID2 ID StartDate EndDate
1 1 07/22/09 09/13/09
2 1 03/19/10 11/29/10
3 1 05/09/11 09/04/11
4 3 05/18/12 08/15/12
5 3 01/15/13 04/21/13
")[, (cols) := lapply(.SD, function(x) as.Date(x, "%m/%d/%y")), .SDcols=cols]
working starts here:
#left join dt1 with dt2
dt <- dt2[dt1, on="ID", allow.cartesian=TRUE]
#check date range, get unique row
res <- dt[, {
if (!all(is.na(StartDate <= Date & Date <= EndDate)) &&
any(StartDate <= Date & Date <= EndDate)) {
#case where Date within a range
chosen <- StartDate <= Date & Date <= EndDate
list(UID2=UID2[chosen], StartDate=StartDate[chosen])
} else {
list(UID2=NA_integer_, StartDate=as.Date(NA))
}
}, by=c("UID1","ID","Date")]
#count DaysSinceStart
res[, ':=' (InRange=!is.na(UID2),
DaysSinceStart=as.numeric(Date - StartDate))][,
StartDate:=NULL]
res

Summing the number of times a value appears in either of 2 columns

I have a large data set - around 32mil rows. I have information on the telephone number, the origin of the call, and the destination.
For each telephone number, I want to count the number of times it appeared either as Origin or as Destination.
An example data table is as follows:
library(data.table)
dt <- data.table(Tel=seq(1,5,1), Origin=seq(1,5,1), Destination=seq(3,7,1))
Tel Origin Destination
1: 1 1 3
2: 2 2 4
3: 3 3 5
4: 4 4 6
5: 5 5 7
I have working code, but it takes too long for my data since it involves a for loop. How can I optimize it?
Here it is:
for (i in unique(dt$Tel)){
index <- (dt$Origin == i | dt$Destination == i)
dt[dt$Tel ==i, "N"] <- sum(index)
}
Result:
Tel Origin Destination N
1: 1 1 3 1
2: 2 2 4 1
3: 3 3 5 2
4: 4 4 6 2
5: 5 5 7 2
Where N tells that Tel=1 appears 1, Tel=2 appears 1, Tel=3,4 and 5 each appear 2 times.
We can do a melt and match
dt[, N := melt(dt, id.var = "Tel")[, tabulate(match(value, Tel))]]
Or another option is to loop through the columns 2 and 3, use %in% to check whether the values in 'Tel' are present, then with Reduce and + get the sum of logical elements for each 'Tel', assign (:=) the values to 'N'
dt[, N := Reduce(`+`, lapply(.SD, function(x) Tel %in% x)), .SDcols = 2:3]
dt
# Tel Origin Destination N
#1: 1 1 3 1
#2: 2 2 4 1
#3: 3 3 5 2
#4: 4 4 6 2
#5: 5 5 7 2
A second method constructs a temporary data.table which is then joins to the original. This is longer and likely less efficient than #akrun's, but can be useful to see.
# get temporary data.table as the sum of origin and destination frequencies
temp <- setnames(data.table(table(unlist(dt[, .(Origin, Destination)], use.names=FALSE))),
c("Tel", "N"))
# turn the variables into integers (Tel is the name of the table above, and thus character)
temp <- temp[, lapply(temp, as.integer)]
Now, join the original table on
dt <- temp[dt, on="Tel"]
dt
Tel N Origin Destination
1: 1 1 1 3
2: 2 1 2 4
3: 3 2 3 5
4: 4 2 4 6
5: 5 2 5 7
You can get the desired column order using setcolorder
setcolorder(dt, c("Tel", "Origin", "Destination", "N"))

R - compare rows consecutively in two data frames and return a value

I have the following two data frames:
df1 <- data.frame(month=c("1","1","1","1","2","2","2","3","3","3","3","3"),
temp=c("10","15","16","25","13","17","20","5","16","25","30","37"))
df2 <- data.frame(period=c("1","1","1","1","1","1","1","1","2","2","2","2","2","2","3","3","3","3","3","3","3","3","3","3","3","3"),
max_temp=c("9","13","16","18","30","37","38","39","10","15","16","25","30","32","8","10","12","14","16","18","19","25","28","30","35","40"),
group=c("1","1","1","2","2","2","3","3","3","3","4","4","5","5","5","5","5","6","6","6","7","7","7","7","8","8"))
I would like to:
Consecutively for each row, check if the value in the month column in df1 matches that in the period column of df2, i.e. df1$month == df2$period.
If step 1 is not TRUE, i.e. df1$month != df2$period, then repeat step 1 and compare the value in df1 with the value in the next row of df2, and so forth until df1$month == df2$period.
If df1$month == df2$period, check if the value in the temp column of df1 is less than or equal to that in the max_temp column of df2, i.e. df1$temp <= df$max_temp.
If df1$temp <= df$max_temp, return value in that row for the group column in df2 and add this value to df1, in a new column called "new_group".
If step 3 is not TRUE, i.e. df1$temp > df$max_temp, then go back to step 1 and compare the same row in df1 with the next row in df2.
An example of the output data frame I'd like is:
df3 <- data.frame(month=c("1","1","1","1","2","2","2","3","3","3","3","3"),
temp=c("10","15","16","25","13","17","20","5","16","25","30","37"),
new_group=c("1","1","1","2","3","4","4","5","6","7","7","8"))
I've been playing around with the ifelse function and need some help or re-direction. Thanks!
I found the procedure for computing new_group hard to follow as stated. As I understand it, you're trying to create a variable called new_group in df1. For row i of df1, the new_group value is the group value of the first row in df2 that:
Is indexed i or higher
Has a period value matching df1$month[i]
Has a max_temp value no less than df1$temp[i]
I approached this by using sapply called on the row indices of df1:
fxn = function(idx) {
# Potentially matching indices in df2
pm = idx:nrow(df2)
# Matching indices in df2
m = pm[df2$period[pm] == df1$month[idx] &
as.numeric(as.character(df1$temp[idx])) <=
as.numeric(as.character(df2$max_temp[pm]))]
# Return the group associated with the first matching index
return(df2$group[m[1]])
}
df1$new_group = sapply(seq(nrow(df1)), fxn)
df1
# month temp new_group
# 1 1 10 1
# 2 1 15 1
# 3 1 16 1
# 4 1 25 2
# 5 2 13 3
# 6 2 17 4
# 7 2 20 4
# 8 3 5 5
# 9 3 16 6
# 10 3 25 7
# 11 3 30 7
# 12 3 37 8
library(data.table)
dt1 <- data.table(df1, key="month")
dt2 <- data.table(df2, key="period")
## add a row index
dt1[, rn1 := seq(nrow(dt1))]
dt3 <-
unique(dt1[dt2, allow.cartesian=TRUE][, new_group := group[min(which(temp <= max_temp))], by="rn1"], by="rn1")
## Keep only the columns you want
dt3[, c("month", "temp", "max_temp", "new_group"), with=FALSE]
month temp max_temp new_group
1: 1 1 19 1
2: 1 3 19 1
3: 1 4 19 1
4: 1 7 19 1
5: 2 2 1 3
6: 2 5 1 3
7: 2 6 1 4
8: 3 10 18 5
9: 3 4 18 5
10: 3 7 18 5
11: 3 8 18 5
12: 3 9 18 5

Resources