impute missing values using minimum of a class in R - r

I'm new to R and need help with imputing missing values in one of the columns in a dataset that I'm currently working on. The below image shows the missing value I want to impute along with few of the columns.
I want to fill in the value with the minimum qty for a customer using its previous entries as I think this best fits my situation and data. For example, in the image I should be able to fill in the missing value with 1 (min of 1,5,2).
During my search, I mainly came across methods that use mean for a given class, and not minimum or maximum.
Any help or pointers would really be appreciated.
Edit: Here is the output from dput.
structure(list(YEAR = c(2011L, 2012L, 2014L, 2015L, 2011L, 2012L
), CustomerId = c("00000063", "00000063", "00000063", "00000063",
"00000065", "00000065"), MemberType = structure(c(2L, 2L, 2L,
2L, 2L, 2L), .Label = c("GROUP", "INDIVIDUAL", "PARTNER"), class = "factor"),
MembershipTypeCode = structure(c(6L, 6L, 6L, 10L, 6L, 6L), .Label = c("EGROUP",
"EINDIV", "EINDIV2", "EPARTNER", "GROUP", "INDIV", "INDIV2",
"INDIV3", "PARTNER", "PLUS", "PLUS2", "PLUS20", "PLUS3",
"PLUSENTERPRI", "PLUSGROUP", "PLUSGROUP2", "PROF_ENTERPR",
"PROF_GROUP", "PROF_GROUP2", "PROF_INDIV", "PROF_INDIV2",
"PROF_INDIV3"), class = "factor"), MembershipPeriodBegin = structure(c(15279,
15677, 16071, 16436, 15006, 15371), class = "Date"), MembershipPeriodEnd = structure(c(15644,
16070, 16435, 16800, 15370, 15736), class = "Date"), ConsecutiveYearsAsMember = c(14L,
15L, 17L, 18L, 8L, 9L), AllocationUsage = c(0, 0, 0, 0, 0,
0), SetCOPPreference = structure(c(2L, 2L, 2L, 2L, 2L, 2L
), .Label = c("Y", "N"), class = "factor"), Purchase.Qty = c(2L,
5L, 1L, NA, 7L, 27L), Webcast.Registration = c(0L, 0L, 0L,
0L, 0L, 1L), Web.Visits = c(0L, 0L, 42L, 0L, 0L, 0L), Web.Page.Views = c(0L,
0L, 98L, 0L, 0L, 0L), Blog.Visits = c(0L, 0L, 3L, 0L, 0L,
0L), Blog.Page.Views = c(0L, 0L, 4L, 0L, 0L, 0L), Forum.Visits = c(0L,
0L, 45L, 0L, 0L, 0L), Forum.Page.Views = c(0L, 0L, 102L,
0L, 0L, 0L), ParatureTickets = c(0L, 0L, 0L, 0L, 0L, 0L),
ParatureChats = c(0L, 0L, 0L, 0L, 0L, 0L), Registered.for.Edu = c(0L,
0L, 0L, 0L, 0L, 0L), Attended.ICE = structure(c(2L, 2L, 2L,
2L, 2L, 2L), .Label = c("Y", "N"), class = "factor"), Attended.TK = structure(c(2L,
2L, 2L, 2L, 2L, 2L), .Label = c("Y", "N"), class = "factor"),
Frugal = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Y",
"N"), class = "factor"), Chapter.Board = structure(c(2L,
2L, 2L, 2L, 2L, 2L), .Label = c("Y", "N"), class = "factor"),
Retained = structure(c(5L, 5L, 5L, 1L, 5L, 5L), .Label = c("Active",
"Awaiting Renewal", "Future Dated", "Lost", "Retained"), class = "factor"),
ProfileCompletion = c(60, 60, 60, 60, 60, 60), NumberofLogins = c(1L,
1L, 15L, 0L, 0L, 4L), Downloads = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), ForumMember = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), .Label = "N", class = "factor"), FreeUpgrade = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = c("Y", "N"), class = "factor")), .Names = c("YEAR",
"CustomerId", "MemberType", "MembershipTypeCode", "MembershipPeriodBegin",
"MembershipPeriodEnd", "ConsecutiveYearsAsMember", "AllocationUsage",
"SetCOPPreference", "Purchase.Qty", "Webcast.Registration", "Web.Visits",
"Web.Page.Views", "Blog.Visits", "Blog.Page.Views", "Forum.Visits",
"Forum.Page.Views", "ParatureTickets", "ParatureChats", "Registered.for.Edu",
"Attended.ICE", "Attended.TK", "Frugal", "Chapter.Board", "Retained",
"ProfileCompletion", "NumberofLogins", "Downloads", "ForumMember",
"FreeUpgrade"), row.names = c(NA, 6L), class = "data.frame")
Thanks,
Pratik

We can use na.aggregate with FUN= min. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'CustomerID', we apply na.aggregate on 'PurchaseQty' and assign (:=) the output back to the 'PurchaseQty'.
library(data.table)
library(zoo)
setDT(df1)[, PurchaseQty := na.aggregate(PurchaseQty, FUN= min) , by = CustomerID]
data
df1 <- data.frame(CustomerID= rep(1:2, each=4), PurchaseQty= c(4, 3, NA, 3, 1, 9, NA, 4))

Since you provide no data, here a toy example how I would do it in base R:
# simple sample data
data <- data.frame( a = rep( 10:12, each = 4 ), b = 12:1 )
data[ c( 3, 5, 12 ), 2 ] <- NA
# for each unique a value, get the row index with the min b value,
# and write that min value to col b where b is NA
for( i in unique( data$a ) )
data[ which( is.na( data$b ) & data$a == i ), "b" ] <-
min( data[ data$a == i, "b" ], na.rm = TRUE )
data
a b
1 10 12
2 10 11
3 10 9
4 10 9
5 11 5
6 11 7
7 11 6
8 11 5
9 12 4
10 12 3
11 12 2
12 12 2

Related

How to calculate the average value in one column for the 10 maximum values in another column?

I have a dataset and the task:"Average number of major credit cards held for people with top 10 income".
dput(head(creditcard))
structure(list(card = structure(c(2L, 2L, 2L, 2L, 2L, 2L), levels = c("no","yes"), class = "factor"), reports = c(0L, 0L, 0L, 0L, 0L, 0L), age = c(37.66667, 33.25, 33.66667, 30.5, 32.16667, 23.25), income = c(4.52, 2.42, 4.5, 2.54, 9.7867, 2.5), share = c(0.03326991, 0.005216942, 0.004155556, 0.06521378, 0.06705059, 0.0444384), expenditure = c(124.9833, 9.854167, 15, 137.8692, 546.5033, 91.99667), owner = structure(c(2L, 1L, 2L, 1L, 2L, 1L), levels = c("no", "yes"), class = "factor"), selfemp = structure(c(1L, 1L, 1L, 1L, 1L, 1L), levels = c("no", "yes"), class = "factor"),
dependents = c(3L, 3L, 4L, 0L, 2L, 0L), days = c(54L, 34L,58L, 25L, 64L, 54L), majorcards = c(1L, 1L, 1L, 1L, 1L, 1L), active = c(12L, 13L, 5L, 7L, 5L, 1L), income_fam = c(1.13, 0.605, 0.9, 2.54, 3.26223333333333, 2.5)), row.names = c("1","2", "3", "4", "5", "6"), class = "data.frame")
I tried to do the task like this
round(mean(creditcard[order(creditcard$income, decreasing = TRUE),]$majorcards[1:10]))
But my solution turned out to be inoptimal and I do not understand how it can be corrected
You can get the 10 observations with the highest income using slice_max, then creating a new dataset with the mean of majorcards
library(dplyr)
creditcard %>%
slice_max(income, n = 10) %>%
summarise(mean(majorcards))
If your dataset is one row per person, then you can do this:
library(dplyr)
creditcard %>%
arrange(desc(income)) %>%
slice_head(n=10) %>%
summarize(mean_cards = mean(majorcards,na.rm=T))
Maybe something like:
mean(creditcard$majorcards[which(creditcard$income%in%sort(creditcard$income, decreasing = TRUE)[1:10])])
Using base R
with(creditcard, mean(head(majorcards[order(-income)], 10)))
Or in data.table
library(data.table)
setDT(creditcard)[order(-income), mean(head(majorcards, 10))]

create a dataframe for multiple line plot for ggplot R

This question is about arranging data for a ggplot line plot. I have been doing this manually with excel and I want to work out a way to do this using r.
I have reviewed this post which is similar
Arrange dataframe format for ggplot - R
I have a dataset that looks like this:
]1
I want to convert it to a dataframe that is divided into the groups (N,A,G) and into age brackets and the proportion per age_group.
An example of what I am trying to achieve:
Appreciate your help.
Data:
structure(list(ID = 1:10, Age = c(9L, 16L, 12L, 13L, 29L, 24L,
23L, 24L, 16L, 40L), Sex = structure(c(1L, 1L, 2L, 1L, 1L, 2L,
2L, 1L, 1L, 1L), .Label = c("F", "M"), class = "factor"), Age_group =
c(1L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 4L), N = c(1L, 1L, 1L, 1L, 0L,
0L, 0L, 0L, 0L, 0L), A = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L,
0L), G = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L)), class = "data.frame",
row.names = c(NA,
-10L))
We can pivot to 'long' format with pivot_longer and then create a grouping variable with cut on the 'Age' and get the sum of 'n' and 'proportion'
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = N:G, names_to = 'group', values_to = 'n') %>%
group_by(Age_group_new = cut(Age, breaks = c(-Inf, 0, seq(10, 70, by = 10), 100, Inf)), group) %>%
summarise(n = sum(n)) %>%
group_by(Age_group_new) %>%
mutate(proportion = n/sum(n),
proportion = replace(proportion, is.nan(proportion), 0))

R program visualization + How to plot between categorical and numerical data

I want to develop a visualization graph with date in x axis and A,B,C values in Y axis. A value categorical data and B,c are numerical data. In the x axis, date should be represented in month and day like (01/07, 02/07). The problem here is A value is categorical data and B value is numerical data. The visualization charts should not be of bar type.It is like scatter plots.
I didn't know how to do this one. Your help would be highly appreciable. Thanks.
dput(df)
structure(list(Date = structure(list(sec = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = 1:31, mon = c(6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), year = c(116L,
116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L,
116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L,
116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L), wday = c(5L,
6L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 0L,
1L, 2L, 3L, 4L, 5L, 6L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 0L), yday = 182:212,
isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), zone = c("JST", "JST", "JST", "JST", "JST",
"JST", "JST", "JST", "JST", "JST", "JST", "JST", "JST", "JST",
"JST", "JST", "JST", "JST", "JST", "JST", "JST", "JST", "JST",
"JST", "JST", "JST", "JST", "JST", "JST", "JST", "JST"),
gmtoff = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_)), .Names = c("sec", "min", "hour",
"mday", "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"
), class = c("POSIXlt", "POSIXt"), tzone = "Asia/Tokyo"), A =
structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("N",
"YES"), class = "factor"), B = c(0L, 0L, 0L, 0L, 1L, 3L, 5L,
10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), C = c(0L, 0L, 0L, 0L, 1L,
3L, 5L, 10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L, 5L, 10L, 9L,
8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L, 10L, 9L, 8L)), .Names = c("Date",
"A", "B", "C"), row.names = c(NA, -31L), class = "data.frame")
sample data:
Date A B C
1 2016-07-01 N 0 0
2 2016-07-02 N 0 0
3 2016-07-03 N 0 0
4 2016-07-04 N 0 0
5 2016-07-05 N 1 1
6 2016-07-06 N 3 3
7 2016-07-07 N 5 5
8 2016-07-08 N 10 10
9 2016-07-09 N 9 9
10 2016-07-10 N 8 8
11 2016-07-11 N 7 7
12 2016-07-12 N 6 6
13 2016-07-13 N 5 5
14 2016-07-14 N 4 4
15 2016-07-15 N 3 3
16 2016-07-16 N 2 2
17 2016-07-17 N 1 1
18 2016-07-18 N 0 5
19 2016-07-19 N 0 10
20 2016-07-20 N 0 9
21 2016-07-21 N 0 8
22 2016-07-22 N 0 7
23 2016-07-23 N 0 6
24 2016-07-24 YES 0 5
25 2016-07-25 N 0 4
26 2016-07-26 N 0 3
27 2016-07-27 N 0 2
28 2016-07-28 N 0 1
29 2016-07-29 N 0 10
30 2016-07-30 N 0 9
31 2016-07-31 N 0 8
I tried this code. It works good for one value like B.. when I try to plot C value along with B, the graph seems different .
library(scales)
library(ggplot2)
ggplot(df, aes(x = Date, y = B)) +
geom_line(aes(y=B,group=1),colour="#000099") +
geom_point(size=2, colour="#CC0000") +
scale_y_continuous(breaks = seq(0, 50, by = 1)) +
scale_x_datetime(date_breaks = "2 day")
Here's one possible way to do it. I am not sure if this is ideal but hopefully helps in your case:
Using Markers
#library(devtools)
#install_github("ropensci/plotly")
library(plotly)
library(zoo)
df$Date <- as.yearmon(df$Date)
df$Date <- as.Date(df$Date)
plot_ly(df, x = ~Date) %>%
add_markers(y = ~B, marker = list(size = 15, symbol = "cross"),
name = "B") %>%
add_markers(y = ~C, marker = list(size = 10, symbol = "circle"),
name = "C") %>%
add_markers(y = ~A, marker = list(size = 20, symbol = "diamond-open"),
name = "A", yaxis = "y2") %>%
layout(xaxis = list(domain = c(0, 0.9)),
yaxis2 = list(side = "right", anchor = "xaxis", overlaying = "y", title = "A"),
yaxis = list(title = "B/C"))
Using lines
#library(devtools)
#install.github("ropensci/plotly")
library(plotly)
df$x <- 1:nrow(df)
df$Date <- as.character(as.Date(df$Date))
plot_ly(df, x = ~x) %>%
add_lines(y = ~B, line = list(width = 5), opacity = "0.5",
name = "B") %>%
add_lines(y = ~C, line = list(width = 5, dash = "5px"),
name = "C") %>%
add_markers(y = ~A, marker = list(size = 15, symbol = "cross"),
name = "A", yaxis = "y2") %>%
layout(xaxis = list(domain = c(0, 0.9), title = "",
tickmode = "array",
tickvals = ~x,
ticktext = ~Date,
tickfont = list(size = 10)),
yaxis2 = list(side = "right", anchor = "xaxis", overlaying = "y", title = "A"),
yaxis = list(title = "B/C"))

group by and then count unique observations [duplicate]

I have a data frame that looks like this:
date time id datetime
1 2015-01-02 14:27:22.130 999000000007628 2015-01-02 14:27:22
2 2015-01-02 14:41:27.720 989001002807730 2015-01-02 14:41:27
3 2015-01-02 14:41:27.940 989001002807730 2015-01-02 14:41:27
4 2015-01-02 14:41:28.140 989001002807730 2015-01-02 14:41:28
5 2015-01-02 14:41:28.170 989001002807730 2015-01-02 14:41:28
6 2015-01-02 14:41:28.350 989001002807730 2015-01-02 14:41:28
I need to find the number of unique "id"s for each "date" in that data frame.
I tried this:
sums<-data.frame(date=unique(data$date), numIDs=0)
for(i in unique(data$date)){
sums[sums$date==i,]$numIDs<-length(unique(data[data$date==i,]$id))
}
and I got the following error:
Error in `$<-.data.frame`(`*tmp*`, "numIDs", value = 0L) :
replacement has 1 row, data has 0
In addition: Warning message:
In `==.default`(data$date, i) :
longer object length is not a multiple of shorter object length
Any ideas?? Thank you!
Hopefully this helps!
data <- structure(list(date = structure(list(sec = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(115L, 115L, 115L, 115L,
115L, 115L, 115L, 115L, 115L, 115L), wday = c(5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L), yday = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), zone = c("PST", "PST", "PST", "PST", "PST",
"PST", "PST", "PST", "PST", "PST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), time = c("14:27:22.130",
"14:41:27.720", "14:41:27.940", "14:41:28.140", "14:41:28.170",
"14:41:28.350", "14:41:28.390", "14:41:28.520", "14:41:28.630",
"14:41:28.740"), id = c("999000000007628", "989001002807730",
"989001002807730", "989001002807730", "989001002807730", "989001002807730",
"989001002807730", "989001002807730", "989001002807730", "989001002807730"
), datetime = structure(list(sec = c(22.13, 27.72, 27.94, 28.14,
28.17, 28.35, 28.39, 28.52, 28.63, 28.74), min = c(27L, 41L,
41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L), hour = c(14L, 14L, 14L,
14L, 14L, 14L, 14L, 14L, 14L, 14L), mday = c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), year = c(115L, 115L, 115L, 115L, 115L, 115L, 115L,
115L, 115L, 115L), wday = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L), yday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), isdst = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), zone = c("PST", "PST", "PST",
"PST", "PST", "PST", "PST", "PST", "PST", "PST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), site = c("Chivato",
"Chivato", "Chivato", "Chivato", "Chivato", "Chivato", "Chivato",
"Chivato", "Chivato", "Chivato")), .Names = c("date", "time",
"id", "datetime", "site"), row.names = c(NA, 10L), class = "data.frame")
You can use the uniqueN function from data.table:
library(data.table)
setDT(df)[, uniqueN(id), by = date]
or (as per the comment of #Richard Scriven):
aggregate(id ~ date, df, function(x) length(unique(x)))
Or we could use n_distinct from library(dplyr)
library(dplyr)
df %>%
group_by(date) %>%
summarise(id=n_distinct(id))
This answer is in response to this post: group by and then count unique observations which was marked as duplicate as I was writing this draft. This is not in response to the question for the duplicate basis here: How to find number of unique ids corresponding to each date in a data drame which asks about finding unique ID's. I'm not sure the second post actually answers the OP's question which is,
"I want to create a table with the number of unique id for each
combination of group1 and group2."
The keyword here is 'combination'. The interpretation is each id has a particular value for group1 and a particular value for group2 so that the set of data of interest is the particular set of values c(id, group1, group2).
Here is the data.frame the OP provided:
df1 <- data.frame(id=sample(letters, 10000, replace = T),
group1=sample(1:2, 10000, replace = T),
group2=sample(100:101, 10000, replace = T))
Using data.table inspired by this post -- https://stackoverflow.com/a/13017723/5220858:
>library(data.table)
>DT <- data.table(df1)
>DT[, .N, by = .(group1, group2)]
group1 group2 N
1: 1 100 2493
2: 1 101 2455
3: 2 100 2559
4: 2 101 2493
N is the count for the id that has a particular group1 value and a particular group2 value. Expanding to include the id also returns a table of 104 unique id, group1, group2 combinations.
>DT[, .N, by = .(id, group1, group2)]
id group1 group2 N
1: t 1 100 107
2: g 1 101 85
3: l 1 101 98
4: a 1 100 83
5: j 1 101 98
---
100: p 1 101 96
101: r 2 101 91
102: y 1 101 104
103: g 1 100 83
104: r 2 100 77

How to find number of unique ids corresponding to each date in a data drame

I have a data frame that looks like this:
date time id datetime
1 2015-01-02 14:27:22.130 999000000007628 2015-01-02 14:27:22
2 2015-01-02 14:41:27.720 989001002807730 2015-01-02 14:41:27
3 2015-01-02 14:41:27.940 989001002807730 2015-01-02 14:41:27
4 2015-01-02 14:41:28.140 989001002807730 2015-01-02 14:41:28
5 2015-01-02 14:41:28.170 989001002807730 2015-01-02 14:41:28
6 2015-01-02 14:41:28.350 989001002807730 2015-01-02 14:41:28
I need to find the number of unique "id"s for each "date" in that data frame.
I tried this:
sums<-data.frame(date=unique(data$date), numIDs=0)
for(i in unique(data$date)){
sums[sums$date==i,]$numIDs<-length(unique(data[data$date==i,]$id))
}
and I got the following error:
Error in `$<-.data.frame`(`*tmp*`, "numIDs", value = 0L) :
replacement has 1 row, data has 0
In addition: Warning message:
In `==.default`(data$date, i) :
longer object length is not a multiple of shorter object length
Any ideas?? Thank you!
Hopefully this helps!
data <- structure(list(date = structure(list(sec = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), mday = c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(115L, 115L, 115L, 115L,
115L, 115L, 115L, 115L, 115L, 115L), wday = c(5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L), yday = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), zone = c("PST", "PST", "PST", "PST", "PST",
"PST", "PST", "PST", "PST", "PST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), time = c("14:27:22.130",
"14:41:27.720", "14:41:27.940", "14:41:28.140", "14:41:28.170",
"14:41:28.350", "14:41:28.390", "14:41:28.520", "14:41:28.630",
"14:41:28.740"), id = c("999000000007628", "989001002807730",
"989001002807730", "989001002807730", "989001002807730", "989001002807730",
"989001002807730", "989001002807730", "989001002807730", "989001002807730"
), datetime = structure(list(sec = c(22.13, 27.72, 27.94, 28.14,
28.17, 28.35, 28.39, 28.52, 28.63, 28.74), min = c(27L, 41L,
41L, 41L, 41L, 41L, 41L, 41L, 41L, 41L), hour = c(14L, 14L, 14L,
14L, 14L, 14L, 14L, 14L, 14L, 14L), mday = c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L), year = c(115L, 115L, 115L, 115L, 115L, 115L, 115L,
115L, 115L, 115L), wday = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L), yday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), isdst = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), zone = c("PST", "PST", "PST",
"PST", "PST", "PST", "PST", "PST", "PST", "PST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), site = c("Chivato",
"Chivato", "Chivato", "Chivato", "Chivato", "Chivato", "Chivato",
"Chivato", "Chivato", "Chivato")), .Names = c("date", "time",
"id", "datetime", "site"), row.names = c(NA, 10L), class = "data.frame")
You can use the uniqueN function from data.table:
library(data.table)
setDT(df)[, uniqueN(id), by = date]
or (as per the comment of #Richard Scriven):
aggregate(id ~ date, df, function(x) length(unique(x)))
Or we could use n_distinct from library(dplyr)
library(dplyr)
df %>%
group_by(date) %>%
summarise(id=n_distinct(id))
This answer is in response to this post: group by and then count unique observations which was marked as duplicate as I was writing this draft. This is not in response to the question for the duplicate basis here: How to find number of unique ids corresponding to each date in a data drame which asks about finding unique ID's. I'm not sure the second post actually answers the OP's question which is,
"I want to create a table with the number of unique id for each
combination of group1 and group2."
The keyword here is 'combination'. The interpretation is each id has a particular value for group1 and a particular value for group2 so that the set of data of interest is the particular set of values c(id, group1, group2).
Here is the data.frame the OP provided:
df1 <- data.frame(id=sample(letters, 10000, replace = T),
group1=sample(1:2, 10000, replace = T),
group2=sample(100:101, 10000, replace = T))
Using data.table inspired by this post -- https://stackoverflow.com/a/13017723/5220858:
>library(data.table)
>DT <- data.table(df1)
>DT[, .N, by = .(group1, group2)]
group1 group2 N
1: 1 100 2493
2: 1 101 2455
3: 2 100 2559
4: 2 101 2493
N is the count for the id that has a particular group1 value and a particular group2 value. Expanding to include the id also returns a table of 104 unique id, group1, group2 combinations.
>DT[, .N, by = .(id, group1, group2)]
id group1 group2 N
1: t 1 100 107
2: g 1 101 85
3: l 1 101 98
4: a 1 100 83
5: j 1 101 98
---
100: p 1 101 96
101: r 2 101 91
102: y 1 101 104
103: g 1 100 83
104: r 2 100 77

Resources