getting a sample of a data.frame in R - r

I have the following data frame in R:
id<-c(1,2,3,4,10,2,4,5,6,8,2,1,5,7,7)
date<-c(19970807,19970902,19971010,19970715,19991212,19961212,19980909,19990910,19980707,19991111,19970203,19990302,19970605,19990808,19990706)
spent<-c(1997,19,199,134,654,37,876,890,873,234,643,567,23,25,576)
df<-data.frame(id,date,spent)
I need to take a random sample of 3 customers (based on id) in a way that all observations of the customers be extracted.

You want to use %in% and unique
df[df$id %in% sample(unique(df$id),3),]
## id date spent
## 4 4 19970715 134
## 7 4 19980909 876
## 8 5 19990910 890
## 10 8 19991111 234
## 13 5 19970605 23
Using data.table to avoid $ referencing
library(data.table)
DT <- data.table(df)
DT[id %in% sample(unique(id),3)]
## id date spent
## 1: 1 19970807 1997
## 2: 4 19970715 134
## 3: 4 19980909 876
## 4: 1 19990302 567
## 5: 7 19990808 25
## 6: 7 19990706 576
This ensures that you are always evaluating the expressions within the data.table.

Use something like:
df[sample(df$id, 3), ]
# id date spent
# 1 1 19970807 1997
# 5 10 19991212 654
# 8 5 19990910 890
Of course, your samples would be different.
Update
If you want unique customers, you can aggregate first.
df2 = aggregate(list(date = df$date, spent = df$spent), list(id = df$id), c)
df2[sample(df2$id, 3), ]
# id date spent
# 4 4 19970715, 19980909 134, 876
# 5 5 19990910, 19970605 890, 23
# 8 8 19991111 234
OR--an option with out aggregate:
df[df$id %in% sample(unique(df$id), 3), ]
# id date spent
# 1 1 19970807 1997
# 3 3 19971010 199
# 12 1 19990302 567
# 14 7 19990808 25
# 15 7 19990706 576

Related

how to replace the NA in a data frame with the average number of this data frame

I have a data frame like this:
nums id
1233 1
3232 2
2334 3
3330 1
1445 3
3455 3
7632 2
NA 3
NA 1
And I can know the average "nums" of each "id" by using:
id_avg <- aggregate(nums ~ id, data = dat, FUN = mean)
What I would like to do is to replace the NA with the value of the average number of the corresponding id. for example, the average "nums" of 1,2,3 are 1000, 2000, 3000, respectively. The NA when id == 3 will be replaced by 3000, the last NA whose id == 1 will be replaced by 1000.
I tried the following code to achieve this:
temp <- dat[is.na(dat$nums),]$id
dat[is.na(dat$nums),]$nums <- id_avg[id_avg[,"id"] ==temp,]$nums
However, the second part
id_avg[id_avg[,"id"] ==temp,]$nums
is always NA, which means I always pass NA to the NAs I want to replace.
I don't know where I was wrong, or do you have better method to do this?
Thank you
Or you can fix it by:
dat[is.na(dat$nums),]$nums <- id_avg$nums[temp]
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
What you want is contained in the zoo package.
library(zoo)
na.aggregate.default(dat, by = dat$id)
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
Here is a dplyr way:
df %>%
group_by(id) %>%
mutate(nums = replace(nums, is.na(nums), as.integer(mean(nums, na.rm = T))))
# Source: local data frame [9 x 2]
# Groups: id [3]
# nums id
# <int> <int>
# 1 1233 1
# 2 3232 2
# 3 2334 3
# 4 3330 1
# 5 1445 3
# 6 3455 3
# 7 7632 2
# 8 2411 3
# 9 2281 1
You essentially want to merge the id_avg back to the original data frame by the id column, so you can also use match to follow your original logic:
dat$nums[is.na(dat$nums)] <- id_avg$nums[match(dat$id[is.na(dat$nums)], id_avg$id)]
dat
# nums id
# 1: 1233.000 1
# 2: 3232.000 2
# 3: 2334.000 3
# 4: 3330.000 1
# 5: 1445.000 3
# 6: 3455.000 3
# 7: 7632.000 2
# 8: 2411.333 3
# 9: 2281.500 1

Loops in custom R function to transform data

I use this code to create a sample dataframe of events:
set.seed(100)
mydf <-data.frame(time=(1:100),
status = sample(c('OK','UNKNOWN'),1000,replace=TRUE),
event = sample(1:10,1000,replace=TRUE)
)
The data looks like this:
head(mydf)
time status event
1 1 OK 1
2 2 OK 2
3 3 UNKNOWN 7
4 4 OK 7
5 5 OK 4
6 6 UNKNOWN 2
I would like to create a new dataset like this:
StartTime EndTime SeqID Sequence
1 1 3 1 {1,2,7}
2 4 6 2 {7,4,2}
Essentially I'd like to create a column named Sequence that is an array of the events, but I'd like to start over after the status column is equal to UNKNOWN. I've tried a for loop with a while loop, but no success.
Here's a data.table solution:
library(data.table);
dt <- as.data.table(mydf);
dt[,.(StartTime=time[1L],EndTime=time[length(time)],Sequence=list(event)),.(SeqID=cumsum(status=='UNKNOWN')+1L)];
## SeqID StartTime EndTime Sequence
## 1: 1 1 2 1,2
## 2: 2 3 6 7,7,4,2
## 3: 3 7 8 1,5
## 4: 4 9 10 6,10
## 5: 5 11 11 4
## ---
## 513: 513 90 92 7,3,5
## 514: 514 93 93 2
## 515: 515 94 95 8,10
## 516: 516 96 99 3,2,3,1
## 517: 517 100 100 7
I believe you've made a mistake with your expected output. If the sequence starts over every time the status column is equal to UNKNOWN, then the first array should be 1,2 rather than 1,2,7.
Update: If you want the sequence to start over in the row after the status column equalled UNKNOWN, then you can do this:
dt[,.(StartTime=time[1L],EndTime=time[length(time)],Sequence=list(event)),.(SeqID=c(0L,cumsum(status[-length(status)]=='UNKNOWN'))+1L)];
## SeqID StartTime EndTime Sequence
## 1: 1 1 3 1,2,7
## 2: 2 4 7 7,4,2,1
## 3: 3 8 9 5,6
## 4: 4 10 11 10, 4
## 5: 5 12 12 2
## ---
## 512: 512 89 90 2,7
## 513: 513 91 93 3,5,2
## 514: 514 94 94 8
## 515: 515 95 96 10, 3
## 516: 516 97 100 2,3,1,7
Note that your expected output is still incorrect; the second group should be 7,4,2,1 rather than 7,4,2 under this design. Edit: Actually, I think perhaps the issue is with a discrepancy in mydf; I get this with your sample creation code:
head(mydf,10L);
## time status event
## 1 1 OK 1
## 2 2 OK 2
## 3 3 UNKNOWN 7
## 4 4 OK 7
## 5 5 OK 4
## 6 6 OK 2
## 7 7 UNKNOWN 1
## 8 8 OK 5
## 9 9 UNKNOWN 6
## 10 10 OK 10
Please try running your sample creation code again with the seed of 100. We should be getting the same result for mydf.
Here's a base R solution built around by():
with(list(SeqID=c(0L,cumsum(mydf$status[-nrow(mydf)]=='UNKNOWN'))+1L),
do.call(rbind,by(cbind(mydf,SeqID),SeqID,function(x)
data.frame(
SeqID=x$SeqID[1L],
StartTime=x$time[1L],
EndTime=x$time[length(x$time)],
Sequence=I(list(x$event))
)
))
);
## SeqID StartTime EndTime Sequence
## 1 1 1 3 1, 2, 7
## 2 2 4 7 7, 4, 2, 1
## 3 3 8 9 5, 6
## 4 4 10 11 10, 4
## 5 5 12 12 2
##
## ... snip ...
##
## 512 512 89 90 2, 7
## 513 513 91 93 3, 5, 2
## 514 514 94 94 8
## 515 515 95 96 10, 3
## 516 516 97 100 2, 3, 1, 7
Benchmarking
library(data.table);
library(microbenchmark);
bgoldst1 <- function(dt) dt[,.(StartTime=time[1L],EndTime=time[length(time)],Sequence=list(event)),.(SeqID=c(0L,cumsum(status[-length(status)]=='UNKNOWN'))+1L)];
bgoldst2 <- function(mydf) with(list(SeqID=c(0L,cumsum(mydf$status[-nrow(mydf)]=='UNKNOWN'))+1L),do.call(rbind,by(cbind(mydf,SeqID),SeqID,function(x) data.frame(SeqID=x$SeqID[1L],StartTime=x$time[1L],EndTime=x$time[length(x$time)],Sequence=I(list(x$event))))));
lebatsnok <- function(mydf) { mydfs <- split(mydf, head(cumsum(c("", mydf$status) == "UNKNOWN"), -1)); res <- lapply(mydfs, function(x) data.frame(StartTime = x$time[1], EndTime = tail(x$time,1), SeqID = NA, Sequence = paste(x$event, collapse=","))); res <- do.call(rbind, res); res$SeqID <- seq_len(NROW(res)); res; };
set.seed(100L);
mydf <- data.frame(time=1:100,status=sample(c('OK','UNKNOWN'),1000L,T),event=sample(1:10,1000L,T),stringsAsFactors=F);
dt <- as.data.table(mydf);
ex <- as.data.frame(bgoldst1(dt)); o <- names(ex);
all.equal(ex,bgoldst2(mydf)[o],check.attributes=F);
## [1] TRUE
all.equal(transform(ex,Sequence=factor(sapply(Sequence,paste,collapse=','))),lebatsnok(mydf)[o],check.attributes=F);
## [1] TRUE
microbenchmark(bgoldst1(dt),bgoldst2(mydf),lebatsnok(mydf));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst1(dt) 1.363785 1.671909 1.896345 1.839763 2.041828 3.900621 100
## bgoldst2(mydf) 217.960902 234.978058 244.491406 243.867674 251.392438 298.083774 100
## lebatsnok(mydf) 254.961413 273.434086 284.439844 283.864322 291.889867 337.319627 100
A base R solution (relies on stringsAsFactors being FALSE, so mydf is redefined):
set.seed(100)
mydf <-data.frame(time=(1:100),
status = sample(c('OK','UNKNOWN'),1000,replace=TRUE),
event = sample(1:10,1000,replace=TRUE), stringsAsFactors=FALSE
)
mydfs <- split(mydf, head(cumsum(c("", mydf$status) == "UNKNOWN"), -1))
res <- lapply(mydfs, function(x)
data.frame(StartTime = x$time[1],
EndTime = tail(x$time,1),
SeqID = NA,
Sequence = paste(x$event, collapse=",")))
res <- do.call(rbind, res)
res$SeqID <- seq_len(NROW(res))
head(res)
# StartTime EndTime SeqID Sequence
# 0 1 3 1 1,2,7
# 1 4 7 2 7,4,2,1
# 2 8 9 3 5,6
# 3 10 11 4 10,4
# 4 12 12 5 2
# 5 13 15 6 10,1,8

How to add a date to each row for a column in a data frame?

df <- data.frame(DAY = character(), ID = character())
I'm running a (for i in DAYS[i]) and get IDs for each day and storing them in a data frame
df <- rbind(df, data.frame(ID = IDs))
I want to add the DAY[i] in a second column across each row in a loop.
How do I do that?
As #Pascal says, this isn't the best way to create a data frame in R. R is a vectorised language, so generally you don't need for loops.
I'm assuming each ID is unique, so you can create a vector of IDs from 1 to 10:
ID <- 1:10
Then, you need a vector for your DAYs which can be the same length as your IDs, or can be recycled (i.e. if you only have a certain number of days that are repeated in the same order you can have a smaller vector that's reused). Use c() to create a vector with more than one value:
DAY <- c(1, 2, 9, 4, 4)
df <- data.frame(ID, DAY)
df
# ID DAY
# 1 1 1
# 2 2 2
# 3 3 9
# 4 4 4
# 5 5 4
# 6 6 1
# 7 7 2
# 8 8 9
# 9 9 4
# 10 10 4
Or with a vector for DAY that includes unique values:
DAY <- sample(1:100, 10, replace = TRUE)
df <- data.frame(ID, DAY)
df
# ID DAY
# 1 1 61
# 2 2 30
# 3 3 32
# 4 4 97
# 5 5 32
# 6 6 74
# 7 7 97
# 8 8 73
# 9 9 16
# 10 10 98

Using one data frame to sum a range of data from another data frame in R

I'm migrating from SAS to R. I need help figuring out how to sum up weather data for date ranges. In SAS, I take the date ranges, use a data step to create a record for every date (with startdate, enddate, date) in the range, merge with weather and then summarize (VAR hdd cdd; CLASS=startdate enddate sum=) to sum up the values for the date range.
R code:
startdate <- c(100,103,107)
enddate <- c(105,104,110)
billperiods <-data.frame(startdate,enddate);
to get:
> billperiods
startdate enddate
1 100 105
2 103 104
3 107 110
R code:
weatherdate <- c(100:103,105:110)
hdd <- c(0,0,4,5,0,0,3,1,9,0)
cdd <- c(4,1,0,0,5,6,0,0,0,10)
weather <- data.frame(weatherdate,hdd,cdd)
to get:
> weather
weatherdate hdd cdd
1 100 0 4
2 101 0 1
3 102 4 0
4 103 5 0
5 105 0 5
6 106 0 6
7 107 3 0
8 108 1 0
9 109 9 0
10 110 0 10
Note: weatherdate = 104 is missing. I may not have weather for a day.
I can't figure out how to get to:
> billweather
startdate enddate sumhdd sumcdd
1 100 105 9 10
2 103 104 5 0
3 107 110 13 10
where sumhdd is the sum of the hdd's from startdate to enddate in the weather data.frame.
Any ideas?
Here's a method using IRanges and data.table. Seemingly, for this question, this answer may seem kind of an overkill. But in general, I find it convenient to use IRanges to deal with intervals, how simple they may be.
# load packages
require(IRanges)
require(data.table)
# convert data.frames to data.tables
dt1 <- data.table(billperiods)
dt2 <- data.table(weather)
# construct Ranges to get overlaps
ir1 <- IRanges(dt1$startdate, dt1$enddate)
ir2 <- IRanges(dt2$weatherdate, width=1) # start = end
# find Overlaps
olaps <- findOverlaps(ir1, ir2)
# Hits of length 10
# queryLength: 3
# subjectLength: 10
# queryHits subjectHits
# <integer> <integer>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 2 4
# 7 3 7
# 8 3 8
# 9 3 9
# 10 3 10
# get billweather (final output)
billweather <- cbind(dt1[queryHits(olaps)],
dt2[subjectHits(olaps),
list(hdd, cdd)])[, list(sumhdd = sum(hdd),
sumcdd = sum(cdd)),
by=list(startdate, enddate)]
# startdate enddate sumhdd sumcdd
# 1: 100 105 9 10
# 2: 103 104 5 0
# 3: 107 110 13 10
Code breakdown for last line: First I construct using queryHits, subjectHits and cbind a mid-way data.table from which then, I group by startdate, enddate and get the sum of hdd and sum of cdd. It is easier to look at the line separately as shown below for better understanding.
# split for easier understanding
billweather <- cbind(dt1[queryHits(olaps)],
dt2[subjectHits(olaps),
list(hdd, cdd)])
billweather <- billweather[, list(sumhdd = sum(hdd),
sumcdd = sum(cdd)),
by=list(startdate, enddate)]
cbind(billperiods, t(sapply(apply(billperiods, 1, function(x)
weather[weather$weatherdate >= x[1] &
weather$weatherdate <= x[2], c("hdd", "cdd")]), colSums)))
startdate enddate hdd cdd
1 100 105 9 10
2 103 104 5 0
3 107 110 13 10
billweather <- cbind(billperiods,
t(apply(billperiods, 1, function(x) {
colSums(weather[weather[, 1] %in% c(x[1]:x[2]), 2:3])
})))

Remove rows based on factor-levels

I have a data.frame df in format "long".
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df <- df[order(df$site), ]
df
site time value
1 A 11 12
2 A 22 -24
3 A 33 -30
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
Question
How do I remove the rows where an unique element of df$time is not present for each of the levels of df$site ?
In this case I want to remove df[3,], because for df$time the timestamp 33 is only present for site A and not for site B and site C.
Desired output:
df.trimmed
site time value
1 A 11 12
2 A 22 -24
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
The data.frame has easily 800k rows and 200k unique timestamps. I don't want to use loops but I don't know how to use vectorized functions like apply() or lapply() for this case.
Here's another possible solution using the data.table package:
unTime <- unique(df$time)
library(data.table)
DT <- data.table(df, key = "site")
(notInAll <- unique(DT[, list(ans = which(!unTime %in% time)), by = key(DT)]$ans))
# [1] 3
DT[time %in% unTime[-notInAll]]
# site time value
# [1,] A 11 3
# [2,] A 22 11
# [3,] B 11 -6
# [4,] B 22 -2
# [5,] C 11 -19
# [6,] C 22 -14
EDIT from Matthew
Nice. Or a slightly more direct way :
DT = as.data.table(df)
tt = DT[,length(unique(site)),by=time]
tt
time V1
1: 11 3
2: 22 3
3: 33 1
tt = tt[V1==max(V1)] # See * below
tt
time V1
1: 11 3
2: 22 3
DT[time %in% tt$time]
site time value
1: A 11 7
2: A 22 -2
3: B 11 8
4: B 22 -10
5: C 11 3
6: C 22 1
In case no time is present in all sites, when final result should be empty (as Ben pointed out in comments), the step marked * above could be :
tt = tt[V1==length(unique(DT$site))]
Would rle work for you?
df <- df[order(df$time), ]
df <- subset(df, time != rle(df$time)$value[rle(df$time)$lengths == 1])
df <- df[order(df$site), ]
df
## site time value
## 1 A 11 17
## 4 A 22 -3
## 2 B 11 8
## 5 B 22 5
## 3 C 11 0
## 6 C 22 13
Re-looking at your data, it seems that this solution might be too simple for your needs though....
Update
Here's an approach that should be better than the rle solution that I put above. Rather than look for a run-length of "1", will delete rows that do not match certain conditions of the results of table(df$site, df$time). To illustrate, I've also added some more fake data.
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df2 <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(14,14,15,15,16,16,16),
value = ceiling(rnorm(7)*10))
df <- rbind(df, df2)
df <- df[order(df$site), ]
temp <- as.numeric(names(which(colSums(with(df, table(site, time)))
>= length(levels(df$site)))))
df2 <- merge(df, data.frame(temp), by.x = "time", by.y = "temp")
df2 <- df2[order(df2$site), ]
df2
## time site value
## 3 11 A -2
## 4 16 A -2
## 7 22 A 2
## 1 11 B -16
## 5 16 B 3
## 8 22 B -6
## 2 11 C 8
## 6 16 C 11
## 9 22 C -10
Here's the result of tabulating and summing up the site/time combination:
colSums(with(df, table(site, time)))
## 11 14 15 16 22 33
## 3 2 2 3 3 1
Thus, if we were interested in including sites where at least two sites had the timestamp, we could change the line >= length(levels(df$site)) (in this example, 3) to >= length(levels(df$site))-1 (obviously, 2).
Not sure if this solution is useful to you at all, but I thought I would share it to show the flexibility in solutions we have with R.

Resources