fill missing group data with 0s in a data.table [duplicate] - r

This question already has answers here:
add missing rows to a data table
(2 answers)
Closed 6 years ago.
This isn't a dupe of this. That question deals with rows which already have NAs in them, my question deals with missing rows for which there should be a data point of 0.
Let's say I have this data.table
dt<-data.table(id=c(1,2,4,5,6,1,3,4,5,6),
varname=c(rep('banana',5),rep('apple',5)),
thedata=runif(10,1,10))
What's the best way to add, for each varname, the missing ids with a 0 for thedata?
At the moment I dcast with fill=0 and then melt again but this doesn't seem very efficient.
melt(dcast.data.table(dt,id~varname,value.var='thedata',fill=0),id.var='id',variable.factor=FALSE,variable.name='varname',value.name='thedata')
I also just thought of doing it this way but it gets a little clunky to fill in NAs at the end
merge(dt[,CJ(id=unique(id),varname=unique(varname))],dt,by=c('varname','id'),all=TRUE)[,.(varname,id,thedata=ifelse(!is.na(thedata),thedata,0))]
In this example, I only used one id column but any additional suggestion should be extensible to having more than one id column.
EDIT
I did a system.time on each approach with a largish data set and the melt/cast approach took between 2-3 seconds while the merge/CJ approach took between 12-13.
EDIT2
Roland's CJ approach is much better than mine as it only took between 4-5 seconds with my dataset.
Is there a better way to do this?

setkey(dt, varname, id)
dt[CJ(unique(varname), unique(id))]
# id varname thedata
# 1: 1 apple 9.083738
# 2: 2 apple NA
# 3: 3 apple 7.332652
# 4: 4 apple 3.610315
# 5: 5 apple 7.113414
# 6: 6 apple 9.046398
# 7: 1 banana 3.973751
# 8: 2 banana 9.907012
# 9: 3 banana NA
#10: 4 banana 9.308346
#11: 5 banana 1.572314
#12: 6 banana 7.753611
Then substitute NA with 0 if you must (usually not appropriate).

Related

R: how to aggreate rows by count

This is my data frame
ID=c(1,2,3,4,5,6,7,8,9,10,11,12)
favFruit=c('apple','lemon','pear',
'apple','apple','pear',
'apple','lemon','pear',
'pear','pear','pear')
surveyDate = ('1/1/2005','1/1/2005','1/1/2005',
'2/1/2005','2/1/2005','2/1/2005',
'3/1/2005','3/1/2005','3/1/2005',
'4/1/2005','4/1/2005','4/1/2005')
df<-data.frame(ID,favFruit, surveyDate)
I need to aggregate it so I can plot a line graph in R for count of favFruit by date split by favFruit but I am unable to create an aggregate table. My data has 45000 rows so a manual solution is not possible.
surveyYear favFruit count
1/1/2005 apple 1
1/1/2005 lemon 1
1/1/2005 pear 1
2/1/2005 apple 2
2/1/2005 lemon 0
2/1/2005 pear 1
... etc
I tried this but R printed an error
df2 <- aggregate(df, favFruit, FUN = sum)
and I tried this, another error
df2 <- aggregate(df, date ~ favFruit, sum)
I checked for solutions online but their data generally included a column of quantities which I dont have and the solutions were overly complex. Is there an easy way to do this? Thanx in advance. Thank you to whoever suggested the link as a possible duplicate but it has has date and number of rows. But my question needs number of rows by date and favFruit (one more column) 1
Update:
Ronak Shah's solution worked. Thanx!
The solution provided by Ronak is very good.
In case you prefer to keep the zero counts in your dataframe.
You could use table function:
data.frame(with(df, table(favFruit, surveyDate)))
Output:
favFruit surveyDate Freq
1 apple 1/1/2005 1
2 lemon 1/1/2005 1
3 pear 1/1/2005 1
4 apple 2/1/2005 2
5 lemon 2/1/2005 0
6 pear 2/1/2005 1
7 apple 3/1/2005 1
8 lemon 3/1/2005 1
9 pear 3/1/2005 1
10 apple 4/1/2005 0
11 lemon 4/1/2005 0
12 pear 4/1/2005 3

Having aggregated data - wanna have data for each element [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
Hei,
My aim is to do a histogramm.
Therefor I need unaggregated data - but unfortunately I only have it in aggregated form.
My data:
tribble(~date,~groupsize,
"2020-09-01",3,
"2020-09-02",2,
"2020-09-03",1,
"2020-09-04",2)
I want to have:
tribble(~date,~n,
"2020-09-01",1,
"2020-09-01",1,
"2020-09-01",1,
"2020-09-02",1,
"2020-09-02",1,
"2020-09-01",1,
"2020-09-04",1,
"2020-09-04",1)
I think this is really simple, but I am at a loss. Sorry for that!
What can I do? I really like dplyr solutions :-)
Thank you!
repeat the date according to groupsize.
res <- data.frame(date=rep(dat$date, dat$groupsize), n=1)
res
# date n
# 1 2020-09-01 1
# 2 2020-09-01 1
# 3 2020-09-01 1
# 4 2020-09-02 1
# 5 2020-09-02 1
# 6 2020-09-03 1
# 7 2020-09-04 1
# 8 2020-09-04 1

Complex merging in R with duplicate matching values in y set producing problems

So I'm trying to merge two dataframes. Dataframe x looks something like:
Name ParentID
Steve 1
Kevin 1
Stacy 1
Paula 4
Evan 7
Dataframe y looks like:
ParentID OtherStuff
1 things
2 stuff
3 item
4 ideas
5 short
6 help
7 me
The dataframe I want would look like:
Name ParentID OtherStuff
Steve 1 things
Kevin 1 things
Stacy 1 things
Paula 4 ideas
Evan 7 me
Using a left merge gives me substantially more observations than I want, with many duplicates. Any idea how to merge things, where y is duplicated where appropriate to match x?
I'm working with a databases set up similarly to the example. x has 5013 observations, while y has 6432. Using the merge function as described by Joel and thelatemail gives me 1627727 observations.
We can use match from base R
df1$OtherStuff <- with(df1, df2$OtherStuff[match(ParentID, df2$ParentID)])
df1
# Name ParentID OtherStuff
#1 Steve 1 things
#2 Kevin 1 things
#3 Stacy 1 things
#4 Paula 4 ideas
#5 Evan 7 me

Vectorize calculation across relational dataframes in R

Is it possible in R to vectorize a calculation on data in a dataframe, where one criteria on which the calculation is performed comes from an external dataframe? This can be performed using a for loop, however it is slow.
The full task involves asking questions of 15 years of medical laboratory data in relational format. For example, what is the lowest haemoglobin level recorded for a patient in the three months following a surgical procedure? This from two tables: one with dates of surgery (~ 6000, often multiple per patient) and one of dated haemoglobin levels (~200,000, multiple per patient). A loop as below takes ~30 minutes per query.
In this MWE data is in two tables and is linked by an index.
##create two dataframes
a<-c("ID1","ID2","ID3","ID2","ID1")
b<-c(1,2,3,4,5)
c<-as.Date(c("2005-01-01","2002-01-01","2003-01-01","2004-01-01","2001-01-01"))
df.1<-cbind.data.frame(a,b,c,stringsAsFactors=FALSE)
d<-c("ID1","ID2","ID1")
e<-as.Date(c("2002-02-01","2001-02-01","2000-01-01"))
df.2<-cbind.data.frame(d,e,stringsAsFactors=FALSE)
>df.1
a b c
1 ID1 1 2005-01-01
2 ID2 2 2002-01-01
3 ID3 3 2003-01-01
4 ID2 4 2004-01-01
5 ID1 5 2001-01-01
>df.2
d e
1 ID1 2002-02-01
2 ID2 2001-02-01
3 ID1 2000-01-01
out<-rep(NA,length(df.2$d))
for(i in 1:length(df.2$d)){
out[i]<-max(df.1$b[df.1$a==df.2$d[i] & df.1$c>df.2$e[i]])
}
> cbind(df.2,out)
d e out
1 ID1 2002-02-01 1
2 ID2 2001-02-01 4
3 ID1 2000-01-01 5
To answer your question, you can vectorize a calculation in r with Vectorize.
However, I'm not sure what "slow" means here. And there are probably better ways to accomplish your task, but I would rather read a word problem than code.
##create two dataframes
a<-c("ID1","ID2","ID3","ID2","ID1")
b<-c(1,2,3,4,5)
c<-as.Date(c("2005-01-01","2002-01-01","2003-01-01","2004-01-01","2001-01-01"))
df.1<-cbind.data.frame(a,b,c,stringsAsFactors=FALSE)
d<-c("ID1","ID2","ID1")
e<-as.Date(c("2002-02-01","2001-02-01","2000-01-01"))
df.2<-cbind.data.frame(d,e,stringsAsFactors=FALSE)
f <- function(i)
## your code here
max(df.1$b[df.1$a==df.2$d[i] & df.1$c>df.2$e[i]])
vf <- Vectorize(f)
vf(1:3)
# [1] 1 4 5

First element in a data.table aggregation

I have a data.table of tick data, which I want to aggregate into seconds timeframe. While getting max, min and last is pretty straightforward:
data[, list(max(value), min(value), last(value)), by=time]
I am struggling to get the first datapoint which corresponds to a certain second timestamp. There is nothing in the manual. Is there an easy way to do it, like say, SQL TOP?
I managed to find the solution. The query to get the first element is to just subset that column's first value using [:
data[, list(value[1], max(value), min(value), last(value)),by=time]
Maybe it helps someone.
It seems that first is a valid aggregation.
foo <- data.table(x=1:10, y=11:20)
x y
1: 1 11
2: 2 12
3: 3 13
4: 4 14
5: 5 15
6: 6 16
7: 7 17
8: 8 18
9: 9 19
10: 10 20
foo[, .(first(x), last(x))]
V1 V2
1: 1 10

Resources