Replicating a 2011 example script, the aggregate() function of base R produces NANs. I was wondering if I need to use a more recent version of aggregate or a similar function? Please advise.
Example s1s2.df can be found here: https://www.dropbox.com/s/dsqina3vuy0774u/df.csv?dl=0
Code that produces NAN instead of summarised values:
s1.no.present <- aggregate(s1s2.df$no.present[s1s2.df$sabap==-1], by=list(s1s2.df$month.n[s1s2.df$sabap==-1]),sum)[,2]
s1.no.cards <- aggregate(s1s2.df$no.cards[s1s2.df$sabap==-1], by=list(s1s2.df$month.n[s1s2.df$sabap==-1]),sum)[,2]
s2.no.present <- aggregate(s1s2.df$no.present[s1s2.df$sabap==1], by=list(s1s2.df$month.n[s1s2.df$sabap==1]),sum)[,2]
s2.no.cards <- aggregate(s1s2.df$no.cards[s1s2.df$sabap==1], by=list(s1s2.df$month.n[s1s2.df$sabap==1]),sum)[,2]
Incorrect output:
> tibble(s1.no.present)
# A tibble: 12 × 1
s1.no.present
<int>
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 NA
8 NA
9 NA
10 NA
11 NA
12 NA
Use a custom sum function to remove NAs:
#data
s1s2.df <- read.csv("tmp.csv")
mySum <- function(x){ sum(x, na.rm = TRUE) }
aggregate(s1s2.df$no.present[s1s2.df$sabap == -1 ],
by = list(s1s2.df$month.n[s1s2.df$sabap == -1 ]),
mySum)
# Group.1 x
# 1 1 218
# 2 2 369
# 3 3 590
# 4 4 1471
# 5 5 1880
# 6 6 2241
# 7 7 2306
# 8 8 1827
# 9 9 1377
# 10 10 774
# 11 11 281
# 12 12 280
Or use formulas:
aggregate(formula = no.present ~ month.n,
data = s1s2.df[s1s2.df$sabap == -1, ],
FUN = sum)
# month.n no.present
# 1 1 218
# 2 2 369
# 3 3 590
# 4 4 1471
# 5 5 1880
# 6 6 2241
# 7 7 2306
# 8 8 1827
# 9 9 1377
# 10 10 774
# 11 11 281
# 12 12 280
Related
I have 2 data frames
Data Frame A:
Time Reading
1 20
2 23
3 25
4 22
5 24
6 23
7 24
8 23
9 23
10 22
Data Frame B:
TimeStart TimeEnd Alarm
2 5 556
7 9 556
I would like to create the following joined dataframe:
Time Reading Alarmtime Alarm alarmno
1 20 n/a n/a n/a
2 23 2 556 1
3 25 556 1
4 22 556 1
5 24 5 556 1
6 23 n/a n/a n/a
7 24 7 556 2
8 23 556 2
9 23 9 556 2
10 22 n/a n/a n/a
I can do the join easy enough however im struggling with getting the following rows filled with the alarm until the time the alarm ended. Also numbering each individual alarm so even if they are the same alarm they are counted separately. Any thoughts on how i can do this would be great
Thanks
library(sqldf)
df_b$AlarmNo <- seq_len(nrow(df_b))
sqldf('
select a.Time
, a.Reading
, case when a.Time in (b.TimeStart, b.TimeEnd)
then a.Time
else NULL
end as AlarmTime
, b.Alarm
, b.AlarmNo
from df_a a
left join df_b b
on a.Time between b.TimeStart and b.TimeEnd
')
# Time Reading AlarmTime Alarm AlarmNo
# 1 1 20 NA NA NA
# 2 2 23 2 556 1
# 3 3 25 NA 556 1
# 4 4 22 NA 556 1
# 5 5 24 5 556 1
# 6 6 23 NA NA NA
# 7 7 24 7 556 2
# 8 8 23 NA 556 2
# 9 9 23 9 556 2
# 10 10 22 NA NA NA
Or
library(data.table)
setDT(df_b)
df_c <-
df_b[, .(Time = seq(TimeStart, TimeEnd), Alarm, AlarmNo = .GRP)
, by = TimeStart]
merge(df_a, df_c, by = 'Time', all.x = T)
# Time Reading TimeStart Alarm AlarmNo
# 1: 1 20 NA NA NA
# 2: 2 23 2 556 1
# 3: 3 25 2 556 1
# 4: 4 22 2 556 1
# 5: 5 24 2 556 1
# 6: 6 23 NA NA NA
# 7: 7 24 7 556 2
# 8: 8 23 7 556 2
# 9: 9 23 7 556 2
# 10: 10 22 NA NA NA
Data used:
df_a <- fread('
Time Reading
1 20
2 23
3 25
4 22
5 24
6 23
7 24
8 23
9 23
10 22
')
df_b <- fread('
TimeStart TimeEnd Alarm
2 5 556
7 9 556
')
I have a data frame df, and a list L of indices at which I should put 0 instead of the current values of df.
Example:
DF:
# A tibble: 11 x 3
A B C
<dbl> <dbl> <dbl>
1724 4 2013
1758 4 2013
1612 3 2013
1692 3 2013
1260 33 2014
1157 22 2014
1359 63 2014
1414 27 2014
387 3 2016
374 3 2016
L:
[[1]]
[1] 3 4
[[2]]
[1] 1 2 3 4 5
[[3]]
[1] 1
So in this example, I have to put zeros in rows 3, 4 of column A, in rows 1:5 in column B and row 1 in column C.
Is there a way to do it as a one-liner in R? A dplyr or R-base solution would be great! Also, I would like to avoid apply or loops since I have to do this very efficiently
Loop looks very fast to me. Haven't done the complexity comparison but if you have your replacement in list form and want to replace with 'val', just simply:
df
a b c
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
val<-0
for(i in 1:length(L)){
df[L[[i]],i]<-val
}
df
a b c
1 1 0 0
2 2 0 2
3 0 0 3
4 0 0 4
5 5 0 5
6 6 6 6
7 7 7 7
8 8 8 8
9 9 9 9
10 10 10 10
I tested it on x, a 10,000 row and 10,0000 column df:
> b<-Sys.time()
> for(i in 1:length(L)){
+ x[L[[i]],i]<-0
+ }
> Sys.time()-b
Time difference of 0.490464 secs
Looks pretty quick :) I know it's obvious but hope it helps!
******** EDIT 1 ********
If we look at method by #mt1022 using unlist and cbind:
> b<-Sys.time()
> Lcol <- rep(seq_along(L), lengths(L))
> x[cbind(unlist(L), Lcol)] <- 0
> Sys.time()-b
Time difference of 7.467723 secs
Clearly much slower (because when we unlist, we essentailly loop through each and every element in L instead of each vector in L). ;)
Another way using matrix of indices:
# DF <- read.table(textConnection('A B C
# 1724 4 2013
# 1758 4 2013
# 1612 3 2013
# 1692 3 2013
# 1260 33 2014
# 1157 22 2014
# 1359 63 2014
# 1414 27 2014
# 387 3 2016
# 374 3 2016'), header = T)
#
# L <- list(c(3, 4), c(1, 2, 3, 4, 5), c(1))
Lcol <- rep(seq_along(L), lengths(L))
DF[cbind(unlist(L), Lcol)] <- 0
# > DF
# A B C
# 1 1724 0 0
# 2 1758 0 2013
# 3 0 0 2013
# 4 0 0 2013
# 5 1260 0 2014
# 6 1157 22 2014
# 7 1359 63 2014
# 8 1414 27 2014
# 9 387 3 2016
# 10 374 3 2016
Another option is to use mapply in combination with do.call.
do.call(cbind, mapply(function(x,y){
df[x,y]<-0
df[y]
}, mylist, seq_along(mylist)))
# A B C
# [1,] 1724 0 0
# [2,] 1758 0 2013
# [3,] 0 0 2013
# [4,] 0 0 2013
# [5,] 1260 0 2014
# [6,] 1157 22 2014
# [7,] 1359 63 2014
# [8,] 1414 27 2014
# [9,] 387 3 2016
# [10,] 374 3 2016
Data:
df <- read.table(text =
"A B C
1724 4 2013
1758 4 2013
1612 3 2013
1692 3 2013
1260 33 2014
1157 22 2014
1359 63 2014
1414 27 2014
387 3 2016
374 3 2016", header = TRUE)
mylist <- list(c(3, 4), c(1, 2, 3, 4, 5), c(1))
I was looking to separate rows of data by Cue and adding a row which calculate averages per subject. Here is an example:
Before:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379
After:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 0.67978 0.51071 0.31723
4 4 22 0.26855 0.17487 0.22461
5 4 20 0.15106 0.48767 0.49072
6 0.209 0.331 0.357
7 7 18 0.11627 0.12604 0.2832
8 7 24 0.50201 0.14252 0.21454
9 0.309 0.134 0.248
10 12 16 0.27649 0.96008 0.42114
11 12 18 0.60852 0.21637 0.18799
12 0.442 0.588 0.304
13 22 20 0.32867 0.65308 0.29388
14 22 24 0.25726 0.37048 0.32379
15 0.292 0.511 0.308
So in the "after" example, line 3 is the average of lines 1 and 2 (line 6 is the average of lines 4 and 5, etc...).
Any help/information would be greatly appreciated!
Thank you!
You can use base r to do something like:
Reduce(rbind,by(data,data[1],function(x)rbind(x,c(NA,NA,colMeans(x[-(1:2)])))))
Cue ITI a b c
1 0 16 0.820620 0.521850 0.276790
2 0 24 0.538940 0.499570 0.357670
3 NA NA 0.679780 0.510710 0.317230
32 4 22 0.268550 0.174870 0.224610
4 4 20 0.151060 0.487670 0.490720
31 NA NA 0.209805 0.331270 0.357665
5 7 18 0.116270 0.126040 0.283200
6 7 24 0.502010 0.142520 0.214540
33 NA NA 0.309140 0.134280 0.248870
7 12 16 0.276490 0.960080 0.421140
8 12 18 0.608520 0.216370 0.187990
34 NA NA 0.442505 0.588225 0.304565
9 22 20 0.328670 0.653080 0.293880
10 22 24 0.257260 0.370480 0.323790
35 NA NA 0.292965 0.511780 0.308835
Here is one idea. Split the data frame, perform the analysis, and then combine them together.
DF_list <- split(DF, f = DF$Cue)
DF_list2 <- lapply(DF_list, function(x){
df_temp <- as.data.frame(t(colMeans(x[, -c(1, 2)])))
df_temp[, c("Cue", "ITI")] <- NA
df <- rbind(x, df_temp)
return(df)
})
DF2 <- do.call(rbind, DF_list2)
rownames(DF2) <- 1:nrow(DF2)
DF2
# Cue ITI a b c
# 1 0 16 0.820620 0.521850 0.276790
# 2 0 24 0.538940 0.499570 0.357670
# 3 NA NA 0.679780 0.510710 0.317230
# 4 4 22 0.268550 0.174870 0.224610
# 5 4 20 0.151060 0.487670 0.490720
# 6 NA NA 0.209805 0.331270 0.357665
# 7 7 18 0.116270 0.126040 0.283200
# 8 7 24 0.502010 0.142520 0.214540
# 9 NA NA 0.309140 0.134280 0.248870
# 10 12 16 0.276490 0.960080 0.421140
# 11 12 18 0.608520 0.216370 0.187990
# 12 NA NA 0.442505 0.588225 0.304565
# 13 22 20 0.328670 0.653080 0.293880
# 14 22 24 0.257260 0.370480 0.323790
# 15 NA NA 0.292965 0.511780 0.308835
DATA
DF <- read.table(text = " Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379", header = TRUE)
A data.table approach, but if someone can offer some improvements I'd be keen to hear.
library(data.table)
dt <- data.table(df)
dt2 <- dt[, lapply(.SD, mean), by = Cue][,ITI := NA][]
data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
> data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
Cue ITI a b c
1: 0 16 0.820620 0.521850 0.276790
2: 0 24 0.538940 0.499570 0.357670
3: NA NA 0.679780 0.510710 0.317230
4: 4 22 0.268550 0.174870 0.224610
5: 4 20 0.151060 0.487670 0.490720
6: NA NA 0.209805 0.331270 0.357665
If you want to leave the Cue values as-is to confirm group, just drop the [is.na(ITI), Cue := NA] from the last line.
I would use group_by and summarise from the DPLYR package to get a dataframe with the average values. Then rbind the new data frame with the old one and sort by Cue:
df_averages <- df_orig >%>
group_by(Cue) >%>
summarise(ITI = NA, a = mean(a), b = mean(b), c = mean(c)) >%>
ungroup()
df_all <- rbind(df_orig, df_averages)
These are examples of two dataframes I am working on. 'Claims' has fewer rows than 'lastaction'.
My attempts give the following error.
newtable <- merge(claims, lastaction, by = "X", all = TRUE)
Error in [<-.data.frame(tmp, value, value = NA) : new columns
would leave holes after existing columns
newtable <- merge(claims, lastaction, by.x = claims$X, by.y = lastaction$X, all = TRUE)
Error in fix.by(by.x, x) : 'by' must match numbers of columns
merge function works fine for me. As both dataframes have the same column name X, it can be used to merge using by.
claims = data.frame(X = c(10,24,30,35,64,104),
TransactionDateTime = c('JUL-15','APR-17','SEP-15','JUL-15','APR-16','SEP-15'))
claims
# X TransactionDateTime
# 1 10 JUL-15
# 2 24 APR-17
# 3 30 SEP-15
# 4 35 JUL-15
# 5 64 APR-16
# 6 104 SEP-15
lastaction = data.frame(X = c(10,24,30,35,40,57), lastvalue = c(6,1,4,6,6,1),
Approvalmonth = c('15-OCT','17-JAN','16-MAR','15-OCT','15-SEP','17-JUN'),
lastvalue = c(0,1,0,0,0,1))
lastaction
# X lastvalue Approvalmonth lastvalue
# 1 10 6 15-OCT 0
# 2 24 1 17-JAN 1
# 3 30 4 16-MAR 0
# 4 35 6 15-OCT 0
# 5 40 6 15-SEP 0
# 6 57 1 17-JUN 1
merge(claims, lastaction, by = "X", all = TRUE)
# X TransactionDateTime lastvalue Approvalmonth lastvalue.1
# 1 10 JUL-15 6 15-OCT 0
# 2 24 APR-17 1 17-JAN 1
# 3 30 SEP-15 4 16-MAR 0
# 4 35 JUL-15 6 15-OCT 0
# 5 40 <NA> 6 15-SEP 0
# 6 57 <NA> 1 17-JUN 1
# 7 64 APR-16 NA <NA> NA
# 8 104 SEP-15 NA <NA> NA
dplyr's full_join as well works
dplyr::full_join(claims, lastaction, by = 'X')
X TransactionDateTime lastvalue Approvalmonth lastvalue.y
1 10 JUL-15 6 15-OCT 6
2 24 APR-17 1 17-JAN 1
3 30 SEP-15 4 16-MAR 4
4 35 JUL-15 6 15-OCT 6
5 64 APR-16 NA <NA> NA
6 104 SEP-15 NA <NA> NA
7 40 <NA> 6 15-SEP 6
8 57 <NA> 1 17-JUN 1
I am trying to shorten a chunk of code to make it faster and easier to modify. This is a short example of my data.
order obs year var1 var2 var3
1 3 1 1 32 588 NA
2 4 1 2 33 689 2385
3 5 1 3 NA 678 2369
4 33 3 1 10 214 1274
5 34 3 2 10 237 1345
6 35 3 3 10 242 1393
7 78 6 1 5 62 NA
8 79 6 2 5 75 296
9 80 6 3 5 76 500
10 93 7 1 NA NA NA
11 94 7 2 4 86 247
12 95 7 3 3 54 207
Basically, what I want is R to find any possible and unique combination of two values (observations) in column "obs", within the same year, to create a new matrix or DF with observations being the aggregation of the originals. Order is not important, so 1+6 = 6+1. For instance, having 150 observations, I will expect 11,175 feasible combinations (each year).
I sort of got what I want with basic coding but, as you will see, is way too long (I have built this way 66 different new data sets so it does not really make a sense) and I am wondering how to shorten it. I did some trials (plyr,...) with no real success. Here what I did:
# For the 1st year, groups of 2 obs
newmatrix <- data.frame(t(combn(unique(data$obs[data$year==1]), 2)))
colnames(newmatrix) <- c("obs1", "obs2")
newmatrix$name <- do.call(paste, c(newmatrix[c("obs1", "obs2")], sep = "_"))
# and the aggregation of var. using indexes, which I will skip here to save your time :)
To ilustrate, here the result, considering above sample, of what I would get for the 1st year. NA is because I only computed those where the 2 values were valid. And only for variables 1 and 3. More, I did the sum but it could be any other possible Function:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 42 NA
2 2 1 6 1_6 37 NA
3 3 1 7 1_7 NA NA
4 4 3 6 3_6 15 NA
5 5 3 7 3_7 NA NA
6 6 6 7 6_7 NA NA
As for the 2 first lines in the 3rd year, same type of matrix:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 NA 3762
2 2 1 6 1_6 NA 2868
.......... etc ............
I hope I explained myself. Thank you in advance for your hints on how to do this more efficient.
I would use split-apply-combine to split by year, find all the combinations, and then combine back together:
do.call(rbind, lapply(split(data, data$year), function(x) {
p <- combn(nrow(x), 2)
data.frame(order=paste(x$order[p[1,]], x$order[p[2,]], sep="_"),
obs1=x$obs[p[1,]],
obs2=x$obs[p[2,]],
year=x$year[1],
var1=x$var1[p[1,]] + x$var1[p[2,]],
var2=x$var2[p[1,]] + x$var2[p[2,]],
var3=x$var3[p[1,]] + x$var3[p[2,]])
}))
# order obs1 obs2 year var1 var2 var3
# 1.1 3_33 1 3 1 42 802 NA
# 1.2 3_78 1 6 1 37 650 NA
# 1.3 3_93 1 7 1 NA NA NA
# 1.4 33_78 3 6 1 15 276 NA
# 1.5 33_93 3 7 1 NA NA NA
# 1.6 78_93 6 7 1 NA NA NA
# 2.1 4_34 1 3 2 43 926 3730
# 2.2 4_79 1 6 2 38 764 2681
# 2.3 4_94 1 7 2 37 775 2632
# 2.4 34_79 3 6 2 15 312 1641
# 2.5 34_94 3 7 2 14 323 1592
# 2.6 79_94 6 7 2 9 161 543
# 3.1 5_35 1 3 3 NA 920 3762
# 3.2 5_80 1 6 3 NA 754 2869
# 3.3 5_95 1 7 3 NA 732 2576
# 3.4 35_80 3 6 3 15 318 1893
# 3.5 35_95 3 7 3 13 296 1600
# 3.6 80_95 6 7 3 8 130 707
This enables you to be very flexible in how you combine data pairs of observations within a year --- x[p[1,],] represents the year-specific data for the first element in each pair and x[p[2,],] represents the year-specific data for the second element in each pair. You can return a year-specific data frame with any combination of data for the pairs, and the year-specific data frames are combined into a single final data frame with do.call and rbind.