Replicating an Excel SUMIFS formula - r

I need to replicate - or at least find an alternative solution - for a SUMIFS function I have in Excel.
I have a transactional database:
SegNbr Index Revenue SUMIF
A 1 10 30
A 1 20 30
A 2 30 100
A 2 40 100
B 1 50 110
B 1 60 110
B 3 70 260
B 3 80 260
and I need to create another column that sums the Revenue, by SegmentNumber, for all indexes that are equal or less the Index in that row. It is a distorted rolling revenue as it will be the same for each SegmentNumber/Index key. This is the formula is this one:
=SUMIFS([Revenue],[SegNbr],[#SegNbr],[Index],"<="&[#Index])

Let's say you have this sample data.frame
dd<-read.table(text="SegNbr Index Revenue
A 1 10
A 1 20
A 2 30
A 2 40
B 1 50
B 1 60
B 3 70
B 3 80", header=T)
Now if we make sure the data is ordered by segment and index, we can do
dd<-dd[order(dd$SegNbr, dd$Index), ] #sort data
dd$OUT<-with(dd,
ave(
ave(Revenue, SegNbr, FUN=cumsum), #get running sum per seg
interaction(SegNbr, Index, drop=T),
FUN=max, na.rm=T) #find largest sum per index per seg
)
dd
This gives
SegNbr Index Revenue OUT
1 A 1 10 30
2 A 1 20 30
3 A 2 30 100
4 A 2 40 100
5 B 1 50 110
6 B 1 60 110
7 B 3 70 260
8 B 3 80 260
as desired.

Related

Select row meeting condition and all subsequent rows by group

Let's assume I have a data frame consisting of a categorical variable and a numerical one.
df <- data.frame(group=c(1,1,1,1,1,2,2,2,2,2),days=floor(runif(10, min=0, max=101)))
df
group days
1 1 54
2 1 61
3 1 31
4 1 52
5 1 21
6 2 22
7 2 18
8 2 50
9 2 46
10 2 35
I would like to select the row corresponding to the maximum number of days by group as well as all the following/subsequent group rows. For the example above, my subset df2 should look as follows:
df2
group days
2 1 61
3 1 31
4 1 52
5 1 21
8 2 50
9 2 46
10 2 35
Please note that the groups could have different lengths.
For a base R solution, aggregate days by group using a function that keeps the elements with index greater than or equal to the maximum, and then reshape as a long data.frame
df0 = aggregate(days ~ group, df, function(x) x[seq_along(x) >= which.max(x)])
data.frame(group=rep(df0$group, lengths(df0$days)),
days=unlist(df0$days, use.names=FALSE)))
leading to
group days
1 1 84
2 1 31
3 1 65
4 1 23
5 2 94
6 2 69
7 2 45
You can use which.max to find out the index of the maximum of the days and then use slice from dplyr to select all the rows after that, where n() gives the number of rows in each group:
library(dplyr)
df %>% group_by(group) %>% slice(which.max(days):n())
#Source: local data frame [7 x 2]
#Groups: group [2]
# group days
# <int> <int>
#1 1 61
#2 1 31
#3 1 52
#4 1 21
#5 2 50
#6 2 46
#7 2 35
data.table syntax would be similar, .N is similar to n() in dplyr and gives the number of rows in each group:
library(data.table)
setDT(df)[, .SD[which.max(days):.N], group]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35
We can use a faster option with data.table where we find the row index (.I) and then subset the rows based on that.
library(data.table)
setDT(df)[df[ , .I[which.max(days):.N], by = group]$V1]
# group days
#1: 1 61
#2: 1 31
#3: 1 52
#4: 1 21
#5: 2 50
#6: 2 46
#7: 2 35

Select specific rows based on previous row value (in the same column)

I've been trying to figure a way to script this through R, but just can't get it. I have a dataset like this:
Trial Type Correct Latency
1 55 0 0
3 30 1 766
4 10 1 344
6 40 1 716
7 10 1 326
9 30 1 550
10 10 1 350
11 64 0 0
13 30 1 683
14 10 1 270
16 30 1 666
17 10 1 297
19 40 1 616
20 10 1 315
21 64 0 0
23 40 1 850
24 10 1 322
26 30 1 566
27 20 0 766
28 40 1 500
29 20 1 230
which goes for much longer(around 1000 rows).
From this one dataset, I would like to create 4 separate data.frames/tables I can export tables with as well as do my own calculations
I would like to have a data.frame (4 in total), one for each of these bullet points:
type 10 rows which are preceded by a type 30 row
type 10 rows which are preceded by a type 40 row
type 20 rows which are preceded by a type 30 row
type 20 rows which are preceded by a type 40 row
I would like for all the columns in the relevant rows to be placed into these new tables, but only including the column info of row types 10 or 20.
For example, the first table (type 10 preceded by type 30) would like this based on the sample data:
Trial Type Correct Latency
4 10 1 344
10 10 1 350
14 10 1 270
17 10 1 297
Second table (type 10 preceded by type 40):
Trial Type Correct Latency
7 10 1 326
20 10 1 315
24 10 1 322
Third table (type 20 preceded by type 30):
Trial Type Correct Latency
27 20 0 766
Fourth table (table 20 preceded by type 40):
Trial Type Correct Latency
29 20 1 230
I can subset just fine to get one table only of type 10 rows and another for type 20 rows, but I can't figure out how to create different tables for type 10 and 20 rows based on the previous type value. Also, an issue is that "Trials" is not in order (skips numbers).
Any help would be greatly appreciated. Thank you.
Also, is there a way to include the previous row as well, so the output for the fourth table would look something like this:
Fourth table (table 20 preceded by type 40):
Trial Type Correct Latency
28 40 1 500
29 20 1 230
For the fourth example, you could use which() in combination with lag() from dplyr, to attain the indices that meet your criteria. Then you can use these to subset the data.frame.
# Get indices of rows that meet condition
ind2 <- which(df$Type==20 & dplyr::lag(df$Type)==40)
# Get indices of rows before the ones that meet condition
ind1 <- which(df$Type==20 & dplyr::lag(df$Type)==40)-1
# Subset data
> df[c(ind1,ind2)]
Trial Type Correct Latency
1: 28 40 1 500
2: 29 20 1 230
Here is an example code if you always want to delete the first trials of your data.
var1 <- c(1,2,1,2,1,2,1,2,1,2)
var2 <- c(1,1,1,2,2,2,2,3,3,3)
dat <- data.frame(var1, var2)
var1 var2
1 1 1
2 2 1
3 1 1
4 2 2
5 1 2
6 2 2
7 1 2
8 2 3
9 1 3
10 2 3
#delete only this line directly
filter(dat,lag(var2)==var2)
var1 var2
1 1 1
2 2 1
3 1 1
6 2 2
7 1 2
10 2 3
#delete the first 2 trials
#make a list of all rows where var2[n-1]!=var2[n] --> using lag from dplyr
drops <- c(1,2,which(lag(dat$var2)!=dat$var2), which(lag(dat$var2)!=dat$var2)+1)
if (!identical(drops,numeric(0))) { dat <- dat[-drops,] }
var1 var2
3 1 1
6 2 2
7 1 2
10 2 3

Calculate the mean of a column for each batch of n rows in R

Suppose I have a data frame like this...
> head(x)
round value
1 1 0.37207016
2 2 0.51954917
3 3 -0.70684976
4 4 0.76105557
5 5 0.09252876
6 6 -2.42223178
> tail(x)
round value
95 95 -0.6799075
96 96 -0.4109732
97 97 0.9740048
98 98 -0.8877499
99 99 0.1501041
100 100 -0.5415825
...and I want to get the mean value over each 10-round interval. I've posted one answer below, but a common thing to want to do, so is there is a more straightforward way?
I can do some gymnastics to create a data frame with an extra column for the "batch" index, and then group by that to calculate the mean.
> y <- data.frame(x$round, x$value, rep(1:10, each=10))
> colnames(y) <- c("round","value", "batch")
> head(y)
round value batch
1 1 0.37207016 1
2 2 0.51954917 1
3 3 -0.70684976 1
4 4 0.76105557 1
5 5 0.09252876 1
6 6 -2.42223178 1
> tail(y)
round value batch
95 95 -0.6799075 10
96 96 -0.4109732 10
97 97 0.9740048 10
98 98 -0.8877499 10
99 99 0.1501041 10
100 100 -0.5415825 10
> tapply(y$value, y$batch, mean)
1 2 3 4 5 6
-0.13784753 -0.15969468 0.41346173 0.09019686 -0.26467052 -0.29677632
7 8 9 10
0.06489254 0.17609739 0.35029525 -0.19669901
Try using modulo division. Need to subtract 1 to get first group of size 10:
tapply(y$yvalue, (nrow(x)-1) %/% 10, mean)

R: Modifying Subsets of Dataframe using Calculations on that Subset

I am going to ask my question through example, because I don't know what the best way to phrase it in general is. Using the ChickWeight dataset built into R:
> head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
> tail(ChickWeight)
weight Time Chick Diet
573 155 12 50 4
574 175 14 50 4
575 205 16 50 4
576 234 18 50 4
577 264 20 50 4
578 264 21 50 4
I can use ddply to calculate mean for each unique Diet, for example
> ddply(d, .(Diet), summarise, mean_weight=mean(weight, na.rm=TRUE))
Diet mean_weight
1 1 102.6455
2 2 122.6167
3 3 142.9500
4 4 135.2627
What do I do if I wanted to easily create a data frame that modifies the 'weight' column in ChickWeight by dividing it by the mean_weight of it's corresponding diet?
A solution with data.table that's short, fast and readable:
library(data.table)
cw <- data.table(ChickWeight)
cw[, pct_mw_diet:=weight/mean(weight, na.rm=T), by=Diet]
Now you have a column with percent of mean weight by diet

In R, can I concatenate, then call variable column=concatenated string?

I am trying to identify the time of primary ambulance arrival for a number of patients in my dataframe=data.
The primary ambulance is either the 1st, 2nd, 3rd or 4th vehicle on scene (data$prim.amb.num=1, 2, 3, or 4 for each patient/row).
data$time_v1, data$time_v2, data$time_v3 and data$time_v4 have a time or a missing value, which corresponds to the 1st, 2nd, 3rd and 4th vehicles, where relevant.
What I would like to do is make a new variable=prim.amb.time with the time that corresponds to primary ambulance arrival time. Suppose for patient=1, the ambulance was the first. Then I want data[1,"prim.amb.time"]=data[1,"time_v1"].
I can figure out the correct time_v* with the following:
paste("time_v", data$prim.amb.num, sep="")
But I'm stuck as to how to pass the resulting information to call the correct column.
My hope was to simply have something like:
data$prim.amb.time<-data$paste("time_v", data$prim.amb.num, sep="")
but of course, this doesn't work. I'm not even sure how to Google for this; I tried various combinations of this title but to no avail. Any suggestions?
Although I liked the answer by #mhermans, if you want a one-liner, one solution is to use ?apply as follows:
#From #mhermans
zz <- textConnection("patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4
1000 1 30 40 60 100
1001 3 40 50 60 80
1002 2 10 30 40 45
1003 1 24 40 45 60
")
d <- read.table(zz, header = TRUE)
close(zz)
#Take each row of d and pull out time_vn where n = d$prime.amb.num
d$prime.amb.time <- apply(d, 1, function(x) {x[x['prime.amb.num'] + 2]})
> d
patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4 prime.amb.time
1 1000 1 30 40 60 100 30
2 1001 3 40 50 60 80 60
3 1002 2 10 30 40 45 30
4 1003 1 24 40 45 60 24
EDIT - or with paste:
d$prime.amb.time <-
apply(
d,
1,
function(x) {
x[paste('time_v', x['prime.amb.num'], sep = '')]
}
)
#Gives the same result
Set up example data:
# read in basic example data for four patients, wide format
zz <- textConnection("patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4
1000 1 30 40 60 100
1001 3 40 50 60 80
1002 2 10 30 40 45
1003 1 24 40 45 60
")
d <- read.table(zz, header = TRUE)
close(zz)
In the example dataset I'm thus assuming your data looks like this:
patient.id prime.amb.num time_v1 time_v2 time_v3 time_v4
1 1000 1 30 40 60 100
2 1001 3 40 50 60 80
3 1002 2 10 30 40 45
4 1003 1 24 40 45 60
Given that data structure, it is perhaps easier to work with a dataset with a vehicle per row, instead of a patient per row. This can be accomplised by using reshape() to convert from a wide to a long format.
dl <- reshape(d, direction='long', idvar="patient.id", varying=list(3:6))
# ordering & rename var for aesth. reasons:
dl <- dl[order(dl$patient.id, dl$time),]
dl$vehicle.id <- dl$time
dl$time <- NULL
dl
This gives a long dataset, with a row per vehicle:
patient.id prime.amb.num time_v1 vehicle.id
1000.1 1000 1 30 1
1000.2 1000 1 40 2
1000.3 1000 1 60 3
1000.4 1000 1 100 4
1001.1 1001 3 40 1
1001.2 1001 3 50 2
1001.3 1001 3 60 3
1001.4 1001 3 80 4
1002.1 1002 2 10 1
1002.2 1002 2 30 2
1002.3 1002 2 40 3
1002.4 1002 2 45 4
1003.1 1003 1 24 1
1003.2 1003 1 40 2
1003.3 1003 1 45 3
1003.4 1003 1 60 4
Getting the arrival time of the first ambulance per patient then become a simple oneliner:
dl[dl$prime.amb.num == dl$vehicle.id,]
which gives
patient.id prime.amb.num time_v1 vehicle.id
1000.1 1000 1 30 1
1001.3 1001 3 60 3
1002.2 1002 2 30 2
1003.1 1003 1 24 1

Resources