Possible to avoid a FOR loop in this very simple R code? - r

The answers below are very helpful. But I oversimplified my original question. I figured I learn more if I oversimplify and then adapt to my actual need, but now I am stuck. There are other factors that drive the amortization. See more complete code here. I like the response using "amort$end_bal <- begin_bal * (1 - mpr)^amort$period" and "amort$pmt <- c(0, diff(amort$end_bal))* -1", but in addition npr increases the ending balances and ch_off decreases ending balances. Here´s the more complete code:
n_periods <- 8
begin_bal <- 10000
yld <- .20
npr <- .09
mpr <- .10
co <- .10
period = seq(0,n_periods,1)
fin = 0
pur = 0
pmt = 0
ch_off = 0
end_bal = begin_bal
for(i in 1:n_periods){
{fin[i+1] = end_bal[i]*yld/12}
{pur[i+1] = end_bal[i]*npr}
{pmt[i+1] = end_bal[i]*mpr}
{ch_off[i+1] = end_bal[i]*co/12}
end_bal[i+1] = end_bal[i]+pur[i+1]-pmt[i+1]-ch_off[i+1]}
amort <- data.frame(period,fin,pur,pmt,ch_off,end_bal)
Which gives the below correct output:
print(amort,row.names=FALSE)
period fin pur pmt ch_off end_bal
0 0.0000 0.0000 0.0000 0.00000 10000.000
1 166.6667 900.0000 1000.0000 83.33333 9816.667
2 163.6111 883.5000 981.6667 81.80556 9636.694
3 160.6116 867.3025 963.6694 80.30579 9460.022
4 157.6670 851.4020 946.0022 78.83351 9286.588
5 154.7765 835.7929 928.6588 77.38823 9116.334
6 151.9389 820.4700 911.6334 75.96945 8949.201
7 149.1534 805.4281 894.9201 74.57668 8785.132
8 146.4189 790.6619 878.5132 73.20944 8624.072
I´m new to R, and I understand one of its features is matrix/vector manipulation. In the below example I amortize an asset over 8 months, where each payment ("pmt") is 10% ("mpr") of the prior period balance ("end_bal"). The below works fine. I used a FOR loop. I understand FOR loops can be slow in large models and a better solution is use of R´s abundant vector/matrix functions. But I didn´t know how to do this in my example since each monthly payment is calculated by referencing the prior period ending balance.
So my questions are:
Is there a more efficient way to do the below?
How do I replace the 0 for pmt in period 0, with an empty space?
R code:
n_periods <- 8
begin_bal <- 100
mpr <- .10
# Example loan amortization
pmt = 0
end_bal = begin_bal
for(i in 1:n_periods){
{pmt[i+1] = end_bal[i]*mpr}
end_bal[i+1] = end_bal[i]-pmt[i+1]}
amort <- data.frame(period = 0:n_periods,pmt,end_bal)
amort
Results, which are correct:
> amort
period pmt end_bal
1 0 0.000000 100.00000
2 1 10.000000 90.00000
3 2 9.000000 81.00000
4 3 8.100000 72.90000
5 4 7.290000 65.61000
6 5 6.561000 59.04900
7 6 5.904900 53.14410
8 7 5.314410 47.82969
9 8 4.782969 43.04672

Use R's vectorised calculations
n_periods <- 8
begin_bal <- 100
mpr <- .10
amort <- data.frame(period = seq(0, n_periods, 1))
amort$end_bal <- begin_bal * (1 - mpr)^amort$period
amort$pmt <- c(0, diff(amort$end_bal))* -1
amort
#> period end_bal pmt
#> 1 0 100.00000 0.000000
#> 2 1 90.00000 10.000000
#> 3 2 81.00000 9.000000
#> 4 3 72.90000 8.100000
#> 5 4 65.61000 7.290000
#> 6 5 59.04900 6.561000
#> 7 6 53.14410 5.904900
#> 8 7 47.82969 5.314410
#> 9 8 43.04672 4.782969
Created on 2021-05-12 by the reprex package (v2.0.0)
dplyr way for a different case (say)
n_periods <- 15
begin_bal <- 1000
mpr <- .07
library(dplyr)
seq(0, n_periods, 1) %>% as.data.frame() %>%
setNames('period') %>%
mutate(end_bal = begin_bal * (1 - mpr)^period,
pmt = -1 * c(0, diff(end_bal)))
#> period end_bal pmt
#> 1 0 1000.0000 0.00000
#> 2 1 930.0000 70.00000
#> 3 2 864.9000 65.10000
#> 4 3 804.3570 60.54300
#> 5 4 748.0520 56.30499
#> 6 5 695.6884 52.36364
#> 7 6 646.9902 48.69819
#> 8 7 601.7009 45.28931
#> 9 8 559.5818 42.11906
#> 10 9 520.4111 39.17073
#> 11 10 483.9823 36.42878
#> 12 11 450.1035 33.87876
#> 13 12 418.5963 31.50725
#> 14 13 389.2946 29.30174
#> 15 14 362.0439 27.25062
#> 16 15 336.7009 25.34308
Created on 2021-05-12 by the reprex package (v2.0.0)
Though OP has put another question in edited scenario, here's the approach suggested (for future reference)
n_periods <- 8
begin_bal <- 10000
yld <- .20
npr <- .09
mpr <- .10
co <- .10
library(dplyr)
seq(0, n_periods, 1) %>% as.data.frame() %>%
setNames('period') %>%
mutate(end_bal = begin_bal * (1 - (mpr + co/12 - npr))^period,
fin = c(0, (end_bal * yld/12)[-nrow(.)]),
pur = c(0, (end_bal * npr)[-nrow(.)]),
pmt = c(0, (end_bal * mpr)[-nrow(.)]),
ch_off = c(0, (end_bal * co/12)[-nrow(.)]))
#> period end_bal fin pur pmt ch_off
#> 1 0 10000.000 0.0000 0.0000 0.0000 0.00000
#> 2 1 9816.667 166.6667 900.0000 1000.0000 83.33333
#> 3 2 9636.694 163.6111 883.5000 981.6667 81.80556
#> 4 3 9460.022 160.6116 867.3025 963.6694 80.30579
#> 5 4 9286.588 157.6670 851.4020 946.0022 78.83351
#> 6 5 9116.334 154.7765 835.7929 928.6588 77.38823
#> 7 6 8949.201 151.9389 820.4700 911.6334 75.96945
#> 8 7 8785.132 149.1534 805.4281 894.9201 74.57668
#> 9 8 8624.072 146.4189 790.6619 878.5132 73.20944
Created on 2021-05-13 by the reprex package (v2.0.0)

If you are "lazy" (don't want to formulate the general expression of pmt and end_bal), you can define a recursive function f like blow
f <- function(k) {
if (k == 1) {
return(data.frame(pmt = 100 * mpr, end_bal = 100))
}
u <- f(k - 1)
end_bal <- with(tail(u, 1), end_bal - pmt)
pmt <- mpr * end_bal
rbind(u, data.frame(pmt, end_bal))
}
n_periods <- 8
res <- transform(
cbind(period = 0:n_periods, f(n_periods + 1)),
pmt = c(0, head(pmt, -1))
)
and you will see
> res
period pmt end_bal
1 0 0.000000 100.00000
2 1 10.000000 90.00000
3 2 9.000000 81.00000
4 3 8.100000 72.90000
5 4 7.290000 65.61000
6 5 6.561000 59.04900
7 6 5.904900 53.14410
8 7 5.314410 47.82969
9 8 4.782969 43.04672

Related

How to reduce the if statement for multiple arguments?

I would like to reduce this code:
carro$custo_tprivate = with(carro, ifelse(decile_renda == 1,
renda_fa*0.116,
ifelse(decile_renda == 2, renda_fa*0.106,
ifelse(decile_renda == 3, renda_fa*0.102,
ifelse(decile_renda == 4, renda_fa*0.115,
ifelse(decile_renda == 5, renda_fa*0.124,
ifelse(decile_renda == 6, renda_fa*0.125,
ifelse(decile_renda == 7, renda_fa*0.137,
ifelse(decile_renda == 8, renda_fa*0.141,
ifelse(decile_renda == 9, renda_fa*0.156,
ifelse(decile_renda == 10, renda_fa*0.131, 0)))))))))))
Someone could teach me how to do that?
Thank you very much!
You can technically use match statement to reduce your ifelse statements as below
# put your case data into a data frame
data = data.frame(x= seq(1, 10, by = 1)
, y = runif(10))
# creating your actual data
carro = data.frame(decile_renda = sample(1:10, 10, replace =T)
,renda_fa = runif(10)
)
#Match it to get positions of case statement
pos = with(carro, match(decile_renda, data$x, 0))
# multiply to get results
data$y[pos]*carro$renda_fa
Alternatively, this can be solved by left-joining with a lookup table.
If I understand correctly the nested ifelse() construct, the OP wants to multiply renda with a factor which depends on the value of decile_renda. The factors are given for 10 distinct values of decile_renda. In all other cases, the result must be zero.
Using left join will find matching values of decile_renda in the lookup table. Non-matching rows in carro will get an NA value. These need to be replaced by zero, subsequently.
The lookup table treats the single use cases as data instead of hard-coding. This gives the flexibility to add or change the use cases without changing the code.
Create lookup table and test dataset
# create lookup table
lut <- data.frame(
decile_renda = 1:10,
fa = c(0.116, 0.106, 0.102, 0.115, 0.124, 0.125, 0.137, 0.141, 0.156, 0.131)
)
lut
decile_renda fa
1 1 0.116
2 2 0.106
3 3 0.102
4 4 0.115
5 5 0.124
6 6 0.125
7 7 0.137
8 8 0.141
9 9 0.156
10 10 0.131
# create test dataset
carro <- data.frame(decile_renda = 0:11, renda_fa = 100)
# randomize row order
set.seed(1L) # required for reproducible data
carro <- carro[sample(nrow(carro)), ]
carro
decile_renda renda_fa
9 8 100
4 3 100
7 6 100
1 0 100
2 1 100
5 4 100
3 2 100
8 7 100
6 5 100
11 10 100
12 11 100
10 9 100
Note that the test dataset has been choosen to allow for easy verification of the results.
Base R: merge()
carro <- merge(carro, lut, all.x = TRUE, by = "decile_renda")
carro$custo_tprivate <- with(carro, ifelse(is.na(fa), 0, renda_fa * fa))
carro
decile_renda renda_fa fa custo_tprivate
1 0 100 NA 0.0
2 1 100 0.116 11.6
3 2 100 0.106 10.6
4 3 100 0.102 10.2
5 4 100 0.115 11.5
6 5 100 0.124 12.4
7 6 100 0.125 12.5
8 7 100 0.137 13.7
9 8 100 0.141 14.1
10 9 100 0.156 15.6
11 10 100 0.131 13.1
12 11 100 NA 0.0
Note that the result column custo_tprivate shows 0.0 for the rows with non-matching decile_renda values of 0 and 11 as requested.
However, the drawback here is that merge() does not maintain the original row order (this is why the test dataset uses a random row order for demonstration). Also, the result contains the fa column which is no longer needed.
dplyr
library(dplyr)
carro %>%
left_join(lut, by = "decile_renda") %>%
mutate(custo_tprivate = if_else(is.na(fa), 0, renda_fa * fa)) %>%
select(-fa)
decile_renda renda_fa custo_tprivate
1 8 100 14.1
2 3 100 10.2
3 6 100 12.5
4 0 100 0.0
5 1 100 11.6
6 4 100 11.5
7 2 100 10.6
8 7 100 13.7
9 5 100 12.4
10 10 100 13.1
11 11 100 0.0
12 9 100 15.6
Here, the original row order is kept and the fa column has been removed.
data.table
With data.table we can do an update join where the matching rows of carro are being updated by reference, i.e., without copying the whole object. Only the result column custo_tprivate is appended to carro but not fa which would have to be removed afterwards. Also, for replacing the NA values only the affected rows are updated in place.
This might be an advantage in terms of speed and memory consumption in case of large datasets.
library(data.table)
setDT(carro)[lut, on = .(decile_renda), custo_tprivate := renda_fa * fa]
carro[is.na(custo_tprivate), custo_tprivate := 0]
carro
decile_renda renda_fa custo_tprivate
1: 8 100 14.1
2: 3 100 10.2
3: 6 100 12.5
4: 0 100 0.0
5: 1 100 11.6
6: 4 100 11.5
7: 2 100 10.6
8: 7 100 13.7
9: 5 100 12.4
10: 10 100 13.1
11: 11 100 0.0
12: 9 100 15.6

When a variable switches from 1 to 2, delete some data from the other variables and average what's left?

I am analysing some data and need help.
Basically, I have a dataset that looks like this:
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
As can be seen, there's a switch column that switches between 1 and 2 every 10 data points. I want to write a code that does: when the "switch" column changes its value (from 1 to 2, or 2 to 1), delete the first 5 rows of data after the switch (i.e. leaving the 5 last data points for all the 4 variables), average the rest of the data points for O2 and CO2, and put them in 2 new columns (avg.O2 and avg.CO2) before the next switch. Then repeat this process until the end.
It's quite easy to do manually on paper or excel, but my real dataset would comprise thousands of data points and I would like to use R to do it automatically for me. So anyone has any ideas that could help me?
Please find my edits which should work for both regular and irregular
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
CleanMachineData <- function(Data, SwitchData, UnreliableRows = 5){
# First, we can properly turn your switch column into a grouping column (1,2,1,2)->(1,2,3,4)
grouplength <- rle(Data[,"switch"])$lengths
# mapply lets us input vector arguments into typically one/first-element only argument functions.
# In this case we create a sequence of lengths (output is a list/vector)
grouping <- mapply(seq, grouplength)
# Here we want it to become a single vector representing groups
groups <- mapply(rep, 1:length(grouplength), each = grouplength)
# if frequency was irregular, it will be a list, if regular it will be a matrix
# convert either into a vector by doing as follows:
if(class(grouping) == "list"){
groups <- unlist(groups)
} else {
groups <- as.vector(groups)
}
Data$group <- groups
#
# vector of the first row of each new switch (except the starting 0)
switchRow <- c(0,which(abs(diff(SwitchData)) == 1))+1
# I use "as.vector" to turn the matrix output of mapply into a sequence of numbers.
# "ToRemove" will have all the row numbers to get rid of from your original data, except for what happens before (in this case) row 10
ToRemove <- c(1:UnreliableRows, as.vector(mapply(seq, switchRow, switchRow+(UnreliableRows)-1)))
# I concatenate the missing beginning (1,2,3,4,5) and theToRemove them with c() and then remove them from n with "-"
Keep <- seq(nrow(Data))[-c(1:UnreliableRows,ToRemove)]
# Create the new data, (in case you don't know: data[<ROW>,<COLUMN>])
newdat <- Data[-ToRemove,]
# print the results
newdat
}
dat <- CleanMachineData(test.data, test.data$switch, 5)
dat
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
19 2017-04-18 2 21.03252 0.07960098 2
20 2017-04-19 2 21.04032 0.07892145 2
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
39 2017-05-08 2 21.04136 0.07781525 4
40 2017-05-09 2 21.05375 0.07941123 4
aggregate(cbind(O2,CO2) ~ group, dat, mean)
group O2 CO2
1 1 21.04675 0.07812336
2 2 21.03497 0.07819329
3 3 21.03967 0.07834986
4 4 21.04166 0.07882221
# crazier, irregular switching
test.data2 <- test.data
test.data2$switch <- unlist(mapply(rep, 1:2, times = 1, each = c(10,8,10,5,3,10)))[1:20]
dat2 <- CleanMachineData(test.data2, test.data2$switch, 5)
dat2
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
24 2017-04-23 1 21.05658 0.07669662 3
25 2017-04-24 1 21.04452 0.07983165 3
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
# You can try removing a vector with the following
lapply(5:7, function(x) {
dat <- CleanMachineData(test.data2, test.data2$switch, x)
list(data = dat, means = aggregate(cbind(O2,CO2)~group, dat, mean))
})
Use
test.data[rep(c(FALSE, TRUE), each=5),]
to select always the last five rows from the group of 10 rows.
Then you can use aggregate:
d2 <- test.data[rep(c(FALSE, TRUE), each=5),]
aggregate(cbind(O2, CO2) ~ 1, data=d2, FUN=mean)
If you want the average for every 5-rows-group:
aggregate(cbind(O2, CO2) ~ gl(k=5, n=nrow(d2)/5L), data=d2, FUN=mean)
Here is a generalization for the situation of arbitrary number of rows in test.data:
stay <- rep(c(FALSE, TRUE), each=5, length.out=nrow(test.data))
d2 <- test.data[stay,]
group <- gl(k=5, n=nrow(d2)/5L+1L, length=nrow(d2))
aggregate(cbind(O2, CO2) ~ group, data=d2, FUN=mean)
Here is a variant for mixing the data with the averages:
group <- gl(k=10, n=nrow(test.data)/10L+1L, length=nrow(test.data))
L <- split(test.data, group)
mySummary <- function(x) {
if (nrow(x) <= 5) return(NULL)
x <- x[-(1:5),]
d.avg <- aggregate(cbind(O2, CO2) ~ 1, data=x, FUN=mean)
rbind(x, cbind(date=NA, switch=-1, d.avg))
}
lapply(L, mySummary) # as list of dataframes
do.call(rbind, lapply(L, mySummary)) # as one dataframe

Find rows with time between two values in R

I have a dataframe temp which looks like the following:
Time Count Colour
01:02:30 11.000000 Red
17:05:49 52.000000 White
04:06:07 4.000000 Blue
01:07:03 30.000000 Red
20:08:30 4.000000 Yellow
The Time was initially an ISODate do I stripped it off to get the time, which is what I wanted, using the code below.
temp$Time = parse_iso_8601(temp$Time)
temp$Time <- as.POSIXlt(temp$Time)
library(chron)
temp$Time=times(format(temp$Time, format="%H:%M:%S"))
Now, I wish to find rows with time between 02:00:00 and 05:00:00. Can you please suggest how this can be done? Thanks!
Here is my try,
temp <- read.table(text = "Time Count Colour
01:02:30 11.000000 Red
17:05:49 52.000000 White
04:06:07 4.000000 Blue
01:07:03 30.000000 Red
20:08:30 4.000000 Yellow", header = TRUE)
library(chron)
temp$Time <- times(format(temp$Time, format="%H:%M:%S"))
temp[temp$Time >= 2/24 & temp$Time <= 5/24, ]
Output:
> temp[temp$Time >= 2/24 & temp$Time <= 5/24, ]
Time Count Colour
3 04:06:07 4 Blue
The logic:
The code below shows that the function times should be mapping [00:00:00, 23:59:59.999...) to [0, 1)
> as.numeric(times(paste(0:23, ":00:00", sep = "")))
[1] 0.00000000 0.04166667 0.08333333 0.12500000 0.16666667 0.20833333
[7] 0.25000000 0.29166667 0.33333333 0.37500000 0.41666667 0.45833333
[13] 0.50000000 0.54166667 0.58333333 0.62500000 0.66666667 0.70833333
[19] 0.75000000 0.79166667 0.83333333 0.87500000 0.91666667 0.95833333
Thus, to find if the Time is between 02:00:00 and 05:00:00, you can check whether it is greater or equal to 2/24 and smaller or equal to 5/25.
The gap problem:
Not sure if it is what you want.
Assume temp is ordered by date and time, like the one below
library(chron)
temp <- data.frame(
Record = 1:8,
Day = c(1, 1, 1, 1, 1, 2, 2, 2),
Time = c("01:02:30", "01:07:03", "04:06:07", "17:05:49", "20:08:30", "02:00:00", "02:15:00", "04:07:00")
)
temp$Time <- times(format(temp$Time, format="%H:%M:%S"))
> temp
Record Day Time
1 1 1 01:02:30
2 2 1 01:07:03
3 3 1 04:06:07
4 4 1 17:05:49
5 5 1 20:08:30
6 6 2 02:00:00
7 7 2 02:15:00
8 8 2 04:07:00
R code to do your task:
temp$Gap <- 24 * (c(NA, diff(temp$Time)) + c(NA, diff(temp$Day) > 0))
temp$Gap3hr <- temp$Gap >= 3 # 3 hour gap
temp$HourFromFirst <- 24 * (temp$Time - temp$Time[1]) + 24 * (temp$Day - temp$Day[1])
tempSelected <- lapply(which(temp$Gap3hr == TRUE), function(i) {
BeforeGap1hr <- (temp$HourFromFirst[i - 1] - temp$HourFromFirst) <= 1 & (temp$HourFromFirst[i - 1] - temp$HourFromFirst) >= 0
AfterGap1hr <- (temp$HourFromFirst - temp$HourFromFirst[i]) <= 1 & (temp$HourFromFirst - temp$HourFromFirst[i]) >= 0
temp[BeforeGap1hr | AfterGap1hr, ]
}
)
Output:
> tempSelected
[[1]]
Record Day Time Gap Gap3hr HourFromFirst
3 3 1 04:06:07 2.984444 FALSE 3.060278
4 4 1 17:05:49 12.995000 TRUE 16.055278
[[2]]
Record Day Time Gap Gap3hr HourFromFirst
4 4 1 17:05:49 12.995000 TRUE 16.05528
5 5 1 20:08:30 3.044722 TRUE 19.10000
[[3]]
Record Day Time Gap Gap3hr HourFromFirst
5 5 1 20:08:30 3.044722 TRUE 19.10000
6 6 2 02:00:00 5.858333 TRUE 24.95833
7 7 2 02:15:00 0.250000 FALSE 25.20833

interaction and first differences in zelig

I have a dataset with this structure:
# libraries
library(Zelig) # 5.0-12
library(datatable)
# create data
time <- factor(rep(-12:12, 50))
treatment <- rbinom(length(time), 1, .75)
outcome <- rnorm(length(time), 1, 3) + 3 * treatment
dat <- data.table(outcome, time, treatment)
dat
outcome time treatment
1: 5.2656458 -12 0
2: 4.8888805 -11 1
3: 2.6322592 -10 1
4: 8.2449092 -9 1
5: 0.5752739 -8 0
---
1246: 2.1865924 8 0
1247: 1.6028838 9 1
1248: 2.4056725 10 1
1249: 2.0257008 11 1
1250: 6.1503307 12 1
I run a LS model interacting time and treatment:
z <- zls$new()
z$zelig(out ~ time * treatment, data = dat)
summary(z)
Here a trimmed output...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.40264 0.71552 3.358 0.00081
time-11 -1.61292 1.08177 -1.491 0.13622
time-10 -1.03283 0.99850 -1.034 0.30116
time-9 -1.47934 1.02667 -1.441 0.14987
time-8 -0.35614 1.02667 -0.347 0.72874
time-7 -1.05803 1.04304 -1.014 0.31061
time-6 -2.25316 1.16178 -1.939 0.05269
....
treatment 1.28097 0.89440 1.432 0.15234
time-11:treatment 2.86965 1.30927 2.192 0.02859
time-10:treatment 1.69479 1.25788 1.347 0.17813
time-9:treatment 1.78684 1.27330 1.403 0.16078
time-8:treatment 0.82332 1.27330 0.647 0.51801
time-7:treatment 1.62808 1.28334 1.269 0.20482
time-6:treatment 2.64653 1.36895 1.933 0.05344
time-5:treatment 3.08572 1.36895 2.254 0.02437
....
I would like to estimate the first differences (treatment = 1, treatment = 0) for each time so that I can plot the effects by time.
Any ideas?
Thank you in advance
Here a solution using a loop.
m <- zelig(outcome ~ time * treatment, model = "ls", data = dat)
output <- NULL
for (i in unique(dat$time)) {
t0 <- setx(m, treatment = 0, time = i)
t1 <- setx(m, treatment = 1, time = i)
ss <- sim(m, x = t0, x1 = t1, num = 10000)
fd <- unlist(ss$sim.out[["x1"]][["fd"]])
r <- data.table(time = i, mean = mean(fd), low = quantile(fd, .025), high = quantile(fd, 0.975))
output <- rbind(output, r)
}
output
time mean low high
1: -12 1.506365 -0.30605416 3.347631
2: -11 1.013915 -0.83479749 2.817791
3: -10 2.673004 0.72371241 4.645537
4: -9 1.291547 -0.62162353 3.183365
5: -8 2.985348 0.59834003 5.351312
6: -7 3.911258 1.95825840 5.878157
7: -6 4.222870 1.86773822 6.567400
8: -5 3.152967 0.81620039 5.483884
9: -4 3.893867 1.77629999 6.003647
10: -3 2.319123 0.35445149 4.278032
11: -2 1.942848 0.03771276 3.844245
12: -1 3.879313 1.92915419 5.852765
13: 0 1.388601 -0.93881332 3.703387
14: 1 3.576107 1.54679622 5.567298
15: 2 2.413652 0.58863014 4.225094
16: 3 2.160988 0.03251586 4.266438
17: 4 2.203825 0.28985053 4.080361
18: 5 4.445642 2.40569051 6.510071
19: 6 1.504513 -0.27797349 3.251175
20: 7 2.542558 0.77794333 4.269277
21: 8 2.682681 0.93322199 4.449863
22: 9 4.271228 2.39189897 6.137469
23: 10 2.540004 0.66875643 4.454354
24: 11 3.454584 1.54938921 5.340096
25: 12 3.682521 1.85539403 5.501669
time mean low high

Selecting top finite number of rows for each unique value of a column in a data fame in R

I have a data frame with 3 columns. a,b,c. There are multiple rows corresponding to each unique value of column a. I want to select top 5 rows corresponding to each unique value of column a. column c is some value and the data frame is already sorted by it in descending order, so that would not be a problem. Can anyone please suggest how can I do this in R.
Stealing #ptocquin's example, here's how you can use base function by. You can flatten the result using do.call (see below).
> by(data = data, INDICES = data$a, FUN = function(x) head(x, 5))
# or by(data = data, INDICES = data$a, FUN = head, 5)
data$a: 1
a b c
21 1 0.1188552 1.6389895
41 1 1.0182033 1.4811359
61 1 -0.8795879 0.7784072
81 1 0.6485745 0.7734652
31 1 1.5102255 0.7107957
------------------------------------------------------------
data$a: 2
a b c
15 2 -1.09704040 1.1710693
85 2 0.42914795 0.8826820
65 2 -1.01480957 0.6736782
45 2 -0.07982711 0.3693384
35 2 -0.67643885 -0.2170767
------------------------------------------------------------
A similar thing could be achieved by splitting your data.frame based on a and then using lapply to step through each element subsetting first n rows.
split.data <- split(data, data$a)
subsetted.data <- lapply(split.data, FUN = function(x) head(x, 5)) # or ..., FUN = head, 5) like above
flatten.data <- do.call("rbind", subsetted.data)
head(flatten.data)
a b c
1.21 1 0.11885516 1.63898947
1.41 1 1.01820329 1.48113594
1.61 1 -0.87958790 0.77840718
1.81 1 0.64857445 0.77346517
1.31 1 1.51022545 0.71079568
2.15 2 -1.09704040 1.17106930
2.85 2 0.42914795 0.88268205
2.65 2 -1.01480957 0.67367823
2.45 2 -0.07982711 0.36933837
2.35 2 -0.67643885 -0.21707668
Here is my try :
library(plyr)
data <- data.frame(a=rep(sample(1:20,10),10),b=rnorm(100),c=rnorm(100))
data <- data[rev(order(data$c)),]
head(data, 15)
a b c
28 6 1.69611039 1.720081
91 11 1.62656460 1.651574
70 9 -1.17808386 1.641954
6 15 1.23420550 1.603140
23 7 0.70854914 1.588352
51 11 -1.41234359 1.540738
19 10 2.83730734 1.522825
49 10 0.39313579 1.370831
80 9 -0.59445323 1.327825
59 10 -0.55538404 1.214901
18 6 0.08445888 1.152266
86 15 0.53027267 1.066034
69 10 -1.89077464 1.037447
62 1 -0.43599566 1.026505
3 7 0.78544009 1.014770
result <- ddply(data, .(a), "head", 5)
head(result, 15)
a b c
1 1 -0.43599566 1.02650544
2 1 -1.55113486 0.36380251
3 1 0.68608364 0.30911430
4 1 -0.85406406 0.05555500
5 1 -1.83894595 -0.11850847
6 5 -1.79715809 0.77760033
7 5 0.82814909 0.22401278
8 5 -1.52726859 0.06745849
9 5 0.51655092 -0.02737905
10 5 -0.44004646 -0.28106808
11 6 1.69611039 1.72008079
12 6 0.08445888 1.15226601
13 6 -1.99465060 0.82214319
14 6 0.43855489 0.76221979
15 6 -2.15251353 0.64417757

Resources