For Loop and Table Printing in R - r

I'm trying to compute confidence intervals for many rows of a table using a for loop, and would like output that is more readable.. Here is a snippet of how the data looks.
QUESTION X_YEAR X_PARTNER X_CAMP X_N X_CODE1
1 Q1 2011 SCSD ITC 15 4
2 Q1 2011 SCSD Nottingham 4 1
3 Q1 2011 SCSD ALL 19 5
4 Q1 2011 CP CP1 18 4
5 Q1 2011 ALL ALL 37 9
6 Q1 2012 SCSD ITC 8 1
7 Q1 2012 SCSD Nottingham 8 2
8 Q1 2012 SCSD ALL 16 3
9 Q1 2012 CP CP1 18 2
10 Q1 2012 CP CP1 22 2
11 Q1 2012 CP ALL 40 4
I'm trying to print out a confidence interval, with the Question, Year and Camp included. I'd like the output to be in table form like this
QUESTION YEAR CAMP X N MEAN LOWER UPPER
Q1 2011 ITC 4 15 0.26 0.07 0.55
Q1 2011 NOTTINGHAM 1 4 0.25 0.006 0.8
with the first three columns being taken directly from the data table, and the latter 4 extracted from a confidence interval test I'm using.
The code I'm currently using:
for (i in 1:26){
print(data[i,1],max.levels=0)
print(data[i,2],max.levels=0)
print(data[i,4],max.levels=0)
print(binom.confint(data[i,6],data[i,5],conf.level=0.95,methods="exact"))
}
provides output that (I have a lot more data than the snippet) will be far too time consuming to sift through...
[1] Q1
[1] 2011
[1] ITC
method x n mean lower upper
1 exact 4 15 0.2666667 0.07787155 0.5510032
[1] Q1
[1] 2011
[1] Nottingham
method x n mean lower upper
1 exact 1 4 0.25 0.006309463 0.8058796
Any advice is appreciated!

If df is the name of your data, and you only want to do this for where QUESTION is Q1 (see comments), then
library(binom)
df2 <- df[df$QUESTION == "Q1",]
x <- vector("list", nrow(df2))
for(i in seq_len(nrow(df2))) {
x[[i]] <- binom.confint(df2[i,6], df2[i,5], methods = "exact")
}
cbind(df2[c(1,2,4)], do.call(rbind, x)[,-1])
# QUESTION X_YEAR X_CAMP x n mean lower upper
# 1 Q1 2011 ITC 4 15 0.26666667 0.077871546 0.5510032
# 2 Q1 2011 Nottingham 1 4 0.25000000 0.006309463 0.8058796
# 3 Q1 2011 ALL 5 19 0.26315789 0.091465785 0.5120293
# 4 Q1 2011 CP1 4 18 0.22222222 0.064092048 0.4763728
# 5 Q1 2011 ALL 9 37 0.24324324 0.117725174 0.4119917
# 6 Q1 2012 ITC 1 8 0.12500000 0.003159724 0.5265097
# 7 Q1 2012 Nottingham 2 8 0.25000000 0.031854026 0.6508558
# 8 Q1 2012 ALL 3 16 0.18750000 0.040473734 0.4564565
# 9 Q1 2012 CP1 2 18 0.11111111 0.013751216 0.3471204
# 10 Q1 2012 CP1 2 22 0.09090909 0.011205586 0.2916127
# 11 Q1 2012 ALL 4 40 0.10000000 0.027925415 0.2366374
Note that conf.level = 0.95 is the default setting for binom.confint, so you don't need to include it in your call.

Related

counting NA from R Dataframe in a for loop

If I have a timeseries dataframe in r from 2011 to 2018. How can I do a for loop where I count the number of NA per year separately and if that specific year has more than x % I drop that year or do something.
please refer to the image to see how my Dataframe looks like.
https://i.stack.imgur.com/2fwDk.png
years_values <- 2011:2020
years = pretty(years_values,n=10)
count = 0
for (y in years){
for (j in df$Flow == y) {
if (is.na(df$Flow[j]){
count = count+1
}
}
if (count) > 1{
bfi = BFI(df$Flow == y)}
else {bfi = NA}
}
I am trying to use this code to loop for each year and then count the NA. and if the NA is greater than 1% I want to no compute for BFI and if it is less the compute for the BFI. I do have the BFI function working well. The problem I have is to formulate this loop.
Since you have not included any reproducible data, let us take a simple example that captures the essence of your own data. We have a column called Year and one called Flow that contains some missing values:
df <- data.frame(Year = rep(2011:2013, each = 4),
Flow = c(1, 2, NA, NA, 5, 6, NA, 8, 9, 10, 11, 12))
df
#> Year Flow
#> 1 2011 1
#> 2 2011 2
#> 3 2011 NA
#> 4 2011 NA
#> 5 2012 5
#> 6 2012 6
#> 7 2012 NA
#> 8 2012 8
#> 9 2013 9
#> 10 2013 10
#> 11 2013 11
#> 12 2013 12
Now suppose we want to count the number of missing values in each year. We can use table and is.na, like this:
tab <- table(df$Year, is.na(df$Flow))
tab
#>
#> FALSE TRUE
#> 2011 2 2
#> 2012 3 1
#> 2013 4 0
We can see that these are the absolute counts of missing values, but we can convert this into proportions by dividing the second column by the row sums of this table:
props <- tab[,2] / rowSums(tab)
props
#> 2011 2012 2013
#> 0.50 0.25 0.00
Now, suppose we want to find and remove the years where more than 33% of cases are missing. We can just filter the values of props that are greater than 0.33 and get the associated year (or years):
years_to_drop <- names(props)[props > 0.33]
years_to_drop
#> [1] "2011"
Now we can use this to remove the years with more than 33% missing values from our original data frame by doing:
df[!df$Year %in% years_to_drop,]
#> Year Flow
#> 5 2012 5
#> 6 2012 6
#> 7 2012 NA
#> 8 2012 8
#> 9 2013 9
#> 10 2013 10
#> 11 2013 11
#> 12 2013 12
Created on 2022-11-14 with reprex v2.0.2
As Allan Cameron suggests, there's no need to use a loop, and R is usually more efficient working vectorially anyway.
I would suggest a solution based on ave (using the synthetic data from the previous answer)
df$NA_fraction <- ave(df$Flow, df$Year, FUN = \(values) mean(is.na(values)))
df
Year Flow NA_fraction
1 2011 1 0.50
2 2011 2 0.50
3 2011 NA 0.50
4 2011 NA 0.50
5 2012 5 0.25
6 2012 6 0.25
7 2012 NA 0.25
8 2012 8 0.25
9 2013 9 0.00
10 2013 10 0.00
11 2013 11 0.00
12 2013 12 0.00
You can then pick whatever threshold and filter by it
> df[df$NA_fraction < 0.3,]
Year Flow NA_fraction
5 2012 5 0.25
6 2012 6 0.25
7 2012 NA 0.25
8 2012 8 0.25
9 2013 9 0.00
10 2013 10 0.00
11 2013 11 0.00
12 2013 12 0.00

Search in a column based on the value of a different column

I have a simple table with three columns ("Year", "Target", "Value") and I would like to create a new column (Resp) containing the "Year" where "Value" is higher than "Target". The select value (column "Year") correspond to the first time that "Value" is higher than "Target".
This is part of the table:
db <- data.frame(Year=2010:2017, Target=c(3,5,2,7,5,8,3,6), Value=c(4,5,2,7,4,9,5,8)).
print(db)
Yea Target Value
1 2010 3 4
2 2011 5 5
3 2012 2 2
4 2013 7 3
5 2014 5 4
6 2015 8 9
7 2016 3 5
8 2017 6 8
The pretended result is:
Year Target Value Resp
1 2010 3 4 2011
2 2011 5 5 2015
3 2012 2 2 2013
4 2013 7 3 2015
5 2014 5 4 2015
6 2015 8 9 NA
7 2016 3 5 2017
8 2017 6 8 NA
Any suggestion how can I solve this problem?
In addition to the 'Resp' column, I want to create a new one (Black.Y) containing the "Year" corresponding to the minimum of "Value" until 'Value' is higher than "Target".
The pretended result is:
Year Target Value Resp Black.Y
1 2010 3 4 2011 NA
2 2011 5 5 2015 2012
3 2012 2 2 2013 NA
4 2013 7 3 2015 2014
5 2014 5 4 2015 NA
6 2015 8 9 NA 2016
7 2016 3 5 2017 NA
8 2017 6 8 NA NA
Any suggestion how can I solve this problem?
Here's an approach in base R:
o <- outer(db$Target, db$Value, `<`) # compute a logical matrix
o[lower.tri(o, diag = TRUE)] <- FALSE # replace lower.tri and diag with FALSE
idx <- max.col(o, ties.method = "first") # get the index of the first maximum
idx <- replace(idx, rowSums(o) == 0, NA) # take care of cases without greater Value
db$Resp <- db$Year[idx] # add new column
The resulting table is:
# Year Target Value Resp
# 1 2010 3 4 2011
# 2 2011 5 5 2013
# 3 2012 2 2 2013
# 4 2013 7 7 2015
# 5 2014 5 4 2015
# 6 2015 8 9 NA
# 7 2016 3 5 2017
# 8 2017 6 8 NA

R, lag( ) has inconsistent behavior for xts and ts objects

I would like to take a lag of an xts variable, and the lag() function returns a lag. However, if I use it on a ts variable, it gives a lead. Is this a bug, or working as intended?
library('xts')
a = as.xts(ts(c(5,3,7,2,4,8,3), start=c(1980,1), freq=4))
cbind(a, lag(a)) # provides lag 1
# ..1 ..2
# 1980 Q1 5 NA
# 1980 Q2 3 5
# 1980 Q3 7 3
# 1980 Q4 2 7
# 1981 Q1 4 2
# 1981 Q2 8 4
# 1981 Q3 3 8
b = ts(c(5,3,7,2,4,8,3), start=c(1980,1), freq=4)
cbind(b, lag(b)) # provides lead 1
# b lag(b)
# 1979 Q4 NA 5
# 1980 Q1 5 3
# 1980 Q2 3 7
# 1980 Q3 7 2
# 1980 Q4 2 4
# 1981 Q1 4 8
# 1981 Q2 8 3
# 1981 Q3 3 NA
As was pointed out in the documentation from ?lag.xts, this is the intended behavior.

Subset by multiple conditions

Maybe it's something basic, but I couldn't find the answer.
I have
Id Year V1
1 2009 33
1 2010 67
1 2011 38
2 2009 45
3 2009 65
3 2010 74
4 2009 47
4 2010 51
4 2011 14
I need to select only the rows that have the same Id but it´s in the three years 2009, 2010 and 2011.
Id Year V1
1 2009 33
1 2010 67
1 2011 38
4 2009 47
4 2010 51
4 2011 14
I try
d1_3 <- subset(d1, Year==2009 |Year==2010 |Year==2011 )
but it doesn't work.
Can anyone provide some suggestions that how I can do this in R?
I think ave could be useful here. I call your original data frame 'df'. For each Id, check if 2009-2011 is present in Year (2009:2011 %in% x). This gives a logical vector, which can be summed. Test if the sum equals 3 (if all Years are present, the sum is 3), which results in a new logical vector, which is used to subset rows of the data frame.
df[ave(df$Year, df$Id, FUN = function(x) sum(2009:2011 %in% x) == 3, ]
# Id Year V1
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
Another way of using ave
DF
## Id Year V1
## 1 1 2009 33
## 2 1 2010 67
## 3 1 2011 38
## 4 2 2009 45
## 5 3 2009 65
## 6 3 2010 74
## 7 4 2009 47
## 8 4 2010 51
## 9 4 2011 14
DF[ave(DF$Year, DF$Id, FUN = function(x) all(2009:2011 %in% x)) == 1, ]
## Id Year V1
## 1 1 2009 33
## 2 1 2010 67
## 3 1 2011 38
## 7 4 2009 47
## 8 4 2010 51
## 9 4 2011 14
This should do the job :)
library(plyr)
ds<-ddply(ds,.(Id),mutate,Nobs=length(Year))
ds[ds$Nobs == 3 & ds$Year %in% 2009:2011,]
I think an approach using ave is reasonable. But there are lots of ways to solve this problem. I show a few other ways using base R. Then in the last 2 examples I'll introduce the package data.table.
Again, just throwing this out there to provide some options to use different aspects of the language.
d1 <- data.frame(ID=c(1,1,1,2,3,3,4,4,4), Year=c(2009,2010,2011, 2009,2009, 2010, 2009, 2010, 2011), V1=c(33, 67, 38, 45, 65, 74, 47, 51, 14))
# long way
use_years <- as.character(2009:2011)
cnts <- table(d1[,c("ID","Year")])[,use_years]
use_id <- rownames(cnts)[rowSums(cnts)==length(use_years)]
d1[d1[,"ID"]%in%use_id,]
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
# another longish way
ind1 <- d1[,"Year"]%in%2009:2011
d1_ind <- d1[ind1,"ID"]
ind2 <- d1_ind %in% unique(d1_ind)[tabulate(d1_ind)==3]
d1[ind1,][ind2,]
# ID Year V1
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
OK, let's try out a couple methods using data.table. One of my favorite packages of all time. Can be a little tricky at first though, so make sure your boots are on tight (Oh, yeah, it's fast!) :)
# medium way
library(data.table)
d2 <- as.data.table(d1)
d2[ID%in%d2[Year%in%2009:2011, list(logic=nrow(.SD)==3),by="ID"][(logic),ID]]
# ID Year V1
# 1: 1 2009 33
# 2: 1 2010 67
# 3: 1 2011 38
# 4: 4 2009 47
# 5: 4 2010 51
# 6: 4 2011 14
# short way
d2[Year%in%2009:2011][ID%in%unique(ID)[table(ID)==3]]
# ID Year V1
# 1: 1 2009 33
# 2: 1 2010 67
# 3: 1 2011 38
# 4: 4 2009 47
# 5: 4 2010 51
# 6: 4 2011 14

How calculate growth rate in long format data frame?

With data structured as follows...
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
I'm having a tough time creating a growth rate column (by year) within category. Can anyone help with code to create something like this...
Category Year Value Growth
A 2010 1
A 2011 2 1.000
A 2012 3 0.500
A 2013 4 0.333
A 2014 5 0.250
A 2015 6 0.200
B 2010 7
B 2011 8 0.143
B 2012 9 0.125
B 2013 10 0.111
B 2014 11 0.100
B 2015 12 0.091
For these sorts of questions ("how do I compute XXX by category YYY")? there are always solutions based on by(), the data.table() package, and plyr. I generally prefer plyr, which is often slower, but (to me) more transparent/elegant.
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
library(plyr)
ddply(df,"Category",transform,
Growth=c(NA,exp(diff(log(Value)))-1))
The main difference between this answer and #krlmr's is that I am using a geometric-mean trick (taking differences of logs and then exponentiating) while #krlmr computes an explicit ratio.
Mathematically, diff(log(Value)) is taking the differences of the logs, i.e. log(x[t+1])-log(x[t]) for all t. When we exponentiate that we get the ratio x[t+1]/x[t] (because exp(log(x[t+1])-log(x[t])) = exp(log(x[t+1]))/exp(log(x[t])) = x[t+1]/x[t]). The OP wanted the fractional change rather than the multiplicative growth rate (i.e. x[t+1]==x[t] corresponds to a fractional change of zero rather than a multiplicative growth rate of 1.0), so we subtract 1.
I am also using transform() for a little bit of extra "syntactic sugar", to avoid creating a new anonymous function.
You can simply use dplyr package:
> df %>% group_by(Category) %>% mutate(Growth = (Value - lag(Value))/lag(Value))
which will produce the following result:
# A tibble: 12 x 4
# Groups: Category [2]
Category Year Value Growth
<fct> <int> <int> <dbl>
1 A 2010 1 NA
2 A 2011 2 1
3 A 2012 3 0.5
4 A 2013 4 0.333
5 A 2014 5 0.25
6 A 2015 6 0.2
7 B 2010 7 NA
8 B 2011 8 0.143
9 B 2012 9 0.125
10 B 2013 10 0.111
11 B 2014 11 0.1
12 B 2015 12 0.0909
Using R base function (ave)
> dfdf$Growth <- with(df, ave(Value, Category,
FUN=function(x) c(NA, diff(x)/x[-length(x)]) ))
> df
Category Year Value Growth
1 A 2010 1 NA
2 A 2011 2 1.00000000
3 A 2012 3 0.50000000
4 A 2013 4 0.33333333
5 A 2014 5 0.25000000
6 A 2015 6 0.20000000
7 B 2010 7 NA
8 B 2011 8 0.14285714
9 B 2012 9 0.12500000
10 B 2013 10 0.11111111
11 B 2014 11 0.10000000
12 B 2015 12 0.09090909
#Ben Bolker's answer is easily adapted to ave:
transform(df, Growth=ave(Value, Category,
FUN=function(x) c(NA,exp(diff(log(x)))-1)))
Very easy with plyr:
library(plyr)
ddply(df, .(Category),
function (d) {
d$Growth <- c(NA, tail(d$Value, -1) / head(d$Value, -1) - 1)
d
}
)
We have two problems here:
Splitting by category
Computing the growth rate
ddply is the workhorse, the split and the function to compute the growth rate is defined by parameters to this function.
A more elegant variant based on Ben's idea with the new gdiff function in my R package:
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
library(plyr)
ddply(df, "Category", transform,
Growth=c(NA, kimisc::gdiff(Value, FUN = `/`)-1))
Here, gdiff is used to compute a lagged rate (instead of a lagged difference as diff would).
Many years later: the tsbox package aims to work with all kind of time series objects, including data frames, and offers a standard time series toolkit. Thus, calculating growth rates is as simple as:
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
library(tsbox)
ts_pc(df)
#> [time]: 'Year' [value]: 'Value'
#> Category Year Value
#> 1 A 2010-01-01 NA
#> 2 A 2011-01-01 100.000000
#> 3 A 2012-01-01 50.000000
#> 4 A 2013-01-01 33.333333
#> 5 A 2014-01-01 25.000000
#> 6 A 2015-01-01 20.000000
#> 7 B 2010-01-01 NA
#> 8 B 2011-01-01 14.285714
#> 9 B 2012-01-01 12.500000
#> 10 B 2013-01-01 11.111111
#> 11 B 2014-01-01 10.000000
#> 12 B 2015-01-01 9.090909
The package collapse available in CRAN provides an easy and fully C/C++ based solution to these kinds of problems: with the generic function fgrowth and the associated growth operator G:
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
library(collapse)
G(df, by = ~Category, t = ~Year)
Category Year G1.Value
1 A 2010 NA
2 A 2011 100.000000
3 A 2012 50.000000
4 A 2013 33.333333
5 A 2014 25.000000
6 A 2015 20.000000
7 B 2010 NA
8 B 2011 14.285714
9 B 2012 12.500000
10 B 2013 11.111111
11 B 2014 10.000000
12 B 2015 9.090909
# fgrowth is more of a programmers function, you can do:
fgrowth(df$Value, 1, 1, df$Category, df$Year)
[1] NA 100.000000 50.000000 33.333333 25.000000 20.000000 NA 14.285714 12.500000 11.111111 10.000000 9.090909
# Which means: Calculate the growth rate of Value, using 1 lag, and iterated 1 time (you can compute arbitrary sequences of lagged / leaded and iterated growth rates with these functions), identified by Category and Year.
fgrowth / G also has methods for the plm::pseries and plm::pdata.frame classes available in the plm package.

Resources