counting NA from R Dataframe in a for loop - r

If I have a timeseries dataframe in r from 2011 to 2018. How can I do a for loop where I count the number of NA per year separately and if that specific year has more than x % I drop that year or do something.
please refer to the image to see how my Dataframe looks like.
https://i.stack.imgur.com/2fwDk.png
years_values <- 2011:2020
years = pretty(years_values,n=10)
count = 0
for (y in years){
for (j in df$Flow == y) {
if (is.na(df$Flow[j]){
count = count+1
}
}
if (count) > 1{
bfi = BFI(df$Flow == y)}
else {bfi = NA}
}
I am trying to use this code to loop for each year and then count the NA. and if the NA is greater than 1% I want to no compute for BFI and if it is less the compute for the BFI. I do have the BFI function working well. The problem I have is to formulate this loop.

Since you have not included any reproducible data, let us take a simple example that captures the essence of your own data. We have a column called Year and one called Flow that contains some missing values:
df <- data.frame(Year = rep(2011:2013, each = 4),
Flow = c(1, 2, NA, NA, 5, 6, NA, 8, 9, 10, 11, 12))
df
#> Year Flow
#> 1 2011 1
#> 2 2011 2
#> 3 2011 NA
#> 4 2011 NA
#> 5 2012 5
#> 6 2012 6
#> 7 2012 NA
#> 8 2012 8
#> 9 2013 9
#> 10 2013 10
#> 11 2013 11
#> 12 2013 12
Now suppose we want to count the number of missing values in each year. We can use table and is.na, like this:
tab <- table(df$Year, is.na(df$Flow))
tab
#>
#> FALSE TRUE
#> 2011 2 2
#> 2012 3 1
#> 2013 4 0
We can see that these are the absolute counts of missing values, but we can convert this into proportions by dividing the second column by the row sums of this table:
props <- tab[,2] / rowSums(tab)
props
#> 2011 2012 2013
#> 0.50 0.25 0.00
Now, suppose we want to find and remove the years where more than 33% of cases are missing. We can just filter the values of props that are greater than 0.33 and get the associated year (or years):
years_to_drop <- names(props)[props > 0.33]
years_to_drop
#> [1] "2011"
Now we can use this to remove the years with more than 33% missing values from our original data frame by doing:
df[!df$Year %in% years_to_drop,]
#> Year Flow
#> 5 2012 5
#> 6 2012 6
#> 7 2012 NA
#> 8 2012 8
#> 9 2013 9
#> 10 2013 10
#> 11 2013 11
#> 12 2013 12
Created on 2022-11-14 with reprex v2.0.2

As Allan Cameron suggests, there's no need to use a loop, and R is usually more efficient working vectorially anyway.
I would suggest a solution based on ave (using the synthetic data from the previous answer)
df$NA_fraction <- ave(df$Flow, df$Year, FUN = \(values) mean(is.na(values)))
df
Year Flow NA_fraction
1 2011 1 0.50
2 2011 2 0.50
3 2011 NA 0.50
4 2011 NA 0.50
5 2012 5 0.25
6 2012 6 0.25
7 2012 NA 0.25
8 2012 8 0.25
9 2013 9 0.00
10 2013 10 0.00
11 2013 11 0.00
12 2013 12 0.00
You can then pick whatever threshold and filter by it
> df[df$NA_fraction < 0.3,]
Year Flow NA_fraction
5 2012 5 0.25
6 2012 6 0.25
7 2012 NA 0.25
8 2012 8 0.25
9 2013 9 0.00
10 2013 10 0.00
11 2013 11 0.00
12 2013 12 0.00

Related

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA

Search in a column based on the value of a different column

I have a simple table with three columns ("Year", "Target", "Value") and I would like to create a new column (Resp) containing the "Year" where "Value" is higher than "Target". The select value (column "Year") correspond to the first time that "Value" is higher than "Target".
This is part of the table:
db <- data.frame(Year=2010:2017, Target=c(3,5,2,7,5,8,3,6), Value=c(4,5,2,7,4,9,5,8)).
print(db)
Yea Target Value
1 2010 3 4
2 2011 5 5
3 2012 2 2
4 2013 7 3
5 2014 5 4
6 2015 8 9
7 2016 3 5
8 2017 6 8
The pretended result is:
Year Target Value Resp
1 2010 3 4 2011
2 2011 5 5 2015
3 2012 2 2 2013
4 2013 7 3 2015
5 2014 5 4 2015
6 2015 8 9 NA
7 2016 3 5 2017
8 2017 6 8 NA
Any suggestion how can I solve this problem?
In addition to the 'Resp' column, I want to create a new one (Black.Y) containing the "Year" corresponding to the minimum of "Value" until 'Value' is higher than "Target".
The pretended result is:
Year Target Value Resp Black.Y
1 2010 3 4 2011 NA
2 2011 5 5 2015 2012
3 2012 2 2 2013 NA
4 2013 7 3 2015 2014
5 2014 5 4 2015 NA
6 2015 8 9 NA 2016
7 2016 3 5 2017 NA
8 2017 6 8 NA NA
Any suggestion how can I solve this problem?
Here's an approach in base R:
o <- outer(db$Target, db$Value, `<`) # compute a logical matrix
o[lower.tri(o, diag = TRUE)] <- FALSE # replace lower.tri and diag with FALSE
idx <- max.col(o, ties.method = "first") # get the index of the first maximum
idx <- replace(idx, rowSums(o) == 0, NA) # take care of cases without greater Value
db$Resp <- db$Year[idx] # add new column
The resulting table is:
# Year Target Value Resp
# 1 2010 3 4 2011
# 2 2011 5 5 2013
# 3 2012 2 2 2013
# 4 2013 7 7 2015
# 5 2014 5 4 2015
# 6 2015 8 9 NA
# 7 2016 3 5 2017
# 8 2017 6 8 NA

Performing a dplyr full_join without a common variable to blend data frames

Using the dplyr full_join() operation, I am trying to perform the equivalent of a basic merge() operation in which no common variables exist (unable to satisfy the "by=" argument). This will blend two data frames and return all possible combinations.
However, the current full_join() function requires a common variable. I am unable to locate another dplyr function that can help with this. How can I perform this operation using functions specific to the dplyr library?
df_a = data.frame(department=c(1,2,3,4))
df_b = data.frame(period=c(2014,2015,2016,2017))
#This works as desired
big_df = merge(df_a,df_b)
#I'd like to perform the following in a much bigger operation:
big_df = dplyr::full_join(df_a,df_b)
#Error: No common variables. Please specify `by` param.
You can use crossing from tidyr:
crossing(df_a,df_b)
department period
1 1 2014
2 1 2015
3 1 2016
4 1 2017
5 2 2014
6 2 2015
7 2 2016
8 2 2017
9 3 2014
10 3 2015
11 3 2016
12 3 2017
13 4 2014
14 4 2015
15 4 2016
16 4 2017
If there are duplicate rows, crossing doesn't give the same result as merge.
Instead use full_join with by = character() to perform a cross-join which generates all combinations of df_a and df_b.
library("tidyverse") # version 1.3.2
# Add duplicate rows for illustration.
df_a <- tibble(department = c(1, 2, 3, 3))
df_b <- tibble(period = c(2014, 2015, 2016, 2017))
merge doesn't de-duplicate.
df_a_merge_b <- merge(df_a, df_b)
df_a_merge_b
#> department period
#> 1 1 2014
#> 2 2 2014
#> 3 3 2014
#> 4 3 2014
#> 5 1 2015
#> 6 2 2015
#> 7 3 2015
#> 8 3 2015
#> 9 1 2016
#> 10 2 2016
#> 11 3 2016
#> 12 3 2016
#> 13 1 2017
#> 14 2 2017
#> 15 3 2017
#> 16 3 2017
crossing drops duplicate rows.
df_a_crossing_b <- crossing(df_a, df_b)
df_a_crossing_b
#> # A tibble: 12 × 2
#> department period
#> <dbl> <dbl>
#> 1 1 2014
#> 2 1 2015
#> 3 1 2016
#> 4 1 2017
#> 5 2 2014
#> 6 2 2015
#> 7 2 2016
#> 8 2 2017
#> 9 3 2014
#> 10 3 2015
#> 11 3 2016
#> 12 3 2017
full_join doesn't remove duplicates either.
df_a_full_join_b <- full_join(df_a, df_b, by = character())
df_a_full_join_b
#> # A tibble: 16 × 2
#> department period
#> <dbl> <dbl>
#> 1 1 2014
#> 2 1 2015
#> 3 1 2016
#> 4 1 2017
#> 5 2 2014
#> 6 2 2015
#> 7 2 2016
#> 8 2 2017
#> 9 3 2014
#> 10 3 2015
#> 11 3 2016
#> 12 3 2017
#> 13 3 2014
#> 14 3 2015
#> 15 3 2016
#> 16 3 2017
packageVersion("tidyverse")
#> [1] '1.3.2'
Created on 2023-01-13 with reprex v2.0.2

For Loop and Table Printing in R

I'm trying to compute confidence intervals for many rows of a table using a for loop, and would like output that is more readable.. Here is a snippet of how the data looks.
QUESTION X_YEAR X_PARTNER X_CAMP X_N X_CODE1
1 Q1 2011 SCSD ITC 15 4
2 Q1 2011 SCSD Nottingham 4 1
3 Q1 2011 SCSD ALL 19 5
4 Q1 2011 CP CP1 18 4
5 Q1 2011 ALL ALL 37 9
6 Q1 2012 SCSD ITC 8 1
7 Q1 2012 SCSD Nottingham 8 2
8 Q1 2012 SCSD ALL 16 3
9 Q1 2012 CP CP1 18 2
10 Q1 2012 CP CP1 22 2
11 Q1 2012 CP ALL 40 4
I'm trying to print out a confidence interval, with the Question, Year and Camp included. I'd like the output to be in table form like this
QUESTION YEAR CAMP X N MEAN LOWER UPPER
Q1 2011 ITC 4 15 0.26 0.07 0.55
Q1 2011 NOTTINGHAM 1 4 0.25 0.006 0.8
with the first three columns being taken directly from the data table, and the latter 4 extracted from a confidence interval test I'm using.
The code I'm currently using:
for (i in 1:26){
print(data[i,1],max.levels=0)
print(data[i,2],max.levels=0)
print(data[i,4],max.levels=0)
print(binom.confint(data[i,6],data[i,5],conf.level=0.95,methods="exact"))
}
provides output that (I have a lot more data than the snippet) will be far too time consuming to sift through...
[1] Q1
[1] 2011
[1] ITC
method x n mean lower upper
1 exact 4 15 0.2666667 0.07787155 0.5510032
[1] Q1
[1] 2011
[1] Nottingham
method x n mean lower upper
1 exact 1 4 0.25 0.006309463 0.8058796
Any advice is appreciated!
If df is the name of your data, and you only want to do this for where QUESTION is Q1 (see comments), then
library(binom)
df2 <- df[df$QUESTION == "Q1",]
x <- vector("list", nrow(df2))
for(i in seq_len(nrow(df2))) {
x[[i]] <- binom.confint(df2[i,6], df2[i,5], methods = "exact")
}
cbind(df2[c(1,2,4)], do.call(rbind, x)[,-1])
# QUESTION X_YEAR X_CAMP x n mean lower upper
# 1 Q1 2011 ITC 4 15 0.26666667 0.077871546 0.5510032
# 2 Q1 2011 Nottingham 1 4 0.25000000 0.006309463 0.8058796
# 3 Q1 2011 ALL 5 19 0.26315789 0.091465785 0.5120293
# 4 Q1 2011 CP1 4 18 0.22222222 0.064092048 0.4763728
# 5 Q1 2011 ALL 9 37 0.24324324 0.117725174 0.4119917
# 6 Q1 2012 ITC 1 8 0.12500000 0.003159724 0.5265097
# 7 Q1 2012 Nottingham 2 8 0.25000000 0.031854026 0.6508558
# 8 Q1 2012 ALL 3 16 0.18750000 0.040473734 0.4564565
# 9 Q1 2012 CP1 2 18 0.11111111 0.013751216 0.3471204
# 10 Q1 2012 CP1 2 22 0.09090909 0.011205586 0.2916127
# 11 Q1 2012 ALL 4 40 0.10000000 0.027925415 0.2366374
Note that conf.level = 0.95 is the default setting for binom.confint, so you don't need to include it in your call.

How calculate growth rate in long format data frame?

With data structured as follows...
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
I'm having a tough time creating a growth rate column (by year) within category. Can anyone help with code to create something like this...
Category Year Value Growth
A 2010 1
A 2011 2 1.000
A 2012 3 0.500
A 2013 4 0.333
A 2014 5 0.250
A 2015 6 0.200
B 2010 7
B 2011 8 0.143
B 2012 9 0.125
B 2013 10 0.111
B 2014 11 0.100
B 2015 12 0.091
For these sorts of questions ("how do I compute XXX by category YYY")? there are always solutions based on by(), the data.table() package, and plyr. I generally prefer plyr, which is often slower, but (to me) more transparent/elegant.
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
library(plyr)
ddply(df,"Category",transform,
Growth=c(NA,exp(diff(log(Value)))-1))
The main difference between this answer and #krlmr's is that I am using a geometric-mean trick (taking differences of logs and then exponentiating) while #krlmr computes an explicit ratio.
Mathematically, diff(log(Value)) is taking the differences of the logs, i.e. log(x[t+1])-log(x[t]) for all t. When we exponentiate that we get the ratio x[t+1]/x[t] (because exp(log(x[t+1])-log(x[t])) = exp(log(x[t+1]))/exp(log(x[t])) = x[t+1]/x[t]). The OP wanted the fractional change rather than the multiplicative growth rate (i.e. x[t+1]==x[t] corresponds to a fractional change of zero rather than a multiplicative growth rate of 1.0), so we subtract 1.
I am also using transform() for a little bit of extra "syntactic sugar", to avoid creating a new anonymous function.
You can simply use dplyr package:
> df %>% group_by(Category) %>% mutate(Growth = (Value - lag(Value))/lag(Value))
which will produce the following result:
# A tibble: 12 x 4
# Groups: Category [2]
Category Year Value Growth
<fct> <int> <int> <dbl>
1 A 2010 1 NA
2 A 2011 2 1
3 A 2012 3 0.5
4 A 2013 4 0.333
5 A 2014 5 0.25
6 A 2015 6 0.2
7 B 2010 7 NA
8 B 2011 8 0.143
9 B 2012 9 0.125
10 B 2013 10 0.111
11 B 2014 11 0.1
12 B 2015 12 0.0909
Using R base function (ave)
> dfdf$Growth <- with(df, ave(Value, Category,
FUN=function(x) c(NA, diff(x)/x[-length(x)]) ))
> df
Category Year Value Growth
1 A 2010 1 NA
2 A 2011 2 1.00000000
3 A 2012 3 0.50000000
4 A 2013 4 0.33333333
5 A 2014 5 0.25000000
6 A 2015 6 0.20000000
7 B 2010 7 NA
8 B 2011 8 0.14285714
9 B 2012 9 0.12500000
10 B 2013 10 0.11111111
11 B 2014 11 0.10000000
12 B 2015 12 0.09090909
#Ben Bolker's answer is easily adapted to ave:
transform(df, Growth=ave(Value, Category,
FUN=function(x) c(NA,exp(diff(log(x)))-1)))
Very easy with plyr:
library(plyr)
ddply(df, .(Category),
function (d) {
d$Growth <- c(NA, tail(d$Value, -1) / head(d$Value, -1) - 1)
d
}
)
We have two problems here:
Splitting by category
Computing the growth rate
ddply is the workhorse, the split and the function to compute the growth rate is defined by parameters to this function.
A more elegant variant based on Ben's idea with the new gdiff function in my R package:
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
library(plyr)
ddply(df, "Category", transform,
Growth=c(NA, kimisc::gdiff(Value, FUN = `/`)-1))
Here, gdiff is used to compute a lagged rate (instead of a lagged difference as diff would).
Many years later: the tsbox package aims to work with all kind of time series objects, including data frames, and offers a standard time series toolkit. Thus, calculating growth rates is as simple as:
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
library(tsbox)
ts_pc(df)
#> [time]: 'Year' [value]: 'Value'
#> Category Year Value
#> 1 A 2010-01-01 NA
#> 2 A 2011-01-01 100.000000
#> 3 A 2012-01-01 50.000000
#> 4 A 2013-01-01 33.333333
#> 5 A 2014-01-01 25.000000
#> 6 A 2015-01-01 20.000000
#> 7 B 2010-01-01 NA
#> 8 B 2011-01-01 14.285714
#> 9 B 2012-01-01 12.500000
#> 10 B 2013-01-01 11.111111
#> 11 B 2014-01-01 10.000000
#> 12 B 2015-01-01 9.090909
The package collapse available in CRAN provides an easy and fully C/C++ based solution to these kinds of problems: with the generic function fgrowth and the associated growth operator G:
df <- data.frame(Category=c(rep("A",6),rep("B",6)),
Year=rep(2010:2015,2),Value=1:12)
library(collapse)
G(df, by = ~Category, t = ~Year)
Category Year G1.Value
1 A 2010 NA
2 A 2011 100.000000
3 A 2012 50.000000
4 A 2013 33.333333
5 A 2014 25.000000
6 A 2015 20.000000
7 B 2010 NA
8 B 2011 14.285714
9 B 2012 12.500000
10 B 2013 11.111111
11 B 2014 10.000000
12 B 2015 9.090909
# fgrowth is more of a programmers function, you can do:
fgrowth(df$Value, 1, 1, df$Category, df$Year)
[1] NA 100.000000 50.000000 33.333333 25.000000 20.000000 NA 14.285714 12.500000 11.111111 10.000000 9.090909
# Which means: Calculate the growth rate of Value, using 1 lag, and iterated 1 time (you can compute arbitrary sequences of lagged / leaded and iterated growth rates with these functions), identified by Category and Year.
fgrowth / G also has methods for the plm::pseries and plm::pdata.frame classes available in the plm package.

Resources