If I have a timeseries dataframe in r from 2011 to 2018. How can I do a for loop where I count the number of NA per year separately and if that specific year has more than x % I drop that year or do something.
please refer to the image to see how my Dataframe looks like.
https://i.stack.imgur.com/2fwDk.png
years_values <- 2011:2020
years = pretty(years_values,n=10)
count = 0
for (y in years){
for (j in df$Flow == y) {
if (is.na(df$Flow[j]){
count = count+1
}
}
if (count) > 1{
bfi = BFI(df$Flow == y)}
else {bfi = NA}
}
I am trying to use this code to loop for each year and then count the NA. and if the NA is greater than 1% I want to no compute for BFI and if it is less the compute for the BFI. I do have the BFI function working well. The problem I have is to formulate this loop.
Since you have not included any reproducible data, let us take a simple example that captures the essence of your own data. We have a column called Year and one called Flow that contains some missing values:
df <- data.frame(Year = rep(2011:2013, each = 4),
Flow = c(1, 2, NA, NA, 5, 6, NA, 8, 9, 10, 11, 12))
df
#> Year Flow
#> 1 2011 1
#> 2 2011 2
#> 3 2011 NA
#> 4 2011 NA
#> 5 2012 5
#> 6 2012 6
#> 7 2012 NA
#> 8 2012 8
#> 9 2013 9
#> 10 2013 10
#> 11 2013 11
#> 12 2013 12
Now suppose we want to count the number of missing values in each year. We can use table and is.na, like this:
tab <- table(df$Year, is.na(df$Flow))
tab
#>
#> FALSE TRUE
#> 2011 2 2
#> 2012 3 1
#> 2013 4 0
We can see that these are the absolute counts of missing values, but we can convert this into proportions by dividing the second column by the row sums of this table:
props <- tab[,2] / rowSums(tab)
props
#> 2011 2012 2013
#> 0.50 0.25 0.00
Now, suppose we want to find and remove the years where more than 33% of cases are missing. We can just filter the values of props that are greater than 0.33 and get the associated year (or years):
years_to_drop <- names(props)[props > 0.33]
years_to_drop
#> [1] "2011"
Now we can use this to remove the years with more than 33% missing values from our original data frame by doing:
df[!df$Year %in% years_to_drop,]
#> Year Flow
#> 5 2012 5
#> 6 2012 6
#> 7 2012 NA
#> 8 2012 8
#> 9 2013 9
#> 10 2013 10
#> 11 2013 11
#> 12 2013 12
Created on 2022-11-14 with reprex v2.0.2
As Allan Cameron suggests, there's no need to use a loop, and R is usually more efficient working vectorially anyway.
I would suggest a solution based on ave (using the synthetic data from the previous answer)
df$NA_fraction <- ave(df$Flow, df$Year, FUN = \(values) mean(is.na(values)))
df
Year Flow NA_fraction
1 2011 1 0.50
2 2011 2 0.50
3 2011 NA 0.50
4 2011 NA 0.50
5 2012 5 0.25
6 2012 6 0.25
7 2012 NA 0.25
8 2012 8 0.25
9 2013 9 0.00
10 2013 10 0.00
11 2013 11 0.00
12 2013 12 0.00
You can then pick whatever threshold and filter by it
> df[df$NA_fraction < 0.3,]
Year Flow NA_fraction
5 2012 5 0.25
6 2012 6 0.25
7 2012 NA 0.25
8 2012 8 0.25
9 2013 9 0.00
10 2013 10 0.00
11 2013 11 0.00
12 2013 12 0.00
I have a data frame from which I created a reproducible example:
country <- c('A','A','A','B','B','C','C','C','C')
year <- c(2010,2011,2015,2008,2009,2008,2009,2011,2015)
score <- c(1,2,2,1,4,1,1,3,2)
country year score
1 A 2010 1
2 A 2011 2
3 A 2015 2
4 B 2008 1
5 B 2009 4
6 C 2008 1
7 C 2009 1
8 C 2011 3
9 C 2015 2
And I am trying to calculate the average percentage increase (or decrease) in the score for each country by calculating [(final score - initial score) รท (initial score)] for each year and averaging it over the number of years.
country year score change
1 A 2010 1 NA
2 A 2011 2 1
3 A 2015 2 0
4 B 2008 1 NA
5 B 2009 4 3
6 C 2008 1 NA
7 C 2009 1 0
8 C 2011 3 2
9 C 2015 2 -0.33
The final result I am hoping to obtain:
country avg_change
1 A 0.5
2 B 3
3 C 0.55
As you can see, the trick is that countries have spans over different years, sometimes with a missing year in between. I tried different ways to do it manually but I do struggle. If someone could hint me a solution would be great. Many thanks.
With dplyr, we can group_by country and get mean of difference between scores.
library(dplyr)
df %>%
group_by(country) %>%
summarise(avg_change = mean(c(NA, diff(score)), na.rm = TRUE))
# country avg_change
# <fct> <dbl>
#1 A 0.500
#2 B 3.00
#3 C 0.333
Using base R aggregate with same logic
aggregate(score~country, df, function(x) mean(c(NA, diff(x)), na.rm = TRUE))
We can use data.table to group by 'country' and take the mean of the difference between the 'score' and the lag of 'score'
library(data.table)
setDT(df1)[, .(avg_change = mean(score -lag(score), na.rm = TRUE)), .(country)]
# country avg_change
#1: A 0.5000000
#2: B 3.0000000
#3: C 0.3333333
I am manipulating a dataset but I can't make things right.
Here's an example for this, where df is the name of data frame.
year ID value
2013 1 10
2013 2 20
2013 3 10
2014 1 20
2014 2 20
2014 3 30
2015 1 20
2015 2 10
2015 3 30
So I tried to make another data frame df1 <- aggregate(value ~ year, df, mean, rm.na=T)
And made this data frame df1:
year ID value
2013 avg 13.3
2014 avg 23.3
2015 avg 20
But I want to add each mean by year into each row of df.
The expected form is:
year ID value
2013 1 10
2013 2 20
2013 3 10
2013 avg 13.3
2014 1 20
2014 2 20
2014 3 30
2014 avg 23.3
2015 1 20
2015 2 10
2015 3 30
2015 avg 20
Here is an option with data.table where we convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year', get the 'mean of 'value' and 'ID' as 'avg', then use rbindlist to rbind both the datasets and order by 'year'
library(data.table)
rbindlist(list(setDT(df), df[, .(ID = 'avg', value = mean(value)), year]))[order(year)]
# year ID value
# 1: 2013 1 10.00000
# 2: 2013 2 20.00000
# 3: 2013 3 10.00000
# 4: 2013 avg 13.33333
# 5: 2014 1 20.00000
# 6: 2014 2 20.00000
# 7: 2014 3 30.00000
# 8: 2014 avg 23.33333
# 9: 2015 1 20.00000
#10: 2015 2 10.00000
#11: 2015 3 30.00000
#12: 2015 avg 20.00000
Or using the OP's method, rbind both the datasets and then order
df2 <- rbind(df, transform(df1, ID = 'avg'))
df2 <- df2[order(df2$year),]
I want to replace NA value with mean of adjacent non-missing values in "return" column, grouped by "id". Let assume that there are only two months: 1,2 in a year.
df <- data.frame(id = c("A","A","A","A","B","B","B","B"),
year = c(2014,2014,2015,2015),
month = c(1, 2),
marketcap = c(4,6,2,6,23,2,5,34),
return = c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6))
df1
id year month marketcap return
1: A 2014 1 4 NA # <-
2: A 2014 2 6 0.23
3: A 2015 1 2 0.20
4: A 2015 2 6 0.10
5: B 2014 1 23 0.40
6: B 2014 2 2 0.90
7: B 2015 1 5 NA # <-
8: B 2015 2 34 0.60
Desired data
desired_df <- data.frame(id = c("A","A","A","A","B","B","B","B"),
year = c(2014,2014,2015,2015),
month = c(1,2),
marketcap = c(4,6,2,6,23,2,5,34),
return = c(0.23,0.23,0.2,0.1,0.4,0.9,0.75,0.6))
desired_df
id year month marketcap return
1 A 2014 1 4 0.23 # <-
2 A 2014 2 6 0.23
3 A 2015 1 2 0.20
4 A 2015 2 6 0.10
5 B 2014 1 23 0.40
6 B 2014 2 2 0.90
7 B 2015 1 5 0.75 # <-
8 B 2015 2 34 0.60
The second NA (row 7) should be replaced by the mean of the values before and after, i.e. (0.9 + 0.6)/2 = 0.75.
Note that the first NA (row 1), has no previous data. Here NA should be replaced with the next non-missing value, 0.23 ("last observation carried backwards").
A data.table solution is preferred if it is possible
UPDATE:
When use the code structure as follows (which works for the sample)
df[,returnInterpolate:=na.approx(return,rule=2), by=id]
I have encountered the error:
Error in approx(x[!na], y[!na], xout, ...) :
need at least two non-NA values to interpolate
I guess that may be there is some id that have no non-NA values to interpolate. .any suggestions?
library(data.table)
df <- data.frame(id=c("A","A","A","A","B","B","B","B"),
year=c(2014,2014,2015,2015),
month=c(1,2),
marketcap=c(4,6,2,6,23,2,5,34),
return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6))
setDT(df)
library(zoo)
df[, returnInterpol := na.approx(return, rule = 2), by = id]
# id year month marketcap return returnInterpol
#1: A 2014 1 4 NA 0.23
#2: A 2014 2 6 0.23 0.23
#3: A 2015 1 2 0.20 0.20
#4: A 2015 2 6 0.10 0.10
#5: B 2014 1 23 0.40 0.40
#6: B 2014 2 2 0.90 0.90
#7: B 2015 1 5 NA 0.75
#8: B 2015 2 34 0.60 0.60
Edit:
If you have groups with only NA values or only one non-NA, you could do this:
df <- data.frame(id=c("A","A","A","A","B","B","B","B","C","C","C","C"),
year=c(2014,2014,2015,2015),
month=c(1,2),
marketcap=c(4,6,2,6,23,2,5,34, 1:4),
return=c(NA,0.23,0.2,0.1,0.4,0.9,NA,0.6,NA,NA,0.3,NA))
setDT(df)
df[, returnInterpol := switch(as.character(sum(!is.na(return))),
"0" = return,
"1" = {na.omit(return)},
na.approx(return, rule = 2)), by = id]
# id year month marketcap return returnInterpol
# 1: A 2014 1 4 NA 0.23
# 2: A 2014 2 6 0.23 0.23
# 3: A 2015 1 2 0.20 0.20
# 4: A 2015 2 6 0.10 0.10
# 5: B 2014 1 23 0.40 0.40
# 6: B 2014 2 2 0.90 0.90
# 7: B 2015 1 5 NA 0.75
# 8: B 2015 2 34 0.60 0.60
# 9: C 2014 1 1 NA 0.30
# 10: C 2014 2 2 NA 0.30
# 11: C 2015 1 3 0.30 0.30
# 12: C 2015 2 4 NA 0.30
The easy imputeTS solution without caring for the ID would be:
library("imputeTS")
na.interpolate(df)
Since the imputation should be according to ID, it is a little bit more complicated - since it seems often there are not enough values left when filtered by ID. I would take the solution Roland posted and use imputeTS::na.interpolation() where possible and in the other cases maybe the overall mean with imputeTS::na.mean() or a random guess in the overall bounds imputeTS::na.random() could be used.
In this case it might also be a very good idea to look beyond univariate time series interpolation / imputation. There are a lot of other variables that could help estimating the missing values (if there is a correlation). Packages like AMELIA could help here.
I'm trying to compute confidence intervals for many rows of a table using a for loop, and would like output that is more readable.. Here is a snippet of how the data looks.
QUESTION X_YEAR X_PARTNER X_CAMP X_N X_CODE1
1 Q1 2011 SCSD ITC 15 4
2 Q1 2011 SCSD Nottingham 4 1
3 Q1 2011 SCSD ALL 19 5
4 Q1 2011 CP CP1 18 4
5 Q1 2011 ALL ALL 37 9
6 Q1 2012 SCSD ITC 8 1
7 Q1 2012 SCSD Nottingham 8 2
8 Q1 2012 SCSD ALL 16 3
9 Q1 2012 CP CP1 18 2
10 Q1 2012 CP CP1 22 2
11 Q1 2012 CP ALL 40 4
I'm trying to print out a confidence interval, with the Question, Year and Camp included. I'd like the output to be in table form like this
QUESTION YEAR CAMP X N MEAN LOWER UPPER
Q1 2011 ITC 4 15 0.26 0.07 0.55
Q1 2011 NOTTINGHAM 1 4 0.25 0.006 0.8
with the first three columns being taken directly from the data table, and the latter 4 extracted from a confidence interval test I'm using.
The code I'm currently using:
for (i in 1:26){
print(data[i,1],max.levels=0)
print(data[i,2],max.levels=0)
print(data[i,4],max.levels=0)
print(binom.confint(data[i,6],data[i,5],conf.level=0.95,methods="exact"))
}
provides output that (I have a lot more data than the snippet) will be far too time consuming to sift through...
[1] Q1
[1] 2011
[1] ITC
method x n mean lower upper
1 exact 4 15 0.2666667 0.07787155 0.5510032
[1] Q1
[1] 2011
[1] Nottingham
method x n mean lower upper
1 exact 1 4 0.25 0.006309463 0.8058796
Any advice is appreciated!
If df is the name of your data, and you only want to do this for where QUESTION is Q1 (see comments), then
library(binom)
df2 <- df[df$QUESTION == "Q1",]
x <- vector("list", nrow(df2))
for(i in seq_len(nrow(df2))) {
x[[i]] <- binom.confint(df2[i,6], df2[i,5], methods = "exact")
}
cbind(df2[c(1,2,4)], do.call(rbind, x)[,-1])
# QUESTION X_YEAR X_CAMP x n mean lower upper
# 1 Q1 2011 ITC 4 15 0.26666667 0.077871546 0.5510032
# 2 Q1 2011 Nottingham 1 4 0.25000000 0.006309463 0.8058796
# 3 Q1 2011 ALL 5 19 0.26315789 0.091465785 0.5120293
# 4 Q1 2011 CP1 4 18 0.22222222 0.064092048 0.4763728
# 5 Q1 2011 ALL 9 37 0.24324324 0.117725174 0.4119917
# 6 Q1 2012 ITC 1 8 0.12500000 0.003159724 0.5265097
# 7 Q1 2012 Nottingham 2 8 0.25000000 0.031854026 0.6508558
# 8 Q1 2012 ALL 3 16 0.18750000 0.040473734 0.4564565
# 9 Q1 2012 CP1 2 18 0.11111111 0.013751216 0.3471204
# 10 Q1 2012 CP1 2 22 0.09090909 0.011205586 0.2916127
# 11 Q1 2012 ALL 4 40 0.10000000 0.027925415 0.2366374
Note that conf.level = 0.95 is the default setting for binom.confint, so you don't need to include it in your call.