How to count how many values were used in a mean() function? - r

I am trying to create a column in a data frame containing how many values were used in the mean function for each line.
First, I had a data frame df like this:
df <- data.frame(tree_id=rep(c("CHC01", "CHC02"),each=8),
rad=(c(rep("A", 4),rep("B", 4), rep("A", 4),
rep("C", 4))), year=rep(2015:2018, 4),
growth= c(NA, NA, 1.2, 3.2, 2.1, 1.5, 2.3, 2.7, NA, NA, NA, 1.7, 3.5, 1.4, 2.3, 2.7))
Then, I created a new data frame called avg_df, containing only the mean values of growth grouped by tree_id and year
library(dplyr)
avg_df <- df%>%
group_by(tree_id, year, add=TRUE)%>%
summarise(avg_growth=mean(growth, na.rm = TRUE))
Now, I would like to add a new column in avg_df, containing how much values I used for calculating the mean growth for each tree_id and year, ignoring the NA.
Example: for CHC01 in 2015, the result is 1, because it was the average of 2.1 and NA and
for CHC01 in 2018, it will be 2, because the result is the average of 3.2 and 2.7
Here is the expected output:
avg_df$radii <- c(1,1,2,2,1,1,1,2)
tree_id year avg_growth radii
CHC01 2015 2.1 1
CHC01 2016 1.5 1
CHC01 2017 1.75 2
CHC01 2018 2.95 2
CHC02 2015 3.5 1
CHC02 2016 1.4 1
CHC02 2017 2.3 1
CHC02 2018 2.2 2
*In my real data, the values in radii will vary from 1 to 4.
Could anyone help me with this?
Thank you very much!

We can get the sum of non-NA elements (!is.na(growth)) after grouping by 'tree_id' and 'year'
library(dplyr)
df %>%
group_by(tree_id, year) %>%
summarise(avg_growth=mean(growth, na.rm = TRUE),
radii = sum(!is.na(growth)))
# A tibble: 8 x 4
# Groups: tree_id [2]
# tree_id year avg_growth radii
# <fct> <int> <dbl> <int>
#1 CHC01 2015 2.1 1
#2 CHC01 2016 1.5 1
#3 CHC01 2017 1.75 2
#4 CHC01 2018 2.95 2
#5 CHC02 2015 3.5 1
#6 CHC02 2016 1.4 1
#7 CHC02 2017 2.3 1
#8 CHC02 2018 2.2 2
Or using data.table
library(data.table)
setDT(df)[, .(avg_growth = mean(growth, na.rm = TRUE),
radii = sum(!is.na(growth))), by = .(tree_id, year)]

Related

How best to parse fields in R?

Below is the sample data. This is how it comes from the current population survey. There are 115 columns in the original. Below is just a subset. At the moment, I simply append a new row each month and leave it as is. However, there has been a new request that it be made longer and parsed a bit.
For some context, the first character is the race, a = all, b=black, w=white, and h= hispanic. The second character is the gender, x = all, m = male, and f= female. The third variable, which does not appear in all columns is the age. These values are 2024 for ages 20-24, 3039 or 30-39, and so on. Each one will end in the terms, laborforce unemp or unemprate.
stfips <- c(32,32,32,32,32,32,32,32)
areatype <- c(01,01,01,01,01,01,01,01)
periodyear <- c(2021,2021,2021,2021,2021,2021,2021,2021)
period <- (01,02,03,04,05,06,07,08)
xalaborforce <- c(1210.9,1215.3,1200.6,1201.6,1202.8,1209.3,1199.2,1198.9)
xaunemp <- c(55.7,55.2,65.2,321.2,77.8,88.5,92.4,102.6)
xaunemprate <- c(2.3,2.5,2.7,2.9,3.2,6.5,6.0,12.5)
walaborforce <- c(1000.0,999.2,1000.5,1001.5,998.7,994.5,999.2,1002.8)
waunemp <- c(50.2,49.5,51.6,251.2,59.9,80.9,89.8,77.8)
waunemprate <- c(3.4,3.6,3.8,4.0,4.2,4.5,4.1,2.6)
balaborforce <- c (5.5,5.7,5.2,6.8,9.2,2.5,3.5,4.5)
ba2024laborforce <- c(1.2,1.4,1.2,1.3,1.6,1.7,1.4,1.5)
ba2024unemp <- c(.2,.3,.2,.3,.4,.5,.02,.19))
ba2024lunemprate <- c(2.1,2.2,3.2,3.2,3.3,3.4,1.2,2.5)
test2 <- data.frame (stfips,areatype,periodyear, period, xalaborforce,xaunemp,xaunemprate,walaborforce, waunemp,waunemprate,balaborforce,ba2024laborforce,ba2024unemp,ba2024unemprate)
Desired result
stfips areatype periodyear period race gender age laborforce unemp unemprate
32 01 2021 01 x a all 1210.9 55.7 2.3
32 01 2021 02 x a all 1215.3 55.2 2.5
.....(the other six rows for race = x and gender = a
32 01 2021 01 w a all 1000.0 50.2 3.4
32 01 2021 02 w a all 999.2 49.5 3.6
....(the other six rows for race = w and gender = a
32 01 2021 01 b a 2024 1.2 .2 2.1
Edit -- added handling for columns with age prefix. Mostly there, but would be nice to have a concise way to add the - to make 2024 into 20-24....
test2 %>%
pivot_longer(xalaborforce:ba2024laborforce) %>%
separate(name, c("race", "gender", "stat"), sep = c(1,2)) %>%
mutate(age = coalesce(parse_number(stat) %>% as.character, "all"),
stat = str_remove_all(stat, "[0-9]")) %>%
pivot_wider(names_from = stat, values_from = value)
# A tibble: 32 × 10
stfips areatype periodyear period race gender age laborforce unemp unemprate
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 32 1 2021 1 x a all 1211. 55.7 2.3
2 32 1 2021 1 w a all 1000 50.2 3.4
3 32 1 2021 1 b a all 5.5 NA NA
4 32 1 2021 1 b a 2024 1.2 NA NA
5 32 1 2021 2 x a all 1215. 55.2 2.5
6 32 1 2021 2 w a all 999. 49.5 3.6
7 32 1 2021 2 b a all 5.7 NA NA
8 32 1 2021 2 b a 2024 1.4 NA NA
9 32 1 2021 3 x a all 1201. 65.2 2.7
10 32 1 2021 3 w a all 1000. 51.6 3.8
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows

R Expand time series data based on start and end point

I think I have a pretty simple request. I have the following dataframe, where "place" is a unique identifier, while start_date and end_date may overlap. The values are unique for each ID "place".
place start_date end_date value
1 2007-09-01 2010-10-12 0.5
2 2013-09-27 2015-10-11 0.7
...
What I need is to create a year-based variable, where I expand the time series by each year (starting from first of January (i.e. 2011-01-01) starts a new row for that particular "place" and "value". I mean something like this:
place year value
1 2007 0.5
1 2008 0.5
1 2009 0.5
1 2010 0.5
2 2013 0.7
2 2014 0.7
2 2015 0.7
...
There are some cases with overlap (ie. "place"=1 & "year"=2007) for two separate cases, where one observations starts with one year and the other observation continues from that year. In that case I would prefer the "value" that ends on that specific year. So if one observation for place=1 ends with 2007 in March and another place=1 starts with 2007 in April, year=2007 value for place=1 would be marked with the previous "ending" value if that makes sense.
I've only gotten this far:
library(data.table)
data <- data.table(dat)
data[,:=(start_date = as.Date(start_date), end_date = as.Date(end_date))]
data[,num_mons:= length(seq(from=start_date, to=end_date, by='year')),by=1:nrow(data)]
I guess writing a loop makes the most sense?
Thank you for your help and advice.
Using a tidyverse solution could look like:
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
data <- tibble(place = c(1, 2),
start_date = c('2007-09-01',
'2013-09-27'),
end_date = c('2010-10-12',
'2015-10-11'),
value = c(0.5, 0.7))
data %>%
mutate(year = map2(start_date,
end_date,
~ as.character(str_extract(.x, '\\d{4}'):
str_extract(.y, '\\d{4}')))) %>%
separate_rows(year) %>%
filter(!year %in% c('c', '')) %>%
select(place, year, value)
# place year value
# <dbl> <chr> <dbl>
# 1 1 2007 0.5
# 2 1 2008 0.5
# 3 1 2009 0.5
# 4 1 2010 0.5
# 5 2 2013 0.7
# 6 2 2014 0.7
# 7 2 2015 0.7
I'm having problems understanding the third paragraph of your question ("There are ..."). It seems to me to be a separate question. If that is the case, please consider moving the question to a separate post here on SO. If it is not a separate question, please reformulate the paragraph.
You could do the following:
library(lubridate)
library(tidyverse)
df %>%
group_by(place) %>%
mutate(year = list(seq(year(ymd(start_date)), year(ymd(end_date)))))%>%
unnest(year)%>%
select(place,year,value)
# A tibble: 7 x 3
# Groups: place [2]
place year value
<int> <int> <dbl>
1 1 2007 0.5
2 1 2008 0.5
3 1 2009 0.5
4 1 2010 0.5
5 2 2013 0.7
6 2 2014 0.7
7 2 2015 0.7

Select unique entries showing at least one value from another column

I have the following dataset (32000 entries) of water chemical compounds annual means organized by monitoring sites and sampling year:
data= data.frame(Site_ID=c(1, 1, 1, 2, 2, 2, 3, 3, 3), Year=c(1976, 1977, 1978, 2004, 2005, 2006, 2003, 2004, 2005), AnnualMean=c(1.1, 1.2, 1.1, 2.1, 2.6, 3.1, 2.7, 2.6, 1.9))
Site_ID Year AnnualMean
1 1976 1.1
1 1977 1.2
1 1978 1.1
2 2004 2.1
2 2005 2.6
2 2006 3.1
3 2003 2.7
3 2004 2.6
3 2005 1.9
I would like to select the data only from all monitoring sites showing at least a measurement in 2005 in their time range. With the above dataset, the expect output dataset would be:
Site_ID Year AnnualMean
2 2004 2.1
2 2005 2.6
2 2006 3.1
3 2003 2.7
3 2004 2.6
3 2005 1.9
I am completely new in R and have been spinning my head around with data manipulation, so thank you in advance!
With dplyr:
library(dplyr)
data %>%
group_by(Site_ID) %>%
filter(2005 %in% Year)
Here is a base R solution, using subset + ave
dfout <- subset(df,!!ave(Year,Site_ID,FUN = function(x) "2005" %in% x))
such that
> dfout
Site_ID Year AnnualMean
4 2 2004 2.1
5 2 2005 2.6
6 2 2006 3.1
7 3 2003 2.7
8 3 2004 2.6
9 3 2005 1.9
An option with data.table
library(data.table)
setDT(data)[, .SD[2005 %in% Year], Site_ID]

Average column in daily information at every n-th row

I am very new on R. I have daily observations of temperature and PP for 12-year period (6574 row, 6col, some NA ). I want to calculate, for example, the average from 1st to 10thday of January-2001, then 11-20 and finally 21 to 31 and so on for every month until december for each year in the period I mentioned before.
But also I have problems because February sometimes has 28 or 29 days (leap years).
This is how i open my file is a CSV, with read.table
# READ CSV
setwd ("C:\\Users\\GVASQUEZ\\Documents\\ESTUDIO_PAMPAS\\R_sheet")
huancavelica<-read.table("huancavelica.csv",header = TRUE, sep = ",",
dec = ".", fileEncoding = "latin1", nrows = 6574 )
This is the output of my CSV file
Año Mes Dia PT101 TM102 TM103
1 1998 1 1 6.0 15.6 3.4
2 1998 1 2 8.0 14.4 3.2
3 1998 1 3 8.6 13.8 4.4
4 1998 1 4 5.6 14.6 4.6
5 1998 1 5 0.4 17.4 3.6
6 1998 1 6 3.4 17.4 4.4
7 1998 1 7 9.2 14.6 3.2
8 1998 1 8 2.2 16.8 2.8
9 1998 1 9 8.6 18.4 4.4
10 1998 1 10 6.2 15.0 3.6
. . . . . . .
With the data setup that you have a fairly tried and true method should work:
# add 0 in front of single digit month variable to account for 1 and 10 sorting
huancavelica$MesChar <- ifelse(nchar(huancavelica$Mes)==1,
paste0("0",huancavelica$Mes), as.character(huancavelica$Mes))
# get time of month ID
huancavelica$timeMonth <- ifelse(huancavelica$Dia < 11, 1,
ifelse(huancavelica$Dia > 20, 3, 2)
# get final ID
huancavelica$ID <- paste(huancavelica$Año, huancavelica$MesChar, huancavelica$timeMonth, sep=".")
# average stat
huancavelica$myStat <- ave(huancavelica$PT101, huancavelica$ID, FUN=mean, na.rm=T)
We can try
library(data.table)
setDT(df1)[, Grp := (Dia - 1)%/%10+1, by = .(Ano, Mes)
][Grp>3, Grp := 3][,lapply(.SD, mean, na.rm=TRUE), by = .(Ano, Mes, Grp)]
It adds a bit more complexity, but you could cut each month into thirds and get the average for each third. For example:
library(dplyr)
library(lubridate)
# Fake data
set.seed(10)
df = data.frame(date=seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by="1 day"),
value=rnorm(365))
# Cut months into thirds
df = df %>%
mutate(mon_yr = paste0(month(date, label=TRUE, abbr=TRUE) , " ", year(date))) %>%
group_by(mon_yr) %>%
mutate(cutMonth = cut(day(date),
breaks=c(0, round(1/3*n()), round(2/3*n()), n()),
labels=c("1st third","2nd third","3rd third")),
cutMonth = paste0(mon_yr, ", ", cutMonth)) %>%
ungroup %>%
mutate(cutMonth = factor(cutMonth, levels=unique(cutMonth)))
date value cutMonth
1 2015-01-01 0.01874617 Jan 2015, 1st third
2 2015-01-02 -0.18425254 Jan 2015, 1st third
3 2015-01-03 -1.37133055 Jan 2015, 1st third
...
363 2015-12-29 -1.3996571 Dec 2015, 3rd third
364 2015-12-30 -1.2877952 Dec 2015, 3rd third
365 2015-12-31 -0.9684155 Dec 2015, 3rd third
# Summarise to get average value for each 1/3 of a month
df.summary = df %>%
group_by(cutMonth) %>%
summarise(average.value = mean(value))
cutMonth average.value
1 Jan 2015, 1st third -0.49065685
2 Jan 2015, 2nd third 0.28178222
3 Jan 2015, 3rd third -1.03870698
4 Feb 2015, 1st third -0.45700203
5 Feb 2015, 2nd third -0.07577199
6 Feb 2015, 3rd third 0.33860882
7 Mar 2015, 1st third 0.12067388
...

Matching DFs on two columns and multiplying

I have a dataframe such as the following one, only with much more columns and an additional ID variable.
data <- data.frame(year = c(rep(2014,12), rep(2015,12)), month = c(seq(1,12), seq(1,12)), value = c(rep(5,24)))
The data for some year/month combinations is incorrect, and must be adjusted by multiplying by a factor for the periods shown below.
fix <- data.frame(year = c(2014, 2014, 2015), month = c(1, 5, 6), f = c(.9, 1.1, 12))
I'm currently doing this via ddply, but I'm looking for a more elegant solution:
factorize <- function(x) {
x$value = x$value * fix[fix$year == unique(x$year) & fix$month == unique(x$month),3]
x
}
data2 <- ddply(data, c("year", "month"), factorize)
Any thoughts or suggestions?
Thanks!
Here's a base R approach:
transform(merge(data, fix, all.x=TRUE), value = ifelse(is.na(f), value, value*f), f=NULL)
And in case you need faster performance you can use data.table:
library(data.table)
data <- merge(setDT(data), setDT(fix), all.x = TRUE, by = c("year", "month"))
data[!is.na(f), value := value*f]
data[,f := NULL]
I think that with one line of code with dplyr and ifelse you can achieve your goal.
data %>% mutate(fix = ifelse( year == fix$year &
month == fix$month,
fix$f, value)) %>% select(-value)
year month fix
1 2014 1 0.9
2 2014 2 5.0
3 2014 3 5.0
4 2014 4 5.0
5 2014 5 1.1
6 2014 6 5.0
7 2014 7 5.0
8 2014 8 5.0
9 2014 9 5.0
10 2014 10 5.0
11 2014 11 5.0
12 2014 12 5.0
13 2015 1 5.0
14 2015 2 5.0
15 2015 3 5.0
16 2015 4 5.0
17 2015 5 5.0
18 2015 6 12.0
19 2015 7 5.0
20 2015 8 5.0
21 2015 9 5.0
22 2015 10 5.0
23 2015 11 5.0
24 2015 12 5.0

Resources