How to sum parts of a column? - r

I am doing a project on working out Flashiness index for a 15-minute flow data.
I have got code on how to work out flow data.
# new variable for lag time
flow_lagged_S <- S %>% mutate(
flow_lag = lag(flow, n = 1), #1st claculate lag
Qi_Qi1 = abs(flow - flow_lag))# calculate the abs value of the diff
# calculate sums following the formula
RB_index_S <- flow_lagged_S %>%
summarise(RB_index = sum(,Qi_Qi1, na.rm = T) / sum(flow, na.rm = T))
The data is for different years and at the moment I can calculate the flashiness for the whole station but not for ever year.
For the last bit of the code I need to change it so that it calculates the sum for each year. How do I do that? so instead of the whole column Qi_Qi1 i need t sum Qi_Qi1 for year 2002.
so my table flow_lagged_S looks like this:
time_stamp flow year flow_lag Qi_Qi1
2002-10-24 22:45:00 9.50 2002 N/A N/a
2002-10-24 23:00:00 10.00 2002 9.50 0.50
2002-10-24 23:15:00 10.50 2002 10.00 0.50
2002-10-24 23:30:00 11.00 2002 10.50 0.70

You can use group_by() function from dplyr package:
df <- data.frame(time_stamp = c("2002-10-24 22:45:00", "2002-10-24 23:00:00", "2002-10-24 23:15:00", "2002-10-24 23:30:00"),
flow = c(9.5, 10, 10.5, 11),
year = c(2002, 2002, 2002, 2002),
flow_lag = c(NA, 9.5, 10, 10.5),
Qi_Qi = c(NA, .5, .5, .7))
df %>%
group_by(year) %>%
summarize(total = sum(Qi_Qi, na.rm = T))
The answer is:
# A tibble: 1 x 2
year total
<dbl> <dbl>
1 2002 1.7

Related

R: Filtering rows based on a group criterion

I have a data frame with over 100,000 rows and with about 40 columns. The schools column has about 100 distinct schools. I have data from 1980 to 2023.
I want to keep all data from schools that have at least 10 rows for each of the years 2018 through 2022. Schools that do not meet that criterion should have all rows deleted.
In my minimal example, Schools, I have three schools.
Computing a table makes it apparent that only Washington should be retained. Adams only has 5 rows for 2018 and Jefferson has 0 for 2018.
Schools2 is what the result should look like.
How do I use the table computation or a dplyr computation to perform the filter?
Schools =
data.frame(school = c(rep('Washington', 60),
rep('Adams',70),
rep('Jefferson', 100)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5),
rep(2017, 25), rep(2018, 5), rep(2019:2022, each = 10),
rep(2019:2023, each = 20)),
stuff = rnorm(230)
)
Schools2 =
data.frame(school = c(rep('Washington', 60)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5)),
stuff = rnorm(60)
)
table(Schools$school, Schools$year)
Schools |> group_by(school, year) |> summarize(counts = n())
Keep only the year from 2018 to 2022 in the data with filter, then add a frequency count column by school, year, and filter only those 'school', having all count greater than or equal to 10 and if all the year from the range are present
library(dplyr)# version >= 1.1.0
Schools %>%
filter(all(table(year[year %in% 2018:2022]) >= 10) &
all(2018:2022 %in% year), .by = c("school")) %>%
as_tibble()
-output
# A tibble: 60 × 3
school year stuff
<chr> <dbl> <dbl>
1 Washington 2016 0.680
2 Washington 2016 -1.14
3 Washington 2016 0.0420
4 Washington 2016 -0.603
5 Washington 2016 2.05
6 Washington 2018 -0.810
7 Washington 2018 0.692
8 Washington 2018 -0.502
9 Washington 2018 0.464
10 Washington 2018 0.397
# … with 50 more rows
Or using count
library(magrittr)
Schools %>%
filter(tibble(year) %>%
filter(year %in% 2018:2022) %>%
count(year) %>%
pull(n) %>%
is_weakly_greater_than(10) %>%
all, all(2018:2022 %in% year) , .by = "school")
As it turns out, a friend just helped me come up with a base R solution.
# form 2-way table, school against year
sdTable = table(Schools$school, Schools$year)
# say want years 2018-2022 having lots of rows in school data
sdTable = sdTable[,3:7]
# which have >= 10 rows in all years 2018-2022
allGtEq = function(oneRow) all(oneRow >= 10)
whichToKeep = which(apply(sdTable,1,allGtEq))
# now whichToKeep is row numbers from the table; get the school names
whichToKeep = names(whichToKeep)
# back to school data
whichOrigRowsToKeep = which(Schools$school %in% whichToKeep)
newHousing = Schools[whichOrigRowsToKeep,]
newHousing

Sum of column based on a condition a R

I would like to print out total amount for each date so that my new dataframe will have date and and total amount columns.
My data frame looks like this
permitnum
amount
6/1/2022
na
ascas
30.00
olic
40.41
6/2/2022
na
avrey
17.32
fev
32.18
grey
12.20
any advice on how to go about this will be appreciated
Here is another tidyverse option, where I convert to date (and then reformat), then we can fill in the date, so that we can use that to group. Then, get the sum for each date.
library(tidyverse)
df %>%
mutate(permitnum = format(as.Date(permitnum, "%m/%d/%Y"), "%m/%d/%Y")) %>%
fill(permitnum, .direction = "down") %>%
group_by(permitnum) %>%
summarise(total_amount = sum(as.numeric(amount), na.rm = TRUE))
Output
permitnum total_amount
<chr> <dbl>
1 06/01/2022 70.4
2 06/02/2022 61.7
Data
df <- structure(list(permitnum = c("6/1/2022", "ascas", "olic", "6/2/2022",
"avrey", "fev", "grey"), amount = c("na", "30.00", "40.41", "na",
"17.32", "32.18", "12.20")), class = "data.frame", row.names = c(NA,
-7L))
Here is an option. Split the data by the date marked by a row with a number, then summarize the total in amount and combine the date and all rows.
library(tidyverse)
dat <- read_table("permitnum amount
6/1/2022 na
ascas 30.00
olic 40.41
6/2/2022 na
avrey 17.32
fev 32.18
grey 12.20")
dat |>
group_split(id = cumsum(grepl("\\d", permitnum))) |>
map_dfr(\(x){
date <- x$permitnum[[1]]
x |>
slice(-1) |>
summarise(date = date,
total_amount = sum(as.numeric(amount)))
})
#> # A tibble: 2 x 2
#> date total_amount
#> <chr> <dbl>
#> 1 6/1/2022 70.4
#> 2 6/2/2022 61.7

How can I group a variable so I can calculate the median from it?

I have a dataset with three variables:
US states (categorical)
year (continuous)
GDP per capita (continuous)
I want to create a table with the median GDP for each decade in my dataset (1955-60, 1961-70, 1971-80, 1981-90, 1991-97) for all US states, so that I end up with two columns and five rows.
So far, I produced the following code:
dataset %>%
group_by(year) %>%
summarise(median_gdp = median(gdpcap))
It creates the following table:
year median_gdp
<dbl> <dbl>
1 1955 2.39
2 1956 2.54
3 1957 2.68
4 1958 2.73
5 1959 2.77
6 1960 2.97
7 1961 3.14
8 1962 3.37
9 1963 3.61
10 1964 3.68
As seen in the table, I haven't grouped the years yet into a new 'decade variable'. I can't figue out how to do it. So far, I can only showcase the median for each individual year...
Also, how would I have to adjust the median command in my code?
I hope for any help here.
Thanks!
You should do:
dataset %>%
group_by(year = (year - 1) %/% 10) %>%
summarise(median_gdp = median(gdpcap))
You can use cut to divide the year into decade breaks, paste min and max year together in each group to create a year entry and calculate median gdpcap for each group.
library(dplyr)
dataset %>%
group_by(group = cut(year, c(1955, seq(1960, 1990, 10), 1997),
include.lowest = TRUE)) %>%
summarise(decade = paste(min(year), max(year), sep = '-'),
median_gdp = median(gdpcap)) -> dataset1
dataset1

calculation average from a few month at the turn of the year for 3 diffrent indexs and for 30 years

I do not have a really date values. I have one column with Year and another with Month. And 3 more columns for 3 diffrent indexes.There is one index value for one month. (so 12 months per year for 30 years,. It is lots numbers) So I´d like to see the average value from a few month.
I need the information about this index to predict pollen season in summer time. So I would like to have a average for winter months (like Dec-Jan_Feb_Mars) for NAO and also average for winter months for AO and SO. (so 3 average for 3 index). But also I ´d like to receive this value not only for one year but for all years. I think the complicate story is because Dec 1988 - Jan 1989- Feb 1989 (so it is a average for a few month at the turn of the years). If I succsse with this I will do diffrent combination of months.
Year Month NAO AO SO
1 1988 1 1.02 0.26 -0.1
2 1988 2 0.76 -1.07 -0.4
3 1988 3 -0.17 -0.20 0.6
4 1988 4 -1.17 -0.56 0.1
5 1988 5 0.63 -0.85 0.9
6 1988 6 0.88 0.06 0.1
7 1988 7 -0.35 -0.14 1.0
8 1988 8 0.04 0.25 1.5
9 1988 9 -0.99 1.04 1.8
10 1988 10 -1.08 0.03 1.4
11 1988 11 -0.34 -0.03 1.7
12 1988 12 0.61 1.68 1.2
13 1989 1 1.17 3.11 1.5
14 1989 2 2.00 3.28 1.2
...
366 2018 6 1.09 0.38 -0.1
367 2018 7 1.39 0.61 0.2
368 2018 8 1.97 0.84 -0.3
index$Month<-as.character(index$Month)
#define function to compute average by consecutive season of interest/month_combination
compute_avg_season <- function(index, month_combination){
index<-index%>%
mutate(date=paste(Year,Month, "01",sep="-")) %>%
mutate(date=as.Date(date,"%Y-%b-%d")) %>%
arrange(date)%>%
mutate(winter_mths=ifelse(Month %in% month_combination, 1, NA))
index<-setDT(index)[,id :=rleid(winter_mths)]%>%
filter(!is.na(winter_mths))%>%
group_by(id)%>%
summarise(mean_winter_NAO=mean(NAO, na.rm = TRUE)),
Error: unexpected ',' in:
"group_by(id)%>%
summarise(mean_winter_NAO=mean(NAO, na.rm = TRUE)),"
summarise(mean_winter_NAO=mean(NAO, na.rm = TRUE),
+ mean_winter_AO=mean(AO, na.rm = TRUE),
+ mean_winter_SO=mean(SO, na.rm=TRUE))
Error in mean(NAO, na.rm = TRUE) : object 'NAO' not found
View(index)
Why do I have such error?
I updated the answer to the new insights from your comments:
# load libraries
library(dplyr)
library(data.table)
# pre-processing
index$Month <- as.character(index$Month) # Month is factor, make it character
colnames(index)[1] <- "Year" # simplify name of the Year column
# define a function to compute average by consecutive season of interest/month_combination (do not modify this function)
compute_avg_season <- function(df, month_combination) {
# mark combination of months as 1, else NA
df <- df %>%
# correction month MAY
mutate(Month = replace(Month, Month=="MAI", "MAY")) %>%
# create date
mutate(date = paste(Year, Month, "01", sep="-")) %>%
mutate(date = as.Date(date, "%Y-%b-%d")) %>%
# sort by date (you want average by consecutive months: DEC, JAN, FEB, MAR)
arrange(date) %>%
mutate(winter_mths = ifelse(Month %in% month_combination, 1, NA))
# add index for each set of months of interest and compute mean by index value
df <- setDT(df)[, id := rleid(winter_mths)] %>%
filter(!is.na(winter_mths)) %>%
group_by(id) %>%
summarise(mean_winter_NAO = mean(NAO, na.rm = TRUE),
mean_winter_AO = mean(AO, na.rm = TRUE),
mean_winter_SO = mean(SO, na.rm = TRUE))
return(df)
}
# Use the above-defined function to compute mean values by desired month combination:
# set the month combination
month_combination <- c("DEC", "JAN", "FEB", "MAR")
# compute mean values by month combination
compute_avg_season(index, month_combination)

Sum duplicates then remove all but first occurrence

I have a data frame (~5000 rows, 6 columns) that contains some duplicate values for an id variable. I have another continuous variable x, whose values I would like to sum for each duplicate id. The observations are time dependent, there are year and month variables, and I'd like to keep the chronologically first observation of each duplicate id and add the subsequent dupes to this first observation.
I've included dummy data that resembles what I have: dat1. I've also included a data set that shows the structure of my desired outcome: outcome.
I've tried two strategies, neither of which quite give me what I want (see below). The first strategy gives me the correct values for x, but I loose my year and month columns - I need to retain these for all the first duplicate id values. The second strategy doesn't sum the values of x correctly.
Any suggestions for how to get my desired outcome would be much appreciated.
# dummy data set
set.seed(179)
dat1 <- data.frame(id = c(1234, 1321, 4321, 7423, 4321, 8503, 2961, 1234, 8564, 1234),
year = rep(c("2006", "2007"), each = 5),
month = rep(c("December", "January"), each = 5),
x = round(rnorm(10, 10, 3), 2))
# desired outcome
outcome <- data.frame(id = c(1234, 1321, 4321, 7423, 8503, 2961, 8564),
year = c(rep("2006", 4), rep("2007", 3)),
month = c(rep("December", 4), rep("January", 3)),
x = c(36.42, 11.55, 17.31, 5.97, 12.48, 10.22, 11.41))
# strategy 1:
library(plyr)
dat2 <- ddply(dat1, .(id), summarise, x = sum(x))
# strategy 2:
# partition into two data frames - one with unique cases, one with dupes
dat1_unique <- dat1[!duplicated(dat1$id), ]
dat1_dupes <- dat1[duplicated(dat1$id), ]
# merge these data frames while summing the x variable for duplicated ids
# with plyr
dat3 <- ddply(merge(dat1_unique, dat1_dupes, all.x = TRUE),
.(id), summarise, x = sum(x))
# in base R
dat4 <- aggregate(x ~ id, data = merge(dat1_unique, dat1_dupes,
all.x = TRUE), FUN = sum)
I got different sums, but it were b/c I forgot the seed:
> dat1$x <- ave(dat1$x, dat1$id, FUN=sum)
> dat1[!duplicated(dat1$id), ]
id year month x
1 1234 2006 December 25.18
2 1321 2006 December 15.06
3 4321 2006 December 15.50
4 7423 2006 December 7.16
6 8503 2007 January 13.23
7 2961 2007 January 7.38
9 8564 2007 January 7.21
(To be safer It would be better to work on a copy. And you might need to add an ordering step.)
You could do this with data.table (quicker, more memory efficiently than plyr)
With a bit of self-joining fun using mult ='first'. Keying by id year and month will sort by id, year then month.
library(data.table)
DT <- data.table(dat1, key = c('id','year','month'))
# setnames is required as there are two x columns that get renamed x, x.1
DT1 <- setnames(DT[DT[,list(x=sum(x)),by=id],mult='first'][,x:=NULL],'x.1','x')
Or a simpler approach :
DT = as.data.table(dat1)
DT[,x:=sum(x),by=id][!duplicated(id)]
id year month x
1: 1234 2006 December 36.42
2: 1321 2006 December 11.55
3: 4321 2006 December 17.31
4: 7423 2006 December 5.97
5: 8503 2007 January 12.48
6: 2961 2007 January 10.22
7: 8564 2007 January 11.41

Resources