R: Filtering rows based on a group criterion - r

I have a data frame with over 100,000 rows and with about 40 columns. The schools column has about 100 distinct schools. I have data from 1980 to 2023.
I want to keep all data from schools that have at least 10 rows for each of the years 2018 through 2022. Schools that do not meet that criterion should have all rows deleted.
In my minimal example, Schools, I have three schools.
Computing a table makes it apparent that only Washington should be retained. Adams only has 5 rows for 2018 and Jefferson has 0 for 2018.
Schools2 is what the result should look like.
How do I use the table computation or a dplyr computation to perform the filter?
Schools =
data.frame(school = c(rep('Washington', 60),
rep('Adams',70),
rep('Jefferson', 100)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5),
rep(2017, 25), rep(2018, 5), rep(2019:2022, each = 10),
rep(2019:2023, each = 20)),
stuff = rnorm(230)
)
Schools2 =
data.frame(school = c(rep('Washington', 60)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5)),
stuff = rnorm(60)
)
table(Schools$school, Schools$year)
Schools |> group_by(school, year) |> summarize(counts = n())

Keep only the year from 2018 to 2022 in the data with filter, then add a frequency count column by school, year, and filter only those 'school', having all count greater than or equal to 10 and if all the year from the range are present
library(dplyr)# version >= 1.1.0
Schools %>%
filter(all(table(year[year %in% 2018:2022]) >= 10) &
all(2018:2022 %in% year), .by = c("school")) %>%
as_tibble()
-output
# A tibble: 60 × 3
school year stuff
<chr> <dbl> <dbl>
1 Washington 2016 0.680
2 Washington 2016 -1.14
3 Washington 2016 0.0420
4 Washington 2016 -0.603
5 Washington 2016 2.05
6 Washington 2018 -0.810
7 Washington 2018 0.692
8 Washington 2018 -0.502
9 Washington 2018 0.464
10 Washington 2018 0.397
# … with 50 more rows
Or using count
library(magrittr)
Schools %>%
filter(tibble(year) %>%
filter(year %in% 2018:2022) %>%
count(year) %>%
pull(n) %>%
is_weakly_greater_than(10) %>%
all, all(2018:2022 %in% year) , .by = "school")

As it turns out, a friend just helped me come up with a base R solution.
# form 2-way table, school against year
sdTable = table(Schools$school, Schools$year)
# say want years 2018-2022 having lots of rows in school data
sdTable = sdTable[,3:7]
# which have >= 10 rows in all years 2018-2022
allGtEq = function(oneRow) all(oneRow >= 10)
whichToKeep = which(apply(sdTable,1,allGtEq))
# now whichToKeep is row numbers from the table; get the school names
whichToKeep = names(whichToKeep)
# back to school data
whichOrigRowsToKeep = which(Schools$school %in% whichToKeep)
newHousing = Schools[whichOrigRowsToKeep,]
newHousing

Related

Filtering twice with multiple variables and counting rows

I have this data frame that is grouped by id_station, id_parameter, "zona", and its date.
id_station id_parameter zona year month day mediaDiaria sdDiaria Count
1 AJM CO SO 2019 1 1 0.281 0.181 21
2 AJM CO SO 2019 1 2 0.367 0.230 24
3 AJM CO SO 2019 1 3 0.371 0.160 24
4 AJM CO SO 2019 1 4 0.312 0.185 24
5 AJM CO SO 2019 1 5 0.296 0.168 24
6 AJM CO SO 2019 1 6 0.225 0.142 24
7 AJM CO SO 2019 1 7 0.281 0.0873 21
8 AJM CO SO 2019 1 8 0.388 0.236 24
9 AJM CO SO 2019 1 9 0.421 0.265 24
10 AJM CO SO 2019 1 10 0.225 0.103 24
What I want to do is to filter March 1st, 2019 to February 29, 2020. I would treat this as "Year 1." Afterwards, I want to count the number of rows in Count, in Year 1 and per id_station, to eliminate all rows from stations that have less than 275 rows (days) with Count > 18.
I have tried the following with filter:
Year1in2019CO <- datosCO %>%
filter(year == 2019, month %in% c(3:12)) %>%
group_by(id_station, id_parameter, zona, year, month, day) %>%
summarise(mediaDiaria = mean(valor, na.rm = TRUE), sdDiaria = sd(valor, na.rm = TRUE),
Count = sum(!is.na(valor)))
Year1in2020CO <- datosCO %>%
filter(year == 2020, month %in% c(1:2)) %>%
group_by(id_station, id_parameter, zona, year, month, day) %>%
summarise(mediaDiaria = mean(valor, na.rm = TRUE), sdDiaria = sd(valor, na.rm = TRUE),
Count = sum(!is.na(valor)))
Year1CO <- bind_rows(Year1in2019CO, Year1in2020CO)
It does the job. But is there a way to do this while only creating one data frame, instead of 3?
And I have tried the following for the counting rows part:
YEAR1dfCO_2 <- Year1CO %>%
group_by(id_station) %>%
summarise(dws = sum(Count > 18))
And while it does give me what I need, I do not know how to eliminate all data from stations with less than 275 rows in Count (being > 18) in Year 1 from the original dataset (Year1CO).
Can you please help me?
This might work. First, filter year 1 rows using a constructed date then remove the stations based on the condition you described.
library(tidyverse)
yr1 <- datosCO %>%
mutate(d = as.Date(paste(year, month, day, sep = "-"))) %>%
filter(between(d, as.Date("2019-03-01"), as.Date("2020-02-29"))) %>%
group_by(id_station, id_parameter, zona, d) %>%
summarise(mediaDiaria = mean(valor, na.rm = TRUE),
sdDiaria = sd(valor, na.rm = TRUE),
Count = sum(!is.na(valor)))
yr1 %>%
group_by(id_station) %>%
filter(sum(Count > 18) < 275) %>%
ungroup()

dplyr, filter if both values are above a number [duplicate]

This question already has answers here:
dplyr filter with condition on multiple columns
(6 answers)
Closed 2 years ago.
I have a data set like such.
df = data.frame(Business = c('HR','HR','Finance','Finance','Legal','Legal','Research'), Country = c('Iceland','Iceland','Norway','Norway','US','US','France'), Gender=c('Female','Male','Female','Male','Female','Male','Male'), Value =c(10,5,20,40,10,20,50))
I need to be filter out all rows where both male value and female value are >= 10. For example, Iceland HR should be removed as well as Research France.
I've tried df %>% group_by(Business,Country) %>% filter((Value>=10)) but this filters out any value less than 10. any ideas?
Maybe this can help:
library(reshape2)
df2 <- reshape(df,idvar = c('Business','Country'),timevar = 'Gender',direction = 'wide')
df2 %>% mutate(Index=ifelse(Value.Female>=10 & Value.Male>=10,1,0)) %>%
filter(Index==1) -> df3
df4 <- reshape2::melt(df3[,-5],idvar=c('Business','Country'))
Business Country variable value
1 Finance Norway Value.Female 20
2 Legal US Value.Female 10
3 Finance Norway Value.Male 40
4 Legal US Value.Male 20
You could just use two ave steps, one with length, one with min.
df <- df[with(df, ave(Value, Country, FUN=length)) == 2, ]
df[with(df, ave(Value, Country, FUN=min)) >= 10, ]
# Business Country Gender Value
# 3 Finance Norway Female 20
# 4 Finance Norway Male 40
# 5 Legal US Female 10
# 6 Legal US Male 20
Notice that this also works if we disturb the data frame.
set.seed(42)
df2 <- df[sample(1:nrow(df)), ]
df2 <- df2[with(df2, ave(Value, Country, FUN=length)) == 2, ]
df2[with(df2, ave(Value, Country, FUN=min)) >= 10, ]
# Business Country Gender Value
# 5 Legal US Female 10
# 6 Legal US Male 20
# 3 Finance Norway Female 20
# 4 Finance Norway Male 40

Agregating and counting elements in the variables of a dataset

I might have not asked the proper question in my research, sorry in such case.
I have a multiple columns dataset:
helena <-
Year US$ Euros Country Regions
2001 12 13 US America
2000 13 15 UK Europe
2003 14 19 China Asia
I want to group the dataset in a way that I have for each region the total per year of the earnings plus a column showing how many countries have communicated their data per region every year
helena <-
Year US$ Euros Regions Number of countries per region per Year
2000 150 135 America 2
2001 135 151 Europe 15
2002 142 1900 Asia 18
Yet, I have tried
count(helena, c("Regions", "Year"))
but it does not work properly since includes only the columns indicated
Here is the data.table way, I have added a row for Canada for year 2000 to test the code:
library(data.table)
df <- data.frame(Year = c(2000, 2001, 2003,2000),
US = c(13, 12, 14,13),
Euros = c(15, 13, 19,15),
Country = c('US', 'UK', 'China','Canada'),
Regions = c('America', 'Europe', 'Asia','America'))
df <- data.table(df)
df[,
.(sum_US = sum(US),
sum_Euros = sum(Euros),
number_of_countries = uniqueN(Country)),
.(Regions, Year)]
Regions Year sum_US sum_Euros number_of_countries
1: America 2000 26 30 2
2: Europe 2001 12 13 1
3: Asia 2003 14 19 1
With dplyr:
library(dplyr)
your_data %>%
group_by(Regions, Year) %>%
summarize(
US = sum(US),
Euros = sum(Euros),
N_countries = n_distinct(Country)
)
using tidyr
library(tidyr)
df %>% group_by(Regions, Year) %>%
summarise(Earnings_US = sum(`US$`),
Earnings_Euros = sum(Euros),
N_Countries = length(Country))
aggregate the data set by regions, summing the earnings columns and doing a length of the country column (assuming countries are unique)
Using tidyverse and building the example
library(tidyverse)
df <- tibble(Year = c(2000, 2001, 2003,2000),
US = c(13, 12, 14,13),
Euros = c(15, 13, 19,15),
Country = c('US', 'UK', 'China','Canada'),
Regions = c('America', 'Europe', 'Asia','America'))
df %>%
group_by(Regions, Year) %>%
summarise(US = sum(US),
Euros = sum(Euros),
Countries = n_distinct(Country))
updated to reflect the data in the original question

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

Using conditions in group_by()/summarize() loop

I have a dataframe that looks something like this (I have a lot more years and variables):
Name State2014 State2015 State2016 Tuition2014 Tuition2015 Tuition2016 StateGrants2014
Jared CA CA MA 22430 23060 40650 5000
Beth CA CA CA 36400 37050 37180 4200
Steven MA MA MA 18010 18250 18720 NA
Lary MA CA MA 24080 30800 24600 6600
Tom MA OR OR 40450 15800 16040 NA
Alfred OR OR OR 23570 23680 23750 3500
Cathy OR OR OR 32070 32070 33040 4700
My objective (in this example) is to get the mean tuition for each state, and the sum of state grants for each state. My thought was to subset the data by year:
State2014 Tuition2014 StateGrants2014
CA 22430 5000
CA 36400 4200
MA 18010 NA
MA 24080 6600
MA 40450 NA
OR 23570 3500
OR 32070 4700
State2015 Tuition2015
CA 23060
CA 37050
MA 18250
CA 30800
OR 15800
OR 23680
OR 32070
State2016 Tuition2016
MA 40650
CA 37180
MA 18720
MA 24600
OR 16040
OR 23750
OR 33040
Then I would group_by state and summarize (and save each as a separate df) to get the following:
State2014 Tuition2014 StateGrants2014
CA 29415 9200
MA 27513 6600
OR 27820 6600
State2015 Tuition2015
CA 30303
MA 18250
OR 23850
State2016 Tuition2016
CA 37180
MA 27990
OR 24277
Then I would merge the by state. Here is my code:
years = c(2014,2015,2016)
for (i in seq_along(years){
#grab the variables from a certain year and save as a new df.
df_year <- df[, grep(paste(years[[i]],"$",sep=""), colnames(df))]
#Take off the year from each variable name (to make it easier to summarize)
names(df_year) <- gsub(years[[i]], "", names(df_year), fixed = TRUE)
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE),
#this part of the code does not work. In this example, I only want to have this part if the year is 2016.
if (years[[i]]=='2016')
{Stategrant = mean(Stategrant, na.rm = TRUE)})
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
}
I have about 50 years of data, and a good amount of variables, so I wanted to use a loop. So my question is, how do i add a conditional statement (summarize certain variables conditioned on the year) in the group_by()/summarize() function? Thanks!
*Edit: I realize that I could take the if{} out of the function, and do something like:
if (years[[i]]==2016){
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE),
Stategrant = mean(Stategrant, na.rm = TRUE))
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
}
else{
df_year <- df_year %>%
group_by(state) %>%
summarize(Tuition = mean(Tuition, na.rm = TRUE))
#rename df_year to df####
assign(paste("df",years[[i]],sep=''),df_year)
{
}
but there are just so many combinations of variables, that using a for loop would not be very efficient or useful.
This is so much easier with tidy data, so let me show you how to tidy up your data. See http://r4ds.had.co.nz/tidy-data.html.
library(tidyr)
library(dplyr)
df <- gather(df, key, value, -Name) %>%
# separate years from the variables
separate(key, c("var", "year"), sep = -5) %>%
# the above line splits up e.g. State2014 into State and 2014.
# It does so by splitting at the fifth element from the end of the
# entry. Please check that this works for your other variables
# in case your naming conventions are inconsistent.
spread(var, value) %>%
# turn numbers back to numeric
mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>%
gather(var, val, -Name, -year, -State) %>%
# group by the variables of interest. Note that `var` here
# refers to Tuition and StateGrants. If you have more variables,
# they will be included here as well. If you want to exclude more
# variables from being included here in `var`, add more "-colName"
# entries in the `gather` statement above
group_by(year, State, var) %>%
# summarize:
summarise(mean_values = mean(val))
This gives you:
Source: local data frame [18 x 4]
Groups: year, State [?]
year State var mean_values
<chr> <chr> <chr> <dbl>
1 2014 CA StateGrants 4600.00
2 2014 CA Tuition 29415.00
3 2014 MA StateGrants NA
4 2014 MA Tuition 27513.33
5 2014 OR StateGrants 4100.00
6 2014 OR Tuition 27820.00
7 2015 CA StateGrants NA
8 2015 CA Tuition 30303.33
9 2015 MA StateGrants NA
10 2015 MA Tuition 18250.00
11 2015 OR StateGrants NA
12 2015 OR Tuition 23850.00
13 2016 CA StateGrants NA
14 2016 CA Tuition 37180.00
15 2016 MA StateGrants NA
16 2016 MA Tuition 27990.00
17 2016 OR StateGrants NA
18 2016 OR Tuition 24276.67
If you don't like the shape of this, you can e.g. add an %>% spread(var, mean_values) behind the summarise statement to have the means for Tuition and StateGrants in different columns.
If you want to compute different functions for Tuition and Grants (e.g. mean of Tuition and sum for grants, you could do the following:
df <- gather(df, key, value, -Name) %>%
separate(key, c("var", "year"), sep = -5) %>%
spread(var, value) %>%
mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>%
group_by(year, State) %>%
summarise(Grant_Sum = sum(StateGrants, na.rm=T), Tuition_Mean = mean(Tuition) )
This gives you:
Source: local data frame [9 x 4]
Groups: year [?]
year State Grant_Sum Tuition_Mean
<chr> <chr> <dbl> <dbl>
1 2014 CA 9200 29415.00
2 2014 MA 6600 27513.33
3 2014 OR 8200 27820.00
4 2015 CA 0 30303.33
5 2015 MA 0 18250.00
6 2015 OR 0 23850.00
7 2016 CA 0 37180.00
8 2016 MA 0 27990.00
9 2016 OR 0 24276.67
Note that I used sum here, with na.rm = T, which returns 0 if all elements are NAs. Make sure this makes sense in your use case.
Also, just to mention it, to get your individual data.frames that you asked for, you can use filter(year == 2014) etc, as in df_2014 <- filter(df, year == 2014).

Resources