dplyr, filter if both values are above a number [duplicate] - r

This question already has answers here:
dplyr filter with condition on multiple columns
(6 answers)
Closed 2 years ago.
I have a data set like such.
df = data.frame(Business = c('HR','HR','Finance','Finance','Legal','Legal','Research'), Country = c('Iceland','Iceland','Norway','Norway','US','US','France'), Gender=c('Female','Male','Female','Male','Female','Male','Male'), Value =c(10,5,20,40,10,20,50))
I need to be filter out all rows where both male value and female value are >= 10. For example, Iceland HR should be removed as well as Research France.
I've tried df %>% group_by(Business,Country) %>% filter((Value>=10)) but this filters out any value less than 10. any ideas?

Maybe this can help:
library(reshape2)
df2 <- reshape(df,idvar = c('Business','Country'),timevar = 'Gender',direction = 'wide')
df2 %>% mutate(Index=ifelse(Value.Female>=10 & Value.Male>=10,1,0)) %>%
filter(Index==1) -> df3
df4 <- reshape2::melt(df3[,-5],idvar=c('Business','Country'))
Business Country variable value
1 Finance Norway Value.Female 20
2 Legal US Value.Female 10
3 Finance Norway Value.Male 40
4 Legal US Value.Male 20

You could just use two ave steps, one with length, one with min.
df <- df[with(df, ave(Value, Country, FUN=length)) == 2, ]
df[with(df, ave(Value, Country, FUN=min)) >= 10, ]
# Business Country Gender Value
# 3 Finance Norway Female 20
# 4 Finance Norway Male 40
# 5 Legal US Female 10
# 6 Legal US Male 20
Notice that this also works if we disturb the data frame.
set.seed(42)
df2 <- df[sample(1:nrow(df)), ]
df2 <- df2[with(df2, ave(Value, Country, FUN=length)) == 2, ]
df2[with(df2, ave(Value, Country, FUN=min)) >= 10, ]
# Business Country Gender Value
# 5 Legal US Female 10
# 6 Legal US Male 20
# 3 Finance Norway Female 20
# 4 Finance Norway Male 40

Related

R: Filtering rows based on a group criterion

I have a data frame with over 100,000 rows and with about 40 columns. The schools column has about 100 distinct schools. I have data from 1980 to 2023.
I want to keep all data from schools that have at least 10 rows for each of the years 2018 through 2022. Schools that do not meet that criterion should have all rows deleted.
In my minimal example, Schools, I have three schools.
Computing a table makes it apparent that only Washington should be retained. Adams only has 5 rows for 2018 and Jefferson has 0 for 2018.
Schools2 is what the result should look like.
How do I use the table computation or a dplyr computation to perform the filter?
Schools =
data.frame(school = c(rep('Washington', 60),
rep('Adams',70),
rep('Jefferson', 100)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5),
rep(2017, 25), rep(2018, 5), rep(2019:2022, each = 10),
rep(2019:2023, each = 20)),
stuff = rnorm(230)
)
Schools2 =
data.frame(school = c(rep('Washington', 60)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5)),
stuff = rnorm(60)
)
table(Schools$school, Schools$year)
Schools |> group_by(school, year) |> summarize(counts = n())
Keep only the year from 2018 to 2022 in the data with filter, then add a frequency count column by school, year, and filter only those 'school', having all count greater than or equal to 10 and if all the year from the range are present
library(dplyr)# version >= 1.1.0
Schools %>%
filter(all(table(year[year %in% 2018:2022]) >= 10) &
all(2018:2022 %in% year), .by = c("school")) %>%
as_tibble()
-output
# A tibble: 60 × 3
school year stuff
<chr> <dbl> <dbl>
1 Washington 2016 0.680
2 Washington 2016 -1.14
3 Washington 2016 0.0420
4 Washington 2016 -0.603
5 Washington 2016 2.05
6 Washington 2018 -0.810
7 Washington 2018 0.692
8 Washington 2018 -0.502
9 Washington 2018 0.464
10 Washington 2018 0.397
# … with 50 more rows
Or using count
library(magrittr)
Schools %>%
filter(tibble(year) %>%
filter(year %in% 2018:2022) %>%
count(year) %>%
pull(n) %>%
is_weakly_greater_than(10) %>%
all, all(2018:2022 %in% year) , .by = "school")
As it turns out, a friend just helped me come up with a base R solution.
# form 2-way table, school against year
sdTable = table(Schools$school, Schools$year)
# say want years 2018-2022 having lots of rows in school data
sdTable = sdTable[,3:7]
# which have >= 10 rows in all years 2018-2022
allGtEq = function(oneRow) all(oneRow >= 10)
whichToKeep = which(apply(sdTable,1,allGtEq))
# now whichToKeep is row numbers from the table; get the school names
whichToKeep = names(whichToKeep)
# back to school data
whichOrigRowsToKeep = which(Schools$school %in% whichToKeep)
newHousing = Schools[whichOrigRowsToKeep,]
newHousing

Age calculation for observation data in R [duplicate]

This question already has answers here:
Return date range by group
(3 answers)
Closed 3 years ago.
I have very simple big observation data hypothetically structured as below:
> df = data.frame(ID = c("oak", "birch", rep("oak",2), "pine", "birch", "oak", rep("pine",2), "birch", "oak"),
+ yearobs = c(rep(1998,3), rep(1999,2), rep(2000,3),rep(2001,2), 2002))
> df
ID yearobs
1 oak 1998
2 birch 1998
3 oak 1998
4 oak 1999
5 pine 1999
6 birch 2000
7 oak 2000
8 pine 2000
9 pine 2001
10 birch 2001
11 oak 2002
What I want to do is to calculate the age by taking the difference between the years ( max(yearobs)-min(yearobs) ) for each unique ID (tree species in this example). I have tried to work with lubridate + dplyr packages, however, number of observations for each unique ID varies in my data and I want to create an age column in a fastest way without storing minimum and maximum values separately (avoiding for loops here since my data is huge).
Desired output:
ID age
1 oak 4
2 birch 3
3 pine 3
Any suggestion would be appreciated.
In base R you can do:
aggregate(yearobs ~ ID, data = df, FUN = function(x) max(x) - min(x))
# ID yearobs
# 1 birch 3
# 2 oak 4
# 3 pine 2
An option is to group by 'ID' and get the difference between the min and max of 'yearobs' column
library(dplyr)
df %>%
group_by(ID) %>%
summarise(age = max(yearobs) - min(yearobs))
Also, if we need to do this fast, then data.table would be another option
library(data.table)
setDT(df)[, .(age = max(yearobs) - min(yearobs)), by = ID]
Or using base R
by(df['yearobs'], df$ID, FUN = function(x) max(x)- min(x))

R - dplyr - How to mutate rows or divitions between rows

I found that dplyr is speedy and simple for aggregate and summarise data. But I can't find out how to solve the following problem with dplyr.
Given these data frames:
df_2017 <- data.frame(
expand.grid(1:195,1:65,1:39),
value = sample(1:1000000,(195*65*39)),
period = rep("2017",(195*65*39)),
stringsAsFactors = F
)
df_2017 <- df_2017[sample(1:(195*65*39),450000),]
names(df_2017) <- c("company", "product", "acc_concept", "value", "period")
df_2017$company <- as.character(df_2017$company)
df_2017$product <- as.character(df_2017$product)
df_2017$acc_concept <- as.character(df_2017$acc_concept)
df_2017$value <- as.numeric(df_2017$value)
ratio_df <- data.frame(concept=c("numerator","numerator","numerator","denom", "denom", "denom","name"),
ratio1=c("1","","","4","","","Sales over Assets"),
ratio2=c("1","","","5","6","","Sales over Expenses A + B"), stringsAsFactors = F)
where the columns in df_2017 are:
company = This is a categorical variable with companies from 1 to 195
product = This is a categorical, with home apliance products from 1 to 65. For example, 1 could be equal to irons, 2 to television, etc
acc_concept = This is a categorical variable with accounting concepts from 1 to 39. For example, 1 would be equal to "Sales", 2 to "Total Expenses", 3 to Returns", 4 to "Assets, etc
value = This is a numeric variable, with USD from 1 to 100.000.000
period = Categorical variable. Always 2017
As the expand.grid implies, the combinations of company - product - acc_concept are never duplicated, but, It could happen that certains subjects have not every company - product - acc_concept combinations. That's why the code line "df_2017 <- df_2017[sample(1:195*65*39),450000),]", and that's why the output could turn out into NA (see below).
And where the columns in ratio_df are:
Concept = which acc_concept corresponds to numerator, which one to
denominator, and which is name of the ratio
ratio1 = acc_concept and name for ratio1
ratio2 = acc_concept and name for ratio2
I want to calculate 2 ratios (ratio_df) between acc_concept, for each product within each company.
For example:
I take the first ratio "acc_concepts" and "name" from ratio_df:
num_acc_concept <- ratio_df[ratio_df$concept == "numerator", 2]
denom_acc_concept <- ratio_df[ratio_df$concept == "denom", 2]
ratio_name <- ratio_df[ratio_df$concept == "name", 2]
Then I calculate the ratio for one product of one company, just to show you want i want to do:
ratio1_value <- sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% num_acc_concept, 4]) / sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% denom_acc_concept, 4])
Output:
output <- data.frame(Company="1", Product="1", desc_ratio=ratio_name, ratio_value = ratio1_value, stringsAsFactors = F)
As i said before i want to do this for each product within each company
The output data.frame could be something like this (ratios aren't the true ones because i haven't done the calculations yet):
company product desc_ratio ratio_value
1 1 Sales over Assets 0.9303675
1 2 Sales over Assets 1.30
1 3 Sales over Assets Nan
1 4 Sales over Assets Inf
1 5 Sales over Assets 2.32
1 6 Sales over Assets NA
.
.
.
1 1 Sales over Expenses A + B 3.25
.
.
.
2 1 Sales over Assets 0.256
and so on...
NaN when ratio is 0 / 0
Inf when ratio is number / 0
NA when there is no data for certain company and product.
I hope i have made myself clear this time :)
Is there any way to solve this row problem with dplyr? Should I cast the df_2017 for mutating? In this case, which is the best way for casting?
Any help would be welcome!
This is one way of doing it. At the end I timed the code on all of your records.
First create a function to create all the ratios. Do note, this function is only useful inside the dplyr code.
ratio <- function(data){
result <- data.frame(desc_ratio = rep(NA, ncol(ratio_df) -1), ratio_value = rep(NA, ncol(ratio_df) -1))
for(i in 2:ncol(ratio_df)){
num <- ratio_df[ratio_df$concept == "numerator", i]
denom <- ratio_df[ratio_df$concept == "denom", i]
result$desc_ratio[i-1] <- ratio_df[ratio_df$concept == "name", i]
result$ratio_value[i-1] <- sum(ifelse(data$acc_concept %in% num, data$value, 0)) / sum(ifelse(data$acc_concept %in% denom, data$value, 0))
}
return(result)
}
Using dplyr, tidyr and purrr to put everything together. First group by the data, nest the data needed for the function, run the function with a mutate on the nested data. Drop the not needed nested data and unnest to get your wanted output. I leave the sorting up to you.
library(dplyr)
library(purrr)
library(tidyr)
output <- df_2017 %>%
group_by(company, product, period) %>%
nest() %>%
mutate(ratios = map(data, ratio)) %>%
select(-data) %>%
unnest
output
# A tibble: 25,350 x 5
company product period desc_ratio ratio_value
<chr> <chr> <chr> <chr> <dbl>
1 103 2 2017 Sales over Assets 0.733
2 103 2 2017 Sales over Expenses A + B 0.219
3 26 26 2017 Sales over Assets 0.954
4 26 26 2017 Sales over Expenses A + B 1.01
5 85 59 2017 Sales over Assets 4.14
6 85 59 2017 Sales over Expenses A + B 1.83
7 186 38 2017 Sales over Assets 7.85
8 186 38 2017 Sales over Expenses A + B 0.722
9 51 25 2017 Sales over Assets 2.34
10 51 25 2017 Sales over Expenses A + B 0.627
# ... with 25,340 more rows
Time it took to run this code on my machine measured with system.time:
user system elapsed
6.75 0.00 6.81

How do I melt or reshape binned data in R? [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 7 years ago.
I have binned data reflecting the width of rivers across each continent. Below is a sample dataset. I pretty much just want to get the data into the form I have shown.
dat <- read.table(text =
"width continent bin
5.32 Africa 10
6.38 Africa 10
10.80 Asia 20
9.45 Africa 10
22.66 Africa 30
9.45 Asia 10",header = TRUE)
How do I melt the above toy dataset to create this dataframe?
Bin Count Continent
10 3 Africa
10 1 Asia
20 1 Asia
30 1 Africa
We could use either one of the aggregate by group.
The data.table option would be to convert the 'data.frame' to 'data.table' (setDT(dat)), grouped by 'continent' and 'bin' variables, we get the number of elements per group (.N)
library(data.table)
setDT(dat)[,list(Count=.N) ,.(continent,bin)]
# continent bin Count
#1: Africa 10 3
#2: Asia 20 1
#3: Africa 30 1
#4: Asia 10 1
Or a similar option with dplyr by grouping the variables and then use n() instead of .N to get the count.
library(dplyr)
dat %>%
group_by(continent, bin) %>%
summarise(Count=n())
Or we can use aggregate from base R and using the formula method, we get the length.
aggregate(cbind(Count=width)~., dat, FUN=length)
# continent bin Count
#1 Africa 10 3
#2 Asia 10 1
#3 Asia 20 1
#4 Africa 30 1
From #Frank's and #David Arenburg's comments, some additional options using data.table and dplyr. We convert the dataset to data.table (setDT(dat)), convert to 'wide' format with dcast, then reconvert it back to 'long' using melt, and subset the roww (value>0)
library(data.table)
melt(dcast(setDT(dat),continent~bin))[value>0]
Using count from dplyr
library(dplyr)
count(dat, bin, continent)
With sqldf:
library(sqldf)
sqldf("SELECT bin, continent, COUNT(continent) AS count
FROM dat
GROUP BY bin, continent")
Output:
bin continent count
1 10 Africa 3
2 10 Asia 1
3 20 Asia 1
4 30 Africa 1

How to calculate time-weighted average and create lags

I have searched the forum, but found nothing that could answer or provide hint on how to do what I wish to on the forum.
I have yearly measurement of exposure data from which I wish to calculate individual level annual average based on entry of each individual into the study. For each row the one year exposure assignment should include data from the preceding 12 months starting from the last month before joining the study.
As an example the first person in the sample data joined the study on Feb 7, 2002. His exposure will include a contribution of January 2002 (annual average is 18) and February to December 2001 (annual average is 19). The time weighted average for this person would be (1/12*18) + (11/12*19). The two year average exposure for the same person would extend back from January 2002 to February 2000.
Similarly, for last person who joined the study in December 2004 will include contribution on 11 months in 2004 and one month in 2003 and his annual average exposure will be (11/12*5 ) derived form 2004 and (1/12*6) which comes from the annual average of 2003.
How can I calculate the 1, 2 and 5 year average exposure going back from the date of entry into study? How can I use lags in the manner taht I hve described?
Sample data is accessed from this link
https://drive.google.com/file/d/0B_4NdfcEvU7La1ZCd2EtbEdaeGs/view?usp=sharing
This is not an elegant answer. But, I would like to leave what I tried. I first arranged the data frame. I wanted to identify which year will be the key year for each subject. So, I created id. variable comes from the column names (e.g., pol_2000) in your original data set. entryYear comes from entry in your data. entryMonth comes from entry as well. check was created in order to identify which year is the base year for each participant. In my next step, I extracted six rows for each participant using getMyRows in the SOfun package. In the next step, I used lapply and did math as you described in your question. For the calculation for two/five year average, I divided the total values by year (2 or 5). I was not sure how the final output would look like. So I decided to use the base year for each subject and added three columns to it.
library(stringi)
library(SOfun)
devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
### Big thanks to BondedDust for this function
### http://stackoverflow.com/questions/6987478/convert-a-month-abbreviation-to-a-numeric-month-in-r
mo2Num <- function(x) match(tolower(x), tolower(month.abb))
### Arrange the data frame.
ana <- foo %>%
mutate(id = 1:n()) %>%
melt(id.vars = c("id","entry")) %>%
arrange(id) %>%
mutate(variable = as.numeric(gsub("^.*_", "", variable)),
entryYear = as.numeric(stri_extract_last(entry, regex = "\\d+")),
entryMonth = mo2Num(substr(entry, 3,5)) - 1,
check = ifelse(variable == entryYear, "Y", "N"))
### Find a base year for each subject and get some parts of data for each participant.
indx <- which(ana$check == "Y")
bob <- getMyRows(ana, pattern = indx, -5:0)
### Get one-year average
cathy <- lapply(bob, function(x){
x$one <- ((x[6,6] / 12) * x[6,4]) + (((12-x[5,6])/12) * x[5,4])
x
})
one <- unnest(lapply(cathy, `[`, i = 6, j = 8))
### Get two-year average
cathy <- lapply(bob, function(x){
x$two <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + (((12-x[4,6])/12) * x[4,4])) / 2
x
})
two <- unnest(lapply(cathy, `[`, i = 6, j =8))
### Get five-year average
cathy <- lapply(bob, function(x){
x$five <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + x[4,4] + x[3,4] + x[2,4] + (((12-x[2,6])/12) * x[1,4])) / 5
x
})
five <- unnest(lapply(cathy, `[`, i =6 , j =8))
### Combine the results with the key observations
final <- cbind(ana[which(ana$check == "Y"),], one, two, five)
colnames(final) <- c(names(ana), "one", "two", "five")
# id entry variable value entryYear entryMonth check one two five
#6 1 07feb2002 2002 18 2002 1 Y 18.916667 18.500000 18.766667
#14 2 06jun2002 2002 16 2002 5 Y 16.583333 16.791667 17.150000
#23 3 16apr2003 2003 14 2003 3 Y 15.500000 15.750000 16.050000
#31 4 26may2003 2003 16 2003 4 Y 16.666667 17.166667 17.400000
#39 5 11jun2003 2003 13 2003 5 Y 13.583333 14.083333 14.233333
#48 6 20feb2004 2004 3 2004 1 Y 3.000000 3.458333 3.783333
#56 7 25jul2004 2004 2 2004 6 Y 2.000000 2.250000 2.700000
#64 8 19aug2004 2004 4 2004 7 Y 4.000000 4.208333 4.683333
#72 9 19dec2004 2004 5 2004 11 Y 5.083333 5.458333 4.800000

Resources