Conditional statement that modifies two variables - r

I have the following data frame:
dat<-data.frame(site=c(rep("A", 3), rep("B", 3)),
landuse= rep(c("urban", "dev", "undev"),2),
percent= c(30,30,40, 50, 30, 20))
For each site, I want to filter for rows where the percent is greater than or equal to 50, but for sites where all landuse categories do not meet the minimum criteria, the landuse entry is changed to "mixed" and the percent is changed to 100.
The result data frame would look like this:
result<- data.frame(site= c("A", "B"), landuse=c("mixed", "urban"), percent= c(100, 50))

With dplyr you can try :
library(dplyr)
dat %>%
group_by(site) %>%
summarise(landuse = if (all(percent < 50)) "mixed" else landuse[percent >= 50],
percent = ifelse(landuse == 'mixed', 100, percent))
# site landuse percent
# <chr> <chr> <dbl>
#1 A mixed 100
#2 B urban 50

Does this work:
library(dplyr)
dat %>% group_by(site) %>% mutate(landuse = case_when(all(percent < 50) ~ 'mixed', TRUE ~ landuse),
percent= case_when(landuse == 'mixed' ~ 100, TRUE ~ percent)) %>%
filter(percent >= 50) %>% distinct()
# A tibble: 2 x 3
# Groups: site [2]
site landuse percent
<chr> <chr> <dbl>
1 A mixed 100
2 B urban 50

Related

How to easily generate/simulate example data with different groups for modelling

How to easily generate/simulate meaningful example data for modelling: e.g. telling that give me n rows of data, for 2 groups, their sex distributions and mean age should differ by X and Y units, respectively? Is there a simple way for doing it automatically? Any packages?
For example, what would be the simplest way for generating such data?
groups: two groups: A, B
sex: different sex distributions: A 30%, B 70%
age: different mean ages: A 50, B 70
PS! Tidyverse solutions are especially welcome.
My best try so far is still quite a lot of code:
n=100
d = bind_rows(
#group A females
tibble(group = rep("A"),
sex = rep("Female"),
age = rnorm(n*0.4, 50, 4)),
#group B females
tibble(group = rep("B"),
sex = rep("Female"),
age = rnorm(n*0.3, 45, 4)),
#group A males
tibble(group = rep("A"),
sex = rep("Male"),
age = rnorm(n*0.20, 60, 6)),
#group B males
tibble(group = rep("B"),
sex = rep("Male"),
age = rnorm(n*0.10, 55, 4)))
d %>% group_by(group, sex) %>%
summarise(n = n(),
mean_age = mean(age))
There are lots of ways to sample from vectors and to draw from random distributions in R. For example, the data set you requested could be created like this:
set.seed(69) # Makes samples reproducible
df <- data.frame(groups = rep(c("A", "B"), each = 100),
sex = c(sample(c("M", "F"), 100, TRUE, prob = c(0.3, 0.7)),
sample(c("M", "F"), 100, TRUE, prob = c(0.5, 0.5))),
age = c(runif(100, 25, 75), runif(100, 50, 90)))
And we can use the tidyverse to show it does what was expected:
library(dplyr)
df %>%
group_by(groups) %>%
summarise(age = mean(age),
percent_male = length(which(sex == "M")))
#> # A tibble: 2 x 3
#> groups age percent_male
#> <chr> <dbl> <int>
#> 1 A 49.4 29
#> 2 B 71.0 50

finding outliers and counting number of occurrence

I have a data frame as below
raw_data <- data.frame(
"id" = c(1, 1, 1, 2, 2),
"salary" = c(10000,15000,20000,40000,50000),
"expenditure" = c(10000,15000,20000,30000,40000))
if a salary greater than 15000 would be flaged as an outlier, and if an expenditure is greater than 10000, it should be flaged as an outlier. But the problem is now, how to count how many times an outlier (both spearately) occured by a specific id. The output should look like the following df
output <- data.frame(
"id"=c(1,1,1,2,2,2,2),
"question_name"=c("expenditure", "salary","expenditure","salary","expenditure","salary","expenditure"),
"values"=c(15000,20000,20000,30000,40000,500000,40000),
"count"=c(1,1,1,1,1,1,1))
Here's a dplyr solution:
raw_data %>%
mutate(salary_flag =
ifelse(salary > 15000, 1, 0),
expenditure_flag = ifelse(expenditure > 10000, 1, 0)) %>%
group_by(id) %>%
mutate(total_outlier = sum(salary_flag) + sum(expenditure_flag))
You are flagging for salary and expenditure, then grouping by id and calculating the sum of all salary_flag and the sum of all expenditure_flag for each id.
id salary expenditure salary_flag expenditure_flag total_outlier
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10000 10000 0 0 0
2 2 15000 15000 0 1 1
3 3 20000 20000 1 1 2
4 4 40000 30000 1 1 2
5 5 50000 40000 1 1 2
If you're only concerned with the total outliers, #MartinGal provided a very nice option:
raw_data %>%
group_by(id) %>%
mutate(total_outlier = sum(salary>15000, expenditure>10000))
Gives us:
id salary expenditure total_outlier
<int> <dbl> <dbl> <int>
1 1 10000 10000 0
2 2 15000 15000 1
3 3 20000 20000 2
4 4 40000 30000 2
5 5 50000 40000 2
edit:
This seems to get the end result that you're looking for:
raw_data %>%
group_by(id) %>%
summarise(count = sum(salary>15000, expenditure>10000),
value = min(salary)) %>%
mutate(title = "salary") %>%
select(id, title, value, count)
Which gives you:
id title value count
<int> <chr> <dbl> <int>
1 1 salary 10000 0
2 2 salary 15000 1
3 3 salary 20000 2
4 4 salary 40000 2
5 5 salary 50000 2
Raw data is :
raw_data <- data.frame("id" = c(1, 1, 1, 2, 2),
"salary" = c(10000,15000,20000,40000,50000),
"expenditure" = c(10000,15000,15000,30000,40000))
And the solution is :
raw_data <- raw_data %>% filter(salary>15000 | expenditure>10000)
entry_variables <- raw_data %>%select(id,salary,expenditure) %>%
pivot_longer(cols = -id,
names_to = "Question_name", values_to= "Value",
values_drop_na = TRUE) %>%
count(id, Question_name, Value)
You can try the following
raw_data <- data.frame("id" = 1:5,
"salary" = c(10000,15000,20000,40000,50000),
"expenditure" = c(10000,15000,20000,30000,40000))
raw_data$SaleryOutlier <- ifelse(
raw_data$salary > 15000, TRUE, FALSE)
raw_data$ExpenditureOutlier <- ifelse(
raw_data$expenditure > 10000, TRUE, FALSE)
You can then use aggregate function to summarize the data, e.g. for each id by using FUN=sum. This should look like
aggregate(raw_data, by=list(id = raw_data$id), FUN=sum)
This works because TRUE=1.
I hope this helps.
EDIT
Based on your comment, I guess you are looking for
raw_data <- data.frame("id" = c(1, 1, 1, 2, 2),
"salary" = c(10000,15000,20000,40000,50000),
"expenditure" = c(10000,15000,20000,30000,40000))
raw_data$SaleryOutlier <- ifelse(
raw_data$salary > 15000, TRUE, FALSE)
raw_data$ExpenditureOutlier <- ifelse(
raw_data$expenditure > 10000, TRUE, FALSE)
raw_data_aggregate <- aggregate(raw_data, by=list(id = raw_data$id), FUN=sum)
raw_data_aggregate$count <- raw_data_aggregate$SaleryOutlier + raw_data_aggregate$ExpenditureOutlier
EDIT TWO
If you want to aggregate over two variables, just exchange the above aggregate with
raw_data_aggregate <- aggregate(
SalaryOutlier + ExpenditureOutlier ~ id + salary + expenditure, raw_data, FUN=sum)
EDIT THREE
Based on the comments below, I created the following code
raw_data <- data.frame(
"id" = c(1, 1, 1, 2, 2),
"salary" = c(10000,15000,20000,40000,50000),
"expenditure" = c(10000,15000,20000,30000,40000))
# Identify salary outliers
raw_data$SalaryOutlier <- ifelse(
raw_data$salary > 15000, TRUE, FALSE)
# Identify expenditure outliers
raw_data$ExpenditureOutlier <- ifelse(
raw_data$expenditure > 10000, TRUE, FALSE)
# Aggregate over id + salay
raw_data_aggregate_salary <- aggregate(
SalaryOutlier ~ id + salary, raw_data, FUN=sum)
# Aggregate over id + expenditure
raw_data_aggregate_expenditure <- aggregate(
ExpenditureOutlier ~ id + expenditure, raw_data, FUN=sum)
# Just some renaming to fit with desired output data frame.
raw_data_aggregate_salary$question_name <- "salary"
raw_data_aggregate_expenditure$question_name <- "expenditure"
colnames(raw_data_aggregate_salary)[2] <- "values"
colnames(raw_data_aggregate_expenditure)[2] <- "values"
colnames(raw_data_aggregate_salary)[3] <- "count"
colnames(raw_data_aggregate_expenditure)[3] <- "count"
# Bind result together into one df.
raw_data_aggregate <- rbind(
raw_data_aggregate_salary, raw_data_aggregate_expenditure)
# Only select entries where we actually have a count.
raw_data_aggregate <- subset(
raw_data_aggregate,
raw_data_aggregate$count > 0)
# Order to fit with desired output
raw_data_aggregate <- raw_data_aggregate[ order(raw_data_aggregate$id), ]
In a data.table this would look like
raw_data[, flag0 := (salary > 15000) + (expenditure > 10000)]
raw_data[, flag := sum(flag0), by = "id"]
Here flag0 is the flag by row (which can later be deleted if you like) and flag would be the final result.
Edit: Seeing your reply to #Matt, you seem to want the total amount by salary and expenditure seperately. You can do something like
raw_data[, flag_salary := as.integer(salary > 15000)]
raw_data[, flag_expenditure := as.integer(expenditure > 10000)]
raw_data[, flag_salary := sum(flag_salary), by = "id"]
raw_data[, flag_expenditure := sum(flag_expenditure), by = "id"]

More efficient way to perform calculations on multiple (combined) columns by group

What is a more efficient way to perform calculations on multiple combined columns by group?
I have a dataset with Manager Effectiveness & Team Effectiveness components. How can I quickly calculate the number of 5s for each component by gender?
The desired outcome is like so:
Number of 5s for 'Manager effectiveness' = 2
Number of 5s for 'Team effectiveness' = 0
So far, I've tried the dplyr method:
Data %>%
group_by(gender) %>%
summarise(sum(c(Manager EQ, Manager IQ)) == 5)
Data %>%
group_by(gender) %>%
summarise(sum(c(Team collaboration, Team friendliness)) == 5)
Though it works, typing each column name quickly becomes tedious and error-prone as more columns are involved.
We can use summarise_at
library(dplyr)
Data %>%
group_by(gender) %>%
summarise_at(vars(starts_with('Manager')), ~ sum(. == 5))
Or if we are checking the sum of all numeric columns, use summarise_if
Data %>%
group_by(gender) %>%
summarise_if(is.numeric, ~ sum(. == 5))
Can we wrapped in a function
f1 <- function(dat, colPrefix, grp, val) {
dat %>%
group_by_at(grp) %>%
summarise_at(vars(starts_with(colPrefix)), ~ sum(. == val))
}
f1(Data, "Manager", "gender", 5)
Mostly expanding on #akrun's answer:
## made up data 100 observations
set.seed(133)
dat <- 1:5
gen <- c("M", "F")
z <- tibble(me = sample(dat, 100, TRUE),
mi = sample(dat, 100, TRUE),
tc = sample(dat, 100, TRUE),
tf = sample(dat, 100, TRUE),
gender = sample(gen, 100, TRUE))
# Grouping by gender, counting 5's, and reshaping data
z %>%
group_by(gender) %>%
summarise_at(vars(everything()), ~ sum(. == 5)) %>%
pivot_longer(me:tf) %>%
mutate(name = paste0("# 5's for ", name)) %>%
pivot_wider(gender)
Output:
# A tibble: 2 x 5
gender `# 5's for me` `# 5's for mi` `# 5's for tc` `# 5's for tf`
<chr> <int> <int> <int> <int>
1 F 6 6 8 5
2 M 10 14 20 5
This is starting to get a little hack-ey, but in response to Amanda's comment & my misunderstanding of the question:
z %>%
group_by(gender) %>%
summarise_at(vars(everything()), ~ sum(. == 5)) %>%
pivot_longer(me:tf) %>%
mutate(name = paste0("# 5's for ", name)) %>%
mutate(grp = ifelse(str_detect(name, 'm'), 'manager', 'team')) %>%
group_by(gender, grp) %>%
summarise(total_5s = sum(value))
Gives results:
# A tibble: 4 x 3
# Groups: gender [2]
gender grp total_5s
<chr> <chr> <int>
1 F manager 12
2 F team 13
3 M manager 24
4 M team 25
Unfortunately this relies heavily on making a distinction and group based on the column names of the original data.

How to set different set.seed() per group and then sample()

I would like to sample any number from Min to Max column of a data.frame after grouping and every group having different seed. I've tried a few approaches, you can see them in the reproducible example below, but none of them work.
The data.frame consists of four columns:
letters - my grouping variable
seed - an integer that is dynamic and group/letter specific
min - minimum value for the sample()
max - maximum value for the sample()
Here is a reproducible example:
set.seed(123)
data.frame(letter = sample(letters[1:3],20, replace=TRUE)) %>%
group_by(letter) %>%
summarise(seed = n()) %>%
mutate(min = ifelse(letter == "a", 20,
ifelse(letter == "b", 40, 60)),
max = ifelse(letter == "a", 30,
ifelse(letter == "b", 50, 70))) %>%
group_by(letter) %>%
# set.seed(seed) %>% # or mutate(randomNumber = sample(min:max, 1, set.seed(seed))) # these aren't working, but I hope you get my point
mutate(randomNumber = sample(min:max, 1))
Many thanks in advance!
I would suggest to use pmap from the purrr package in your last row:
library(tidyverse)
set.seed(123)
data.frame(letter = sample(letters[1:3],20, replace=TRUE)) %>%
group_by(letter) %>%
summarise(seed = n()) %>%
mutate(min = ifelse(letter == "a", 20,
ifelse(letter == "b", 40, 60)),
max = ifelse(letter == "a", 30,
ifelse(letter == "b", 50, 70))) %>%
group_by(letter) %>%
mutate(randomNumber = pmap_dbl(list(min, max, seed), function(x, y, z){set.seed(z); sample(x:y, 1)}))
# A tibble: 3 x 5
# Groups: letter [3]
letter seed min max randomNumber
<fct> <int> <dbl> <dbl> <dbl>
1 a 5 20 30 21
2 b 7 40 50 49
3 c 8 60 70 63

dplyr summarise() with special functions for edge cases

I would like to make a groupwise summarise() operation in dplyr, but where a different function applies if edge cases are met.
I have count data that looks like this. The concentration and the standard deviation are calculated like this:
library(dplyr)
testdata <- data_frame(sample = sort(rep(1:3, 4)),
volume = rep(c(1e-1, 1e-1, 1e-2, 1e-2), 3),
count = c(400, 400, 40, 40, 0, 0, 0, 0, 400, 400, 400, 400))
testdata %>%
group_by(sample) %>%
summarise(concentration = sum(count) / sum(volume),
sd = sqrt(sum(count)))
However, when making the calculation only counts with values between 25-250 are to be included. which I could achieve with:
testdata %>%
group_by(sample) %>%
filter((count >= 25) & (count <= 250)) %>%
summarise(concentration = sum(count) / sum(volume),
sd = sqrt(sum(count)))
But then samples 2 & 3 have no concentration.
The edge cases for each group might be calculated with something like:
if (all(count <= 25)){
summarise(concentration = 25 / min(volume),
sd = NA)
}
else if (all(count >= 250)){
summarise(concentration = 250 / max(volume),
sd = NA)
}
Can such edge cases be integrated into the summarise() function?
I would ideally also like a flag to indicate an edge case which returns result = "OK" for all cases except edge cases that return:
if (all(count <= 25)){
summarise(concentration = 25 / min(volume),
sd = NA,
result = "LOW")
}
else if (all(count >= 250)){
summarise(concentration = 250 / max(volume),
sd = NA,
result = "HIGH")
}
One way is to encode your logic within summarise using ifelse:
library(dplyr)
result <- testdata %>% group_by(sample) %>%
summarise(concentration = ifelse(all(count <= 25),
25 / min(volume),
ifelse(all(count >= 250),
250 / max(volume),
sum(count) / sum(volume))),
sd = ifelse(all(count <= 25),
NA,
ifelse(all(count >= 250),
NA,
sqrt(sum(count)))),
result = ifelse(all(count <= 25),
"LOW",
ifelse(all(count >= 250),
"HIGH",
"OK")))
print(result)
### A tibble: 3 x 4
## sample concentration sd result
## <int> <dbl> <dbl> <chr>
##1 1 4000 29.66479 OK
##2 2 2500 NA LOW
##3 3 2500 NA HIGH
Updated approach
Another approach, which is hopefully closer to what the OP asks, is to define a function:
summarise.func <- function(count, volume) {
if (all(count <= 25)) {
concentration <- 25 / min(volume)
sd <- NA
result <- "LOW"
} else if (all(count >= 250)) {
concentration <- 250 / max(volume)
sd <- NA
result <- "HIGH"
} else {
concentration <- sum(count) / sum(volume)
sd <- sqrt(sum(count))
result <- "OK"
}
data.frame(concentration=concentration, sd=sd, result=result, stringsAsFactors=FALSE)
}
that handles both the regular case and the edge cases. The key is that this function return a data.frame containing the summarized results. Then, summarise will create a column that is a list containing these data frames that can then be tidyr::unnested:
library(dplyr)
library(tidyr)
result <- testdata %>% group_by(sample) %>%
summarise(csr=list(f(count, volume))) %>%
unnest(csr)
print(result)
### A tibble: 3 x 4
## sample concentration sd result
## <int> <dbl> <dbl> <chr>
##1 1 4000 29.66479 OK
##2 2 2500 NA LOW
##3 3 2500 NA HIGH

Resources