dplyr relative frequency within group - r

(hopefully) simplified
I have asked farmers of a specific farmtype (organic and conventional) that I asked for a report on species (A,B) occur (0/1) on their land.
So, I have
df<-data.frame(id=1:10,
farmtype=c(rep("org",4), rep("conv",6)),
spA=c(0,0,0,1,1,1,1,1,1,1),
spB=c(1,1,1,0,0,0,0,0,0,0)
)
And my question is pretty simple... In what percentage of organic or conventional farms do the species occur?
solution
sp A occurs in 25% of org farms and 100% of conv farms
sp B occurs in 75% of org farms and 0% of conv farms
None of the solutions outlined below achieve that.
**additional question **
All I want is a simple ggplot with the species on the x-axis and the percentage of detection on the y-axis (once for org and once for conv).
ggplot(df.melt)+
geom_bar(aes(x=species, fill=farmtype))
### but, of course the species recognitions not just the farm types

janitor's tabyl is your friend. What you're calculating is "row"-percentages, but what you want is "col"-percentages. E.g.
set.seed(1234)
df <- data.frame(farmtype=sample(c("organic","conventional"),100, replace=T),
species=sample(letters[1:4], 100, replace=T),
occ=sample(c("yes","no"),100, replace=T))
df |>
tabyl(species,farmtype) |>
adorn_percentages("col")
# species conventional organic
# a 0.2553191 0.2641509
# b 0.2765957 0.2452830
# c 0.2553191 0.1886792
# d 0.2127660 0.3018868
But you could also use your own approach. Group by farmtype in the second group_by and remember to save the dataframe. This would be easier to use with ggplot2 as it is already in a long format.
df <-
df %>%
group_by(species, farmtype) %>%
dplyr::summarise(count = n()) %>%
group_by(farmtype) %>%
dplyr::mutate(prop = count/sum(count))
df
# A tibble: 8 × 4
# Groups: farmtype [2]
# species farmtype count prop
# <chr> <chr> <int> <dbl>
# a conventional 12 0.255
# a organic 14 0.264
# b conventional 13 0.277
# b organic 13 0.245
# c conventional 12 0.255
# c organic 10 0.189
# d conventional 10 0.213
# d organic 16 0.302
df %>%
ggplot(aes(x = species, y = prop, fill = farmtype)) +
geom_col()
Update: A variant of second option also suggested by Isaac Bravo.

Here you can have another option using your approach:
df %>%
group_by(farmtype, species) %>%
summarize(n = n()) %>%
mutate(percentage = n/sum(n))
OUTPUT:
farmtype species n percentage
<chr> <chr> <int> <dbl>
1 conventional a 12 0.235
2 conventional b 12 0.235
3 conventional c 12 0.235
4 conventional d 15 0.294
5 organic a 16 0.327
6 organic b 9 0.184
7 organic c 14 0.286
8 organic d 10 0.204

If I understand the poster's first question correctly, the poster seeks the proportion of organic versus conventional farm types among farms that grew a given species. This can also be accomplished using the data.table package as follows.
First, the example data set is recreated by setting the seed.
set.seed(1234) ##setting seed for reproducible example
df<-data.frame(farmtype=sample(c("organic","conventional"),100, replace=T),
species=sample(letters[1:4], 100, replace=T),
occ=sample(c("yes","no"),100, replace=T))
require(data.table)
df = data.table(df)
Next, the "no" answers are filtered out because we are only interested in farms that reported growing the species in the "occur" column. We then count the occurrences of the species for each farm type. The column "N" gives the count.
#Filter out "no" answers because they shouldn't affect the result sought
#and count the number of farmtypes that reported each species
ans = df[occ == "yes",.N,by = .(farmtype,species)]
ans
# farmtype species N
#1: conventional a 8
#2: conventional c 8
#3: organic a 6
#4: conventional d 11
#5: organic d 5
#6: organic c 7
#7: organic b 4
#8: conventional b 6
The total occurrences of each species for either farm type are then counted. As a check for this result, each row for a given species should give the same species total.
#Total number of farms that reported the species
ans[,species_total := sum(N), by = species] #
ans
# farmtype species N species_total
#1: conventional a 8 14
#2: conventional c 8 15
#3: organic a 6 14
#4: conventional d 11 16
#5: organic d 5 16
#6: organic c 7 15
#7: organic b 4 10
#8: conventional b 6 10
Finally, the columns are combined to calculate the proportion of organic or conventional farms for each species that was reported. As a check against the result, the proportion of organic and the proportion of conventional for each species should sum to 1 because there are only two farm types.
##Calculate the proportion of each farm type reported for each species
ans[, proportion := N/species_total]
ans
# farmtype species N species_total proportion
#1: conventional a 8 14 0.5714286
#2: conventional c 8 15 0.5333333
#3: organic a 6 14 0.4285714
#4: conventional d 11 16 0.6875000
#5: organic d 5 16 0.3125000
#6: organic c 7 15 0.4666667
#7: organic b 4 10 0.4000000
#8: conventional b 6 10 0.6000000
##Gives the proportion of organic farms specifically
ans[farmtype == "organic"]
# farmtype species N species_total proportion
#1: organic a 6 14 0.4285714
#2: organic d 5 16 0.3125000
#3: organic c 7 15 0.4666667
#4: organic b 4 10 0.4000000
If, on the other hand, one wanted to calculate the fraction of each species to all species occurrences reported for organic or conventional farms, you could use this code:
ans = df[,.N, by = .(species, farmtype,occ)] ##count by species,farmtype, and occurrence
ans[, spf := sum(N), by = .(occ,farmtype)] ##spf is the total number of times an occurrence was reported for each type
ans[, prop := N/spf]
ans = ans[occ == "yes"] ##proportion of the given species to all species occurrences reported for each farm type
ans
# species farmtype occ N spf prop
#1: a conventional yes 8 33 0.2424242
#2: c conventional yes 8 33 0.2424242
#3: a organic yes 6 22 0.2727273
#4: d conventional yes 11 33 0.3333333
#5: d organic yes 5 22 0.2272727
#6: c organic yes 7 22 0.3181818
#7: b organic yes 4 22 0.1818182
#8: b conventional yes 6 33 0.1818182
This result means that, for example, conventional farmers reported species "a" about 24.2% of the times that they reported any species. The result can be verified by selecting a species and farmtype and calculating manually as a spot check.

Related

How to remove rows that contain duplicate characters in R

I want remove entire row if there are duplicates in two columns. Any quick help in doing so in R (for very large dataset) would be highly appreciated. For example:
mydf <- data.frame(p1=c('a','a','a','b','g','b','c','c','d'),
p2=c('b','c','d','c','d','e','d','e','e'),
value=c(10,20,10,11,12,13,14,15,16))
This gives:
mydf
p1 p2 value
1 a b 10
2 c c 20
3 a d 10
4 b c 11
5 d d 12
6 b b 13
7 c d 14
8 c e 15
9 e e 16
I want to get:
p1 p2 value
1 a b 10
2 a d 10
3 b c 11
4 c d 14
5 c e 15
your note in the comments suggests your actual problem is more complex. There's some preprocessing you could do to your strings before you compare p1 to p2. You will have the domain expertise to know what steps are appropriate, but here's a first start. I remove all spaced and punctuation from p1 and p2. I then convert them all to uppercase before testing for equality. You can modify the clean_str function to include more / different cleaning operations.
Additionally, you may consider approximate matching to address typos / colloquial naming conventions. Package stringdist is a good place to start.
mydf <- data.frame(p1=c('New York','New York','New York','TokYo','LosAngeles','MEMPHIS','memphis','ChIcAGo','Cleveland'),
p2=c('new York','New.York','MEMPHIS','Chicago','knoxville','tokyo','LosAngeles','Chicago','CLEVELAND'),
value=c(10,20,10,11,12,13,14,15,16),
stringsAsFactors = FALSE)
mydf[mydf$p1 != mydf$p2,]
#> p1 p2 value
#> 1 New York new York 10
#> 2 New York New.York 20
#> 3 New York MEMPHIS 10
#> 4 TokYo Chicago 11
#> 5 LosAngeles knoxville 12
#> 6 MEMPHIS tokyo 13
#> 7 memphis LosAngeles 14
#> 8 ChIcAGo Chicago 15
#> 9 Cleveland CLEVELAND 16
clean_str <- function(col){
#removes all punctuation
d <- gsub("[[:punct:][:blank:]]+", "", col)
d <- toupper(d)
return(d)
}
mydf$p1 <- clean_str(mydf$p1)
mydf$p2 <- clean_str(mydf$p2)
mydf[mydf$p1 != mydf$p2,]
#> p1 p2 value
#> 3 NEWYORK MEMPHIS 10
#> 4 TOKYO CHICAGO 11
#> 5 LOSANGELES KNOXVILLE 12
#> 6 MEMPHIS TOKYO 13
#> 7 MEMPHIS LOSANGELES 14
Created on 2020-05-03 by the reprex package (v0.3.0)
Several ways to do that. Among them :
Base R
mydf[mydf$p1 != mydf$p2, ]
dplyr
library(dplyr)
mydf %>% filter(p1 != p2)
data.table
library(data.table)
setDT(mydf)
mydf[p1 != p2]
Here's a two-step solution based on #Chase's data:
First step (as suggested by #Chase) - preprocess your data in p1and p2to make them comparable:
# set to lower-case:
mydf[,c("p1", "p2")] <- lapply(mydf[,c("p1", "p2")], tolower)
# remove anything that's not alphanumeric between words:
mydf[,c("p1", "p2")] <- lapply(mydf[,c("p1", "p2")], function(x) gsub("(\\w+)\\W(\\w+)", "\\1\\2", x))
Second step - (i) using apply, paste the rows together, (ii) use grepl and backreference \\1 to look out for immediately adjacent duplicates in these rows, and (iii) remove (-) those rows which contain these duplicates:
mydf[-which(grepl("\\b(\\w+)\\s+\\1\\b", apply(mydf, 1, paste0, collapse = " "))),]
p1 p2 value
3 newyork memphis 10
4 tokyo chicago 11
5 losangeles knoxville 12
6 memphis tokyo 13
7 memphis losangeles 14

Create Dataframe w/All Combinations of 2 Categorical Columns then Sum 3rd Column by Each Combination

I have an large messy dataset but want to accomplish a straightforward thing. Essentially I want to fill a tibble based on every combination of two columns and sum a third column.
As a hypothetical example, say each observation has the company_name (Wendys, BK, McDonalds), the food_option (burgers, fries, frosty), and the total_spending (in $). I would like to make a 9x3 tibble with the company, food, and total as a sum of every observation. Here's my code so far:
df_table <- df %>%
group_by(company_name, food_option) %>%
summarize(total= sum(total_spending))
company_name food_option total
<chr> <chr> <dbl>
1 Wendys Burgers 757
2 Wendys Fries 140
3 Wendys Frosty 98
4 McDonalds Burgers 1044
5 McDonalds Fries 148
6 BK Burgers 669
7 BK Fries 38
The problem is that McDonalds has zero observations with "Frosty" as the food_option. Consequently, I get a partial table. I'd like to fill that with a row that shows:
8 McDonalds Frosty 0
9 BK Frosty 0
I know I can add the rows manually, but the actual dataset has over a hundred combinations so it will be tedious and complicated. Also, I'm constantly modifying the upstream data and I want the code to automatically fill correctly.
Thank you SO MUCH to anyone who can help. This forum has really been a godsend, really appreciate all of you.
Try:
library(dplyr)
df %>%
mutate(food_option = factor(food_option, levels = unique(food_option))) %>%
group_by(company_name, food_option, .drop = FALSE) %>%
summarise(total = sum(total_spending))
Newer versions of dplyr have a .drop argument to group_by where if you've got a factor with pre-defined levels they will not be dropped (and you'll get the zeros).
You can use tidyr::expand_grid():
tidyr::expand_grid(company_name = c("Wendys", "McDonalds", "BK"),
food_option = c("Burgers", "Fries", "Frosty"))
to create all possible variations
library(tidyverse)
# example data
df = read.table(text = "
company_name food_option total
1 Wendys Burgers 757
2 Wendys Fries 140
3 Wendys Frosty 98
4 McDonalds Burgers 1044
5 McDonalds Fries 148
6 BK Burgers 669
7 BK Fries 38
", header=T)
df %>% complete(company_name, food_option, fill=list(total = 0))
# # A tibble: 9 x 3
# company_name food_option total
# <fct> <fct> <dbl>
# 1 BK Burgers 669
# 2 BK Fries 38
# 3 BK Frosty 0
# 4 McDonalds Burgers 1044
# 5 McDonalds Fries 148
# 6 McDonalds Frosty 0
# 7 Wendys Burgers 757
# 8 Wendys Fries 140
# 9 Wendys Frosty 98

How can I alter the values of certain rows in a column, based on a condition from another column in a dataframe, using the ifelse function?

So I have this first dataframe (fish18) which consists of data on fish specimens, and a column "grade" that is to be filled with conditions in an ifelse function.
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo NA 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India NA 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa NA 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa NA 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa NA 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States NA 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
And after filling the grade column I have something like this (fish19)
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo D 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India A 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa C 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa A 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa E 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States B 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
Both dataframes have many specimens belonging to the same species of fish, and the thing is that the grades are suposed to be assigned to each species for every specimen of that species. The problem I'm having is that some rows belonging to the same species are having different grades, specially in the case of the grades "C" and "E". What I want to incorporate into my ifelse function is: Change from grade "C" to "E" every occurrence of the dataframe where two or more specimens belonging to the same species are assigned "C" in one row and "E" in another row. Because if one species has grade "E", every other row with that species name should also have grade "E".
So far I've tried the %in% function and just using "=="
Trying with %in%
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]%in%fish19$species[fish19$grade=="C"]==TRUE,"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Trying with "=="
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]==fish19$species[fish19$grade=="C"],"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Both these two options did not work and the output of this alteration should be that if one occurrence of a specific species name has the grade "E" assigned to it, so should all other occurences with that same species name.
I'm sorry if this was confusion but I tried to be as clear as I could, thank you in advance for any responses.
Kind of a long winded answer, but:
dat = data.frame('species'=c('a','b','c','a','a','b'),'grade'=c('E','E','C','C','C','D'))
dat %>% left_join(dat %>%
group_by(species) %>%
summarize(sum_e = sum(grade=='E')),by='species')
Then you could do an ifelse for sum_e>0

cbind arguments in large dataframe

I have searched unsuccessfully for several days for an answer to this question: I have a dataframe with 279 columns and want to generate subtotals using aggregate(), or indeed, anything suitable. Here is a subset:
LGA off.cat sub.cat Jan1995 Feb1995
1 Albury Homicide Murder * 0 0
2 Albury Homicide Attempted murder 0 0
3 Albury Homicide Murder accessory, conspiracy 0 0
4 Albury Homicide Manslaughter * 0 0
5 Albury Assault Domestic violence related assault 7 7
6 Albury Assault Non-domestic violence related assault 29 20
7 Albury Assault Assault Police 12 3
8 Albury Sexual offences Sexual assault 4 3
The full dataframe contains dozens of LGA values, and many more date columns. I would like to obtain subtotals for each unique LGA value grouped by unique values of off.cat and sub.cat, summed over all dates. I tried using cbind in aggregate, but found no way to generate the 276 date column names that would not cause errors. Explicit column names worked fine. Apologies for the lack of clarity in the earlier post, and thanks to those who valiantly tried to interpret my meaning.
Your question is a bit unclear, but you may be successful using the formula syntax of aggregate. Here's an example:
df <- data.frame(group = letters[1:5],
x = 1:5,
y = 6:10,
z = 11:15)
group x y z
1 a 1 6 11
2 b 2 7 12
3 c 3 8 13
4 d 4 9 14
5 e 5 10 15
We now sum all three variables x, y and z by the levels of group, using setdiff to get a vector of column names except group, and pasting them together to use in as.formula:
aggregate(as.formula(paste(paste(setdiff(names(df), c("group")), collapse = "+"), "~ group")), data = df, sum)
group x + y + z
1 a 18
2 b 21
3 c 24
4 d 27
5 e 30
Hope this helps.

Looping over a data frame and adding a new column in R with certain logic

I have a data frame which contains information about sales branches, customers and sales.
branch <- c("Chicago","Chicago","Chicago","Chicago","Chicago","Chicago","LA","LA","LA","LA","LA","LA","LA","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa","Tampa")
customer <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21)
sales <- c(33816,24534,47735,1467,39389,30659,21074,20195,45165,37606,38967,41681,47465,3061,23412,22993,34738,19408,11637,36234,23809)
data <- data.frame(branch, customer, sales)
What I need to accomplish is to iterate over each branch, take each customer in the branch and divide the sales for that customer by the total of the branch. I need to do it to find out how much each customer is contributing towards the total sales of the corresponding branch. E.g. for customer 1 I would like to divide 33816/177600 and store this value in a new column. (177600 is the total of chicago branch)
I have tried to write a function to iterate over each row in a for loop but I am not sure how to do it at a branch level. Any guidance is appreciated.
Consider base R's ave for new column of inline aggregate which also considers same customer with multiple records within the same branch:
data$customer_contribution <- ave(data$sales, data$customer, FUN=sum) /
ave(data$sales, data$branch, FUN=sum)
data
# branch customer sales customer_contribution
# 1 Chicago 1 33816 0.190405405
# 2 Chicago 2 24534 0.138141892
# 3 Chicago 3 47735 0.268778153
# 4 Chicago 4 1467 0.008260135
# 5 Chicago 5 39389 0.221784910
# 6 Chicago 6 30659 0.172629505
# 7 LA 7 21074 0.083576241
# 8 LA 8 20195 0.080090263
# 9 LA 9 45165 0.179117441
# 10 LA 10 37606 0.149139610
# 11 LA 11 38967 0.154537126
# 12 LA 12 41681 0.165300433
# 13 LA 13 47465 0.188238887
# 14 Tampa 14 3061 0.017462291
# 15 Tampa 15 23412 0.133560003
# 16 Tampa 16 22993 0.131169705
# 17 Tampa 17 34738 0.198172193
# 18 Tampa 18 19408 0.110718116
# 19 Tampa 19 11637 0.066386372
# 20 Tampa 20 36234 0.206706524
# 21 Tampa 21 23809 0.135824795
Or less wordy:
data$customer_contribution <- with(data, ave(sales, customer, FUN=sum) /
ave(sales, branch, FUN=sum))
We can use dplyr::group_by and dplyr::mutate to calculate fractional sales of total by branch.
library(dplyr);
library(magrittr);
data %>%
group_by(branch) %>%
mutate(sales.norm = sales / sum(sales))
## A tibble: 21 x 4
## Groups: branch [3]
# branch customer sales sales.norm
# <fct> <dbl> <dbl> <dbl>
# 1 Chicago 1. 33816. 0.190
# 2 Chicago 2. 24534. 0.138
# 3 Chicago 3. 47735. 0.269
# 4 Chicago 4. 1467. 0.00826
# 5 Chicago 5. 39389. 0.222
# 6 Chicago 6. 30659. 0.173
# 7 LA 7. 21074. 0.0836
# 8 LA 8. 20195. 0.0801
# 9 LA 9. 45165. 0.179
#10 LA 10. 37606. 0.149

Resources