Create Dataframe w/All Combinations of 2 Categorical Columns then Sum 3rd Column by Each Combination - r

I have an large messy dataset but want to accomplish a straightforward thing. Essentially I want to fill a tibble based on every combination of two columns and sum a third column.
As a hypothetical example, say each observation has the company_name (Wendys, BK, McDonalds), the food_option (burgers, fries, frosty), and the total_spending (in $). I would like to make a 9x3 tibble with the company, food, and total as a sum of every observation. Here's my code so far:
df_table <- df %>%
group_by(company_name, food_option) %>%
summarize(total= sum(total_spending))
company_name food_option total
<chr> <chr> <dbl>
1 Wendys Burgers 757
2 Wendys Fries 140
3 Wendys Frosty 98
4 McDonalds Burgers 1044
5 McDonalds Fries 148
6 BK Burgers 669
7 BK Fries 38
The problem is that McDonalds has zero observations with "Frosty" as the food_option. Consequently, I get a partial table. I'd like to fill that with a row that shows:
8 McDonalds Frosty 0
9 BK Frosty 0
I know I can add the rows manually, but the actual dataset has over a hundred combinations so it will be tedious and complicated. Also, I'm constantly modifying the upstream data and I want the code to automatically fill correctly.
Thank you SO MUCH to anyone who can help. This forum has really been a godsend, really appreciate all of you.

Try:
library(dplyr)
df %>%
mutate(food_option = factor(food_option, levels = unique(food_option))) %>%
group_by(company_name, food_option, .drop = FALSE) %>%
summarise(total = sum(total_spending))
Newer versions of dplyr have a .drop argument to group_by where if you've got a factor with pre-defined levels they will not be dropped (and you'll get the zeros).

You can use tidyr::expand_grid():
tidyr::expand_grid(company_name = c("Wendys", "McDonalds", "BK"),
food_option = c("Burgers", "Fries", "Frosty"))
to create all possible variations

library(tidyverse)
# example data
df = read.table(text = "
company_name food_option total
1 Wendys Burgers 757
2 Wendys Fries 140
3 Wendys Frosty 98
4 McDonalds Burgers 1044
5 McDonalds Fries 148
6 BK Burgers 669
7 BK Fries 38
", header=T)
df %>% complete(company_name, food_option, fill=list(total = 0))
# # A tibble: 9 x 3
# company_name food_option total
# <fct> <fct> <dbl>
# 1 BK Burgers 669
# 2 BK Fries 38
# 3 BK Frosty 0
# 4 McDonalds Burgers 1044
# 5 McDonalds Fries 148
# 6 McDonalds Frosty 0
# 7 Wendys Burgers 757
# 8 Wendys Fries 140
# 9 Wendys Frosty 98

Related

dplyr relative frequency within group

(hopefully) simplified
I have asked farmers of a specific farmtype (organic and conventional) that I asked for a report on species (A,B) occur (0/1) on their land.
So, I have
df<-data.frame(id=1:10,
farmtype=c(rep("org",4), rep("conv",6)),
spA=c(0,0,0,1,1,1,1,1,1,1),
spB=c(1,1,1,0,0,0,0,0,0,0)
)
And my question is pretty simple... In what percentage of organic or conventional farms do the species occur?
solution
sp A occurs in 25% of org farms and 100% of conv farms
sp B occurs in 75% of org farms and 0% of conv farms
None of the solutions outlined below achieve that.
**additional question **
All I want is a simple ggplot with the species on the x-axis and the percentage of detection on the y-axis (once for org and once for conv).
ggplot(df.melt)+
geom_bar(aes(x=species, fill=farmtype))
### but, of course the species recognitions not just the farm types
janitor's tabyl is your friend. What you're calculating is "row"-percentages, but what you want is "col"-percentages. E.g.
set.seed(1234)
df <- data.frame(farmtype=sample(c("organic","conventional"),100, replace=T),
species=sample(letters[1:4], 100, replace=T),
occ=sample(c("yes","no"),100, replace=T))
df |>
tabyl(species,farmtype) |>
adorn_percentages("col")
# species conventional organic
# a 0.2553191 0.2641509
# b 0.2765957 0.2452830
# c 0.2553191 0.1886792
# d 0.2127660 0.3018868
But you could also use your own approach. Group by farmtype in the second group_by and remember to save the dataframe. This would be easier to use with ggplot2 as it is already in a long format.
df <-
df %>%
group_by(species, farmtype) %>%
dplyr::summarise(count = n()) %>%
group_by(farmtype) %>%
dplyr::mutate(prop = count/sum(count))
df
# A tibble: 8 × 4
# Groups: farmtype [2]
# species farmtype count prop
# <chr> <chr> <int> <dbl>
# a conventional 12 0.255
# a organic 14 0.264
# b conventional 13 0.277
# b organic 13 0.245
# c conventional 12 0.255
# c organic 10 0.189
# d conventional 10 0.213
# d organic 16 0.302
df %>%
ggplot(aes(x = species, y = prop, fill = farmtype)) +
geom_col()
Update: A variant of second option also suggested by Isaac Bravo.
Here you can have another option using your approach:
df %>%
group_by(farmtype, species) %>%
summarize(n = n()) %>%
mutate(percentage = n/sum(n))
OUTPUT:
farmtype species n percentage
<chr> <chr> <int> <dbl>
1 conventional a 12 0.235
2 conventional b 12 0.235
3 conventional c 12 0.235
4 conventional d 15 0.294
5 organic a 16 0.327
6 organic b 9 0.184
7 organic c 14 0.286
8 organic d 10 0.204
If I understand the poster's first question correctly, the poster seeks the proportion of organic versus conventional farm types among farms that grew a given species. This can also be accomplished using the data.table package as follows.
First, the example data set is recreated by setting the seed.
set.seed(1234) ##setting seed for reproducible example
df<-data.frame(farmtype=sample(c("organic","conventional"),100, replace=T),
species=sample(letters[1:4], 100, replace=T),
occ=sample(c("yes","no"),100, replace=T))
require(data.table)
df = data.table(df)
Next, the "no" answers are filtered out because we are only interested in farms that reported growing the species in the "occur" column. We then count the occurrences of the species for each farm type. The column "N" gives the count.
#Filter out "no" answers because they shouldn't affect the result sought
#and count the number of farmtypes that reported each species
ans = df[occ == "yes",.N,by = .(farmtype,species)]
ans
# farmtype species N
#1: conventional a 8
#2: conventional c 8
#3: organic a 6
#4: conventional d 11
#5: organic d 5
#6: organic c 7
#7: organic b 4
#8: conventional b 6
The total occurrences of each species for either farm type are then counted. As a check for this result, each row for a given species should give the same species total.
#Total number of farms that reported the species
ans[,species_total := sum(N), by = species] #
ans
# farmtype species N species_total
#1: conventional a 8 14
#2: conventional c 8 15
#3: organic a 6 14
#4: conventional d 11 16
#5: organic d 5 16
#6: organic c 7 15
#7: organic b 4 10
#8: conventional b 6 10
Finally, the columns are combined to calculate the proportion of organic or conventional farms for each species that was reported. As a check against the result, the proportion of organic and the proportion of conventional for each species should sum to 1 because there are only two farm types.
##Calculate the proportion of each farm type reported for each species
ans[, proportion := N/species_total]
ans
# farmtype species N species_total proportion
#1: conventional a 8 14 0.5714286
#2: conventional c 8 15 0.5333333
#3: organic a 6 14 0.4285714
#4: conventional d 11 16 0.6875000
#5: organic d 5 16 0.3125000
#6: organic c 7 15 0.4666667
#7: organic b 4 10 0.4000000
#8: conventional b 6 10 0.6000000
##Gives the proportion of organic farms specifically
ans[farmtype == "organic"]
# farmtype species N species_total proportion
#1: organic a 6 14 0.4285714
#2: organic d 5 16 0.3125000
#3: organic c 7 15 0.4666667
#4: organic b 4 10 0.4000000
If, on the other hand, one wanted to calculate the fraction of each species to all species occurrences reported for organic or conventional farms, you could use this code:
ans = df[,.N, by = .(species, farmtype,occ)] ##count by species,farmtype, and occurrence
ans[, spf := sum(N), by = .(occ,farmtype)] ##spf is the total number of times an occurrence was reported for each type
ans[, prop := N/spf]
ans = ans[occ == "yes"] ##proportion of the given species to all species occurrences reported for each farm type
ans
# species farmtype occ N spf prop
#1: a conventional yes 8 33 0.2424242
#2: c conventional yes 8 33 0.2424242
#3: a organic yes 6 22 0.2727273
#4: d conventional yes 11 33 0.3333333
#5: d organic yes 5 22 0.2272727
#6: c organic yes 7 22 0.3181818
#7: b organic yes 4 22 0.1818182
#8: b conventional yes 6 33 0.1818182
This result means that, for example, conventional farmers reported species "a" about 24.2% of the times that they reported any species. The result can be verified by selecting a species and farmtype and calculating manually as a spot check.

Unable to Group and Sum Properly

I have data similar to this Sample Data:
Cities Country Date Cases
1 BE A 2/12/20 12
2 BD A 2/12/20 244
3 BF A 2/12/20 1
4 V 2/12/20 13
5 Q 2/13/20 2
6 D 2/14/20 4
7 GH N 2/15/20 6
8 DA N 2/15/20 624
9 AG J 2/15/20 204
10 FS U 2/16/20 433
11 FR U 2/16/20 38
I want to organize the data by on the date and country and then sum a country's daily case. However, I try something like, it reveal the total sum:
my_data %>%
group_by(Country, Date)%>%
summarize(Cases=sum(Cases))
Your summarize function is likely being called from another package (plyr?). Try calling dplyr::sumarize like this:
my_data %>%
group_by(Country, Date)%>%
dplyr::summarize(Cases=sum(Cases))
# A tibble: 7 x 3
# Groups: Country [7]
Country Date Cases
<fct> <fct> <int>
1 A 2/12/20 257
2 D 2/14/20 4
3 J 2/15/20 204
4 N 2/15/20 630
5 Q 2/13/20 2
6 U 2/16/20 471
7 V 2/12/20 13
I sympathize with you that this is can be very frustrating. I have gotten into a habit of always using dplyr::select, dplyr::filter and dplyr::summarize. Otherwise you spend needless time frustrated about why your code isn't working.
We can also use aggregate
aggregate(Cases ~ Country + Date, my_data, sum)

Add row with group sum in new column at the end of group category

I have been searching this information since yesterday but so far I could not find a nice solution to my problem.
I have the following dataframe:
CODE CONCEPT P. NR. NAME DEPTO. PRICE
1 Lunch 11 John SALES 160
1 Lunch 11 John SALES 120
1 Lunch 11 John SALES 10
1 Lunch 13 Frank IT 200
2 Internet 13 Frank IT 120
and I want to add a column with the sum of rows by group, for instance, the total amount of concept: Lunch, code: 1 by name in order to get an output like this:
CODE CONCEPT P. NR. NAME DEPTO. PRICE TOTAL
1 Lunch 11 John SALES 160 NA
1 Lunch 11 John SALES 120 NA
1 Lunch 11 John SALES 10 290
1 Lunch 13 Frank IT 200 200
2 Internet 13 Frank IT 120 120
So far, I tried with:
aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
But this retrieves just the total of the concepts like this:
NAME CODE TOTAL
John 1 290
Frank 1 200
Frank 2 120
And not the table with the rest of the data as I would like to have it.
I also tried adding an extra column with NA but somehow I cannot paste the total in a specific row position.
Any suggestions? I would like to have something I can do in BaseR.
Thanks!!
In base R you can use ave to add new column. We insert the sum of group only if it is last row in the group.
df$TOTAL <- with(df, ave(PRICE, CODE, CONCEPT, PNR, NAME, FUN = function(x)
ifelse(seq_along(x) == length(x), sum(x), NA)))
df
# CODE CONCEPT PNR NAME DEPTO. PRICE TOTAL
#1 1 Lunch 11 John SALES 160 NA
#2 1 Lunch 11 John SALES 120 NA
#3 1 Lunch 11 John SALES 10 290
#4 1 Lunch 13 Frank IT 200 200
#5 2 Internet 13 Frank IT 120 120
Similar logic using dplyr
library(dplyr)
df %>%
group_by(CODE, CONCEPT, PNR, NAME) %>%
mutate(TOTAL = ifelse(row_number() == n(), sum(PRICE) ,NA))
For a base R option, you may try merging the original data frame and aggregate:
df2 <- aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
out <- merge(df[ , !(names(df) %in% c("PRICE"))], df2, by=c("NAME", "CODE"))
out[with(out, order(CODE, NAME)), ]
NAME CODE CONCEPT PNR DEPT PRICE
1 Frank 1 Lunch 13 IT 200
3 John 1 Lunch 11 SALES 290
4 John 1 Lunch 11 SALES 290
5 John 1 Lunch 11 SALES 290
2 Frank 2 Internet 13 IT 120

How to create a Markdown table with different column lengths based on a dataframe in long format in R?

I'm working on a R Markdown file that I would like to submit as a manuscript to an academic journal. I would like to create a table that shows which three words (item2) co-occur most frequently with some keywords (item1). Note that some key words have more than three co-occurring words. The data that I am currently working with:
item1 <- c("water","water","water","water","water","sun","sun","sun","sun","moon","moon","moon")
item2 <- c("tree","dog","cat","fish","eagle","bird","table","bed","flower","house","desk","tiger")
n <- c("200","83","34","34","34","300","250","77","77","122","46","46")
df <- data.frame(item1,item2,n)
Which gives this dataframe:
item1 item2 n
1 water tree 200
2 water dog 83
3 water cat 34
4 water fish 34
5 water eagle 34
6 sun bird 300
7 sun table 250
8 sun bed 77
9 sun flower 77
10 moon house 122
11 moon desk 46
12 moon tiger 46
Ultimately, I would like to pass the data to the function papaja::apa_table, which requires a data.frame (or a matrix / list). I therefore need to reshape the data.
My question:
How can I reshape the data (preferably with dplyr) to get the following structure?
water_item2 water_n sun_item2 sun_n moon_item2 moon_n
1 tree 200 bird 300 house 122
2 dog 83 table 250 desk 46
3 cat 34 bed 77 tiger 46
4 fish 34 flower 77 <NA> <NA>
5 eagle 34 <NA> <NA> <NA> <NA>
We can borrow an approach from an old answer of mine to a different question, and modify a classic gather(), unite(), spread() strategy by creating unique identifiers by group to avoid duplicate identifiers, then dropping that variable:
library(dplyr)
library(tidyr)
item1 <- c("water","water","water","water","water","sun","sun","sun","sun","moon","moon","moon")
item2 <- c("tree","dog","cat","fish","eagle","bird","table","bed","flower","house","desk","tiger")
n <- c("200","83","34","34","34","300","250","77","77","122","46","46")
# Owing to Richard Telford's excellent comment,
# I use data_frame() (or equivalently for our purposes,
# data.frame(..., stringsAsFactors = FALSE))
# to avoid turning the strings into factors
df <- data_frame(item1,item2,n)
df %>%
group_by(item1) %>%
mutate(id = 1:n()) %>%
ungroup() %>%
gather(temp, val, item2, n) %>%
unite(temp2, item1, temp, sep = '_') %>%
spread(temp2, val) %>%
select(-id)
# A tibble: 5 x 6
moon_item2 moon_n sun_item2 sun_n water_item2 water_n
<chr> <chr> <chr> <chr> <chr> <chr>
1 house 122 bird 300 tree 200
2 desk 46 table 250 dog 83
3 tiger 46 bed 77 cat 34
4 NA NA flower 77 fish 34
5 NA NA NA NA eagle 34

Best method to Merge two Datasets (Maybe if function?)

I have two data sets I am working with. Datasets TestA and Test B (Below is how to make them in R)
Instructor <- c('Mr.A','Mr.A','Mr.B', 'Mr.C', 'Mr.D')
Class <- c('French','French','English', 'Math', 'Geometry')
Section <- c('1','2','3','5','5')
Time <- c('9:00-10:00','10:00-11:00','9:00-10:00','9:00-10:00','10:00-11:00')
Date <- c('MWF','MWF','TR','TR','MWF')
Enrollment <- c('30','40','24','29','40')
TestA <- data.frame(Instructor,Class,Section,Time,Date,Enrollment)
rm(Instructor,Class,Section,Time,Date,Enrollment)
Student <- c("Frances","Cass","Fern","Pat","Peter","Kory","Cole")
ID <- c('123','121','101','151','456','789','314')
Instructor <- c('','','','','','','')
Time <- c('','','','','','','')
Date <- c('','','','','','','')
Enrollment <- c('','','','','','','')
Class <- c('French','French','French','French','English', 'Math', 'Geometry')
Section <- c('1','1','2','2','3','5','5')
TestB <- data.frame(Student, ID, Instructor, Class, Section, Time, Date, Enrollment)
rm(Instructor,Class,Section,Time,Date,Enrollment,ID,Student)
I would like to merge both datasets (If possible, without using merge() ) So that All the columns of Test A are filled with the information provided by TestB and it should be added depending on the Class and Section.
I tried using merge(TestA, TestB, by=c('Class','Section'), all.x=TRUE) but it adds observations to the original TestA. This is just a test but in the datasets I am using there are hundreds of observations. It worked when I did it with these smaller frames but something is happening to the bigger set. That's why I'd like to know if there is a merge alternative.
Any ideas on how to do this?
The output should look like this
Class Section Instructor Time Date Enrollment Student ID
English 3 Mr.B 9:00-10:00 TR 24 Peter 456
French 1 Mr.A 9:00-10:00 MWF 30 Frances 123
French 1 Mr.A 9:00-10:00 MWF 30 Cass 121
French 2 Mr.A 10:00-11:00 MWF 40 Fern 101
French 2 Mr.A 10:00-11:00 MWF 40 Pat 151
Geometry 5 Mr.D 10:00-11:00 MWF 40 Cole 314
Math 5 Mr.C 9:00-10:00 TR 29 Kory 789
I was once a big fan of merge() until I learned about dplyr's join functions.
Try this instead:
library(dplyr)
TestA %>%
left_join(TestB, by = c("Class", "Section")) %>% #Here, you're joining by just the "Class" and "Section" columns of TestA and TestB
select(Class,
Section,
Instructor = Instructor.x,
Time = Time.x,
Date = Date.x,
Enrollment = Enrollment.x,
Student,
ID) %>%
arrange(Class, Section) #Added to match your output.
The select statement is keeping only those columns that are specifically named and, in some cases, renaming them.
Output:
Class Section Instructor Time Date Enrollment Student ID
1 English 3 Mr.B 9:00-10:00 TR 24 Peter 456
2 French 1 Mr.A 9:00-10:00 MWF 30 Frances 123
3 French 1 Mr.A 9:00-10:00 MWF 30 Cass 121
4 French 2 Mr.A 10:00-11:00 MWF 40 Fern 101
5 French 2 Mr.A 10:00-11:00 MWF 40 Pat 151
6 Geometry 5 Mr.D 10:00-11:00 MWF 40 Cole 314
7 Math 5 Mr.C 9:00-10:00 TR 29 Kory 789
The key is to drop the empty but duplicate columns from TestB before merging / joining as shown by SymbolixAU.
Here is an implementation in data.table syntax:
library(data.table)
setDT(TestB)[, .(Student, ID, Class, Section)][setDT(TestA), on = .(Class, Section)]
Student ID Class Section Instructor Time Date Enrollment
1: Frances 123 French 1 Mr.A 9:00-10:00 MWF 30
2: Cass 121 French 1 Mr.A 9:00-10:00 MWF 30
3: Fern 101 French 2 Mr.A 10:00-11:00 MWF 40
4: Pat 151 French 2 Mr.A 10:00-11:00 MWF 40
5: Peter 456 English 3 Mr.B 9:00-10:00 TR 24
6: Kory 789 Math 5 Mr.C 9:00-10:00 TR 29
7: Cole 314 Geometry 5 Mr.D 10:00-11:00 MWF 40

Resources