Trying to calculate share - summarize function not working - r

I'm trying to calculate the share of a certain variable cost for each country, related to the total. However, when I try to create the "share" column through mutate, it yields all answers as 1.
The code I'm using is as follows:
db %>%
group_by(country,group) %>%
summarize(cost=sum(cost)) %>%
mutate(share=cost/sum(cost))
This is the table it is generating:
# Groups: cluster [18]
cluster group cost share
<chr> <chr> <dbl> <dbl>
1 AT A 7810. 1
2 AU C 7786. 1
3 CA C 5920. 1
4 KO B 172702. 1
5 DE A 40894. 1
6 ES A 26357. 1
7 FR A 65735. 1
8 GB C 11240. 1
9 IT A 85045. 1
10 JP B 10069. 1
I've tried inverting the positions of group and country on the group_by(), but the share column is still returning the shares as a % of the group, instead of the total sum. Why is this happening and how can I fix it?

It's because the default behavior of summarise is to output a grouped dataframe when grouping by more than one variable (it drops one variable and keeps the next).
To solve it you can add an ungroup:
db %>%
group_by(country,group) %>%
summarize(cost=sum(cost)) %>%
ungroup() %>%
mutate(share=cost/sum(cost))
Or from dplyr version > 1.0.0:
db %>%
group_by(country,group) %>%
summarize(cost=sum(cost), .groups = "drop") %>%
mutate(share=cost/sum(cost))

Related

Cumulative sum of unique values based on multiple criteria

I've got a df with multiple columns containing information of species sightings over the years in different sites, therefore each year might show multiple records. I would like to filter my df and calculate some operations based on certain columns, but I'd like to keep all columns for further analyses. I had some previous code using summarise but as I would like to keep all columns I was trying to avoid using it.
Let's say the columns I'm interested to work with at the moment are as follows:
df <- data.frame("Country" = LETTERS[1:5], "Site"=LETTERS[6:10], "species"=1:5, "Year"=1981:2010)
I would like to calculate:
1- The cumulative sum of the records in which a species has been documented within each site creating a new column "Spsum".
2- The number of different years that each species has been seen on a particular site, this could be done as cumulative sum as well, on a new column "nYear".
For example, if species 1 has been recorded 5 times in 1981, and 2 times in 1982 in Site G, Spsum would show 7 (cumulative sum of records) whereas nYear would show 2 as it was spotted over two different years. So far I've got this, but nYear is displaying 0s as a result.
Df1 <- df %>%
filter(Year>1980)%>%
group_by(Country, Site, Species, Year) %>%
mutate(nYear = n_distinct(Year[Species %in% Site]))%>%
ungroup()
Thanks!
This cound help, without the need for a join.
df %>% arrange(Country, Site, species, Year) %>%
filter(Year>1980) %>%
group_by(Site, species) %>%
mutate(nYear = length(unique(Year))) %>%
mutate(spsum = rowid(species))
# A tibble: 30 x 6
# Groups: Site, species [5]
Country Site species Year nYear spsum
<chr> <chr> <int> <int> <int> <int>
1 A F 1 1981 6 1
2 A F 1 1986 6 2
3 A F 1 1991 6 3
4 A F 1 1996 6 4
5 A F 1 2001 6 5
6 A F 1 2006 6 6
7 B G 2 1982 6 1
8 B G 2 1987 6 2
9 B G 2 1992 6 3
10 B G 2 1997 6 4
# ... with 20 more rows
If the table contains multiple records per Country+Site+species+Year combination, I would first aggregate those and then calculate the cumulative counts from that. The counts can then be joined back to the original table.
Something along these lines:
cumulative_counts <- df %>%
count(Country, Site, species, Year) %>%
group_by(Country, Site, species) %>%
arrange(Year) %>%
mutate(Spsum = cumsum(n), nYear = row_number())
df %>%
left_join(cumulative_counts)

Adding new, combined values to existing dataframe in R

This is an approximation of the original dataframe. In the original, there are many more columns than are shown here.
id init_cont family description value
1 K S impacteach 1
1 K S impactover 3
1 K S read 2
2 I S impacteach 2
2 I S impactover 4
2 I S read 1
3 K D impacteach 3
3 K D impactover 5
3 K D read 3
I want to combine the values for impacteach and impactover to generate an average value that is just called impact. I would like the final table to look like the following:
id init_cont family description value
1 K S impact 2
1 K S read 2
2 I S impact 3
2 I S read 1
3 K D impact 4
3 K D read 3
I have not been able to figure out how to generate this table. However, I have been able to create a dataframe that looks like this:
id description value
1 impact 2
1 read 2
2 impact 3
2 read 1
3 impact 4
3 read 3
What is the best way for me to take these new values and add them to the original dataframe? I also need to remove the original values (like impacteach and impactover) in the original dataframe. I would prefer to modify the original dataframe as opposed to creating an entirely new dataframe because the original dataframe has many columns.
In case it is useful, this is a summary of the code I used to create the shorter dataframe with impact as a combination of impacteach and impactover:
df %<%
mutate(newdescription = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %<%
group_by(id, newdescription) %<%
summarise(value = mean(as.numeric(value)))
What if you changed the description column first so that it could be included in the grouping:
df %>%
mutate(description = substr(description, 1, 6)) %>%
group_by(id, init_cont, family, description) %>%
summarise(value = mean(value))
# A tibble: 6 x 5
# Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.
# 2 1 K S read 2.
# 3 2 I S impact 3.
# 4 2 I S read 1.
# 5 3 K D impact 4.
# 6 3 K D read 3.
You just need to modify your group_by statement. Try group_by(id, init_cont, family)
Because your id seems to be mapped to init_cont and family already, adding in these values won't change your summarization result. Then you have all the columns you want with no extra work.
If you have a lot of columns you could trying something like the code below. Essentially, do a left_join onto your original data with your summarised data, but doing it using the . to not store off a new dataframe. Then, once joined (by id and description which we modified in place) you'll have two value columns which should be prepeneded with a .x and .y, drop the original and then use distinct to get rid of the duplicate 'impact' columns.
df %>%
mutate(description = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %>%
left_join(. %>%
group_by(id, description)
summarise(value = mean(as.numeric(value))
,by=c('id','description')) %>%
select(-value.x) %>%
distinct()
gsub can be used to replace description containing imact as impact and then group_by from dplyr package will help in summarising the value.
df %>% group_by(id, init_cont, family,
description = gsub("^(impact).*","\\1", description)) %>%
summarise(value = mean(value))
# # A tibble: 6 x 5
# # Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.00
# 2 1 K S read 2.00
# 3 2 I S impact 3.00
# 4 2 I S read 1.00
# 5 3 K D impact 4.00
# 6 3 K D read 3.00

tidyr::gather() %>% mutate() %>% spread() returns NA's unexpectedly

My ultimate goal is to do a series of chisq.test's on this data, comparing the values of 'dealer','store' and 'transport' by 'gender'. I'm using spread and gather to create a column of 'female' and one for 'males' then planned to use group_by & map to run the chisq.test by group of 'key', which is created in my gather argument. I'm doing something wrong because I'm getting grouped NA's back.
The code below produces my dilemma.
set.seed(123)
df_ <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
df_ %>%
gather(key,value,-gender) %>%
mutate(id = 1:nrow(.)) %>%
spread(gender,value)
Here is a data_frame of my desired outcome.
data_frame(key = sample(c('dealer','store','transport'),50,T),
male = sample(1:5,50,T),
female = sample(1:5,50,T))
You need to group_by(gender) before adding your id and spreading, i.e.
library(tidyverse)
df_ %>%
gather(key, value, -gender) %>%
group_by(gender) %>%
mutate(id = row_number()) %>%
spread(gender, value)
NOTE Substituting row_number() with 1:nrow(.) will fail because of the grouping. This is because it takes the sequence of the whole data frame (rather than a sequence for each group) and tries to assign it to each group. Hence the error you get with the length
Error in mutate_impl(.data, dots) :
Column id must be length 156 (the group size) or one, not 300
If you do say ... %>%mutate(id = 1:length(key)) It will be fine
The result in both (row_number and 1:length(key)) is,
# A tibble: 168 x 4
key id female male
* <chr> <int> <int> <int>
1 dealer 1 3 4
2 dealer 2 3 2
3 dealer 3 1 4
4 dealer 4 5 3
5 dealer 5 4 4
6 dealer 6 5 2
7 dealer 7 3 3
8 dealer 8 1 2
9 dealer 9 2 5
10 dealer 10 2 2
# ... with 158 more rows
#elliot while #Sotos has given a great answer to the challenge you were having with the tidyverse, I'm a bit confused by why you're going through all that extra effort. Your ultimate goal as stated was to run chisq.test for gender against each of the others (dealer, store & transport). Your original dataset doesn't need any modification to do that!
require(tidyverse)
set.seed(123)
yourdata <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
yourdata
# A tibble: 100 x 4
gender dealer store transport
<chr> <int> <int> <int>
1 female 2 2 5
2 male 2 4 2
3 female 2 2 1
Can be used exactly as it stands! You may have other reasons to want to change the data but it is tidy as it is representing one case or person per row.
Edited (January 16th) To achieve your stated ultimate goal you just have to:
require(dplyr)
require(broom)
allofthem <- lapply(yourdata[-1], function(y) tidy(chisq.test(x = yourdata$gender, y = y )))
allofthem <- bind_rows(allofthem, .id = "dependentv")
allofthem
You may also want to look at the lsr package which will do Chi-square independence (association tests) and provide a much more informative output. Also note that from a statistical perspective you are running very many tests and should correct your confidence appropriately... see for example http://rpubs.com/ibecav/290361

Duplicate identifier error in tidyr

I am using tidyr from R and am running into an issue when using the spread() command with duplicate identifiers.
Here is a mock example that illustrates the problem:
X = data.frame(name=c("Eric","Bob","Mark","Bob","Bob","Mark","Eric","Bob","Mark"),
metric=c("height","height","height","weight","weight","weight","grade","grade","grade"),
values=c(6,5,4,120,118,180,"A","B","C"),
stringsAsFactors=FALSE)
tidyr::spread(X,metric,values)
So when I run this command I get the following error:
Error: Duplicate identifiers for rows (4, 5)
which makes sense why its an error because Bob is recorded twice for weight. It's actually nota mistake because Bob did have his weight recorded twice. What I would like to be able to do is have run the command and have it it give me back the following:
name height weight grade
Eric 6 NA A
Bob 5 120 B
Bob 5 118 B
Mark 4 180 C
Is spread not the command I should be using to accomplish this? And if there isn't an easy solution is there a simple way to remove the record with lowest weight for duplicates when running the spread() command?
After making unique identifiers, which can be done by making a new variable representing the index within each group, you can use fill to fill the second "Bob" row with a duplicate value for "height" and "grade".
You can remove the index variable at the end via select.
library(dplyr)
library(tidyr)
X %>%
group_by(name, metric) %>%
mutate(row = row_number() ) %>%
spread(metric, values) %>%
fill(grade, height) %>%
select(-row)
# A tibble: 4 x 4
# Groups: name [3]
name grade height weight
<chr> <chr> <chr> <chr>
1 Bob B 5 120
2 Bob B 5 118
3 Eric A 6 <NA>
4 Mark C 4 180
To filter to the maximum value of each name/metric group:
X %>%
group_by(name, metric) %>%
filter(values == max(values)) %>%
spread(metric, values)
# A tibble: 3 x 4
# Groups: name [3]
name grade height weight
* <chr> <chr> <chr> <chr>
1 Bob B 5 120
2 Eric A 6 <NA>
3 Mark C 4 180

Using dplyr to summarise a running total of distinct factors

I'm trying to generate a species saturation curve for a camera trapping survey. I have thousands of observations and do most of my manipulations in dplyr.
I have three field sites, with observation records of different animal species from a number of weeks of trapping. In some weeks there are no animals, in other weeks there may be more than one species. I want to generate a separate figure for each site to compare how quickly new species that are encountered over the sequential weeks of the study. These observations of new species should eventually saturate once the total species diversity has been captured in the area. Some field sites are likely to saturate faster than others.
The problem is that I have not come across a way of counting the number of distinct species to provide a running total by time. A simple dummy dataset is below.
field_site<-c(rep("A",4),rep("B",4),rep("C",4))
week<-c(1,2,2,3,2,3,4,4,1,2,3,4)
animal<-c("dog","dog","cat","rabbit","dog","dog","dog","rabbit","cat","cat","rabbit","dog")
df<-as.data.frame(cbind(field_site,week,animal),head=TRUE)
I can easily generate the number of unique species within each week grouping, e.g.
tbl_df(df)%>%
group_by(field_site,week) %>%
summarise(no_of_sp=n_distinct(animal))
But this is not sensitive to the fact that some species are encountered again in subsequent weeks. What I really need is a running count of the different species that counts the unique species per site from week 1 going down through the rows, assuming that the data is sorted by increasing time from the start of the survey.
The cumulative total of species encountered over the course of the study by week in the example for field Site A would be: week 1 = 1 species, week 2 = 2 species, week 3 = 3 species, week 4 = still 3 species.
For site B cumulative total of species would be: week 1 = 0 species, week 2 = 1 species, week 3 = 1 species,week 4 = 1 species, etc...
Any advice would be greatly appreciated.
cheers in advance!
I'm making two assumptions:
Site B, week 4 = 2 species, both "dog" and "rabbit"; and
All sites share the same weeks, so if at least on site has week 4, then all sites should include it. This only drives the mt (empty) variable, feel free to update this variable.
I first suggest an "empty" data.frame to ensure sites have the requisite week numbers populated:
mt <- expand.grid(field_site = unique(ret$field_site),
week = unique(ret$week))
The use of tidyr helps:
library(tidyr)
df %>%
mutate(fake = TRUE) %>%
# ensure all species are "represented" on each row
spread(animal, fake) %>%
# ensure all weeks are shown, even if no species
full_join(mt, by = c("field_site", "week")) %>%
# ensure the presence of a species persists at a site
arrange(week) %>%
group_by(field_site) %>%
mutate_if(is.logical, funs(cummax(!is.na(.)))) %>%
ungroup() %>%
# helps to contain variable number of species columns in one place
nest(-field_site, -week, .key = "species") %>%
group_by(field_site, week) %>%
# could also use purrr::map in place of sapply
mutate(n = sapply(species, sum)) %>%
ungroup() %>%
select(-species) %>%
arrange(field_site, week)
# # A tibble: 12 × 3
# field_site week n
# <fctr> <fctr> <int>
# 1 A 1 1
# 2 A 2 2
# 3 A 3 3
# 4 A 4 3
# 5 B 1 0
# 6 B 2 1
# 7 B 3 1
# 8 B 4 2
# 9 C 1 1
# 10 C 2 1
# 11 C 3 2
# 12 C 4 3

Resources