r: Summarise for rowSums after group_by - r

I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3

You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.

This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333

Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581

Related

Cumulative sum of unique values based on multiple criteria

I've got a df with multiple columns containing information of species sightings over the years in different sites, therefore each year might show multiple records. I would like to filter my df and calculate some operations based on certain columns, but I'd like to keep all columns for further analyses. I had some previous code using summarise but as I would like to keep all columns I was trying to avoid using it.
Let's say the columns I'm interested to work with at the moment are as follows:
df <- data.frame("Country" = LETTERS[1:5], "Site"=LETTERS[6:10], "species"=1:5, "Year"=1981:2010)
I would like to calculate:
1- The cumulative sum of the records in which a species has been documented within each site creating a new column "Spsum".
2- The number of different years that each species has been seen on a particular site, this could be done as cumulative sum as well, on a new column "nYear".
For example, if species 1 has been recorded 5 times in 1981, and 2 times in 1982 in Site G, Spsum would show 7 (cumulative sum of records) whereas nYear would show 2 as it was spotted over two different years. So far I've got this, but nYear is displaying 0s as a result.
Df1 <- df %>%
filter(Year>1980)%>%
group_by(Country, Site, Species, Year) %>%
mutate(nYear = n_distinct(Year[Species %in% Site]))%>%
ungroup()
Thanks!
This cound help, without the need for a join.
df %>% arrange(Country, Site, species, Year) %>%
filter(Year>1980) %>%
group_by(Site, species) %>%
mutate(nYear = length(unique(Year))) %>%
mutate(spsum = rowid(species))
# A tibble: 30 x 6
# Groups: Site, species [5]
Country Site species Year nYear spsum
<chr> <chr> <int> <int> <int> <int>
1 A F 1 1981 6 1
2 A F 1 1986 6 2
3 A F 1 1991 6 3
4 A F 1 1996 6 4
5 A F 1 2001 6 5
6 A F 1 2006 6 6
7 B G 2 1982 6 1
8 B G 2 1987 6 2
9 B G 2 1992 6 3
10 B G 2 1997 6 4
# ... with 20 more rows
If the table contains multiple records per Country+Site+species+Year combination, I would first aggregate those and then calculate the cumulative counts from that. The counts can then be joined back to the original table.
Something along these lines:
cumulative_counts <- df %>%
count(Country, Site, species, Year) %>%
group_by(Country, Site, species) %>%
arrange(Year) %>%
mutate(Spsum = cumsum(n), nYear = row_number())
df %>%
left_join(cumulative_counts)

How to find the first occurrence of a negative value for each factor

I am working with weather data and trying to find the first time a temperature is negative for each winter season. I have a data frame with a column for the winter season (1,2,3,etc.), the temperature, and the ID.
I can get the first time the temperature is negative with this code:
FirstNegative <- min(which(df$temp<=0))
but it only returns the first value, and not one for each season.
I know I somehow need to group_by season, but how do I incorporate this?
For example,
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df <- cbind(season,temp,ID)
Ideally I want a table that looks like this from the above dummy code:
table
season id_firstnegative
[1,] 1 2
[2,] 2 4
[3,] 3 8
[4,] 4 10
[5,] 5 13
A base R option using subset and aggregate
aggregate(ID ~ season, subset(df, temp < 0), head, 1)
# season ID
#1 1 2
#2 2 4
#3 3 8
#4 4 10
#5 5 13
library(dplyr)
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df<-as.data.frame(cbind(season,temp,ID))
df %>%
dplyr::filter(temp < 0) %>%
group_by(season) %>%
dplyr::filter(row_number() == 1) %>%
ungroup()
As you said, I believe you could solve this by simply grouping season and examining the first index of IDs below zero within that grouping. However, the ordering of your data will be important, so ensure that each season has the correct ordering before using this possible solution.
library(dplyr)
library(tibble)
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df<- tibble(season,temp,ID)
df <- df %>%
group_by(season) %>%
mutate(firstNeg = ID[which(temp<0)][1]) %>%
distinct(season, firstNeg) # Combine only unique values of these columns for reduced output
This will provide output like:
# A tibble: 5 x 2
# Groups: season [5]
season firstNeg
<dbl> <dbl>
1 1 2
2 2 4
3 3 8
4 4 10
5 5 13

Remove first 10 and last 10 values

I have a file that contains multiple individuals and multiple values for the same individual.
I need to remove the first 10 and last 10 values of each individual, putting all the leftover values in a new table.
This is what my data kinda looks like:
Cow Data
NL123456 123
NL123456 456
I tried doing a for-loop, counting per individual how many values there were (but I think, I already got stuck there, because I am not using the right command I think? All variables in Cow are a factor).
I figured removing the first and last had to be something like this:
data1[c(11: n-10),]
If you know you always have more than 20 datapoints by cow you can do the following, illustrated on the iris dataset :
library(dplyr)
dim(iris)
# [1] 150 5
iris_trimmed <-
iris %>%
group_by(Species) %>%
slice(11:(n()-10)) %>%
ungroup()
dim(iris_trimmed)
# [1] 90 5
On your data :
res <-
your_data %>%
group_by(Cow) %>%
slice(11:(n()-10)) %>%
ungroup()
In base R you can do :
iris_trimmed <- do.call(
rbind,
lapply(split(iris, iris$Species),
function(x) head(tail(x,-10),-10)))
dim(iris_trimmed)
# [1] 90 5
Using data.table:
library(data.table)
idt <- as.data.table(iris)
idt[, .SD[11:(.N-10)], Species]
Same logic in base R:
do.call(
rbind,
lapply(
split(iris, iris[["Species"]]),
function(x) x[11:(nrow(x)-10), ]
)
)
Here a solution with dplyr.
In my example I cut only the first and last values. (you can adapt it by changing 2 with any number in filter).
The idea is to add after you group_by id the number of row per each observation starting from the top (n) and in reverse from the bottom (n1), then you simply filter out.
library(dplyr)
data %>%
group_by(id) %>%
mutate(n=1:n(),
n1 = n():1) %>% # n and n1 are the row numbers
filter(n >= 2,n1 >= 2) %>% # change 2 with 10, or whatever
# filter() keeps only the rows that you want
select(-n, -n1) %>%
ungroup()
# # A tibble: 4 x 2
# id value
# <dbl> <int>
# 1 1 6
# 2 1 8
# 3 2 1
# 4 2 2
Data:
set.seed(123)
data <- data.frame(id = c(rep(1,4), rep(2,4)), value=sample(8))
data
# id value
# 1 1 3
# 2 1 6
# 3 1 8
# 4 1 5
# 5 2 4
# 6 2 1
# 7 2 2
# 8 2 7

R create sequence in dplyr by id beginning with zero

I'm looking for the cleanest way to create a sequence beginning with zero by id in a dataframe.
df <- data.frame (id=rep(1:10,each=10))
If I wanted to start sequence at 1 the following would do:
library(dplyr)
df<-df %>% group_by(id) %>%
mutate(start = 1:n()) %>%
ungroup()
but starting at 0 doesn't work because it creates an extra number (0-10 compared to 1-10) so I need to add an extra row, is there a way to do this all in one step, perhaps using dplyr? There is obviously a number of work arounds such as creating another dataset and appending it to the original.
df1 <- data.frame (id=1:10,
start=0)
new<-rbind(df,df1)
That just seems a bit awkward and not that tidy. I know you can use rbind in dplyr but not sure how to incorporate everything in one step especially if you had other non-timing varying variables you just wanted to copy over into the new row. Interested to see suggestions, thanks.
You could use complete() from the tidyverse:
library(tidyverse)
df %>%
group_by(id) %>%
mutate(start = 1:n()) %>%
complete(start = c(0:10)) %>%
ungroup()
Which yields
# A tibble: 110 x 2
id start
<int> <int>
1 1 0
2 1 1
3 1 2
4 1 3
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8
10 1 9
# ... with 100 more rows

tidyr::gather() %>% mutate() %>% spread() returns NA's unexpectedly

My ultimate goal is to do a series of chisq.test's on this data, comparing the values of 'dealer','store' and 'transport' by 'gender'. I'm using spread and gather to create a column of 'female' and one for 'males' then planned to use group_by & map to run the chisq.test by group of 'key', which is created in my gather argument. I'm doing something wrong because I'm getting grouped NA's back.
The code below produces my dilemma.
set.seed(123)
df_ <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
df_ %>%
gather(key,value,-gender) %>%
mutate(id = 1:nrow(.)) %>%
spread(gender,value)
Here is a data_frame of my desired outcome.
data_frame(key = sample(c('dealer','store','transport'),50,T),
male = sample(1:5,50,T),
female = sample(1:5,50,T))
You need to group_by(gender) before adding your id and spreading, i.e.
library(tidyverse)
df_ %>%
gather(key, value, -gender) %>%
group_by(gender) %>%
mutate(id = row_number()) %>%
spread(gender, value)
NOTE Substituting row_number() with 1:nrow(.) will fail because of the grouping. This is because it takes the sequence of the whole data frame (rather than a sequence for each group) and tries to assign it to each group. Hence the error you get with the length
Error in mutate_impl(.data, dots) :
Column id must be length 156 (the group size) or one, not 300
If you do say ... %>%mutate(id = 1:length(key)) It will be fine
The result in both (row_number and 1:length(key)) is,
# A tibble: 168 x 4
key id female male
* <chr> <int> <int> <int>
1 dealer 1 3 4
2 dealer 2 3 2
3 dealer 3 1 4
4 dealer 4 5 3
5 dealer 5 4 4
6 dealer 6 5 2
7 dealer 7 3 3
8 dealer 8 1 2
9 dealer 9 2 5
10 dealer 10 2 2
# ... with 158 more rows
#elliot while #Sotos has given a great answer to the challenge you were having with the tidyverse, I'm a bit confused by why you're going through all that extra effort. Your ultimate goal as stated was to run chisq.test for gender against each of the others (dealer, store & transport). Your original dataset doesn't need any modification to do that!
require(tidyverse)
set.seed(123)
yourdata <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
yourdata
# A tibble: 100 x 4
gender dealer store transport
<chr> <int> <int> <int>
1 female 2 2 5
2 male 2 4 2
3 female 2 2 1
Can be used exactly as it stands! You may have other reasons to want to change the data but it is tidy as it is representing one case or person per row.
Edited (January 16th) To achieve your stated ultimate goal you just have to:
require(dplyr)
require(broom)
allofthem <- lapply(yourdata[-1], function(y) tidy(chisq.test(x = yourdata$gender, y = y )))
allofthem <- bind_rows(allofthem, .id = "dependentv")
allofthem
You may also want to look at the lsr package which will do Chi-square independence (association tests) and provide a much more informative output. Also note that from a statistical perspective you are running very many tests and should correct your confidence appropriately... see for example http://rpubs.com/ibecav/290361

Resources