I've got a df with multiple columns containing information of species sightings over the years in different sites, therefore each year might show multiple records. I would like to filter my df and calculate some operations based on certain columns, but I'd like to keep all columns for further analyses. I had some previous code using summarise but as I would like to keep all columns I was trying to avoid using it.
Let's say the columns I'm interested to work with at the moment are as follows:
df <- data.frame("Country" = LETTERS[1:5], "Site"=LETTERS[6:10], "species"=1:5, "Year"=1981:2010)
I would like to calculate:
1- The cumulative sum of the records in which a species has been documented within each site creating a new column "Spsum".
2- The number of different years that each species has been seen on a particular site, this could be done as cumulative sum as well, on a new column "nYear".
For example, if species 1 has been recorded 5 times in 1981, and 2 times in 1982 in Site G, Spsum would show 7 (cumulative sum of records) whereas nYear would show 2 as it was spotted over two different years. So far I've got this, but nYear is displaying 0s as a result.
Df1 <- df %>%
filter(Year>1980)%>%
group_by(Country, Site, Species, Year) %>%
mutate(nYear = n_distinct(Year[Species %in% Site]))%>%
ungroup()
Thanks!
This cound help, without the need for a join.
df %>% arrange(Country, Site, species, Year) %>%
filter(Year>1980) %>%
group_by(Site, species) %>%
mutate(nYear = length(unique(Year))) %>%
mutate(spsum = rowid(species))
# A tibble: 30 x 6
# Groups: Site, species [5]
Country Site species Year nYear spsum
<chr> <chr> <int> <int> <int> <int>
1 A F 1 1981 6 1
2 A F 1 1986 6 2
3 A F 1 1991 6 3
4 A F 1 1996 6 4
5 A F 1 2001 6 5
6 A F 1 2006 6 6
7 B G 2 1982 6 1
8 B G 2 1987 6 2
9 B G 2 1992 6 3
10 B G 2 1997 6 4
# ... with 20 more rows
If the table contains multiple records per Country+Site+species+Year combination, I would first aggregate those and then calculate the cumulative counts from that. The counts can then be joined back to the original table.
Something along these lines:
cumulative_counts <- df %>%
count(Country, Site, species, Year) %>%
group_by(Country, Site, species) %>%
arrange(Year) %>%
mutate(Spsum = cumsum(n), nYear = row_number())
df %>%
left_join(cumulative_counts)
Related
I'm preparing data for a cox regression model and I have a dataset that shows all of the years that participants were registered as living in the province. There is a variable that identifies how many days they were registered as living in the province for each year. I want their start year to be their first year that they were fully registered (>=365 days) as living in the province. I also want the last year that they were fully registered as living in the province. However, there are some participants that left the province, then returned later for at least one full-time year. For this analysis, I want to consider participants follow-up to end when they leave the first time as we can't track their health outcomes that may have occurred while outside the province.
Imagine I have already sorted the dataset by ID, then year. I then removed any observations where there were less than 365 days registered.
Here is a test dataset:
df <- data.frame(
ID = c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3),
values = c(1996,1998,1999,2000,2001,2001,2002,2003,2004,2007,2008,2004,2005,2006,2007)
)
df_inc <- df %>%
group_by(ID) %>%
filter(row_number(values)==1)
This works as intended, returning the first fully registered year per participant
df_lastoverall <- df %>%
group_by(ID) %>%
filter(row_number(values)==n())
This works, but returns the last fully registered year, regardless of whether their years were all consecutive, or they left the province then returned to have at least one full year. This gives a last year of 2001 for ID1, 2008 for ID2, and 2007 for ID3.
Here's where I'm at and can use some help... I'm looking for some way to identify the last full year after a consecutive run from their start year (just incase there are people that left and returned more than once). This should return a last year of 1996 for ID1, 2004 for ID2, and 2007 for ID3.
Something like this, perhaps?
df_last <- df %>%
group_by(ID) %>%
filter(row_number(values)[cumsum(c(1, diff(values)!=1))])
# OR
df_last <- df %>%
group_by(ID) %>%
filter(row_number(values)==max(values[cumsum(c(1, diff(values)!=1))]))
You can leverage data.table::rleid() as follows:
group_by(df,ID) %>%
filter(data.table::rleid(c(1,diff(values)))==1)
Output:
ID values
<dbl> <dbl>
1 1 1996
2 2 2001
3 2 2002
4 2 2003
5 2 2004
6 3 2004
7 3 2005
8 3 2006
9 3 2007
If you wanted only the last year of each group, you can add a second filter at the end:
group_by(df,ID) %>%
filter(data.table::rleid(c(1,diff(values)))==1) %>%
filter(row_number()==n())
Output:
ID values
<dbl> <dbl>
1 1 1996
2 2 2004
3 3 2007
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
df_first <- df %>%
group_by(ID) %>%
filter(cumsum(c(1,diff(values)) - 1) == 0) %>%
slice_min(values) %>%
ungroup()
df_last <- df %>%
group_by(ID) %>%
filter(cumsum(c(1,diff(values)) - 1) == 0) %>%
slice_max(values) %>%
ungroup()
This returns
#> df_first
# A tibble: 3 × 2
ID values
<dbl> <dbl>
1 1 1996
2 2 2001
3 3 2004
and
#> df_last
# A tibble: 3 × 2
ID values
<dbl> <dbl>
1 1 1996
2 2 2004
3 3 2007
I am working with weather data and trying to find the first time a temperature is negative for each winter season. I have a data frame with a column for the winter season (1,2,3,etc.), the temperature, and the ID.
I can get the first time the temperature is negative with this code:
FirstNegative <- min(which(df$temp<=0))
but it only returns the first value, and not one for each season.
I know I somehow need to group_by season, but how do I incorporate this?
For example,
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df <- cbind(season,temp,ID)
Ideally I want a table that looks like this from the above dummy code:
table
season id_firstnegative
[1,] 1 2
[2,] 2 4
[3,] 3 8
[4,] 4 10
[5,] 5 13
A base R option using subset and aggregate
aggregate(ID ~ season, subset(df, temp < 0), head, 1)
# season ID
#1 1 2
#2 2 4
#3 3 8
#4 4 10
#5 5 13
library(dplyr)
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df<-as.data.frame(cbind(season,temp,ID))
df %>%
dplyr::filter(temp < 0) %>%
group_by(season) %>%
dplyr::filter(row_number() == 1) %>%
ungroup()
As you said, I believe you could solve this by simply grouping season and examining the first index of IDs below zero within that grouping. However, the ordering of your data will be important, so ensure that each season has the correct ordering before using this possible solution.
library(dplyr)
library(tibble)
season<-c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5)
temp<-c(2,-1,0,-1,3,-1,0,-1,2,-1,4,5,-1,-1,2)
ID<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df<- tibble(season,temp,ID)
df <- df %>%
group_by(season) %>%
mutate(firstNeg = ID[which(temp<0)][1]) %>%
distinct(season, firstNeg) # Combine only unique values of these columns for reduced output
This will provide output like:
# A tibble: 5 x 2
# Groups: season [5]
season firstNeg
<dbl> <dbl>
1 1 2
2 2 4
3 3 8
4 4 10
5 5 13
My ultimate goal is to do a series of chisq.test's on this data, comparing the values of 'dealer','store' and 'transport' by 'gender'. I'm using spread and gather to create a column of 'female' and one for 'males' then planned to use group_by & map to run the chisq.test by group of 'key', which is created in my gather argument. I'm doing something wrong because I'm getting grouped NA's back.
The code below produces my dilemma.
set.seed(123)
df_ <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
df_ %>%
gather(key,value,-gender) %>%
mutate(id = 1:nrow(.)) %>%
spread(gender,value)
Here is a data_frame of my desired outcome.
data_frame(key = sample(c('dealer','store','transport'),50,T),
male = sample(1:5,50,T),
female = sample(1:5,50,T))
You need to group_by(gender) before adding your id and spreading, i.e.
library(tidyverse)
df_ %>%
gather(key, value, -gender) %>%
group_by(gender) %>%
mutate(id = row_number()) %>%
spread(gender, value)
NOTE Substituting row_number() with 1:nrow(.) will fail because of the grouping. This is because it takes the sequence of the whole data frame (rather than a sequence for each group) and tries to assign it to each group. Hence the error you get with the length
Error in mutate_impl(.data, dots) :
Column id must be length 156 (the group size) or one, not 300
If you do say ... %>%mutate(id = 1:length(key)) It will be fine
The result in both (row_number and 1:length(key)) is,
# A tibble: 168 x 4
key id female male
* <chr> <int> <int> <int>
1 dealer 1 3 4
2 dealer 2 3 2
3 dealer 3 1 4
4 dealer 4 5 3
5 dealer 5 4 4
6 dealer 6 5 2
7 dealer 7 3 3
8 dealer 8 1 2
9 dealer 9 2 5
10 dealer 10 2 2
# ... with 158 more rows
#elliot while #Sotos has given a great answer to the challenge you were having with the tidyverse, I'm a bit confused by why you're going through all that extra effort. Your ultimate goal as stated was to run chisq.test for gender against each of the others (dealer, store & transport). Your original dataset doesn't need any modification to do that!
require(tidyverse)
set.seed(123)
yourdata <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
yourdata
# A tibble: 100 x 4
gender dealer store transport
<chr> <int> <int> <int>
1 female 2 2 5
2 male 2 4 2
3 female 2 2 1
Can be used exactly as it stands! You may have other reasons to want to change the data but it is tidy as it is representing one case or person per row.
Edited (January 16th) To achieve your stated ultimate goal you just have to:
require(dplyr)
require(broom)
allofthem <- lapply(yourdata[-1], function(y) tidy(chisq.test(x = yourdata$gender, y = y )))
allofthem <- bind_rows(allofthem, .id = "dependentv")
allofthem
You may also want to look at the lsr package which will do Chi-square independence (association tests) and provide a much more informative output. Also note that from a statistical perspective you are running very many tests and should correct your confidence appropriately... see for example http://rpubs.com/ibecav/290361
I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3
You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.
This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333
Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581
I'm trying to generate a species saturation curve for a camera trapping survey. I have thousands of observations and do most of my manipulations in dplyr.
I have three field sites, with observation records of different animal species from a number of weeks of trapping. In some weeks there are no animals, in other weeks there may be more than one species. I want to generate a separate figure for each site to compare how quickly new species that are encountered over the sequential weeks of the study. These observations of new species should eventually saturate once the total species diversity has been captured in the area. Some field sites are likely to saturate faster than others.
The problem is that I have not come across a way of counting the number of distinct species to provide a running total by time. A simple dummy dataset is below.
field_site<-c(rep("A",4),rep("B",4),rep("C",4))
week<-c(1,2,2,3,2,3,4,4,1,2,3,4)
animal<-c("dog","dog","cat","rabbit","dog","dog","dog","rabbit","cat","cat","rabbit","dog")
df<-as.data.frame(cbind(field_site,week,animal),head=TRUE)
I can easily generate the number of unique species within each week grouping, e.g.
tbl_df(df)%>%
group_by(field_site,week) %>%
summarise(no_of_sp=n_distinct(animal))
But this is not sensitive to the fact that some species are encountered again in subsequent weeks. What I really need is a running count of the different species that counts the unique species per site from week 1 going down through the rows, assuming that the data is sorted by increasing time from the start of the survey.
The cumulative total of species encountered over the course of the study by week in the example for field Site A would be: week 1 = 1 species, week 2 = 2 species, week 3 = 3 species, week 4 = still 3 species.
For site B cumulative total of species would be: week 1 = 0 species, week 2 = 1 species, week 3 = 1 species,week 4 = 1 species, etc...
Any advice would be greatly appreciated.
cheers in advance!
I'm making two assumptions:
Site B, week 4 = 2 species, both "dog" and "rabbit"; and
All sites share the same weeks, so if at least on site has week 4, then all sites should include it. This only drives the mt (empty) variable, feel free to update this variable.
I first suggest an "empty" data.frame to ensure sites have the requisite week numbers populated:
mt <- expand.grid(field_site = unique(ret$field_site),
week = unique(ret$week))
The use of tidyr helps:
library(tidyr)
df %>%
mutate(fake = TRUE) %>%
# ensure all species are "represented" on each row
spread(animal, fake) %>%
# ensure all weeks are shown, even if no species
full_join(mt, by = c("field_site", "week")) %>%
# ensure the presence of a species persists at a site
arrange(week) %>%
group_by(field_site) %>%
mutate_if(is.logical, funs(cummax(!is.na(.)))) %>%
ungroup() %>%
# helps to contain variable number of species columns in one place
nest(-field_site, -week, .key = "species") %>%
group_by(field_site, week) %>%
# could also use purrr::map in place of sapply
mutate(n = sapply(species, sum)) %>%
ungroup() %>%
select(-species) %>%
arrange(field_site, week)
# # A tibble: 12 × 3
# field_site week n
# <fctr> <fctr> <int>
# 1 A 1 1
# 2 A 2 2
# 3 A 3 3
# 4 A 4 3
# 5 B 1 0
# 6 B 2 1
# 7 B 3 1
# 8 B 4 2
# 9 C 1 1
# 10 C 2 1
# 11 C 3 2
# 12 C 4 3