Display in data.frame a conditional row count by group - r

I am struggling with creating a new variable in my data.frame. I apology for the question title that might not be very clear. I have a database that looks like this:
obs year type
1 2015 A
2 2015 A
3 2015 B
4 2014 A
5 2014 B
I want to add to the current data.frame a column (freq2015) that gives the number of rows by type for 2015 and report the result disregarding the considered year so long as the type is the same. Here is the output I am looking for:
obs year type freq2015
1 2015 A 2 (there are 2 obs. of type A in 2015)
2 2015 A 2 (there are 2 obs. of type A in 2015)
3 2015 B 1 (there is 1 obs. of type B in 2015)
4 2014 A 2 (there are 2 obs. of type A in 2015)
5 2014 B 1 (there are 1 obs. of type B in 2015)
I know how to add to my data.frame the number of rows by type by year using dplyr:
data <- data %>%
group_by(year, type) %>%
mutate(freq = n())
But then, for year=="2014" the added column will display the count of 2014 rows by race instead of that of 2015.
I know how to isolate into a new data.frame the number of rows by race for 2015:
data2015 <- dat[dat$year==2015,] %>%
group_by(type) %>%
mutate(freq2015 = n())
But I don't know how to add a column (with the count of rows by race for 2015) for the entire data.frame conditional on the type being the same (as shown in the example). I am looking for a solution that would prevent me from explicitly using the "type" variable modalities. That is, I don't want to use a code telling R: do this if type==A, do that otherwise. The reason for this restriction is that I have far too many types.
Any ideas? Thank you in advance.

If you group_by using only type, you can sum the rows when year == 2015.
data %>%
group_by(type) %>%
mutate(freq2015 = sum(year == 2015))
Source: local data frame [5 x 4]
Groups: type [2]
obs year type freq2015
<int> <int> <fctr> <int>
1 1 2015 A 2
2 2 2015 A 2
3 3 2015 B 1
4 4 2014 A 2
5 5 2014 B 1

Using the data table we could do:
setDT(df)
setkey(df,type)
df[ df[ year==2015, .(freq2015=.N), by = type]]
Result:
obs year type freq2015
1: 1 2015 A 2
2: 2 2015 A 2
3: 4 2014 A 2
4: 3 2015 B 1
5: 5 2014 B 1

You could use a left_join(), as follows:
temp <- data %>%
filter(year==2015) %>%
group_by(type) %>%
summarize(freq = n())# %>%
data <- data %>% left_join(temp, "type")

We can do this with base R using ave (without any external packages) and it is reasonably fast as well.
df1$freq2015 <- with(df1, ave(year == 2015, type, FUN = sum))
df1$freq2015
#[1] 2 2 1 2 1

Related

R: Delete Rows After First "Break" Occurs

I am working with the R programming language.
I have the following dataset:
library(dplyr)
my_data = data.frame(id = c(1,1,1,1,1,1, 2,2,2) , year = c(2010, 2011, 2012, 2013, 2015, 2016, 2015, 2016, 2020), var = c(1,7,3,9,5,6, 88, 12, 5))
> my_data
id year var
1 1 2010 1
2 1 2011 7
3 1 2012 3
4 1 2013 9
5 1 2015 5
6 1 2016 6
7 2 2015 88
8 2 2016 12
9 2 2020 5
My Question: For each ID - I want to find out when the first "non-consecutive" year occurs, and then delete all remaining rows.
For example:
When ID = 1, the first "jump" occurs at 2013 (i.e. there is no 2014). Therefore, I would like to delete all rows after 2013.
When ID = 2, the first "jump" occurs at 2016 - therefore, I would like to delete all rows after 2016.
This was my attempt to write the code for this problem:
final = my_data %>%
group_by(id) %>%
mutate(break_index = which(diff(year) > 1)[1]) %>%
group_by(id, add = TRUE) %>%
slice(1:break_index)
The code appears to be working - but I get the following warning messages which are concerning me:
Warning messages:
1: In 1:break_index :
numerical expression has 6 elements: only the first used
2: In 1:break_index :
numerical expression has 3 elements: only the first used
Can someone please tell me if I have done this correctly?
Thanks!
You get the warning because break_index has more than 1 value which is the same value for each group so your attempt works. If you want to avoid the warning you can select any one value of break_index. Try with slice(1:break_index[1]) to slice(1:first(break_index)).
Here is another way to handle this.
library(dplyr)
my_data %>%
group_by(id) %>%
filter(row_number() <= which(diff(year) > 1)[1])
# id year var
# <dbl> <dbl> <dbl>
#1 1 2010 1
#2 1 2011 7
#3 1 2012 3
#4 1 2013 9
#5 2 2015 88
#6 2 2016 12
With dplyr 1.1.0, we can use temporary grouping with .by -
my_data %>%
filter(row_number() <= which(diff(year) > 1)[1], .by = id)

I have three datasets of libraries for the past 3 years. I want to take those datasets to make a new dataframe

I have three datasets of ontario libraries for the past 3 years. The data sets have various information about the libraries, their address, city, card holders,etc. I created a dataset to combine all of the data sets into one new data set called data combined.
like so
data_2017<- read.csv("Downloads/2017.csv")
data_2016<- read.csv("Downloads/2016.csv")
data_2015<- read.csv("Downloads/2015.csv")
common_columns <- Reduce(intersect, list(colnames(data_2017), colnames(data_2016),colnames(data_2015)))
data_combined <- rbind(
subset(data_2017, select = common_columns),
subset(data_2016, select = common_columns),
subset(data_2015, select = common_columns)
)
write.csv(data_combined, "Downloads.csv")
What I need help with is that I need write a sequence of code which will create a single data set that can be used to output a table that lists the number of libraries in each city for the last 3 years. In excel I would use the count function to see the amount of libraries each cities has... to create a new table. I need help with the equivalent in R. I want to make a new table that will have the cities names on the row header and the columns will be the sum of the libraries for each year 2015, 2016 and 2017.
I want to make a new dataframe like this:
INSTEAD OF 1999, 2000 and 2001.. I want it to say 2015, 2016 and 2017
Here is where you can find the data set for 2015, 2016 and 2017 here is where you can find the datasets.. only use 2015, 2016 and 2017
thanks
This sounds like Calculate the mean by group for summarizing by group, then Reshape multiple value columns to wide format for pivoting from long to wide. However, this is complicated by the fact that some data have commas, rendering them as character instead of numeric, so rbinding them will be problematic. Here's a pipe that should take care of all of that.
I've downloaded those three files to my ~/Downloads/ directory, then
library(dplyr)
alldat <- lapply(grep("ontario", list.files("~/Downloads/", full.names=TRUE), value = TRUE), read.csv)
common_columns <- Reduce(intersect, sapply(alldat, names))
data_combined <- alldat %>%
lapply(function(dat) as.data.frame(
lapply(dat, function(z) if (all(grepl("^[0-9.,]*$", z))) type.convert(gsub(",", "", z), as.is = TRUE) else z)
)) %>%
lapply(subset, select = common_columns) %>%
bind_rows() %>%
tibble() %>%
count(City = A1.10.City.Town, Year = Survey.Year.From) %>%
tidyr::pivot_wider(City, names_from = Year, values_from = n)
data_combined
# # A tibble: 336 x 4
# City `2015` `2016` `2017`
# <chr> <int> <int> <int>
# 1 Addison 1 1 1
# 2 Ajax 1 1 1
# 3 Alderville 1 1 1
# 4 Algoma Mills 1 1 1
# 5 Alliston 2 2 2
# 6 Almonte 1 1 1
# 7 Amaranth 1 1 1
# 8 Angus 1 1 1
# 9 Apsley 1 1 1
# 10 Arnprior 2 2 2
# # ... with 326 more rows

Cumulative sum of unique values based on multiple criteria

I've got a df with multiple columns containing information of species sightings over the years in different sites, therefore each year might show multiple records. I would like to filter my df and calculate some operations based on certain columns, but I'd like to keep all columns for further analyses. I had some previous code using summarise but as I would like to keep all columns I was trying to avoid using it.
Let's say the columns I'm interested to work with at the moment are as follows:
df <- data.frame("Country" = LETTERS[1:5], "Site"=LETTERS[6:10], "species"=1:5, "Year"=1981:2010)
I would like to calculate:
1- The cumulative sum of the records in which a species has been documented within each site creating a new column "Spsum".
2- The number of different years that each species has been seen on a particular site, this could be done as cumulative sum as well, on a new column "nYear".
For example, if species 1 has been recorded 5 times in 1981, and 2 times in 1982 in Site G, Spsum would show 7 (cumulative sum of records) whereas nYear would show 2 as it was spotted over two different years. So far I've got this, but nYear is displaying 0s as a result.
Df1 <- df %>%
filter(Year>1980)%>%
group_by(Country, Site, Species, Year) %>%
mutate(nYear = n_distinct(Year[Species %in% Site]))%>%
ungroup()
Thanks!
This cound help, without the need for a join.
df %>% arrange(Country, Site, species, Year) %>%
filter(Year>1980) %>%
group_by(Site, species) %>%
mutate(nYear = length(unique(Year))) %>%
mutate(spsum = rowid(species))
# A tibble: 30 x 6
# Groups: Site, species [5]
Country Site species Year nYear spsum
<chr> <chr> <int> <int> <int> <int>
1 A F 1 1981 6 1
2 A F 1 1986 6 2
3 A F 1 1991 6 3
4 A F 1 1996 6 4
5 A F 1 2001 6 5
6 A F 1 2006 6 6
7 B G 2 1982 6 1
8 B G 2 1987 6 2
9 B G 2 1992 6 3
10 B G 2 1997 6 4
# ... with 20 more rows
If the table contains multiple records per Country+Site+species+Year combination, I would first aggregate those and then calculate the cumulative counts from that. The counts can then be joined back to the original table.
Something along these lines:
cumulative_counts <- df %>%
count(Country, Site, species, Year) %>%
group_by(Country, Site, species) %>%
arrange(Year) %>%
mutate(Spsum = cumsum(n), nYear = row_number())
df %>%
left_join(cumulative_counts)

How to insert 0 or NA while reshaping data in R when the key variable has different length across a given group

Bellow it is my data structure (df). The objective is converting data from long to wide separating by variable "status" (df.long) using R. I am aware that status "a" and "b" does not repeat in all the months.
year
month
status
n
2018
1
a
10
2018
1
b
2
2018
2
a
9
2018
3
a
13
2018
3
b
1
For this I use this code in R:
df.long <- df %>% spread(df, key = status, value = n)
It is OK except by the fact that when there is no status in a given month (ex.: status "b" for the month "2" from the above example), it returns in characters that are inserted in the cells with no "status" [characters such as c("2018", "2019"...)].
The question is: how to code this to replace with NA or 0 when there is no status value in a given month?
You can use pivot_wider() instead of spread().
library(tidyverse)
df %>% pivot_wider(everything(), names_from = "status", values_from = "n")
# A tibble: 3 x 4
year month a b
<int> <int> <int> <int>
1 2018 1 10 2
2 2018 2 9 NA
3 2018 3 13 1
You can take a look at the documentation where it stated:
pivot_wider() is an updated approach to spread(), designed to be both simpler to use and to handle more use cases. We recommend you use pivot_wider() for new code; spread() isn't going away but is no longer under active development.

R Conditional Summarizing [duplicate]

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
faster way to create variable that aggregates a column by id [duplicate]
(6 answers)
Closed 5 years ago.
I have a column for company, one for sales and another column for country.I need to sum all the sales in each of the countries separately so that I would have one column for each of the companies(names) for the total sales in the country. The sales in all of the countries is expressed in the same currency.
I have tried several ways of doing so, but neither of them work:
df$total_country_sales = if(df$country[row] == df$country) { sum(df$sales)}
This sums all valuations, not only the ones that I need.
Name Sales Country I would like to have a new column Total Country Sales
abc 122 US 5022
abc 100 Canada
aad 4900 US
I need to have the values in the same dataframe, but in a new column.
Since it is a large dataset, I cannot make a function to do so, but rather need to save it directly as a variable. (Or have I understood incorrectly that making functions is not the best way to solve such issues?)
I am new to R and programming in general, so I might be addressing the issue in an incorrect way.
Sorry for probably a stupid question.
Thanks!
If I understand your question correctly, this solves your problem:
df = data.frame(sales=c(1,3,2,4,5),region=c("A","A","B","B","B"))
library(dplyr)
totals = df %>% group_by(region) %>% summarize(total = sum(sales))
df = left_join(df,totals)
It adds the group totals as a separate column, like this:
sales region total
1 1 A 4
2 3 A 4
3 2 B 11
4 4 B 11
5 5 B 11
Hope this helps.
We can use base R to do this
df$total_country_sales <- with(df, ave(sales, country, FUN = sum))
It can be achieved using dplyr's mutate()
df = data.frame(sales=c(1,3,2,4,5),country=c("A","A","B","B","B"))
df
# sales country
# 1 1 A
# 2 3 A
# 3 2 B
# 4 4 B
# 5 5 B
df %>% group_by(country) %>% mutate(total_sales = sum(sales))
# Source: local data frame [5 x 3]
# Groups: country [2]
#
# # A tibble: 5 x 3
# sales country total_sales
# <dbl> <fctr> <dbl>
# 1 1 A 4
# 2 3 A 4
# 3 2 B 11
# 4 4 B 11
# 5 5 B 11
using data.table
library(data.table)
setDT(df)[, total_sales := sum(sales), by = country]
df
# sales country total_sales
# 1: 1 A 4
# 2: 3 A 4
# 3: 2 B 11
# 4: 4 B 11
# 5: 5 B 11

Resources