summing my data by two specific columns - r

I have my data frame below, I want to sum the data like I have in the first row in the image below (labelled row 38). The total flowering summed for Sections A-D for each date, i also have multiple plots not just Dry1, but Dry2, Dry3 etc.
It's so simple to do in my head but I can't workout how to do it in R?
Essentially I want to do this:
with(dat1, sum(dat1$TotalFlowering[dat1$Date=="1997-07-01" & dat1$Plot=="Dry1"]))
Which tells me that the sum of total flowers for sections "A,B,C,D" in plot "Dry1" for the date "1997-07-01" = 166
I want a way to write a code so this does so for every date and plot combo, and then puts it in the data frame?
In the same format as the first row in the image I included :)

Based on your comment it seems like you just want to add a variable to keep track of TotalFlowering for every Date and Plot combination. If that's the case then can you just add a field like TotalCount below?
library(dplyr)
df %>%
group_by(Date, Plot) %>%
mutate(TotalCount = sum(TotalFlowering)) %>%
ungroup()
Or, alternatively, if all you want is the sum you could make use of dplyr's summarise like below
library(dplyr)
df %>%
group_by(Date, Plot) %>%
summarise(TotalCount = sum(TotalFlowering))

Related

how to average a set of columns and exclude other specific columns in R using the summarise command?

I'm breaking my head here with academic work. I have a data.frame with several numeric columns. I am using the command summarize and group_by in R to perform the average calculations of my data frame.
I tried with the code summarize (across (where (is.numeric), mean), -c(Mes, year_date), but it calculates the average of the entire data.frame and in addition, it creates a new column -c (Mes, year_date)), I would like some numeric columns to be excluded from the media calculation, but continue on the data.frame.
Note that I tried -c(Mes, year_date) to exclude these two columns from the average calculation, but it didn't work.
I tried
library(tidyr)
library(dplyr)
library(lubridate)
sample_station <-c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B','C','C','C','C','C','C','C','C','C','C','A','B','C','A','B','C')
Date_dmy <-c('01/01/2000','08/08/2000','16/03/2001','22/09/2001','01/06/2002','05/01/2002','26/01/2002','16/02/2002','09/03/2002','30/03/2002','20/04/2002','04/01/2000','11/08/2000','19/03/2001','25/09/2001','04/06/2002','08/01/2002','29/01/2002','19/02/2002','12/03/2002','13/09/2001','08/01/2000','15/08/2000','23/03/2001','29/09/2001','08/06/2002','12/01/2002','02/02/2002','23/02/2002','16/03/2002','06/04/2002','01/02/2000','01/02/2000','01/02/2000','02/11/2001','02/11/2001','02/11/2001')
temperature <-c(17,20,24,19,17,19,23,26,19,19,21,15,23,18,22,22,23,18,19,26,21,22,23,27,19,19,21,23,24,25,26,29,30,21,25,24,23)
wind_speed<-c(3.001,6.332,9.321,10.9091,6.38,10.5882,10.5,10.4348,10.3846,10.3448,10.3125,8.35,10.2632,10.2439,10.2273,10.2128,10.2,10.1887,10.1786,12,10.1613,10.1538,10.1471,10.1408,10.1351,10.1299,10.125,2.36,10.1163,10.1124,10.1087,11.2,10.102,10.099,10.0962,10.0935,10.0909)
esp<-c(11.6,11.3,11,10.7,10.4,10.1,9.8,9.5,9.2,8.9,8.6,8.3,8,11.2,10.9,10.6,10.3,10,12.8,12.5,12.2,11.9,11.6,11.3,11,4.36,4.06,3.76,3.46,3.16,2.86,2.56,2.26,1.96,1.66,1.36,23)
volum<-c(300,300,300,300,300,300,300,300,250,250,250,250,250,250,400,400,400,400,400,105,105,105,105,105,105,105,105,105,105,81,81,81,81,81,81,81,81)
df<-data.frame(sample_station, Date_dmy, temperature, wind_speed, esp, volum)%>%
mutate(Date_dmy = dmy(Date_dmy)) %>%
mutate(year_date = floor_date(Date_dmy,'year'))%>%
mutate(Ano=year(Date_dmy))%>%
mutate(Mes=month(Date_dmy))%>%
mutate(Epoca = ifelse(Mes %in% 4:9,'dry','rainy'))%>%
group_by(sample_station, Epoca, Ano)%>%
summarise(across(where(is.numeric), mean), -c(Mes, year_date))
I have several columns that I don't want to be averaged (even if they are numeric). For exemple, columns esp and volum.
update
Exit expectation
Because you are summarising only part of the data, you need to specify what data (rows) of the un-summarised data you want to maintain. In your example, you don't want to summarise Mes and year_date, however you have multiple values within each group (sample_station, Epoca, Ano), of these Mes and year_date columns.
Which values of these unsummarised columns do you want to keep?
If you want to keep all values of the unsummarised columns, you may want to include Mes and year_date inside group_by(sample_station, Epoca, Ano) before summarising.
Alternatively, you may use mutate() rather than summarise() to get summary values in a new column for each row of the original dataframe, then choose your rows from there.
Update:
Again, with your edited post including desired output, what values do you expect for Mes. For example, when sample_station == 'A', Epoca == 'rainy' and Ano == 2000, you have values for Mes of 1 & 2, and the same year_date. summarise() wants to calculate one single summary value for this group.
You can use across(c(where(is.numeric), -Mes). Note that year_date is not included in the calculation as it is not of class numeric and also because it is included in group_by.
You can also combine multiple mutate statements into one.
If you want to exclude certain columns from the average calculation but want to keep it in the dataframe you need to decide which value do you want to keep. For example, to keep the 1st value you can use first.
library(dplyr)
library(lubridate)
data.frame(sample_station, Date_dmy, temperature, wind_speed)%>%
mutate(Date_dmy = dmy(Date_dmy),
year_date = floor_date(Date_dmy,'year'),
Ano=year(Date_dmy),
Mes=month(Date_dmy),
Epoca = ifelse(Mes %in% 4:9,'dry','rainy')) %>%
group_by(sample_station, year_date, Epoca) %>%
summarise(across(c(where(is.numeric), -Mes), mean),
across(Mes, first))

Changing a Column to an Observation in a Row in R

I am currently struggling to transition one of my columns in my data to a row as an observation. Below is a representative example of what my data looks like:
library(tidyverse)
test_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409),
abstention=rep(52199))
As seen above, the abstention column exists at the end of my data frame, and I would like my data to look like the following:
library(tidyverse)
desired_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo","abstention"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409,52199))
Here, abstentions are treated like a candidate, in the can column. Thus, the rest of the data is maintained, and the abstention values are their own observation in the cv1 column.
I have tried using pivot_wider, but I am unsure how to use the arguments to get what I want. I have also considered t() to transpose the column into a row, but also having a hard time slotting it back into my data. Any help is appreciated! Thanks!
Here's a strategy what will work if you have multiple unit_names
test_df %>%
group_split(unit_name) %>%
map( function(group_data) {
slice(group_data, 1) %>%
mutate(can="abstention", cv1=abstention) %>%
add_row(group_data, .) %>%
select(-abstention)
}) %>%
bind_rows()
Basically we split the data up by unit_name, then we grab the first row for each group and move the values around. Append that as a new row to each group, and then re-combine all the groups.

Adding a "beep" column

Im new to coding with R and especially in time series. My problem is that I'd like to include a "Beep" column in a dataset. More specifically, in the dataset, there are 3 columns, ID, date and time like this
.
It would be really useful, next to these columns, to add a corresponding beep, since the individuals got many beeps during the day for some days. I'd like my final result to be something like this
.
How could I do that?
library(dplyr) # Load package dplyr
mydata <- mydata %>% # Take the dataframe, then...
group_by(Name, Dates) %>% # Group it by name and dates, then...
mutate(beep = row_number()) %>% # Add a beep column with a sequential number by name & date
ungroup() # Remove grouping

R Take the mean of duplicate rows within a dataset when some columns text within then

Hi I am trying to take the mean of duplicate sample rows within a data frame. I can produce the mean of all columns within the two rows, however some of my columns have text within then - this results in a lot of NA. How can I work around this?
If the rows are truly duplicated (as in, all of the values are the same), and assuming you have an ID variable that groups these duplicated rows, then you can simply take the first row for each ID.
Something like this may work:
library(dplyr)
new_data <- duplicated_data %>%
group_by(ID) %>%
slice(1) %>%
ungroup()
Where duplicated_data is your original dataset, and ID is the ID variable that you use to determine whether a sample is duplicated or now.

Finding Avg/Sum of a Column Value

I have a nice Jitter plot of my data, but I'm looking to look further into the data by finding Mean/Sum/Median etc...
I don't know the syntax to separate the data by column value.
My date frame consists of 2 variables: Year (2010-2017) and Followers (Numeric)
Code I used:
ggplot(MyData, aes(factor(Date), Followers)) +
geom_jitter(aes(color = factor(Date)))
This separated each Numeric data point into categorized groups of each year.
I was able to use sum(MyData$Followers) to get total Followers for all years.
As well as count(MyData, 'Date') To get frequency for each year.
But I'm not sure how to combine them to get total followers/avg followers for each individual year.
You can use dplyr:
df <- MyData %>%
group_by(Year) %>%
summarize(Mean = mean(Followers), Count = n(Followers))

Resources