group_by and summarize usage in tidyverse package in r - r

I am analyzing the COVID-19 data in r and I want to get the aggregate result of total case in different continent.
total_cases_continent <- covid_data %>%
select(continent, new_cases) %>%
group_by(continent) %>%
summarize(total_cases = sum(new_cases))
I get this result below, instead of present total cases in different continent, this only shows total cases in different continent in one row

It looks like there might be some issues with the values of your variable "continent". I would recommend checking the class of the variable, as well as making sure all of the values are what you expected them to be. This is probably causing the issues within your group_by statement.

Related

Trying to use ddply to subset a dataframe by two column variables, then find the maximum of a third column in r?

I have a dataframe called data with variables for data, time, temperature, and a group number called Box #. I'm trying to subset the data to find the maximum temperature for each day, for each box, along with the time that temperature occurred at. Ideally I could place this data into a new dataframe with the date, time, maximum temperature and the time is occurred at.
I tried using ddply but was the code only returns one line of output
ddply(data, .('Box #', 'Date'), summarize, max('Temp'))
I was able to find the maximum temperatures for each day using tapply on separate dataframes that only contain the values for individual groups
mx_day_2 <- tapply(box2$Temp, box2$Date, max)
I was unable to apply this to the larger dataframe with all groups and cannot figure out how to also get time from this code.
Is it possible to have ddply subset by both Box # and Date, then return two separate outputs of both maximum temperature and time, or do I need to use a different function here?
Edit: I managed to get the maximum times using a version of the code in the answer below, but still haven't figured out how to find the time at which the max occurs in the same data. The code that worked for the first part was
max_data <- data %>%
group_by(data$'Box #', data$'Date')
max_values <- summarise(max_data, max_temp=max(Temp, na.rm=TRUE))
I would use dplyr/tidyverse in stead of plyr, it's an updated version of the package. And clean the column names with janitor: a space is difficult to work with (it changes 'Box #' to box_number).
library(tidyverse)
library(janitor)
mx_day2 <- data %>%
clean_names() %>%
group_by(date,box_number)%>%
summarise(max_temp=max(temp, na.rm=TRUE)
I found a solution that pulls full rows from the initial dataframe into a new dataframe based on only max values. Full code for the solution below
max_data_v2 <- data %>%
group_by(data$'Box #', data$'Date') %>%
filter(Temp == max(Temp, na.rm=TRUE))

R mutate which.max by group

I'm running into an issue that I feel should be simple but cannot figure out and have searched the board for comparable problems/question but unable to find an answer.
In short, I have data from a variety of motor vehicles and looking to know the average speed of the vehicle when it is at maximal acceleration. I also want the opposite - the average acceleration at top speed.
I am able to do this for the whole dataset using the following code
data<-data %>% group_by(Name) %>%
mutate(speedATaccel= with(data, avg.Speed[which.max(top.Accel)]),
accelATspeed= with(data, avg.Accel[which.max(top.Speed)]))
However, the group_by function doesn't appear to be working it just provide the values across the whole dataset as opposed to each individual vehicle group.
Any help would be appreciated.
Thanks,
The use of with(data, disrupt the group_by attribute and get the index on the whole data. Instead, use tidyverse methods, i.e. remove the with(data. Note that in tidyverse, we don't need to use any of the base R extraction methods i.e. with $ or [[ or with, instead specify the unquoted column name
library(dplyr)
data %>%
group_by(Name) %>%
mutate(speedATaccel = avg.Speed[which.max(top.Accel)],
accelAtspeed = avg.Accel[which.max(top.Speed)])

Count occurrence in one variable based on another when having duplicated values

I know there are so many threats answering similar questions to mine but none of the answers out there are specific enough to what I want to obtain.
I've got the following dataset:
I want to count the number of patients (found in "Var_name") that harbour each mutation (found in "var_id") and display the count in a new column ("var_freq"). I've tried things like:
y <- ALL_merged %>%
group_by(var_id, Var_name) %>%
summarise(n_counts = n(), var_freq = sum(var_id == Var_name))
NOTE: In case is relevant for the answers... I had to convert "var_id" and "Var_name" into characters to make this work because they were factors.
However, this does not give me the output I want. Instead, I get the count of how many times each "var_id" appear per patient since, for each "var_id", the same "Var_name" appears a lot of times (because rows contain additional columns with different information), so the final outcome gives me a higher count that I would expect:
I also want to add this new column to the original dataset, I believe this could be done for example by using "mutate". But not sure how to set up everything...
So, in sum, what I want to get is: for each "var_id" how many different "Var_name" I have - taking into account that these data is duplicated...
Thanks in advance!
It is not entirely clear what you are looking for. It would help to provide data without images (such as using dput) and clearly show what your output should be for example data.
Based on what you describe this might be helpful. First, group_by just var_id alone. Then in summarise, you can include n() to get the number of observations/rows of data for each var_id, and then n_distinct to get the number of unique Var_name for each var_id:
library(dplyr)
df %>%
group_by(var_id) %>%
summarise(n_counts = n(),
var_freq = n_distinct(Var_name))

What is the average age of daily riders?

Have a simple problem I am trying to solve with the tidyverse, particularly dplyr (I believe this is the appropriate function).
What is the average age of daily riders?
There is a data.frame named Bike and there are two columns of data including cyc_freq which includes the Daily observation and another column of data entitled age which contains the different ages.
I am attempting to write a script that returns the average age of those who ride their bikes Daily. I was able to solve the problem but feel like my solution was inefficient.
Is there a simpler way to achieve my answer using dplyr?
bavg <- filter(BikeData, cyc_freq == "Daily", age)
mean(bavg$age)
It could be done within summarise itself without the need to have another step with filter
library(dplyr)
BikeData %>%
summarise(Mean = mean(age[cyc_freq == "Daily"]))
Or in base R
with(BikeData, mean(age[cyc_freq == "Daily"]))

In R, how to sum multiple columns based on value of another column?

In R, I have a dataframe, so that I have One Variable (the name of a country), a number of variables (Population, Number of cars, etc) and then a Column that represents region.
I would like to sum the variables (1, 2, ....) based on the value of the last region. I think this should be possible with dplyr and summarise each, but I cannot get it to work.
Would someone be able to help me please? Thanks a lot.
Reading the response (althought this may change if you can get some of your dataframe together...
library(dplyr)
summarized_df <- df %>%
group_by(region) %>%
summarise(var1=sum(variable1), var2=sum(variable2), var3=sum(variable3))
If this doesn't seem to work, maybe you can post your code and the errors even if you can't post the dataframe.

Resources