Getting sum of top n variables in a data frame - r

totdal_deaths_confirmed_cases_60_days_only = swine_flu_cases %>%
filter(Confirmed >0) %>%
group_by(Country) %>%
summarise(top_n((total_confirmed_cases = sum(Confirmed), total_deaths = sum(Deaths),60))
So I have a dataframe called swine_flu_cases, and have variables such as:
Country Date Confirmed Recovered Death
What I am trying to do is I want to sum up the groups confirmed and deaths variables in the data frame but only for the first 60 rows/entries per country. I tried using the top_n function but I am not too sure how to apply it into my dataframe before I do the summary. I also tried using slice_max function but my pc doesn't seem to have the function installed even though I loaded the dplyr package so I can't quite figure that out either.
Any suggestions on how I could accomplish this would be appreciated

step_1 = swine_flu_cases %>%
filter(Confirmed >0)
step_2 = step_1 %>%
group_by(Country)
%>% top_n(-60,Date)
Then for the final phase
totdal_deaths_confirmed_cases_60_days_only = summarise(total_confirmed_cases = sum(Confirmed), total_deaths = sum(Deaths))
I don't know if anyone has a much shorter way of doing this, but I did it in steps so that it was easier for me to understand

Related

Sum of selected columns works on subset of data but not full data set

I have a large R data set with over 90K observations and 400 variables representing patient diagnoses. I want to calculate the sum of the values in selected columns (named Code1 through Code200) and store the value in a new column (mytotal). The code below works when I run it with a subset (around 2K) of the observations.
mysubset <- mysubset %>%
mutate(mytotal = select(., Code1:Code200) %>%
rowSums(na.rm = TRUE))
However, when I try to run the same code on the full (90K observations, same dataframe structure) dataframe, I get an error:
Adding missing grouping variables: patient_num
Error in mutate():
! Problem while computing utils = select(., Code1:Code200) %>% rowSums(na.rm = TRUE).
✖ utils must be size 1, not 92574.
ℹ The error occurred in group 1: patient_num = 123456789.
I've searched online for hours to try to resolve the problem or to find an alternative solution, with no luck. If anyone has insights, I'd really appreciate them. Thank you.
Update: Just to save anyone else the hours I wasted trying to figure out the problem, it finally occurred to me to compare the subset and the full data set using class(). It turns out that the full data set had been saved as a grouped dataframe. Once I used ungroup(), the original code worked on the full data set. Apologies for the newbie distress call and thanks for the helpful responses!
Here's a tidyverse approach, where we could take just the columns we want and reshape them into longer data, which will be simpler to sum.
set.seed(42)
df <- matrix(rnorm(9E4*400), nrow= 9E4) |> as.data.frame()
library(tidyverse)
df_sums <- df %>%
mutate(row = row_number()) %>%
select(row, V1:V200) %>%
pivot_longer(-row) %>%
count(row, wt = value, name = "mytotal")
df %>%
bind_cols(df_sums)

Lead and lag issue using dplyr

I have a data frame with data that looks like this that has 365 rows reflecting the calendar year. I am trying to shift the county name columns up by one row. The data frame doesn't contain any missing values.
I tried using the following code to shift it, but the resulting table has values that are all NA.
covid_shift <- covid_pivot %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
Does anyone know what might be the issue?
Since covid_pivot is grouped by date, and each of these groups has one row, the lead and lag functions return NA.
Try:
covid_shift <- covid_pivot %>%
ungroup() %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
You might also consider using across()
covid_pivot %>%
ungroup() %>%
mutate(across(-date, ~lag(.x)))

I need to use a loop in R but don't know where to start

I have a calculation that I have to perform for 23 people (they have varying number of rows allocated to each person so difficult to do in excel. What I'd like to do is take the total time each person took to complete a test and divide it into 5 time categories (20%) so that I can look at their reaction time in more detail.
I will just do this by hand but it will take quite a while because they have 8 sets of data each. I'm hoping someone can show me the best way to use a loop or automate this process even just a little. I have tried to understand the examples but I'm afraid I don't have the skill. So by hand I would do it like I have below where I just filter by each subject.
I started by selecting the relevant columns, then filtered by subject so that I could calculate the time they started and the time they finished and used that to create a variable (testDuration) that could be used to create the 20% proportions of RTs that I'm after. I have no idea how to get the individual subjects' test start, end, duration and timeBin sizes to appear in one column. Any help very gratefully received.
Subj1 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==1) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
Subj2 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==2) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
I'm not positive that I understand your code, but this function can be called for any Subject value and then return the output:
myfunction <- function(subjectNumber){
Subj <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==subjectNumber) %>%
summarise(testStart = min(RetRating.OnsetTime), testEnd = max(RetRating.RTTime)) %>%
mutate(testDuration = testEnd -testStart) %>%
mutate(timeBin = testDuration/5)
return(Subj)
}
Subj1 <- myfunction(1)
Subj2 <- myfunction(2)
To loop through this, I'll need to know what your data and the desired output looks like.
I think you're missing one piece and that is simply dplyr::group_by.
You can use it as follows to break your dataset into groups, each containing the observations belonging to only one subject, and then summarise on those groups with whatever it is you want to analyze.
library(dplyr)
df <- rtTrialsYA_s1 %>%
group_by(Subject) %>%
summarise(
testStart = min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime),
testDuration = testEnd - testStart,
timeBin = testDuration/5,
.groups = "drop"
)
There is no need to do separate mutate calls in your code, btw. Also, you can continue to do column calculations right within summarise, as long as the result vectors have the same length as your aggregated columns.
And since summarise retains only the grouping columns and whatever you are defining, there is no real need to do a select statement before, either.
// update
You say you need all your calculated columns to appear within one single column. For that you can use tidyr::pivot_longer. Using the df we calculated above:
library(tidyr)
df_long <- df %>%
pivot_longer(-Subject)
Above will take all columns, except Subject and pivot them into two columns, one containing the former col name and one containing the former value.

How to mutate paneldata with dplyr in R?

I have panel data (person-year combination) for which I need to investigate the impact that your partner's characterics (several "x") have on your outcome variable (y). Everything is given in one tibble/dataframe. Partner information is given by "pid".
paneldata = data.frame(id=c(1,1,1,2,2,2,3,3,3,4,4,4), time=seq(1:3), pid=c(3,3,NA,4,4,3,1,1,2,2,2,NA),
y=c(9,10,11,12,13,14,15,16,17,18,19,20), x=c(21,22,23,24,25,26,27,28,29,30,31,32),
x_partner=c(27,28,NA,30,31,29,21,22,26,24,25,NA))
library(dplyr)
paneldata %>%
group_by(id, time) %>%
mutate(x_pid = x[pid])
I want to achieve x_partner, but what I have to far is x_pid. I'm trying to catch the index, while running through group_by "id" and "time", get the "pid" (not unique!) and look at x at combination pid-time.
You shouldn't be grouping by id, only by time.
paneldata %>%
group_by(time) %>%
mutate(x_partner = x[match(id, pid)])

Group and summarize with iterative filter using dplyr

Upfront apology if this has been asked, I have been searching all day and have not found an answer I can apply to my problem.
I am trying to solve this issue using dplyr (and co.) because my previous method (for loops) was too inefficient. I have a dataset of event times, at sites, that are in groups. I want to summarize the number (and proportion) of events that occur in a moving window along a sequence.
# Example data
set.seed(1)
sites = rep(letters[1:10],10)
groups = c('red','blue','green','yellow')
times = round(runif(length(sites),1,100))
timePeriod = seq(1,100)
# Example dataframe
df = data.frame(site = sites,
group = rep(groups,length(sites)/length(groups)),
time = times)
This is my attempt to summarize the number of sites from each group that contain a time (event) within a given moving window of time.
The goal is to move through each element of the vector timePeriod and summarize how many events in each group occurred at timePeriod[i] +/- half-window. Ultimately storing them in, e.g., a dataframe with a column for each group, and a row for each time step, is ideal.
df %>%
filter(time > timePeriod[i]-25 & time < timePeriod[i]+25) %>%
group_by(group) %>%
summarise(count = n())
How can I do this without looping through my sequence of time and storing the summary table for each group individually? Thanks!
Combining lapply and dplyr, you can do the following, which is close to what you had worked so far.
lapply(timePeriod, function(i){
df %>%
filter(time > (i - 25) & time < ( i + 25 ) ) %>%
group_by(group) %>%
summarise(count = n()) %>%
mutate(step = i)
}) %>%
bind_rows()

Resources