I have a dataset with 51 columns and I want to add summary rows for most of these variables. Currently columns 5:48 are various metrics with each row being 1 area from 1 quarter, I am summing the metric for all quarters for each area and ultimately creating a rate. The below code works fine for doing this to one individual metric but I need to run this for 44 different columns.
example <- test %>%
group_by(Area) %>%
summarise(`Metric 1`= (sum(`Metric 1`))/(mean(Population))*10000) %>%
bind_rows(test) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),'12 month rolling', 'Quarterly'))
I have tried creating a for loop and using the column index values, however, that hasn't worked and just returns various errors. I've been unable to get the above script working with index values as well, the below gives an error ('Error: unexpected '=' in: " group_by_at(Local_Authority) %>% summarise(u17_12mo[5]=")
example <- test %>%
group_by_at(Area) %>%
summarise(test[5]= (sum(test[5]))/(mean(test[4]))*10000) %>%
bind_rows(test) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),'12 month rolling', 'Quarterly'))
Any help on setting up a for loop for this, or another way entirely would be great
Without data, its tough to help, but maybe this would work for you:
library(tidyverse)
example <- test %>%
group_by(Area) %>%
summarise(across(5:48, ~(sum(.))/(mean(Population))*10000))
Related
I have a large R data set with over 90K observations and 400 variables representing patient diagnoses. I want to calculate the sum of the values in selected columns (named Code1 through Code200) and store the value in a new column (mytotal). The code below works when I run it with a subset (around 2K) of the observations.
mysubset <- mysubset %>%
mutate(mytotal = select(., Code1:Code200) %>%
rowSums(na.rm = TRUE))
However, when I try to run the same code on the full (90K observations, same dataframe structure) dataframe, I get an error:
Adding missing grouping variables: patient_num
Error in mutate():
! Problem while computing utils = select(., Code1:Code200) %>% rowSums(na.rm = TRUE).
✖ utils must be size 1, not 92574.
ℹ The error occurred in group 1: patient_num = 123456789.
I've searched online for hours to try to resolve the problem or to find an alternative solution, with no luck. If anyone has insights, I'd really appreciate them. Thank you.
Update: Just to save anyone else the hours I wasted trying to figure out the problem, it finally occurred to me to compare the subset and the full data set using class(). It turns out that the full data set had been saved as a grouped dataframe. Once I used ungroup(), the original code worked on the full data set. Apologies for the newbie distress call and thanks for the helpful responses!
Here's a tidyverse approach, where we could take just the columns we want and reshape them into longer data, which will be simpler to sum.
set.seed(42)
df <- matrix(rnorm(9E4*400), nrow= 9E4) |> as.data.frame()
library(tidyverse)
df_sums <- df %>%
mutate(row = row_number()) %>%
select(row, V1:V200) %>%
pivot_longer(-row) %>%
count(row, wt = value, name = "mytotal")
df %>%
bind_cols(df_sums)
I am having trouble combining multiple rows into 1 row, below is my current data:
I want one row of symptoms for each VAERS_ID. However, because the number of rows each VAERS_ID is inconsistent, I am having trouble.
I have tried this:
test= data %>%
select(VAERS_ID, SYMPTOM1, SYMPTOM2, SYMPTOM3, SYPMTOM4, SYMPTOM5) %>%
group_by(VAERS_ID) %>%
mutate(Grp = paste0(SYMPTOM1,SYMPTOM1, SYMPTOM2, SYMPTOM3, SYPMTOM4, SYMPTOM5, collapse
= ",")) %>%
distinct(VAERS_ID, Grp, .keep_all = TRUE)
This gives me the original data, plus another column labeled Grp containing all of the symptoms for each VAERS_ID pasted together, with a comma between each set.
Any help would be appreciated.
Your approach seems right but since data cannot be copied and tested, I am not able to reproduce your error. Some changes suggested, which you can try.
since you want "ALL Symptoms" in 1 place for each VAERS_ID, which is a common real world use case and I face this often. If you don't need original data in output, simply use this
data%>%
group_by(VAERS_ID) %>%
summarise("Symptoms" = paste0(SYMPTOM1,SYMPTOM1, SYMPTOM2, SYMPTOM3, SYPMTOM4, SYMPTOM5, collapse = ",")
With mutate you get original data since it adds a new column.
To address the warning to ungroup, just added %>%ungroup at end or within summarise add .groups="drop"
I have a calculation that I have to perform for 23 people (they have varying number of rows allocated to each person so difficult to do in excel. What I'd like to do is take the total time each person took to complete a test and divide it into 5 time categories (20%) so that I can look at their reaction time in more detail.
I will just do this by hand but it will take quite a while because they have 8 sets of data each. I'm hoping someone can show me the best way to use a loop or automate this process even just a little. I have tried to understand the examples but I'm afraid I don't have the skill. So by hand I would do it like I have below where I just filter by each subject.
I started by selecting the relevant columns, then filtered by subject so that I could calculate the time they started and the time they finished and used that to create a variable (testDuration) that could be used to create the 20% proportions of RTs that I'm after. I have no idea how to get the individual subjects' test start, end, duration and timeBin sizes to appear in one column. Any help very gratefully received.
Subj1 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==1) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
Subj2 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==2) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
I'm not positive that I understand your code, but this function can be called for any Subject value and then return the output:
myfunction <- function(subjectNumber){
Subj <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==subjectNumber) %>%
summarise(testStart = min(RetRating.OnsetTime), testEnd = max(RetRating.RTTime)) %>%
mutate(testDuration = testEnd -testStart) %>%
mutate(timeBin = testDuration/5)
return(Subj)
}
Subj1 <- myfunction(1)
Subj2 <- myfunction(2)
To loop through this, I'll need to know what your data and the desired output looks like.
I think you're missing one piece and that is simply dplyr::group_by.
You can use it as follows to break your dataset into groups, each containing the observations belonging to only one subject, and then summarise on those groups with whatever it is you want to analyze.
library(dplyr)
df <- rtTrialsYA_s1 %>%
group_by(Subject) %>%
summarise(
testStart = min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime),
testDuration = testEnd - testStart,
timeBin = testDuration/5,
.groups = "drop"
)
There is no need to do separate mutate calls in your code, btw. Also, you can continue to do column calculations right within summarise, as long as the result vectors have the same length as your aggregated columns.
And since summarise retains only the grouping columns and whatever you are defining, there is no real need to do a select statement before, either.
// update
You say you need all your calculated columns to appear within one single column. For that you can use tidyr::pivot_longer. Using the df we calculated above:
library(tidyr)
df_long <- df %>%
pivot_longer(-Subject)
Above will take all columns, except Subject and pivot them into two columns, one containing the former col name and one containing the former value.
I am trying to make country-level (by year) summaries of a long-form aggregated dataset that has individual-level data. I have tried using dplyr to summarize the average of the variable I am interested in to create a new dataset. However... there appears to be something wrong with my group_by because the answer is only one observation that appears to be the mean of every observation.
data named: "finaldata.giniE",
country variable: "iso3c",
year variable: "date",
individual-level variable of interest: "Ladder.Life.Present"
Note: there are more variables in my data-- could this be an issue?
country_summmary <- finaldata.giniE %>%
select(iso3c, date, Ladder.Life.Present) %>%
group_by(iso3c, date) %>%
summarize(averaged.M = mean(Ladder.Life.Present))
country_summmary
My output appears like this:> country_summmary
averaged.M
1 5.505455
Thank you!
I actually just changed something and added your suggested code to the front and it worked! Here is the code that was able to work!
library(dplyr)
country_summary <- finaldata.gini %>%
group_by(iso3c, date) %>%
select(Ladder.Life.Present) %>%
summarise_each(funs(mean))
I am trying to understand the way group_by function works in dplyr. I am using the airquality data set, that comes with the datasets package link.
I understand that is if I do the following, it should arrange the records in increasing order of Temp variable
airquality_max1 <- airquality %>% arrange(Temp)
I see that is the case in airquality_max1. I now want to arrange the records by increasing order of Temp but grouped by Month. So the end result should first have all the records for Month == 5 in increasing order of Temp. Then it should have all records of Month == 6 in increasing order of Temp and so on, so I use the following command
airquality_max2 <- airquality %>% group_by(Month) %>% arrange(Temp)
However, what I find is that the results are still in increasing order of Temp only, not grouped by Month, i.e., airquality_max1 and airquality_max2 are equal.
I am not sure why the grouping by Month does not happen before the arrange function. Can anyone help me understand what I am doing wrong here?
More than the problem of trying to sort the data frame by columns, I am trying to understand the behavior of group_by as I am trying to use this to explain the application of group_by to someone.
arrange ignores group_by, see break-changes on dplyr 0.5.0. If you need to order by two columns, you can do:
airquality %>% arrange(Month, Temp)
For grouped data frame, you can also .by_group variable to sort by the group variable first.
airquality %>% group_by(Month) %>% arrange(Temp, .by_group = TRUE)