Loop for aggregated data frame in R - r

I have a data frame with 58 columns labeled SD1 through to SD58 along with columns for date info (Date, Year, Month, Day).
I'm trying to find the date of the maximum value of each of the SD columns each year using the following code:
maxs<-aggregate(SD1~Year, data=SDtime, max)
SDMax<-merge(maxs,SDtime)
I only need the dates so I made a new df and relabeled the column as below:
SD1Max = subset(SDMax, select = c(Year, Date))
SD1Max %>%
rename(
SD1=Date
)
I want to do the same thing for every SD column but I don't want to have to repeat these steps 58 times. Is there a way to loop the process?

Assuming there are no ties (multiple days with where the variable reached its maximum) this probably does what you want:
library('tidyverse')
SDtime %>%
pivot_longer(
cols = matches('^SD[0-9]{1,2}$')
) %>%
group_by(name) %>%
filter(value == max(value, na.rm = TRUE)) %>%
ungroup()
You might want to pivot_wider afterwards.

Related

Counting occurrence of diagnosis code across multiple columns in large R dataset

I'm using two years of NIS data (already combined) to search for a diagnosis code across all of the DX columns. The columns start at I10_DX1 to I10_DX40 (which are column #18-57). I want to create a new dataset that has the observations that has this diagnosis code in any of these columns.
I 've tried loops and the ICD packages but haven't been able to get it right. Most recently tried code as follows:
get_icd_labels(icd3 = c("J80"), year = 2018:2019) %>%
arrange(year, icd_sub) %>%
filter(icd_sub %in% c("J80") %>%
select(year, icd_normcode, label) %>%
knitr::kable(row.names = FALSE)
This is a tidyverse (dplyr) solution. If you don't already have a unique id for each record, I'd start out by adding one.
df <-
df %>%
mutate(my_id = row_number())
Next, I'd gather the diagnosis codes into a table where each record is a single diagnosis.
diagnoses <-
df %>%
select(my_id, 18:57) %>%
gather("diag_num","diag_code",2:ncol(.)) %>%
filter(!is.na(diag_code)) #No need to keep a bunch of empty rows
Finally, I would join my original df to the diagnoses data frame and filter for the code I want.
df %>%
inner_join(diagnoses, by = "my_id") %>%
filter(diag_code == "J80")

How to mutate new columns in R based on earliest and latest dates for other variables

In a dataset where each patient had multiple test administrations and a score on each test date, I have to identify the earliest & latest test dates, then subtract the difference of the scores of those dates. I think I've identified the first & last dates through dplyr, creating new columns for those:
SplitDates <- SortedDates %>%
group_by(PatientID) %>%
mutate(EarliestTestDate = min(AdministrationDate),
LatestTestDate = max(AdministrationDate)) %>%
arrange(desc(PatientID))
Score column is TotalScore
Now how do I extract the scores from these 2 dates (for each patient) to create new columns of earliest & latest scores? Haven't been able to figure out a mutate with case_when or if_else to create a score based on a record with a certain date.
Have you tried to use one combine verb, like left_join, for example?
SplitDates <- SortedDates %>%
group_by(PatientID) %>%
mutate(EarliestTestDate = min(AdministrationDate),
LatestTestDate = max(AdministrationDate)) %>%
ungroup() %>%
left_join(SortedDates,
by = c(“PatientID” = “PatientID”, “AdministrationDate” = “EarliestTestDate”)) %>% # picking the score of EarliestTestDate
left_join(SortedDates,
by = c(“PatientID” = “PatientID”, “AdministrationDate” = “LatestTestDate”)) %>% # picking the score of EarliestTestDate
arrange(desc(PatientID)) # now you can make the mutante task that you need.
I suggest to you see the dplyr cheatsheet.

Choose top n variables in R when matching values

I have a large timeseries dataset, and would like to choose the top 10 observations from each date based one the values in one of my columns.
I am able to do this using group_by(Date) %>% top_n(10)
However, if the values for the 10th and 11th observation are equal, then they are both picked, so that I get 11 observations instead of 10.
Do anyone know what i can do to make sure that only 10 observations are chosen?
You can arrange the data and select first 10 rows in each group.
library(dplyr)
df %>% arrange(Date, desc(col_name)) %>% group_by(Date) %>% slice(1:10)
Similarly, with filter
df %>%
arrange(Date, desc(col_name)) %>%
group_by(Date) %>%
filter(row_number() <= 10)
With data.table you can do
library(data.table)
setDT(df)
df[order(Date, desc(value))][, .SD[1:10], by = Date]
Change value to match the variable name used to choose which observation should be kept in case of ties. You can also do:
df[order(Date, desc(value))][, head(.SD,10), by = Date]
We can use base R
df1 <- df[with(df, order(Date, -value)),]
df1[with(df1, ave(seq_along(Date), Date, FUN = function(x) x %in% 1:10)),]

Counting the rows based on two other column values, and manipulate the value in a loop through one of these column values in R

There are three columns: website, Date ("%Y %m"), click_tracking (T/F). I would like to add a variable describing the number of websites whose click tracking = T in each month / the number of all website in that month.
I thought the steps would be something like:
aggregate(sum(df$click_tracking = TRUE), by=list(Category=df$Date), FUN = sum)
as.data.frame(table(Date))
Then somehow loop through Date and divide the two variables above which would have been already grouped by Date. How can I achieve this? Many thanks!
If we are creating a column, then do a group by 'Date' and get the sum of 'click_tracking' (assuming it is a logical column - TRUE/FALSE) iin mutate
library(dplyr)
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(click_tracking))
If the column is factor, convert to logical with as.logical
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(as.logical(click_tracking)))
If it is to create a summarised output
df %>%
group_by(Date) %>%
summarise(countTRUE = sum(click_tracking))
In the OP's code, = (assignment) is used instead of == in sum(df$click_tracking = TRUE) and there is no need to do a comparison on a logical column
aggregate(cbind(click_tracking = as.logical(click_tracking)) ~ Date, FUN = sum)
This will create the proportion of websites with click tracking (out of all websites) per month.
aggregate(data=df, click_tracking ~ Date, mean)

Adding Hourly Gaps into a Frequency Table

I am trying to create two frequency tables, one that is daily, and one that is hourly. I am able to get the daily values fairly easily.
C<-Data
C$Data<-format(C$Data, "%m/%d/%Y")
Freq_Day<- C %>% group_by(Data) %>% summarise(frequency = n())
However when I try to get the hourly frequency by doing the following
B<-Data
B$Data<-format(B$Data,"%m/%d/%Y %H:%M")
Freq_HRLY<-B %>% group_by(Data) %>% summarise(frequency = n())
It omits hours that simply did not occur in the data set. Thus it returns a column that is less than (# of Days) *24. How would I go about getting a column of dates in one hour increments with their corresponding frequency, in a way that if there is no occurrence in "Data' it just has a value of 0
One way would be to use tidyr::complete to fill in the missing hours on the Freq_HRLY data which is already calculated by creating a sequence of hourly interval between min and max Data.
library(dplyr)
Freq_HRLY %>%
ungroup() %>%
mutate(Data = as.POSIXct(Data, format = "%m/%d/%Y %H:%M")) %>%
tidyr::complete(Data = seq(min(Data), max(Data), by = "1 hour"),
fill = list(frequency = 0))

Resources