Getting Summary Data for Longitudinal Data in R - r

I have a set of longitudinal data, which is a number of patients followed up over several years at irregular time points, I am unable to post it due to confidentiality issues,
Essentially, each row represents a single patient encounter, with admission date, discharge date, patient identifier and various demographic (e.g. ethnicity) and other variables,
e.g:
Patient
Admission Date
Ethnicity
1
26-01-2007
White
1
28-08-2008
White
2
12-02-2001
Black
2
01-12-2015
Black
2
03-12-2018
Black
I've tried using various packages such as brolgar and tsibble, but am unable to get simple summary statistics like number of individual patients, number of encounters per patient, time from first to last attendance per patient in each ethnic category (this one probably deserves another question as it's probably a lot more difficult) for example,
In a standard dataset you could use dplyr to do something like:
df %>%
group_by(Ethnicity) %>%
summarise(n=n)
to count the number of patients per group,
But I'm not sure how to do it for this dataset despite having gone through packages like brolgar/tsibble,
Would be grateful for any advice
Thanks a lot

I've done this :
df <- data.frame("Patient" = c(rep(1, 2), rep(2, 3)),
"Admission Date" = c("26-01-2007", "28-08-2008", "12-02-2001", "01-12-2015", "03-12-2018"),
"Ethnicity" = c(rep("White", 2), rep("Black", 3)),
stringsAsFactors = FALSE)
individual_patient <- n_distinct(df$Patient)
df2 <- df %>% group_by(Patient) %>% summarise(Encounter_number = n())
Are they other things you need to compute ?

Related

summarise function is not grouping the data by groups when used with group_by()

I have a large dataset with COVID-19 cases, with number of cases per for each date.This data is in the dat dataframe. I am trying to summarize these data by a variable which contains ID of all districts and the date variable (Meldedatum), for some reason the output in new data frame is just 1 row with total cases for the entire period and it is not grouped by ID and date variable. I dot know why that is. I am adding screen shot of the dataset to show what it looks like. Can someone help?
sample of data. There are more than 100,000 observations in total for 44 districts, I am just including sample with 2 different districts and dates.
dat<-data.frame(Landkreis=c("Sk Stuttgart", "Sk Stuttgart", "Lk Freiburg","Lk Freiburg"),
Anzahlfall=c(1,1,1,1),AnzahlTodesfall=c(0,1,2,1),
Meldedatum=c("09-03-2020","18-03-2020","09-03-2020","20-03-2020"),IdLandkreis=c(8111, 8111,8116,8116))
datAggMelde <- dat %>% group_by(IdLandkreis, Meldedatum) %>%
summarize(sumCount = sum(AnzahlFall, na.rm = TRUE),
sumDeath = sum(AnzahlTodesfall, na.rm = TRUE),
Landkreis = first(Landkreis) )

How can I group a dataframe's observation 3 by 3?

I am struggling with a dataframe of exchange-rate observations taken 3 times a day for approximately 30 days. This means that currently the dataframe is formed by 90 observations. For the purpose of my research I need to reduce the observations to 1 per day (30 observations), possibly by making the mean every 3 observations. In sum, I need a code that takes the observations 3 by 3 and outputs one observation every 3. I have tried some different codes but my attempts have all completely failed. I was wondering if someone had to do something similar and managed.
Thanks!
Use group_by and summarise like this:
library(tidyverse)
df=tibble(
day = rep(1:30, each=3),
rate = rnorm(90)
)
df %>%
group_by(day) %>%
summarise(mrate = mean(rate))
P.S.
Attach data. It will be easier to help out on specific data.

How do I count the number of occurrences of a factor within another factor?

I am very new to R so please bear with me!
I have a dataset with moth species, names of people who recorded the moths (Recorders), the year in which they were recorded, etc.
I would like to create a new table in which I have the number of different moth recorders per year. So far I have managed to make a table that gives me the total recordings made per year, but it's not quite what I need.
Here is the code I have used, would anybody be able to offer amendments or perhaps alternative ways to go about this?
#create table with number of moth recorders per year
library(plyr)
diversity <- ddply(mydata4, c("Year"), summarise,
N = length(Recorder))
diversity
Thank you!
As you are new to R and actively learning by the sounds of it; I'll give you a nudge in the right direction. I've always found things stuck best when I've figured them out myself and don't want to rob you of that.
So: It sounds like what you want is to have a count of the distinct recorders grouped by year. (Hint hint)
I suggest having a look at the dplyr and tidyr packages (for which there is a handy cheatsheet) as they are very useful for this sort of manipulation of data frames.
Also, as you are just picking up R, another useful thing worth taking a look at (though not relevant to your immediate problem) is the Tidyverse Code Style Guide.
For those looking to have the answer spelled out, see below. Look away now if you want to figure it out yourself.
The original question states that there is a data set with the following properties:
Moth Species
Name of person who Recorded it
Year the moth was Recorded in.
The code provided in the question was reported to produce a table of the total number of recordings made per year. From this we can infer that the original table has one row per recording.
The question also refers to two specific columns: Year and Recorder. From this information and the fact that the question mentioned the data set included moth species we can infer that the data set has at least three columns:
Species
Recorder
Year
So, let's make up some sample data:
mydata4 <- data.frame(
Species = c("Red", "Blue", "Red", "Blue", "Green"),
Year = c("2019", "2019", "2019", "2018", "2018"),
Recorder = c("Alice", "Alice", "Bob", "Alice", "Alice")
)
Now, as I mentioned above, we desire a count of distinct Recorders grouped by year... so:
library(dplyr)
mydata4 %>% group_by(Year) %>% distinct(Recorder) %>% count()
We group by year, we make sure that the rows in each group are distinct by Recorder and finally we count the rows in each group, as by this point we have made sure that each group only has one row per Recorder who recorded at least one moth in that year.
# A tibble: 2 x 2
# Groups: Year [2]
Year n
<fct> <int>
1 2018 1
2 2019 2

R: Create column showing days leading up to/ since the maximum value in another column was reached?

I have a dataset with repeated measures: measurements nested within participants (ID) nested in groups. A variable G (with range 0-100) was measured on the group-level. I want to create a new column that shows:
The first day on which the maximum value of G was reached in a group coded as zero.
How many days each measurement (in this same group) occurred before or after the day on which the maximum was reached. For example: a measurement taken 2 days before the maximum is then coded -2, and a measurement 5 days after the maximum is coded as 5.
Here is an example of what I'm aiming for: Example
I highlighted the days on which the maximum value of G was reached in the different groups. The column 'New' is what I'm trying to get.
I've been trying with dplyr and I managed to get for each group the maximum with group_by, arrange(desc), slice. I then recoded those maxima into zero and joined this dataframe with my original dataframe. However, I cannot manage to do the 'sequence' of days leading up to/ days from the maximum.
EDIT: sorry I didn't include a reprex. I used this code so far:
To find the maximum value: First order by date
data <- data[with(data, order(G, Date)),]
Find maximum and join with original data:
data2 <- data %>%
dplyr::group_by(Group) %>%
arrange(desc(c(G)), .by_group=TRUE) %>%
slice(1) %>%
ungroup()
data2$New <- data2$G
data2 <- data2 %>%
dplyr::select(c("ID", "New", "Date"))
data3 <- full_join(data, data2, by=c("ID", "Date"))
data3$New[!is.na(data3$New)] <- 0
This gives me the maxima coded as zero and all the other measurements in column New as NA but not yet the number of days leading up to this, and the number of days since. I have no idea how to get to this.
It would help if you would be able to provide the data using dput() in your question, as opposed to using an image.
It looked like you wanted to group_by(Group) in your example to compute number of days before and after the maximum date in a Group. However, you have ID of 3 and Group of A that suggests otherwise, and maybe could be clarified.
Here is one approach using tidyverse I hope will be helpful. After grouping and arranging by Date, you can look at the difference in dates comparing to the Date where G is maximum (the first maximum detected in date order).
Also note, as.numeric is included to provide a number, as the result for New is a difftime (e.g., "7 days").
library(tidyverse)
data %>%
group_by(Group) %>%
arrange(Date) %>%
mutate(New = as.numeric(Date - Date[which.max(G)]))

How to mutate variables on a rollwing time window by groups with unequal time distances?

I have a large df with around 40.000.000 rows , covering in total a time period of 2 years and more than 400k unique users.
The time variable is formatted as POSIXct and I have a unique user_id per user. I observe each user over several points in time.
Each row is therefore a unqiue combination of user_id, time and a set of variables.
Based on a set of dummy variables (df$v1, df$v2), a category variable(df$category_var) and the time variable (df$time_var) I now want to calculate 3 new variables on a user_id level on a rolling time window over the previous 30 days.
So in each row, the new variable should be calculated over the values of the previous 30 days of the input variables.
I do not observe all users over the same time period, some enter later some leave earlier, also the distances between times are not equal, therefore I can not calculate the variables just by number of rows.
So far I only managed to calculate my new variables per user_id over the whole observation period, but I couldn’t achieve to calculate the variables for the previous 30 days rolling window per user.
After checking and trying all the related posts here, I assume a data.table solution is the most suitable, but since I have so far mainly worked with dplyr the attempt of calculating these variables on the rolling time window on a groupey_by user_id level has taken more than a week without any results. I would be so grateful for your support!
My df basically looks like :
user_id <- c(1,1,1,1,1,2,2,2,2,3,3,3,3,3)
time_var <- c(“,2,3,4,5, 1.5, 2, 3, 4.5, 1,2.5,3,4,5)
category_var <- c(“A”, “A”, “B”, “B”, “A”, “A”, “C”, “C”, “A”, …)
v1 <- c(0,1,0,0,1,0,1,1,1,0,1,…)
v2 <- c(1,1,0,1,0,1,1,0,...)
My first needed new variable (new_x1) is basically a cumulative sum based on a condition in dummy variable v1. What I achieved so far:
df <- df %>% group_by(user_id) %>% mutate(new_x1=cumsum(v1==1))
What I need: That variables only counting over the previoues 30 days per user
Needed new variable (new_x2): Basically cumulative count of v1 if v2 has a (so far) unique value. So for each new value in v2 given v1==1, count.
What I achieved so far:
df <- df %>%
group_by(user_id, category_var) %>%
mutate(new_x2 = cumsum(!duplicated(v2 )& v1==1))
I also need this based on the previous 30 days and not the whole observation period per user.
My third variable of interest (new__x3):
The time between two observations given a certain condition (v1==1)
#Interevent Time
df2 <- df%>% group_by(user_id) %>% filter(v1==1) %>% mutate(time_between_events=time-lag(time))
I would also need this on the previoues 30 days.
Thank you so much!
Edit after John Springs Post:
My potential solution would then be
setDT(df)[, `:=`(new_x1= cumsum(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]),
new_x2= cumsum(!duplicated(df$v1==1[df$user_id == user_id][between(df$time[df$user_id == user_id], time-30, time, incbounds = TRUE)]))),
by = eval(c("user_id", "time"))]
I really not familiar with data.table and not sure, if I can nest my conditions on cumsum in on data.table like that.
Any suggestions?

Resources