Calculate Difference between dates by group in R - r

I'm using a logistic exposure to calculate hatching success for bird nests. My data set is quite extensive and I have ~2,000 nests, each with a unique ID ("ClutchID). I need to calculate the number of days a given nest was exposed ("Exposure"), or more simply, the difference between the 1st and last day. I used the following code:
HS_Hatch$Exposure=NA
for(i in 2:nrow(HS_Hatch)){HS_Hatch$Exposure[i]=HS_Hatch$DateVisit[i]- HS_Hatch$DateVisit[i-1]}
where HS_Hatch is my dataset and DateVisit is the actual date. The only problem is R is calculating an exposure value for the 1st date (which doesn't make sense).
What I really need is to calculate the difference between the 1st and last date for a given clutch. I've also looked into the following:
Exposure=ddply(HS_Hatch, "ClutchID", summarize,
orderfrequency = as.numeric(diff.Date(DateVisit)))
df %>%
mutate(Exposure = as.Date(HS_Hatch$DateVisit, "%Y-%m-%d")) %>%
group_by(ClutchID) %>%
arrange(Exposure) %>%
mutate(lag=lag(DateVisit), difference=DateVisit-lag)
I'm still learning R so any help would be greatly appreciated.
Edit:
Below is a sample of the data I'm using
HS_Hatch <- structure(list(ClutchID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L
), DateVisit = c("3/15/2012", "3/18/2012", "3/20/2012", "4/1/2012",
"4/3/2012", "3/18/2012", "3/20/2012", "3/22/2012", "4/3/2012",
"4/4/2012", "3/22/2012", "4/3/2012", "4/4/2012", "3/18/2012",
"3/20/2012", "3/22/2012", "4/2/2012", "4/3/2012", "4/4/2012",
"3/20/2012", "3/22/2012", "3/25/2012", "3/27/2012", "4/4/2012",
"4/5/2012"), Year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L,
2012L), Survive = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -25L), .Names = c("ClutchID",
"DateVisit", "Year", "Survive"), spec = structure(list(cols = structure(list(
ClutchID = structure(list(), class = c("collector_integer",
"collector")), DateVisit = structure(list(), class = c("collector_character",
"collector")), Year = structure(list(), class = c("collector_integer",
"collector")), Survive = structure(list(), class = c("collector_integer",
"collector"))), .Names = c("ClutchID", "DateVisit", "Year",
"Survive")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))

Collecting some of the comments...
Load dplyr
We need only the dplyr package for this problem. If we load other packages, e.g. plyr, it can cause conflicts if both packages have functions with the same name. Let's load only dplyr.
library(dplyr)
In the future, you may wish to load tidyverse instead -- it includes dplyr and other related packages, for graphics, etc.
Converting dates
Let's convert the DateVisit variable from character strings to something R can interpret as a date. Once we do this, it allows R to calculate differences in days by subtracting two dates from each other.
HS_Hatch <- HS_Hatch %>%
mutate(date_visit = as.Date(DateVisit, "%m/%d/%Y"))
The date format %m/%d/%Y is different from your original code. This date format needs to match how dates look in your data. DateVisit has dates as month/day/year, so we use %m/%d/%Y.
Also, you don't need to specify the dataset for DateVisit inside mutate, as in HS_Hatch$DateVisit, because it's already looking in HS_Hatch. The code HS_Hatch %>% ... says 'use HS_Hatch for the following steps'.
Calculating exposures
To calculate exposure, we need to find the first date, last date, and then the difference between the two, for each set of rows by ClutchID. We use summarize, which collapses the data to one row per ClutchID.
exposure <- HS_Hatch %>%
group_by(ClutchID) %>%
summarize(first_visit = min(date_visit),
last_visit = max(date_visit),
exposure = last_visit - first_visit)
first_visit = min(date_visit) will find the minimum date_visit for each ClutchID separately, since we are using group_by(ClutchID).
exposure = last_visit - first_visit takes the newly-calculated first_visit and last_visit and finds the difference in days.
This creates the following result:
ClutchID first_visit last_visit exposure
<int> <date> <date> <dbl>
1 1 2012-03-15 2012-04-03 19
2 2 2012-03-18 2012-04-04 17
3 3 2012-03-22 2012-04-04 13
4 4 2012-03-18 2012-04-04 17
5 5 2012-03-20 2012-04-05 16
If you want to keep all the original rows, you can use mutate in place of summarize.

Here is a similar solutions if you look for a difftime results in days, from a vector date, without NA values produce in the new column, and if you expect to group by several conditions/groups.
make sure that your vector of date as been converting in the good format as previously explained.
dat2 <- dat %>%
select(group1, group2, date) %>%
arrange(group1, group2, date) %>%
group_by(group1, group2) %>%
mutate(diff_date = c(0,diff(date)))

Related

count number of times string appears in a column

Can you think about an intuitive way of calculating the number of times the word space appears in a certain column? Or any other solution that is viable.
I basically want to know how many times the space key was pressed, however some participants made the mistake and pressed other keys which would also be considered a mistake. So I was wondering if I should go with the "key_resp.rt" column instead and count the number of response times instead. If you had any idea of how to do both it would be great as I may need to use both.
I used the following code but the results do not conform to the data.
Data %>% group_by(Participant, Session) %>% summarise(false_start = sum(str_count(key_resp.keys, "space")))
Here is a snippet of my data:
Participant RT Session key_resp.keys key_resp.rt
X 0.431265 1 ["space"] [2.3173399999941466]
X 0.217685 1
X 0.317435 2 ["space","space"] [0.6671900000001187,2.032510000000002] 2020.1.3 4
Y 0.252515 1
Y 0.05127 2 ["space","space","space","space","space","space","space","space","space"] [4.917419999999765,6.151149999999689,6.333714999999771,6.638249999999971,6.833514999999338,7.0362499999992,7.217724999999504,7.38576999999988,7.66913999999997]
dput(droplevels(head(Data_PVT)))
structure(list(Interval_stimulus = c(4.157783411, 4.876139922,
5.67011868, 9.338167417, 9.196342656, 7.62448411), Participant = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "ADH80254", class = "factor"),
RT = c(431.265, 277.99, 253.515, 310.53, 299.165, 539.46),
Session = c(1L, 1L, 1L, 1L, 1L, 1L), date = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "2020-06-12_11h11.47.141", class = "factor"),
key_resp.keys = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("",
"[\"space\"]"), class = "factor"), key_resp.rt = structure(c(2L,
1L, 1L, 1L, 1L, 1L), .Label = c("", "[2.3173399999941466]"
), class = "factor"), psychopyVersion = structure(c(1L, 1L,
1L, 1L, 1L, 1L), .Label = "2020.1.3", class = "factor"),
Trials = 0:5, Reciprocal = c(2.31875992719094, 3.59725169970143,
3.94453977082224, 3.22030077609249, 3.3426370063343, 1.85370555740926
)), row.names = c(NA, 6L), class = "data.frame")
Expected output:
Participant Session false_start
x 1 0
x 2 1
y 1 2
y 2 1
z 1 10
z 2 3
We can use str_count to count "space" values for each Participant and Session and sum them to get total. For all_false_start we count number of words in it.
library(dplyr)
library(stringr)
df %>%
group_by(Participant, Session) %>%
summarise(false_start = sum(str_count(key_resp.keys, '\\bspace\\b')),
all_false_start = sum(str_count(key_resp.keys, '\\b\\w+\\b')))

Reduce the range of time in sequence analysis with R

I have a sequence that happens over a very long period of time. I tried 8 different algorithms to classify my sequences (OM, CHi2,...). Time goes from 1 to 123. I have 110 individual and 8 events.
My results are not as expected. First, it's very difficult to read. Second, a category contains too many representatives sequence (group3). Third, the number of sequence per group is really unbalanced.
It may comes from the fact that my time variable has a range of 123. I searched for articles that had an issue with a too long time range. I read in Sabherwal and Robey (1993) and in Shi and Prescott (2011) that you can standardize "each sequence by dividing the number of transformations required by the length of the longer sequence". How can I do that in R?
Please find underneath a description of my data:
library(TraMineRextras)
head(seq.tse.data)
seq.tse.data <- structure(list(
ID = c(1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L),
Year = c(2008L, 2010L, 2012L, 2007L, 2009L, 2010L, 2012L,
2013L, 1996L, 1997L, 1999L, 2003L, 2006L, 2008L,
2012L, 2007L, 2007L, 2008L, 2003L, 2007L, 2007L,
2009L, 2009L, 2011L, 2014L, 2016L, 2006L, 2009L,
2011L, 2013L, 2013L, 2015L, 2015L, 2016L),
Event = c(5L, 4L, 5L, 3L, 1L, 5L, 5L, 5L, 3L,3L,3L,3L,3L,5L, 1L, 5L,
5L,5L,4L,5L, 5L, 5L, 5L, 5L, 5L,5L,5L,5L, 4L, 4L, 1L, 4L, 1L,5L)),
class = "data.frame", row.names = c(NA, -34L)
)
seq.sts <- TSE_to_STS(seq.tse.data,
id = 1, timestamp = 2, event = 3,
stm =NULL, tmin = 1935, tmax = 2018,
firstState = "None")
seq.SPS <- seqformat(seq.sts, 1:84, from = "STS", to = "SPS")
seq.obj <- seqdef(seq.SPS)
> head(seq.tse.data)
ID Year Event
1 1 2008 5
2 2 2010 4
3 2 2012 5
4 3 2007 3
5 3 2009 1
6 3 2010 5
> head(seq.obj)
Sequence
[1] (None,74)-(5,10)-1
[2] (None,76)-(4,2)-(5.4,6)-2
[3] (None,73)-(3,2)-(3.1,1)-(5.3.1,8)-3
[4] (None,62)-(3,12)-(5.3,4)-(5.3.1,6)-3
[5] (None,73)-(5,11)-1
[6] (None,69)-(4,4)-(5.4,11)-2
> head(alphabet(seq.obj),10)
[1] "(1,1)" "(1,10)" "(1,11)" "(1,12)" "(1,14)" "(1,19)" "(1,2)" "(1,21)" "(1,25)" "(1,3)"
...
[145] "(5.4.3.1,5)" "(5.4.3.1,6)" "(5.4.3.1,7)" "(5.4.3.1,8)" "(5.4.3.1.2,9)" "(None,1)" "(None,11)" "(None,20)"
[153] "(None,26)" "(None,30)" "(None,38)" "(None,41)" "(None,42)" "(None,44)" "(None,45)" "(None,49)"
[161] "(None,51)" "(None,53)" "(None,55)" "(None,57)" "(None,58)" "(None,59)" "(None,60)" "(None,61)"
[169] "(None,62)" "(None,64)" "(None,65)" "(None,66)" "(None,67)" "(None,68)" "(None,69)" "(None,7)"
[177] "(None,70)" "(None,71)" "(None,72)" "(None,73)" "(None,74)" "(None,75)" "(None,76)" "(None,77)"
[185] "(None,78)" "(None,79)"
Thanks in advance,
Antonin
I guess that your question is about normalizing the dissimilarities between sequences. E.g., Sabherwal and Robey (1993, p 557) refer to the distance standardization proposed by Abbott & Hyrcac (1990) and do not consider at all the standardization of a sequence. Anyway, I cannot figure out what the standardization of a sequence could be.
The seqdist function of TraMineR has a norm argument that can be used to normalize some of the distance measures proposed. Here is an excerpt from the seqdist help page:
Distances can optionally be normalized by means of the norm argument.
If set to "auto", Elzinga's normalization (similarity divided by
geometrical mean of the two sequence lengths) is applied to "LCS",
"LCP" and "RLCP" distances, while Abbott's normalization (distance
divided by length of the longer sequence) is used for "OM", "HAM" and
"DHD". Elzinga's method can be forced with "gmean" and Abbott's rule
with "maxlength". With "maxdist" the distance is normalized by its
maximal possible value. For more details, see Gabadinho et al. (2009,
2011). Finally, "YujianBo" is the normalization proposed by Yujian and
Bo (2007) that preserves the triangle inequality.
Let me warn you that while normalization makes distances between two short sequences (say of length 10) more comparable with distances between two long sequences (say of length 100), it does not solve the issue of comparing sequences of different lengths.
You find a detailed discussion on the normalization of distance and similarity in sequence analysis in Elzinga & Studer (2016).

How can I find a subsequent trial based on a condition?

I am using R to manipulate a large dataset (dataset) that consists of 20,000+ rows. In my data, I have three important columns to focus on for this question: Trial_Nr (consisting of 90 trials), seconds (increasing in .02 second increments), and threat(fixation to threat: 1=yes, 0=no, NA). Within each trial, I need to answer when the initially fixates to threat (1), how long does it take for them to not fixate on threat (0). So basically, within each trial, I would need to find the first threat=1 and the subsequent threat=0 and subtract the time. I am able to get the first threat with this code:
initalfixthreat <- dataset %>%
group_by(Trial_Nr) %>%
slice(which(threat == '1')[1])
I am stumped on how to get the subsequent threat=0 within that trial number.
Here is an example of the data (sorry don't know how to format it better):
So for Trial_Nr=1, I would be interested in 689.9 seconds- 689.8.
For Trial_Nr=2, I would want 690.04-689.96.
Please let me know if I was unclear and thank you all for your help!
One approach is:
library(dplyr)
df %>%
group_by(Trial_Nr) %>%
filter(!is.na(threat)) %>%
mutate(flag = ifelse(threat == 1, 1, threat - lag(threat))) %>%
filter(abs(flag) == 1 & !duplicated(flag)) %>%
summarise(timediff = ifelse(length(seconds) == 1, NA, diff(seconds)))
# A tibble: 2 x 2
Trial_Nr timediff
<int> <dbl>
1 1 0.1
2 2 0.0800
Data:
df <- structure(list(Trial_Nr = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L), seconds = c(689.76, 689.78, 689.8, 689.82,
689.84, 689.86, 689.88, 689.9, 689.92, 689.94, 689.96, 689.98,
690, 690.02, 690.04), threat = c(0L, 0L, 1L, 1L, 1L, NA, NA,
0L, 1L, 0L, 1L, NA, NA, 1L, 0L)), class = "data.frame", row.names = c(NA,
-15L))

R: dplyr summarize, sum only values of uniques

I am having trouble with a pesky command I would like to have for an analysis of a summary, for which I'm using the dplyr package. It's easiest to explain with some example data:
structure(list(Date = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L),
Name = structure(c(3L, 3L, 4L, 3L, 2L, 3L, 2L, 4L, 1L), .Label = c("George",
"Jack", "John", "Mary"), class = "factor"), Birth.Year = c(1995L,
1995L, 1997L, 1995L, 1999L, 1995L, 1999L, 1997L, 1997L),
Special_Balance = c(10L, 40L, 30L, 5L, 10L, 15L, 2L, 1L,
100L), Total_Balance = c(100L, 100L, 50L, 200L, 20L, 200L,
20L, 100L, 1600L)), .Names = c("Date", "Name", "Birth.Year",
"Special_Balance", "Total_Balance"), class = "data.frame", row.names = c(NA,
-9L))
Two simple summaries are my goal: first, I'd like to summarize just by Date, with the code seen below. The part that is wrong is the total_balance_sum calculation, in which I want to sum the balance of each person but only one time for each person. So for instance, the result of my command for Date=1 is total_balance_sum=100, but what it should be is 150 (add total_balance of 100 for Jack once to total_balance of Mary of 50 once). This wrong calculation obviously messes up the final pct calc.
example_data %>%
group_by(Date) %>%
summarise(
total_people=n_distinct(Name),
total_loan_exposures=n(),
special_sum=sum(Special_Balance,na.rm=TRUE),
total_balance_sum=sum(Total_Balance[n_distinct(Name)]),
total_pct=special_sum/total_balance_sum
) -> example_summary
In the second summary (below), I group by both date and birth year, and again am calculating total_balance_sum incorrectly.
example_data %>%
group_by(Date,Birth.Year) %>%
summarise(
total_people=n_distinct(Name),
total_loan_exposures=n(),
special_sum=sum(Special_Balance,na.rm=TRUE),
total_balance_sum=sum(Total_Balance[n_distinct(Name)]),
total_pct=special_sum/total_balance_sum
) -> example_summary_birthyear
What is the correct way to achieve my goal? Clearly the n_distinct I'm using is only taking one of the values and not summing it properly across names.
Thanks for your help.
I'm a little unclear on what you may be asking for, but does this do what you'd like?: (just for the first example)
example_data %>%
group_by(Date, Name) %>%
summarise(
total_loan_exposures=n(),
total_SpecialPerson=sum(Special_Balance,na.rm=TRUE),
total_balance_sumPerson=Total_Balance[1])%>%
ungroup() %>%
group_by(Date) %>%
summarise(
total_people=n(),
total_loan_exposures=sum(total_loan_exposures),
special_sum=sum(total_SpecialPerson,na.rm=TRUE),
total_balance_sum=sum(total_balance_sumPerson)) %>%
mutate(total_pct=(special_sum/total_balance_sum))-> example_summary
> example_summary
Source: local data frame [3 x 6]
Date total_people total_loan_exposures special_sum total_balance_sum total_pct
1 1 2 3 80 150 0.53333333
2 2 2 4 32 220 0.14545455
3 3 2 2 101 1700 0.05941176
For the second example (for the first, just remove the Birth.Year):
library(dplyr)
example_data %>% group_by(Date, Birth.Year) %>%
mutate(special_sum = sum(Special_Balance),
total_loan_exposure = n( )) %>%
distinct(Name, Total_Balance) %>%
summarise(Total_balance_sum = sum(Total_Balance),
special_sum = special_sum[1],
total_people = n(),
total_loan_exposure = total_loan_exposure[1],
special_sum/Total_balance_sum)

Automate Data Frame Element Division

I have a dataframe, from which I want to obtain percent treated from the dataset // where % treated = Treated / Total visits
eg. % treated Acute Maxillary Sinusitis = 93470/93470 = 100%
dput(droplevels(head(magma)))
structure(list(DIAG_CODE_1 = structure(c(1L, 1L, 2L, 2L, 2L,
2L), .Label = c("4610 SINUSITIS MAXILLARY ACUT", "4619 SINUSITIS ACUTE UNSP"
), class = "factor"), GENDER = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = "FEMALE", class = "factor"), AGE = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "0-2", class = "factor"), Mention_DRGU = c(5460L,
5460L, 17790L, 17790L, 9400L, 9400L), treatment_status = structure(c(1L,
2L, 1L, 2L, 1L, 2L), .Label = c("Total visits", "Treated"), class = "factor"),
diag_class_1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Acute sinusitis", class = "factor"),
year = c(2007L, 2007L, 2007L, 2007L, 2008L, 2008L)), .Names = c("DIAG_CODE_1",
"GENDER", "AGE", "Mention_DRGU", "treatment_status", "diag_class_1",
"year"), row.names = c(1285L, 1286L, 1407L, 1410L, 1408L, 1411L
), class = "data.frame")
However with 432 rows, it's possible I could calculate that all manually but that would be incredibly time consuming. Isn't that what computers are for :p. If you guys could help me find ways to automate tasks within R that would be greatly appreciated.
Is there a way that R could create a resulting dataframe that would tell me the DIAG_CODE_1, GENDER, AGE, % treated, and the year? I've created (in Excel) what I want the output to look like so you guys can see what I mean.
I will be doing this sort of calculation for other respiratory diseases, so I'm looking to learn now that way I can make my life easier in the long run.
You could use dplyr
library(dplyr)
library(tidyr)
magma %>%
spread(treatment_status, Mention_DRGU) %>%
mutate(PercentageTreated=100*(Treated/`Total visits`)) %>%
select(-diag_class_1, -`Total visits`, -Treated)
# DIAG_CODE_1 GENDER AGE year PercentageTreated
#1 4610 SINUSITIS MAXILLARY ACUT FEMALE 0-2 2007 100
#2 4619 SINUSITIS ACUTE UNSP FEMALE 0-2 2007 100
#3 4619 SINUSITIS ACUTE UNSP FEMALE 0-2 2008 100
Try this:
magma2<-reshape(magma, idvar = c("DIAG_CODE_1","GENDER","AGE","diag_class_1","year"), timevar = "treatment_status", direction = "wide")
colnames(magma2)<-c("DIAG_CODE_1","GENDER","AGE","diag_class_1","year","Treated","TotVisits")
magma2$PercentageTreated<-as.numeric(as.character(magma2$Treated))/as.numeric(as.character(magma2$TotVisits))
head(magma2)

Resources