species <- c("frog1","frog1","frog1","frog1","frog1","frog1","frog1","frog1"
,"frog1","frog1","frog2","frog2","frog2","frog2","frog2",
"frog2","frog2","frog2","frog2","frog2")
month <- c(1,12,5,8,10,3,5,7,9,4,2,4,6,7,6,3,8,9,11,1)
number <- c(3,4,5,1,2,3,4,7,6,7,3,5,6,7,8,9,9,5,3,1)
a<- data.frame(species,month,number)
my data frame means I caught two kinds of frogs,frog1 and frog2 with different numbers in different months.
I would like to convert months into 4 seasons. The first season is month 1, month 12, month 2, second is 4,3,5 , third is 7,6,8 ,and forth is 10,9,11. there is order in theses 4 seasons,namely, in the first season I would like to choose month 1 first, month 12 secondly, month 2 finally,in the same way, in the second season I would choose month 4 first, month 3 secondly, month 5 eventually, and so on.For example, in the frog 1 , there are 2 months 1 and 12, I would like to pick up the month 1 instead of month 12 for the first season.
I would like to ask how do I create a column that can choose the most important month in turn for 4 seasons in two kinds of frogs.For instance,in the frog 1 , there are 2 months 1 and 12, I would like to pick up the month 1 instead of month 12 for the first season.
My expected output is :
species <- c("frog1","frog1","frog1","frog1","frog1","frog1","frog1","frog1"
,"frog1","frog1","frog2","frog2","frog2","frog2","frog2",
"frog2","frog2","frog2","frog2","frog2")
month <- c(1,12,5,8,10,3,5,7,9,4,2,4,6,7,6,3,8,9,11,1)
number <- c(3,4,5,1,2,3,4,7,6,7,3,5,6,7,8,9,9,5,3,1)
choosemonth <- c("season1","","","","season4","","","season3","","season2",
"","season2","","season3","","","","season4","","season1")
b<- data.frame(species,month,number,choosemonth)
I'm guessing a little at your final desired result, but here's how to create the season and the importance, and I'm solving for the most important month for each species
Here's a way with dplyr:
library(dplyr)
a %>%
# Season is basically just one-off quarters
mutate(season = trunc((month + 1)%%12 / 3),
# for each month the value mod 3 goes in order 2,3,1
importance = c(2,3,1)[month %% 3 + 1]) %>%
group_by(season, species) %>%
# keep only those with the max importance
filter(importance == max(importance))
EDIT: It looks like you just want to flag the value with the most importance, so here's how to do that,
a %>%
# Season is basically just one-off quarters
mutate(season = trunc((month + 1)%%12 / 3),
# for each month the value mod 3 goes in order 2,3,1
importance = c(2,3,1)[month %% 3 + 1]) %>%
mutate(choosemonth = ifelse(importance == 3, paste0('season',season + 1),''))
EDIT 2: edited one more time, was dividing into 3 seasons rather than 4
Related
I have a tibble in which each row represents an image of an eye and contains the following relevant variables: patientId, laterality (left or right), date, imageId.
I would like to manipulate this to create another tibble showing the number of followUpYears for each eye (patientId, laterality). followUpYears is defined in a somewhat unusual way:
In order to meet the requirements for follow-up in a particular year, there must be two different imaging dates during that year i.e. between days 0-365 for year 1, days 366-730 for year 2 etc. The first image date is always the baseline and followUpYears is always an integer.
Only one image per date is considered.
Follow-up ceases as soon as the requirement for 2 imaging dates in a year is not met i.e. if there is only 1 imaging date in the first year, followUpYears is 0 regardless of how many images are taken subsequently.
There is no requirement for there to be at least n years between the first and last image date for an eye to have n followUpYears.
The following dummy data demonstrates these points:
data <- tibble(patientId = c('A','A','A','A','A','A','B','B','B','B','B','B','B'),
laterality = c('L','L','L','L','L','L','R','R','R','R','L','L','L'),
date = as.Date(c('2000-05-05','2000-05-05','2001-05-06','2001-05-07','2002-05-06','2002-05-07','2000-09-08','2001-09-07','2001-09-09','2001-09-10','2000-09-08','2001-09-07','2001-09-10')),
imageId = 1:13)
expected_output <- tibble(patientId = c('A','B','B'),
laterality = c('L','R','L'),
followUpYears = c(0, 2, 1))
Patient A's left eye has 0 followUpYears because of points 2 and 3. Patient B's right eye has 2 followUpYears because of point 4 (despite the fact that there is only slightly more than 1 year between the first and last image date). Patient B's left eye only has 1 year of follow up since it doesn't meet the requirement for 2 image dates in year 2.
I am familiar with the basic dplyr verbs but I can't think of how to frame this type of variable. Note that patients might have one or both eyes included and some might have 10+ years of follow up. Finally, a solution that considers 1 year to be 365 days regardless of leap years is fine.
Thank you!
Here's a way with ifelse. diff_year is a helper function that computes the difference between two dates in year rounded to the value above.
library(dplyr)
diff_year <- function(date1, date2) ceiling(as.numeric(difftime(date1, date2)) / 365)
data %>%
group_by(patientId) %>%
summarise(followUpYears = ifelse(diff_year(date[date != first(date)][1], first(date)) <= 1,
diff_year(max(date), min(date)), 0))
#A tibble: 2 × 2
# patientId followUpYears
# <chr> <dbl>
#1 A 0
#2 B 2
Update with OP's comment. This should work with all conditions:
diff_year <- function(date1, date2) as.numeric((date1 - date2) / 365)
data %>%
distinct(patientId, laterality, date, .keep_all = TRUE) %>%
group_by(patientId, laterality) %>%
mutate(diffYear = floor(diff_year(date, min(date)))) %>%
add_count(count = diffYear) %>%
filter(!cumany(lag(n == 1, default = 0)) | row_number() == 1) %>%
summarise(followUpYears = ifelse(any(n > 1), ceiling(diff_year(max(date[n != 1]), min(date))), 0))
# patientId laterality followUpYears
#1 A L 0
#2 B L 1
#3 B R 2
Below is my approach which should cover all four conditions, I'm not sure however, how you get:
#> # A tibble: 1 x 3
#> patientId laterality followUpYears
#> <chr> <chr> <dbl>
#> 1 B L 1
since according to your logic it should fall into the two year band from 2000-09-08 to 2001-09-10 are 367 days which equals two years.
The idea is that we first calculate a followup_flag which checks if the date is within 365 days of the former date, and then takes the cummin() so that the series breaks as soon there is no direct follow up year.
Then we can filter all rows which meet the followup_flag == 1.
And for this data set we check how many years are between the first and the last date, and since we want to count 367 as 2 years we have to take the ceiling().
library(dplyr)
library(lubridate)
data %>%
group_by(patientId, laterality) %>%
mutate(followup_flag = cummin(date - dplyr::lag(date, default = first(date)) <= 365)) %>%
filter(followup_flag == 1) %>%
summarise(followUpYears = as.numeric(
difftime(last(date), first(date), units = "days") / 365) %>%
ceiling()
)
#> `summarise()` has grouped output by 'patientId'. You can override using the
#> `.groups` argument.
#> # A tibble: 3 x 3
#> # Groups: patientId [2]
#> patientId laterality followUpYears
#> <chr> <chr> <dbl>
#> 1 A L 0
#> 2 B L 2
#> 3 B R 2
Data used:
data <- tibble(patientId = c('A','A','A','A','A','A','B','B','B','B','B','B','B'),
laterality = c('L','L','L','L','L','L','R','R','R','R','L','L','L'),
date = as.Date(c('2000-05-05','2000-05-05','2001-05-06','2001-05-07','2002-05-06','2002-05-07','2000-09-08','2001-09-07','2001-09-09','2001-09-10','2000-09-08','2001-09-07','2001-09-10')),
imageId = 1:13)
Created on 2023-02-08 by the reprex package (v2.0.1)
I am struggling a little with the logic for recoding nested data into a long "continuous" format based on dates in R
Below is a dummy example of my data. I have three sets of dates The start and stop time for a participant that is stored in long format, and then the start of another incident that is stored as wide data.
GC_ID HMIS_Start HMIS_Stop CPS Start CPS Start 2 CPS Start 3
------- ------------ ----------- ----------- ------------- -------------
1 1/10/14 1/20/14 1/15/14 6/2/14 NA
1 4/10/14 5/30/14 1/15/14 6/2/14 NA
1 12/1/14 12/2/14 1/15/14 6/2/14 NA
1 1/1/15 2/28/15 1/15/14 6/2/14 NA
2 8/13/13 8/17/14 NA NA NA
3 5/1/15 5/2/15 1/16/13 6/26/14 7/27/15
3 6/4/16 7/10/16 1/16/13 6/26/14 7/27/15
4 10/15/13 10/25/13 2/18/15 NA NA
4 12/25/13 1/18/14 2/18/15 NA NA
4 2/8/15 7/20/15 2/18/15 NA NA
My goal is to create two long continuous variables that go along with each months from August 2013 to December 2015. For one of the two variables, I would like to code a 1 for each month that target month is within an HMIS_start and HMIS_stop time for a participant AND has at least one CPS Start date within that month. The second variable would do a similar thing, but it would be if the CPS Start date happened in the month after the HMIS Stop date.
So participant 1's data could look like this:
I assume I need to create a blank data set with the ID variable and then the month/year variable. Then I would use a for loop for each ID to run an "if_then" statement comparing IF the month is greater then the HMIS start and less then the HMIS stop AND if the CPS start is within that month too.
I am mostly just struggling with how to create that process and use the for loop logically given that there are long data already in the file and multiple lines of long data per participant that need to be compared to all possible CPS start dates
Any thoughts or code tips on how to tackle this?
I am not sure how you came to your answers, and I will update this code once that is provided. But I used library(tidyverse) and library(lubridate) for this:
dat <- data.frame(GC_ID = c(1,1,1,1,2,3,3,4,4,4),
HMIS_Start = c("1/10/14", "4/10/14", "12/1/14", "1/1/15", "8/13/13", "5/1/15", "6/4/16", "10/15/13", "12/25/13","2/8/15"), HMIS_Stop = c("1/20/14", "5/30/14", "12/2/14", "2/28/15", "8/17/14", "5/2/15", "7/10/16", "10/25/13", "1/18/14", "7/20/15"), CPS_Start = c("1/15/14","1/15/14","1/15/14","1/15/14",NA, "1/16/13", "1/16/13", "2/18/15", "2/18/15", "2/18/15"), CPS_Start_2 = c("6/2/15", "6/2/15", "6/2/15", "6/2/15", NA, "6/26/14", "6/26/14", NA, NA, NA), CPS_Start_3 = c(NA,NA,NA,NA,NA,"7/27/15", "7/27/15", NA,NA,NA))
dats <- dat %>%
mutate_if(is.factor, as.character) %>%
mutate_if(is.character, ~as.Date(., format = "%m/%d/%y")) %>%
gather(Var, Dates, -GC_ID, -HMIS_Start, -HMIS_Stop) %>%
filter(!is.na(Dates)) %>%
mutate(HMIS_CPS_SAME = if_else(month(HMIS_Start) == month(HMIS_Stop) &
year(HMIS_Start) == year(HMIS_Stop) &
month(HMIS_Start) == month(Dates) &
year(HMIS_Start) == year(Dates), 1, 0 ),
CPS_After = if_else(month(HMIS_Stop) + 1 == month(Dates) &
year(HMIS_Stop) == year(Dates), 1,0 ),
Months = month(HMIS_Start),
Years = year(HMIS_Start)) %>%
arrange(GC_ID, HMIS_Start, Dates) %>%
group_by(GC_ID, Months, Years) %>%
summarise(HMIS_CPS_SAME = max(HMIS_CPS_SAME),
CPS_After = max(CPS_After)) %>%
ungroup()
full_dat <- merge(data.frame(GC_ID = unique(dat$GC_ID)), data.frame(Dates = seq.Date(as.Date("2013-08-01"), as.Date("2015-12-01"), by = "month"))) %>%
mutate(Months = month(Dates), Years = year(Dates)) %>%
left_join(dats, by = c("GC_ID", "Months", "Years")) %>%
mutate_if(is.numeric , replace_na, replace = 0)
First I created the data in R and R format. Then I converted the data to date format for the 5 columns you mentioned. I made the data long to do the comparisons specified, then found the max for each GC_ID, Months, Years. Then I used a cartesian join for each date and GC_ID and got the months and years from those and joined our dats to full_dat by GC_ID, Months, Years. The last mutate_if is to convert all NA values to 0. NO Looping Needed! :-)
I have data for hospitalisations that records date of admission and the number of days spent in the hospital:
ID date ndays
1 2005-06-01 15
2 2005-06-15 60
3 2005-12-25 20
4 2005-01-01 400
4 2006-06-04 15
I would like to create a dataset of days spend at the hospital per year, and therefore I need to deal with cases like ID 3, whose stay at the hospital goes over the end of the year, and ID 4, whose stay at the hospital is longer than one year. There is also the problem that some people do have a record on next year, and I would like to add the `surplus' days to those when this happens.
So far I have come up with this solution:
library(lubridate)
ndays_new <- ifelse((as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) < data$ndays,
(as.Date(paste(year(data$date),"12-31",sep="-")),
format="%Y-%m-%d") - data$date) ,
data$ndays)
However, I can't think of a way to get those `surplus' days that go over the end of the year and assign them to a new record starting on the next year. Can any one point me to a good solution? I use dplyr, so solutions with that package would be specially welcome, but I'm willing to try any other tool if needed.
My solution isn't compact. But, I tried to employ dplyr and did the following. I initially changed column names for my own understanding. I calculated another date (i.e., date.2) by adding ndays to date.1. If the years of date.1 and date.2 match, that means you do not have to consider the following year. If the years do not match, you need to consider the following year. ndays.2 is basically ndays for the following year. Then, I reshaped the data using do. After filtering unnecessary rows with NAs, I changed date to year and aggregated the data by ID and year.
rename(mydf, date.1 = date, ndays.1 = ndays) %>%
mutate(date.1 = as.POSIXct(date.1, format = "%Y-%m-%d"),
date.2 = date.1 + (60 * 60 * 24) * ndays.1,
ndays.2 = ifelse(as.character(format(date.1, "%Y")) == as.character(format(date.2, "%Y")), NA,
date.2 - as.POSIXct(paste0(as.character(format(date.2, "%Y")),"-01-01"), format = "%Y-%m-%d")),
ndays.1 = ifelse(ndays.2 %in% NA, ndays.1, ndays.1 - ndays.2)) %>%
do(data.frame(ID = .$ID, date = c(.$date.1, .$date.2), ndays = c(.$ndays.1, .$ndays.2))) %>%
filter(complete.cases(ndays)) %>%
mutate(date = as.numeric(format(date, "%Y"))) %>%
rename(year = date) %>%
group_by(ID, year) %>%
summarise(ndays = sum(ndays))
# ID year ndays
#1 1 2005 15
#2 2 2005 60
#3 3 2005 7
#4 3 2006 13
#5 4 2005 365
#6 4 2006 50
Hope your doing well, I am working on an assignment related to data pre processing and I need some help in R
I have a column for days in which they are 711 unique values. In total I have 2 million observations. The data has been collected over 2 years and each day represents one day in a week.
For example day 1 is Monday and day 8 is Monday aswell and day 15 Is Monday and so on.
Could someone help me to replace this with 1 to 7 so if day 1 is Monday I want the cell which contains the value 8 to be replaced by 1 and 15 with 1 and so on.
I hope this makes sense.
thank you for your help.
Regards
A
Following the comments (since I can't comment), try this:
# An example data.frame
mydata <- data.frame(DAY= 1:21, ABC= letters[1:21])
mydata
# Do "mod 7" with variable DAY, so DAY have now values from 0 to 6,
# Then assign back to variable DAY
mydata$DAY <- mydata$DAY %% 7
mydata
# Replace 0 for 7 in DAY variable
mydata$DAY <- ifelse(mydata$DAY == 0, 7, mydata$DAY)
mydata
# Save final data.frame
write.csv(mydata, file='mydata.csv')
Rather than issue 7 separate commands (one for each day) you can use dplyr:
require(dplyr)
d <- data.frame(day = seq(1:711))
mutate(d, day = day %% 7 +1)
What we're doing here is taking the day number and finding its remainder when divided by 7. We have to add 1 back to this so we dont get 0 when there is no remainder.
I struggle mightily with dates in R and could do this pretty easily in SPSS, but I would love to stay within R for my project.
I have a date column in my data frame and want to remove the year completely in order to leave the month and day. Here is a peak at my original data.
> head(ds$date)
[1] "2003-10-09" "2003-10-11" "2003-10-13" "2003-10-15" "2003-10-18" "2003-10-20"
> class((ds$date))
[1] "Date"
I "want" it to be.
> head(ds$date)
[1] "10-09" "10-11" "10-13" "10-15" "10-18" "10-20"
> class((ds$date))
[1] "Date"
If possible, I would love to set the first date to be October 1st instead of January 1st.
Any help you can provide will be greatly appreciated.
EDIT: I felt like I should add some context. I want to plot an NHL player's performance over the course of a season which starts in October and ends in April. To add to this, I would like to facet the plots by each season which is a separate column in my data frame. Because I want to compare cumulative performance over the course of the season, I believe that I need to remove the year portion, but maybe I don't; as I indicated, I struggle with dates in R. What I am looking to accomplish is a plot that compares cumulative performance over relative dates by season and have the x-axis start in October and end in April.
> d = as.Date("2003-10-09", format="%Y-%m-%d")
> format(d, "%m-%d")
[1] "10-09"
Is this what you are looking for?
library(ggplot2)
## make up data for two seasons a and b
a = as.Date("2010/10/1")
b = as.Date("2011/10/1")
a.date <- seq(a, by='1 week', length=28)
b.date <- seq(b, by='1 week', length=28)
## make up some score data
a.score <- abs(trunc(rnorm(28, mean = 10, sd = 5)))
b.score <- abs(trunc(rnorm(28, mean = 10, sd = 5)))
## create a data frame
df <- data.frame(a.date, b.date, a.score, b.score)
df
## Since I am using ggplot I better create a "long formated" data frame
df.molt <- melt(df, measure.vars = c("a.score", "b.score"))
levels(df.molt$variable) <- c("First season", "Second season")
df.molt
Then, I am using ggplot2 for plotting the data:
## plot it
ggplot(aes(y = value, x = a.date), data = df.molt) + geom_point() +
geom_line() + facet_wrap(~variable, ncol = 1) +
scale_x_date("Date", format = "%m-%d")
If you want to modify the x-axis (e.g., display format), then you'll probably be interested in scale_date.
You have to remember Date is a numeric format, representing the number of days passed since the "origin" of the internal date counting :
> str(Date)
Class 'Date' num [1:10] 14245 14360 14475 14590 14705 ...
This is the same as in EXCEL, if you want a reference. Hence the solution with format as perfectly valid.
Now if you want to set the first date of a year as October 1st, you can construct some year index like this :
redefine.year <- function(x,start="10-1"){
year <- as.numeric(strftime(x,"%Y"))
yearstart <- as.Date(paste(year,start,sep="-"))
year + (x >= yearstart) - min(year) + 1
}
Testing code :
Start <- as.Date("2009-1-1")
Stop <- as.Date("2011-11-1")
Date <- seq(Start,Stop,length.out=10)
data.frame( Date=as.character(Date),
year=redefine.year(Date))
gives
Date year
1 2009-01-01 1
2 2009-04-25 1
3 2009-08-18 1
4 2009-12-11 2
5 2010-04-05 2
6 2010-07-29 2
7 2010-11-21 3
8 2011-03-16 3
9 2011-07-09 3
10 2011-11-01 4