I'm bit lost in all those formatting and can't make my age function works, I'm not sure where format option applied to source or destination, my goal is to attach AGE column, dx$BIRTH_DATE define exactly like in my source data, factor in this format I can not change it. I don't want to use lubridate for now, as i need to migrate it to diff env. is it possible.
Thanks much for ur help!!
Mario
age_years <- function(from, to)
{
lt <- as.POSIXlt(c(from, to))
age <- lt$year[2] - lt$year[1]
mons <- lt$mon + lt$mday/50
if(mons[2] < mons[1]) age <- age -1
age
}
today <- Sys.Date() #,format="%m/%d/%Y")
class(today)
age_years("1988-06-30", "2003-07-12")
age_years("1988-06-30", date) ### doesn't work ???
as.character(dx$BIRTH_DATE)
as.Date(dx$BIRTH_DATE)
dx <- data.frame(ID = factor(c(1,2,3)), BIRTH_DATE = c("1/11/1953","2/12/1977","3/13/2000"), FEM = (c(11,22,33)))
dx
str(dx)
### <#>><
dx$AGE <- age_years(as.Date(dx$BIRTH_DATE), today) ## Age=1 ?????
Using your code here is a solution for you. You had 2 major problems, first your BirthDate format was not being declared (as noted by Pierre) second you were calling only the second and first year to declare the age, rather than using the entire dataframe. Now from and to are turned into vectors to define ages. Hope this helps!
age_years <- function(from, to)
{
from <- as.POSIXlt(from)
to<- as.POSIXlt(to)
age <- to$year - from$year
mons <- from$mon + from$mday/50
if(mons[2] < mons[1]) age <- age -1
age
}
today <- Sys.Date() #,format="%m/%d/%Y")
dx <- data.frame(ID = factor(c(1,2,3)), BIRTH_DATE = c("1/11/1953","2/12/1977","3/13/2000"), FEM = (c(11,22,33)))
dx$AGE <- age_years(from=as.Date(dx$BIRTH_DATE,format = "%m/%d/%Y"), to=today)
> dx
ID BIRTH_DATE FEM AGE
1 1 1/11/1953 11 62
2 2 2/12/1977 22 38
3 3 3/13/2000 33 15
Related
My code is meant to order a table called Football (imported csv2) and then, using a for loop, go through the data and return the row number of the start year and end year.
Football[order(Football$Year),]
start_year <- min(Football$Year)
end_year <- max(Football$Year)
for (i in 1:nrow(Football)
{
if (Football$Year[i] = start_year)
{
row_of_start <- i
}
if (Football$Year[i] = end_year)
{
row_of_end <- i
}
}
This produces the following error:
> if (Football$Year[1] = start_year) row_of_start <- 1
Error: unexpected '=' in "if (Football$Year[1] ="
I appreciate there are probably ways of doing this without a for loop (which I would be very appreciative to know) although I would also like to know how to make the for loop work (to further my understanding).
You can skip the loop entirely using which(). This will usually be faster and more legible:
# Create example data
set.seed(123)
Football <- data.frame(Year = sample(1990:2000, size = 10),
foo = sample(letters, size = 10))
# Sort the data as you have done
Football_sort <- Football[order(Football$Year), ]
# Get the row numbers of the min and max (start and end years)
which(with(Football_sort, Year == min(Year)))
#> [1] 1
which(with(Football_sort, Year == max(Year)))
#> [1] 10
Depending upon what you actually want to do, you can skip the ordering step as well. Both of the below depend upon the dplyr package to work.
If you just want the start and end year rows rather than their row numbers:
library(dplyr)
Football %>%
filter(Year %in% c(min(Year), max(Year)))
#> Year foo
#> 1 2000 e
#> 2 1990 d
If you want the "year number" of the start and end year:
Football %>%
summarise(start_year = 1,
end_year = max(Year) - min(Year))
#> start_year end_year
#> 1 1 10
I am new to R and am running into difficulty with more advanced filtering. I have a data frame containing 1500 rows of people in households and need to filter out everyone who is part of a household where at least 1 person is older than 24. For example, in the sample set below I would only want to keep rows 3,4, and 5.
PersonalID DOB HouseholdID
1 1961-04-15 123
2 2017-01-12 123
3 2000-01-02 122
4 2001-03-05 122
5 1996-08-22 122
Initially I just filtered to get a new data frame with everyone in that age range and then filtered the original data frame again (and again and again and so on...) with each HouseholdID of someone under 25 to check if anyone else with that HouseholdID is over 24.
Whenever I'm doing the same thing over and over it seems like there's probably a way to use a function instead but I'm having a hard time coming up with one that works. This is my current attempt but I know there's plenty wrong with it:
UNDER25df <- filter(df, DOB >= "yyyy-mm-dd")
for (UNDER25df$HouseholdID in df) {
if (all(df$DOB >= "yyyy-mm-dd")) {
view(filter(df, HouseholdID == "$HouseholdID"))
}
}
The error I get is:
unexpected '}' in "}"
but I'm pretty sure that I can nest an if statement in a for loop in R and that I've been careful about the positioning of the brackets so I don't know exactly what it's referring to.
What I'm not sure of is if I can iterate through a data frame in this way or if this even makes sense. I've read that vectoring might be better in general for advanced filtering but tried to read the documentation on it and couldn't really see how to make that jump to this problem. Does anyone have a suggestion or a direction I should be looking in?
You do not need a loop for this. Try
library(lubridate)
library(dplyr)
set.seed(1)
df <- tibble(DOB = Sys.Date() - sample(3000:12000, 6),
personalID = 1:6,
HouseholdID = c(1,1,2,2,2,3))
df$DOB
# grab householdID from all persons that are at least 24
oldies <- df[(lubridate::today() - lubridate::ymd(df$DOB)) > years(24),
"HouseholdID", TRUE]
# base R way
oldies <- df[as.Date(df$DOB) > as.Date("1993-2-10"),
"HouseholdID", TRUE]
# household members in a household with someone 24 or older
df %>%
filter(HouseholdID %in% oldies)
# household members in a household with noone 24 or older
df %>%
filter(!(HouseholdID %in% oldies))
I am not sure if you want keep the rows grouped by ID that all users are less than or equal to 24-year old. If so, then maybe you can try the code below
library(lubridate)
dfout <- subset(df, ave(floor(time_length(Sys.Date()-as.Date(DOB),"years"))<=24, HouseholdID, FUN = all))
If you really want to use for loop to make it, then the below is an example
dfout <- data.frame()
for (id in unique(df$HouseholdID)) {
subdf <- subset(df,HouseholdID == id)
if (with(subdf, all(floor(time_length(Sys.Date()-as.Date(DOB),"years"))<=24))) {
dfout <- rbind(dfout,subdf)
}
}
Both approaches above can give you the result shown as
> dfout
PersonalID DOB HouseholdID
3 3 2000-01-02 122
4 4 2001-03-05 122
5 5 1996-08-22 122
DATA
df <- structure(list(PersonalID = 1:5, DOB = c("1961-04-15", "2017-01-12",
"2000-01-02", "2001-03-05", "1996-08-22"), HouseholdID = c(123L,
123L, 122L, 122L, 122L)), class = "data.frame", row.names = c(NA,
-5L))
I am not sure if you want to select household where all the people are above 24 or at least one person is above 24. In any case, you can use subset with ave
subset(df, ave(as.integer(format(Sys.Date(), "%Y")) -
as.integer(format(DOB, "%Y")) >= 24, HouseholdID, FUN = any))
This selects households where at least one person is above 24. If you want to select households where all people are above 24 use all instead of any in FUN argument.
Similarly, using dplyr, we can use
library(dplyr)
df %>%
group_by(HouseholdID) %>%
filter(any(as.integer(format(Sys.Date(), "%Y")) -
as.integer(format(DOB, "%Y")) >= 24))
I am unable to figure out how to use nested for loops in R for solving my problem. Here's a miniature version of what I'm trying to solve:
I have two files, test1 and test2 which look like this:
head(test1)
Date Settlement
2008-08-28 138.29
2008-08-29 135.34
2008-09-01 135.23
2008-09-02 123.36
2008-09-03 126.41
2008-09-04 128.68
2008-09-05 123.70
2008-09-08 124.60
2008-09-09 122.33
2008-09-10 120.85
2008-09-11 120.15
2008-09-12 121.17
2008-09-15 118.97
2008-09-16 114.90
2008-09-17 115.78
2008-09-18 115.60
2008-09-19 115.90
2008-09-22 120.49
2008-09-23 124.10
And here is test2:
test2
X1 X2 X3
2008-08-31 2008-09-05 2008-09-11
2008-09-05 2008-09-11 2008-09-14
2008-09-11 2008-09-14 2008-09-18
2008-09-14 2008-09-18 2009-09-22
The logic that I need to put in is:
Select Dates [1,1] and [1,2] from test2
Find all Settlement Prices between those 2 dates in test1
Get average of those prices, place it in [1,1] of a new dataframe.
Repeat by increasing columns, and then rows in pt1.
The end-result of this would look like this:
X1 X2
128.42 122.87
122.87 120.66
120.66 116.55
116.55 115.75
So, the 1st value in X1 is an average of Settlement prices between 31-Aug-08 (including) and 5-Sep-08 (excluding), and the 1st value in X2 is an average of Settlement prices between 5-Sep-08 (including) and 11-Sep-08 (excluding), and so on for the rows below.
Here's my code that works (if I pass it fixed dates from test2 as given below):
temp1 <- test1 %>%
group_by(Date >= test2$X1[1] & Date < test2$X2[1]) %>%
summarise(AvgPrice2 = mean(Settlement, na.rm = T))
temp1 <- filter(temp1, temp1[,1]==TRUE)
However, no matter what I try (over last 3 days !) I cannot figure out how to put this into a for loop. Even tried rollapply, sapply...not able to get anything to work. The code need not be time efficient, I just need to automate this process.
I have been working with R for sometime, but clearly this is a problem for advanced users...Would deeply appreciate any help on this.
Many thanks in advance.
I would use an SQL-like approach through the sqldf package (which lets you to apply SQL sintax to your data.frames
ds = data.frame(Date = c("2008-08-28", "2008-08-29", "2008-09-01", "2008-09-02", "2008-09-03", "2008-09-04", "2008-09-05", "2008-09-08", "2008-09-09", "2008-09-10", "2008-09-11", "2008-09-12", "2008-09-15", "2008-09-16", "2008-09-17", "2008-09-18", "2008-09-19", "2008-09-22", "2008-09-23"),
Settlement = c(138.29, 135.34, 135.23, 123.36, 126.41, 128.68, 123.70, 124.60, 122.33, 120.85, 120.15, 121.17, 118.97, 114.90, 115.78, 115.60, 115.90, 120.49, 124.10))
dr = data.frame(d1=c("2008-08-31", "2008-09-05", "2008-09-11", "2008-09-14"),
d2=c("2008-09-05", "2008-09-11", "2008-09-14", "2008-09-18"),
d3=c("2008-09-11", "2008-09-14", "2008-09-18", "2009-09-22"))
# add a variable which I will use to identify the rows
dr$g = 1:NROW(dr);
library(sqldf);
output = sqldf("SELECT dr.g, AVG(s1.Settlement) AS X1, AVG(s2.Settlement) AS X2
FROM dr
JOIN ds AS s1 ON dr.d1 <= s1.Date AND s1.Date < dr.d2
JOIN ds AS s2 ON dr.d2 <= s2.Date AND s2.Date < dr.d3
GROUP BY dr.g");
I found the suggested package in this post. In the same post another user suggested the use of the data.table package but I don't feel as confident on data.table sintax as the SQL one :)
The documentation of sqldf and some usage example can be found on GitHub project page
I'm not sure I got it, one of my results is different from the one in your wanted output. First, make sure the dates are of class Date.
test1$Date <- as.Date(test1$Date)
test2$X1 <- as.Date(test2$X1)
test2$X2 <- as.Date(test2$X2)
test2$X3 <- as.Date(test2$X3)
Now, for the computations you've described.
res1 <- numeric(nrow(test2))
res2 <- numeric(nrow(test2))
for(i in seq_len(nrow(test2))){
inx <- test2$X1[i] <= test1$Date & test1$Date < test2$X2[i]
res1[i] <- mean(test1$Settlement[inx])
inx <- test2$X2[i] <= test1$Date & test1$Date < test2$X3[i]
res2[i] <- mean(test1$Settlement[inx])
}
result <- data.frame(X1 = res1, X2 = res2)
result
X1 X2
1 128.42 122.8700
2 122.87 120.6600
3 120.66 116.5500
4 116.55 119.0225
The value that is different is the very last one, result$X2[4]. Your output is 115.75 and here it's 119.0225.
Your data
Ensuring dates are Dates
library(lubridate)
test1 = data.frame(Date = ymd(c("2008-08-28", "2008-08-29", "2008-09-01", "2008-09-02", "2008-09-03", "2008-09-04", "2008-09-05", "2008-09-08", "2008-09-09", "2008-09-10", "2008-09-11", "2008-09-12", "2008-09-15", "2008-09-16", "2008-09-17", "2008-09-18", "2008-09-19", "2008-09-22", "2008-09-23")),
Settlement = c(138.29, 135.34, 135.23, 123.36, 126.41, 128.68, 123.70, 124.60, 122.33, 120.85, 120.15, 121.17, 118.97, 114.90, 115.78, 115.60, 115.90, 120.49, 124.10))
test2 = data.frame(d1=ymd(c("2008-08-31", "2008-09-05", "2008-09-11", "2008-09-14")),
d2=ymd(c("2008-09-05", "2008-09-11", "2008-09-14", "2008-09-18")),
d3=ymd(c("2008-09-11", "2008-09-14", "2008-09-18", "2009-09-22")))
tidyverse solution
library(tidyverse)
result <- map_df(1:nrow(test2), ~data.frame(X1=(filter(test1, Date >= test2$d1[.x] & Date < test2$d2[.x]) %>% summarise(m=mean(Settlement)))$m,
X2=(filter(test1, Date >= test2$d2[.x] & Date < test2$d3[.x]) %>% summarise(m=mean(Settlement)))$m))
Output
X1 X2
1 128.42 122.8700
2 122.87 120.6600
3 120.66 116.5500
4 116.55 119.0225
Thanks a lot for all the answers, I tried all of them, but none seemed to fit my needs given that the files above were a miniaturized version of actual files - so coding by column names / splitting data manually into rows didn't seem like a good option for me.
But I finally figured out what'll work nicely in this case:
library(lubridate)
Makingrows <- function(test1, test2, j){
res<<- NULL
m1 = nrow(test2)
for(i in 1:m1){
d1 <- ymd(test2[i,j])
d2 <- ymd(test2[i,j+1])
X1 <- filter(test1, Date < d2 & Date >= d1)
res[i] <- mean(X1$Settlement, na.rm = T)
}
return(res)
}
mcol1 <- ncol(test2)-1
finalres <- lapply(1:mcol1, function(x) Makingrows(test1, test2, x))
finalres <- as.data.frame(finalres)
And yes, I was also getting the last value as 119.02...and I realized that by mistake I put the year as 2009 in the last cell in test2 file. Due to this, the code was picking up all the values till the end.
Thanks a lot everyone. I hope you'll agree with me as I mark this as the answer to my question.
I'm trying to use the 'relsurv' package in R to compare the survival of a cohort to national life tables. The code below shows my problem using the example from relsurv but changing the life-table data. I've just used two years and two ages in the life-table data below, the actual data is much larger but gives the same error. The error is 'invalid ratetable argument' but I've formatted it as per the example life-tables 'slopop' and 'survexp.us'.
library(survival)
library(relsurv)
data(rdata) # example data from relsurv
raw = read.table(header=T, stringsAsFactors = F, sep=' ', text='
Year Age sex qx
1980 30 1 0.00189
1980 31 1 0.00188
1981 30 1 0.00191
1981 31 1 0.00191
1980 30 2 0.00077
1980 31 2 0.00078
1981 30 2 0.00076
1981 31 2 0.00074
')
ages = c(30,40) # in years
years = c(1980, 1990)
rtab = array(data=NA, dim=c(length(ages), 2, length(years))) # set up blank array: ages, sexes, years
for (y in unique(raw$Year)){
for (s in 1:2){
rtab[ , s, y-min(years)+1] = -1 * log(1-subset(raw, Year==y&sex==s)$qx) / 365.24 # probability of death in next year, transformed to hazard (see ratetables help)
}
}
attributes(rtab)$dimnames[[1]] = as.character(ages)
attributes(rtab)$dimnames[[2]] = c('male','female')
attributes(rtab)$dimnames[[3]] = as.character(years)
attributes(rtab)$dimid <- c("age", "sex", 'year')
attributes(rtab)$dim <- c(length(ages), 2, length(years))
attributes(rtab)$factor = c(0,0,1)
attributes(rtab)$type = c(2,1,4)
attributes(rtab)$cutpoints[[1]] = ages*365.24 # must be in days
attributes(rtab)$cutpoints[[2]] = NULL
attributes(rtab)$cutpoints[[3]] = as.date(paste("1Jan", years, sep='')) # must be date
attributes(rtab)$class = "ratetable"
# example from relsurv
rsmul(Surv(time,cens) ~ sex+as.factor(agegr)+
ratetable(age=age*365.24, sex=sex, year=year),
data=rdata, ratetable=rtab, int=1)
Try using the transrate function from the relsurv package to reformat the data. That should give you a compatible dataset.
Regards,
Josh
Three things to add:
You should set attributes(rtab)$factor = c(0,1,0), since sex (the second dimension) is a factor (i.e., doesn't change over time).
A good way to check whether something is a valid rate table is to use the is.ratetable() function. is.ratetable(rtab, verbose = TRUE) will even return a message stating what was wrong.
Check the result of is.ratetable without using verbose first, because it will lie about valid rate tables.
The rest of this comment is about this lie.
If the type attribute isn't given, is.ratetable will calculate it using the factor attribute; you can see this by just printing the function. However, it seems to do so incorrectly. It uses type <- 1 * (fac == 1) + 2 * (fac == 0) + 4 * (fac > 0), where fac is attributes(rtab)$factor.
But the next section, which checks the type attribute if it's provided, says the only valid values are 1, 2, 3, and 4. It's impossible to get 1 from the code above.
For example, let's examine the slopop ratetable provided with the relsurv package.
library(relsurv)
data(slopop)
is.ratetable(slopop)
# [1] TRUE
is.ratetable(slopop, verbose = TRUE)
# [1] "wrong length for cutpoints 3"
I think this is where your rate table is being hung up.
I am sorry for the cryptic title but I didn't know how to adequately summarise my problem. So here's my question. I have a data frame with dates and a name for several entities:
df <- data.frame(
time=rep(as.Date(seq(as.Date("2004/1/1"), as.Date("2005/12/1"), by = "1 month ")),2),
name=c(rep("a",24),rep("b",24))
)
str(df)
'data.frame': 48 obs. of 2 variables:
$ time: Date, format: "2004-01-01" "2004-02-01" ...
$ name: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
And I have another dataframe with several unevenly spaced events:
events <- data.frame(
time = c("2004-12-1", "2005-8-1", "2005-6-1", "2004-4-1"),
event = c("normal", "extraordinary", "normal", "extraordinary"),
name = c("a", "a", "b", "b")
)
I want to merge these two data frames in a way that the event is assigned from the either the beginning of the data set up to the event or starting with the last event up to the next event or the end of the data set. This would look something like:
date name event
2004-01-01 a normal
2004-01-02 a normal
...
2004-12-01 a extraordinary
2005-01-01 a extraordinary
Is there an easy way doing this in R that I don't see or do I merge these by hand? Thank you very much for your help!
I don't know any function to do this, but here is some R code to do it yourself :
# Needed type coercions (Date for comparisons, characters to avoid 'factor' problems)
events$time <- as.Date(events$time)
events$event <- as.character(events$event)
events$name <- as.character(events$name)
df$name <- as.character(df$name)
# Events ordering (needed to detect previous events as non NA)
events <- events[ order(events$time) ,]
# Updates
df$event = NA
for(i in 1:nrow(events)) {
# Update where time is lesser than the limit, if names correspond and if an event was not already assigned to the row
df[ df$time <= events[i,"time"] & df$name == events[i,"name"] & is.na(df$event) , "event" ] = events[i,"event"]
}
Here is a function to do what you want:
event.aligning <- function(time.dataframe, events){
if(!class(events[["time"]]) == 'Date'){
events[["time"]] <- as.Date(events[["time"]])
}
## lets sort on time
events <- events[order(events[["time"]]),]
## setup event column
time.dataframe$event <- NA
time.dataframe$event <- as.factor(time.dataframe$event)
levels(time.dataframe$event) <- event.types
rownames.tdf <- rownames(time.dataframe)
res.time.dataframe <- NULL
for( i in 1:length(levels(events$name))){
i.name <- levels(events$name)[i]
i.name.events <- subset(events, name == i.name)
first.time <- time.dataframe$time[time.dataframe$name == i.name][1]
first.event <- i.name.events$time[1]
## assume 2 events
first.event.type <- i.name.events$event[1]
second.event.type <- unique(i.name.events$event[i.name.events$event != first.event.type])
event.types <- levels(i.name.events$event)
sub.time.df <- time.dataframe[time.dataframe$name == i.name,]
rownames(sub.time.df) <- 1:length(sub.time.df[,1])
sub.time.df[1:(as.numeric(rownames(sub.time.df[sub.time.df$time == first.event,])) - 1),]$event <- second.event.type
cur.event <- first.event
for( j in 2:length(i.name.events[,1])){
next.event <- i.name.events$time[j]
sub.time.df[rownames( sub.time.df[ sub.time.df[["time"]] == cur.event,]) :
(as.numeric(rownames( sub.time.df[sub.time.df[["time"]] == next.event,])) - 1),]$event <- i.name.events$event[j-1]
cur.event <- next.event
next.event.type = i.name.events$event[j]
}
last.time <- sub.time.df$time[length(sub.time.df$time)]
last.event <- i.name.events$time[length(i.name.events$time)]
sub.time.df[rownames( sub.time.df[sub.time.df$time == last.event,]):length(sub.time.df$time),]$event <- next.event.type
res.time.dataframe <- rbind(res.time.dataframe, sub.time.df)
}
rownames(res.time.dataframe) <- rownames.tdf
return(res.time.dataframe)
}
df2 <- event.aligning(df, events)