Transforming Dates in R - r

I have a large csv file in which relevant dates are categorical and formatted in one column as follows: "Thu, 21 Jan 2012 04:59:00 -0000". I am trying to use as.Date, but it doesn't seem to be working. It would be create to have several columns for weekday, day, month, year, but I am happy to settle for one column at this point. Any suggestions?
UPDATE QUESTION: Each row has a different date in the above format (weekday, day, month, year, hour, minutes, seconds. I did not make that clear. How do I transform each date in the column?

The anytime package can parse this without a format:
R> anytime("Thu, 21 Jan 2012 04:59:00 -0000")
[1] "2012-01-21 04:59:00 CST"
R>
It returns a POSIXct you can then operate on, or just format(), at will. It also has a simpler variant anydate() which returns a Date object instead.

library(lubridate)
my_date <- "Thu, 21 Jan 2012 04:59:00 -0000"
# Get it into date format
my_date <- dmy_hms(my_date)
# Use convenience functions to set up the columns you wanted
data.frame(day=day(my_date), month=month(my_date), year=year(my_date),
timestamp = my_date)
day month year timestamp
1 21 1 2012 2012-01-21 04:59:00

We can use
as.Date(str1, "%a, %d %b %Y")
#[1] "2012-01-21"
If we need DateTime format
v1 <- strptime(str1, '%a, %d %b %Y %H:%M:%S %z', tz = "UTC")
v1
#[1] "2012-01-21 04:59:00 UTC"
Or using lubridate
library(lubridate)
dmy_hms(str1)
#[1] "2012-01-21 04:59:00 UTC"
data
str1 <- "Thu, 21 Jan 2012 04:59:00 -0000"

If you really want the separation in components then start with Dirk's powerful suggestion and then transpose the output of as.POSIXlt:
library(anytime)
times <- c("2004-03-21 12:45:33.123456", # example from ?anytime
"2004/03/21 12:45:33.123456",
"20040321 124533.123456",
"03/21/2004 12:45:33.123456",
"03-21-2004 12:45:33.123456",
"2004-03-21",
"20040321",
"03/21/2004",
"03-21-2004",
"20010101")
t( sapply( anytime::anytime(times),
function(x) unlist( as.POSIXlt(x)) ) )
sec min hour mday mon year wday yday isdst
[1,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[2,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[3,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[4,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[5,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[6,] "0" "0" "0" "21" "2" "104" "0" "80" "0"
[7,] "0" "0" "0" "21" "2" "104" "0" "80" "0"
[8,] "0" "0" "0" "21" "2" "104" "0" "80" "0"
[9,] "0" "0" "0" "21" "2" "104" "0" "80" "0"
[10,] "0" "0" "0" "1" "9" "101" "1" "273" "1"
zone gmtoff
[1,] "PST" "-28800"
[2,] "PST" "-28800"
[3,] "PST" "-28800"
[4,] "PST" "-28800"
[5,] "PST" "-28800"
[6,] "PST" "-28800"
[7,] "PST" "-28800"
[8,] "PST" "-28800"
[9,] "PST" "-28800"
[10,] "PDT" "-25200"

Related

How to obtain values from a matrix using stored numbers as indexes in R

am really new at R and I can't find the way of subsetting matrix rows given a list of indexes.
I have a dataframe called 'demo' with 855 rows and 3 columns that looks like this:
## Subject AGE DX
## 1 011_S_0002_bl 74.3 0
## 2 011_S_0003_bl 81.3 1
## 3 011_S_0005_bl 73.7 0
## 4 022_S_0007_bl 75.4 1
## 5 011_S_0008_bl 84.5 0
## 6 011_S_0010_bl 73.9 1
From this, I want to extract the indexes for all the rows that match DX == 1. So I do:
rownames(demo[demo$DX == 1,])
Which returns:
## [1] "2" "4" "6" "14" "20" "31" "33" "34" "36" "39" "40" "41"
## [13] "46" "47" "53" "54" "55" "58" "64" "67" "69" "70" "72" "81"
## [25] "84" "87" "88" "92" "96" "98" "100" "101" "106" "108" "109" "112"
....
Now I have a matrix called T_hat with 855 rows and 1 column that looks like this:
## [,1]
## [1,] 5.812925
## [2,] 10.477721
## [3,] 1.519726
## [4,] -0.221328
## [5,] 1.784920
What I want is to use the numbers in 'al' to subset the values with the corresponding numbers in the indexes and to get something like this:
## [,1]
## [2,] 10.477721
## [4,] -0.221328
...and so on.
I've tried all these options:
T_hat_a <- T_hat[rownames(demo[demo$DX == 1,]),1]
T_hat_b <- T_hat[is.numeric(rownames(demo[demo$DX == 1,])),1]
T_hat_c <- T_hat[rownames(T_hat) %in% rownames(demo[demo$DX == 1,]),1]
T_hat_d <- T_hat[rownames(T_hat) %in% is.numeric(rownames(demo[demo$DX == 1,])),1]
But none returns what I expect.
T_hat_a = ERROR "no 'dimnames' attributes for array
T_hat_b = numeric(0)
T_hat_c = numeric(0)
T_hat_d = numeric(0)
I've also tried to convert my matrix to a df, but only the T_hat_a option returns a result, but it is not at all as desired, since it returns different values...

anova() does not work properly with lme objects after updating - do I miss something?

I had a code that worked fine so far. I want to test things with gls, lme and gamm (from packages nlme and mgcv), and I compared different models with anova(). However, I needed another package, that did not work with my R version (which was almost one year old). Thus, I updated R (via the updater package) and RStudio.
The issue now is, that anova() does not give any output after running or only "Denom. DF: 91" and nothing else.
Now I tried different things and searched a lot, but I found no current threat dealing with such a problem, while looking at the help files just says, it should work that way I use it. Thus, I am suspecting that I miss something essential (probably even obvious), but I don't get it. I hope you can tell me where I do something wrong.
Here is some data to play with (copied from txt-file):
"treat" "x" "time" "nest"
"1" "1" 49.37 1 "K1"
"2" "1" 48.68 1 "K2"
"3" "2" 44.7 1 "T7"
"4" "2" 49.3 1 "T8"
"5" "1" 48.78 1 "K3"
"6" "2" 42.37 1 "T10"
"7" "1" 39.26 1 "K4"
"8" "2" 46.36 1 "T11"
"9" "1" 40.36 1 "K5"
"10" "2" 47.14 1 "T9"
"11" "1" 48.81 1 "K6"
"12" "1" 40.4 1 "K10"
"13" "2" 53.42 1 "T4"
"14" "2" 46.85 1 "T5"
"15" "2" 44.58 1 "T2"
"16" "2" 47.51 1 "T6"
"17" "1" 51.7 1 "K8"
"18" "1" 48.16 1 "K7"
"19" "2" 48.86 1 "T3"
"20" "1" 44.6 1 "K11"
"21" "1" 49.71 1 "K9"
"22" "2" 44.54 1 "T1"
"23" "2" 41.55 2 "T3"
"24" "1" 32.55 2 "K3"
"25" "1" 42.15 2 "K1"
"26" "2" 51.06 2 "T1"
"27" "1" 38.43 2 "K11"
"28" "2" 39.91 2 "T11"
"29" "1" 36.73 2 "K7"
"30" "2" 50.19 2 "T4"
"31" "1" 42.26 2 "K8"
"32" "1" 43.02 2 "K6"
"33" "2" 37.6 2 "T10"
"34" "1" 33.42 2 "K4"
"35" "2" 39.64 2 "T5"
"36" "2" 43.56 2 "T2"
"37" "2" 35.31 2 "T7"
"38" "2" 37 2 "T8"
"39" "2" 40.87 2 "T6"
"40" "1" 35.29 2 "K9"
"41" "2" 41.83 2 "T9"
"42" "1" 37.88 2 "K10"
"43" "1" 36.5 2 "K5"
"44" "1" 34.21 3 "K4"
"45" "1" 38.04 3 "K6"
"46" "1" 35.14 3 "K3"
"47" "2" 38.18 3 "T10"
"48" "1" 40.26 3 "K11"
"49" "2" 37.09 3 "T3"
"50" "2" 43.1 3 "T11"
"51" "2" 34.26 3 "T7"
"52" "1" 36.58 3 "K9"
"53" "1" 35.81 3 "K2"
"54" "1" 39.83 3 "K10"
"55" "2" 37.65 3 "T6"
"56" "1" 39.8 3 "K7"
"57" "1" 36.41 3 "K8"
"58" "1" 35.22 3 "K5"
"59" "2" 39.68 3 "T8"
"60" "2" 41.12 3 "T1"
"61" "2" 36.93 3 "T9"
"62" "1" 35.66 3 "K1"
"63" "2" 36.91 3 "T4"
"64" "2" 38.84 3 "T5"
"65" "2" 34.31 3 "T2"
"66" "1" 32.71 4 "K9"
"67" "2" 37.84 4 "T11"
"68" "1" 28.01 4 "K10"
"69" "2" 39.69 5 "T11"
"70" "2" 35.08 4 "T10"
"71" "2" 34.43 4 "T9"
"72" "1" 32.12 4 "T8"
"73" "2" 30.41 4 "T7"
"74" "1" 31.81 4 "K7"
"75" "2" 36.41 4 "T6"
"76" "1" 29.17 5 "K6"
"77" "1" 28.59 4 "K6"
"78" "2" 33.99 4 "T5"
"79" "1" 30.41 4 "K5"
"80" "1" 29.8 4 "K4"
"81" "2" 34.72 4 "T4"
"82" "2" 34.38 4 "T3"
"83" "1" 28.12 4 "K3"
"84" "2" 34.62 4 "T2"
"85" "1" 31.88 4 "K2"
"86" "1" 29.35 4 "K1"
"87" "2" 37.95 4 "T1"
"88" "2" 40.85 5 "T4"
"89" "2" 35.07 5 "T5"
"90" "2" 36.15 5 "T8"
"91" "2" 36.48 5 "T10"
"92" "1" 33.73 4 "K8"
"93" "1" 28.17 5 "K9"
"94" "1" 32.81 5 "K10"
"95" "1" 32.17 4 "K11"
And this is basically one of the models I try to run:
test <- read.table(file="C:/Users/marvi_000/Desktop/testdata.txt")
str(test)
test$treat <- as.factor(test$treat)
test$nest <- as.factor(test$nest)
library(nlme)
m.test <- gls(x ~ treat * time,
correlation = corAR1(form =~ time | nest),
test, na.action = na.omit)
anova(m.test)
the output is:
Denom. DF: 91
When comparing models with anova(m1, m2) nothing happens at all.
The same is true when I run a gamm from package mgcv and using anova(m$lme) or anova(m1$lme, m2$lme).
I would appreciate any help or hint, pointing me towards the right direction. Thanks a lot!
EDIT:
After some discussion, I found out, that it is a problem with the scripts. I'm using RStudio and RMarkdown. However, when I run the code (with cntrl+enter, line by line) within the markdown script, the anova(lmemodel) command does not work as supposed to. However, if I just copy this single command into a plane r script (still using the current environment), the command is executed properly showing the desired output.
I have no clue what is happening there. If anybody has an idea where the problem is, or how to solve it, I would still be happy to hear it.

R: find first non-NA observation in data.table column by group

I have a data.table with many missing values and I want a variable which gives me a 1 for the first non-missin value in each group.
Say I have such a data.table:
library(data.table)
DT <- data.table(iris)[,.(Petal.Width,Species)]
DT[c(1:10,15,45:50,51:70,101:134),Petal.Width:=NA]
which now has missings in the beginning, at the end and in between. I have tried two versions, one is:
DT[min(which(!is.na(Petal.Width))),first_available:=1,by=Species]
but it only finds the global minimum (in this case, setosa gets the correct 1), not the minimum by group. I think this is the case because data.table first subsets by i, then sorts by group, correct? So it will only work with the row that is the global minimum of which(!is.na(Petal.Width)) which is the first non-NA value.
A second attempt with the test in j:
DT[,first_available:= ifelse(min(which(!is.na(Petal.Width))),1,0),by=Species]
which just returns a column of 1s. Here, I don't have a good explanation as to why it doesn't work.
my goal is this:
DT[,first_available:=0]
DT[c(11,71,135),first_available:=1]
but in reality I have hundreds of groups. Any help would be appreciated!
Edit: this question does come close but is not targeted at NA's and does not solve the issue here if I understand it correctly. I tried:
DT <- data.table(DT, key = c('Species'))
DT[unique(DT[,key(DT), with = FALSE]), mult = 'first']
Here's one way:
DT[!is.na(Petal.Width), first := as.integer(seq_len(.N) == 1L), by = Species]
We can try
DT[DT[, .I[which.max(!is.na(Petal.Width))] , Species]$V1,
first_available := 1][is.na(first_available), first_available := 0]
Or a slightly more compact option is
DT[, first_available := as.integer(1:nrow(DT) %in%
DT[, .I[!is.na(Petal.Width)][1L], by = Species]$V1)][]
> DT[!is.na(DT$Petal.Width) & DT$first_available == 1]
# Petal.Width Species first_available
# 1: 0.2 setosa 1
# 2: 1.8 versicolor 1
# 3: 1.4 virginica 1
> rownames(DT)[!is.na(DT$Petal.Width) & DT$first_available == 1]
# [1] "11" "71" "135"
> rownames(DT)[!is.na(DT$Petal.Width) & DT$first_available == 0]
# [1] "12" "13" "14" "16" "17" "18" "19" "20" "21" "22" "23" "24"
# [13] "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36"
# [25] "37" "38" "39" "40" "41" "42" "43" "44" "72" "73" "74" "75"
# [37] "76" "77" "78" "79" "80" "81" "82" "83" "84" "85" "86" "87"
# [49] "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"
# [61] "100" "136" "137" "138" "139" "140" "141" "142" "143" "144" "145" "146"
# [73] "147" "148" "149" "150"

Reformatting Panel Data according to a time and event variable

I have a panel dataset with many variables. The three most relevant variables are: "cid" (country code), 'time" (0-65), and "event" (0, 1, 2, 3, 4, 5, 6).
I am trying to run a cox regression (using coxph), however, since the time variable has different starting and ending points for each country, I need to first create a start time and end time variable. Here is where I run into my problem.
Here is what a sample of the three main variables may look like:
> data
cid time event
[1,] "AFG" "20" "0"
[2,] "AFG" "21" "0"
[3,] "AFG" "22" "0"
[4,] "AFG" "23" "0"
[5,] "AFG" "24" "0"
[6,] "AFG" "25" "0"
[7,] "AFG" "26" "1"
[8,] "AFG" "27" "1"
[9,] "AFG" "28" "1"
[10,] "AFG" "29" "1"
The idea is to convert this data into the following:
> data
cid time1 time2 event
[1,] "AFG" "20" "25" "0"
[2,] "AFG" "26" "29" "1"
How exactly does one go about doing this (keeping in mind that there are quite a few other explanatory variables in my dataset)?
You could use dplyr and pipe. This solution will work if your data is always ordered sequentially as in your example.
data<-data.frame(cid=rep("AFG",10),time=seq(20,29,1),event=c(0,0,0,0,0,0,1,1,1,1))
library(dplyr)
data %>% group_by(cid,event) %>%
summarise(time1=min(time),time2=max(time))
subset1<- data[data$event==0,]
subset1
subset2<- data[data$event==1,]
subset2
s1<- cbind(cid="AFG",time1=min(subset1$time),time2=max(subset1$time),event = 0)
s1
s2<- cbind(cid="AFG",time1=min(subset2$time),time2=max(subset2$time),event = 1)
s2
data1=rbind(s1,s2)
data1
# cid time1 time2 event
# [1,] "AFG" "20" "25" "0"
# [2,] "AFG" "26" "29" "1"
Hope this would help a little.

Lubridate week() gives me the 'wrong' week, possible TZ issue?

I'm trying to get the number of each particular week, i.e. 1 for the first week, 2 for the second, etc.
My data starts with Jan 1, 2012, and under the assumption that all dates/times are relevant to Chicago/CST6CDT timezone. Right off the bat I seem to be having a problem (with either my understanding or programming) getting the week function to give me what I need.
For example...
x=seq(as.POSIXlt("2012-1-1"), as.POSIXlt("2012-1-10"), by="day")
cbind(as.character(x), week(x))
...gives me...
[,1] [,2]
[1,] "2012-01-01" "1"
[2,] "2012-01-02" "1"
[3,] "2012-01-03" "1"
[4,] "2012-01-04" "1"
[5,] "2012-01-05" "1"
[6,] "2012-01-06" "1"
[7,] "2012-01-07" "2"
[8,] "2012-01-08" "2"
[9,] "2012-01-09" "2"
[10,] "2012-01-10" "2"
January 7th, 2012, a Saturday, should be considered as part of the 1st week, right? Setting the timezone doesn't seem to help.
x=seq(as.POSIXlt("2012-1-1", tz="CST6CDT"), as.POSIXlt("2012-1-10", tz="CST6CDT"), by="day")
Is there a way around this?
What you want is probably isoweek(), not week(). I have the same issues always with my calendar weeks :)
This has to do with the way the function week is written in the package:
> week()
function (x)
yday(x)%/%7 + 1
In your case, for January 7, 2012:
x = as.POSIXlt("2012-1-7")
yday(x) = 1
Then:
week(x) = (1%/%7) + 1 = 2
For it to work as you wish, try this:
x=seq(as.POSIXlt("2012-1-1", tz = "UCT"), as.POSIXlt("2012-1-20", tz = "UTC"), by="day")
cbind(as.character(x), (yday(x)-1)%/%7+1)
You get the following output:
# [,1] [,2]
# [1,] "2012-01-01" "1"
# [2,] "2012-01-02" "1"
# [3,] "2012-01-03" "1"
# [4,] "2012-01-04" "1"
# [5,] "2012-01-05" "1"
# [6,] "2012-01-06" "1"
# [7,] "2012-01-07" "1" <<<
# [8,] "2012-01-08" "2"
# [9,] "2012-01-09" "2"
#[10,] "2012-01-10" "2"
#[11,] "2012-01-11" "2"
#[12,] "2012-01-12" "2"
#[13,] "2012-01-13" "2"
#[14,] "2012-01-14" "2"
#[15,] "2012-01-15" "3"
#[16,] "2012-01-16" "3"
#[17,] "2012-01-17" "3"
#[18,] "2012-01-18" "3"
#[19,] "2012-01-19" "3"
#[20,] "2012-01-20" "3"

Resources