I'm trying to get the number of each particular week, i.e. 1 for the first week, 2 for the second, etc.
My data starts with Jan 1, 2012, and under the assumption that all dates/times are relevant to Chicago/CST6CDT timezone. Right off the bat I seem to be having a problem (with either my understanding or programming) getting the week function to give me what I need.
For example...
x=seq(as.POSIXlt("2012-1-1"), as.POSIXlt("2012-1-10"), by="day")
cbind(as.character(x), week(x))
...gives me...
[,1] [,2]
[1,] "2012-01-01" "1"
[2,] "2012-01-02" "1"
[3,] "2012-01-03" "1"
[4,] "2012-01-04" "1"
[5,] "2012-01-05" "1"
[6,] "2012-01-06" "1"
[7,] "2012-01-07" "2"
[8,] "2012-01-08" "2"
[9,] "2012-01-09" "2"
[10,] "2012-01-10" "2"
January 7th, 2012, a Saturday, should be considered as part of the 1st week, right? Setting the timezone doesn't seem to help.
x=seq(as.POSIXlt("2012-1-1", tz="CST6CDT"), as.POSIXlt("2012-1-10", tz="CST6CDT"), by="day")
Is there a way around this?
What you want is probably isoweek(), not week(). I have the same issues always with my calendar weeks :)
This has to do with the way the function week is written in the package:
> week()
function (x)
yday(x)%/%7 + 1
In your case, for January 7, 2012:
x = as.POSIXlt("2012-1-7")
yday(x) = 1
Then:
week(x) = (1%/%7) + 1 = 2
For it to work as you wish, try this:
x=seq(as.POSIXlt("2012-1-1", tz = "UCT"), as.POSIXlt("2012-1-20", tz = "UTC"), by="day")
cbind(as.character(x), (yday(x)-1)%/%7+1)
You get the following output:
# [,1] [,2]
# [1,] "2012-01-01" "1"
# [2,] "2012-01-02" "1"
# [3,] "2012-01-03" "1"
# [4,] "2012-01-04" "1"
# [5,] "2012-01-05" "1"
# [6,] "2012-01-06" "1"
# [7,] "2012-01-07" "1" <<<
# [8,] "2012-01-08" "2"
# [9,] "2012-01-09" "2"
#[10,] "2012-01-10" "2"
#[11,] "2012-01-11" "2"
#[12,] "2012-01-12" "2"
#[13,] "2012-01-13" "2"
#[14,] "2012-01-14" "2"
#[15,] "2012-01-15" "3"
#[16,] "2012-01-16" "3"
#[17,] "2012-01-17" "3"
#[18,] "2012-01-18" "3"
#[19,] "2012-01-19" "3"
#[20,] "2012-01-20" "3"
Related
am really new at R and I can't find the way of subsetting matrix rows given a list of indexes.
I have a dataframe called 'demo' with 855 rows and 3 columns that looks like this:
## Subject AGE DX
## 1 011_S_0002_bl 74.3 0
## 2 011_S_0003_bl 81.3 1
## 3 011_S_0005_bl 73.7 0
## 4 022_S_0007_bl 75.4 1
## 5 011_S_0008_bl 84.5 0
## 6 011_S_0010_bl 73.9 1
From this, I want to extract the indexes for all the rows that match DX == 1. So I do:
rownames(demo[demo$DX == 1,])
Which returns:
## [1] "2" "4" "6" "14" "20" "31" "33" "34" "36" "39" "40" "41"
## [13] "46" "47" "53" "54" "55" "58" "64" "67" "69" "70" "72" "81"
## [25] "84" "87" "88" "92" "96" "98" "100" "101" "106" "108" "109" "112"
....
Now I have a matrix called T_hat with 855 rows and 1 column that looks like this:
## [,1]
## [1,] 5.812925
## [2,] 10.477721
## [3,] 1.519726
## [4,] -0.221328
## [5,] 1.784920
What I want is to use the numbers in 'al' to subset the values with the corresponding numbers in the indexes and to get something like this:
## [,1]
## [2,] 10.477721
## [4,] -0.221328
...and so on.
I've tried all these options:
T_hat_a <- T_hat[rownames(demo[demo$DX == 1,]),1]
T_hat_b <- T_hat[is.numeric(rownames(demo[demo$DX == 1,])),1]
T_hat_c <- T_hat[rownames(T_hat) %in% rownames(demo[demo$DX == 1,]),1]
T_hat_d <- T_hat[rownames(T_hat) %in% is.numeric(rownames(demo[demo$DX == 1,])),1]
But none returns what I expect.
T_hat_a = ERROR "no 'dimnames' attributes for array
T_hat_b = numeric(0)
T_hat_c = numeric(0)
T_hat_d = numeric(0)
I've also tried to convert my matrix to a df, but only the T_hat_a option returns a result, but it is not at all as desired, since it returns different values...
Following code create a date sequence of 10 years with 16 Day interval.
library(chron)
seq.dates("01/01/2008","12/31/2017", 16)
Output
[1] 01/01/08 01/17/08 02/02/08 02/18/08 03/05/08 03/21/08 04/06/08 04/22/08 05/08/08
[10] 05/24/08 06/09/08 06/25/08 07/11/08 07/27/08 08/12/08 08/28/08 09/13/08 09/29/08
[19] 10/15/08 10/31/08 11/16/08 12/02/08 12/18/08 **01/03/09** 01/19/09 02/04/09 02/20/09
[28] 03/08/09 03/24/09 04/09/09 04/25/09 05/11/09 ..........
........................
...........................
[208] 01/25/17 02/10/17 02/26/17 03/14/17 03/30/17 04/15/17 05/01/17 05/17/17 06/02/17
[217] 06/18/17 07/04/17 07/20/17 08/05/17 08/21/17 09/06/17 09/22/17 10/08/17 10/24/17
[226] 11/09/17 11/25/17 12/11/17 12/27/17
I want first entry for every year to be 1st January not the day which comes after 16 days from the last entry of previous year (BOLD entry in the example sequence) and subsequent entries accordingly.
A long way to do this would be creating date sequence for individual years separately then merging them in a single vector. I'm curious that is there any way to do this in a single line code.
How's this work for you. Uses sapply to pass a vector of starting points and then makes seq.dates do more limited sequences. The sapply function will simplify to an array if possible.
dates(sapply( seq.dates("01/01/2008", "01/01/2017", by="years") ,
function(x) seq.dates(x, to=x+365, by=16, length=23)))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 01/01/08 01/01/09 01/01/10 01/01/11 01/01/12 01/01/13 01/01/14 01/01/15
[2,] 01/17/08 01/17/09 01/17/10 01/17/11 01/17/12 01/17/13 01/17/14 01/17/15
[3,] 02/02/08 02/02/09 02/02/10 02/02/11 02/02/12 02/02/13 02/02/14 02/02/15
[4,] 02/18/08 02/18/09 02/18/10 02/18/11 02/18/12 02/18/13 02/18/14 02/18/15
[5,] 03/05/08 03/06/09 03/06/10 03/06/11 03/05/12 03/06/13 03/06/14 03/06/15
[6,] 03/21/08 03/22/09 03/22/10 03/22/11 03/21/12 03/22/13 03/22/14 03/22/15
[7,] 04/06/08 04/07/09 04/07/10 04/07/11 04/06/12 04/07/13 04/07/14 04/07/15
[8,] 04/22/08 04/23/09 04/23/10 04/23/11 04/22/12 04/23/13 04/23/14 04/23/15
[9,] 05/08/08 05/09/09 05/09/10 05/09/11 05/08/12 05/09/13 05/09/14 05/09/15
[10,] 05/24/08 05/25/09 05/25/10 05/25/11 05/24/12 05/25/13 05/25/14 05/25/15
[11,] 06/09/08 06/10/09 06/10/10 06/10/11 06/09/12 06/10/13 06/10/14 06/10/15
[12,] 06/25/08 06/26/09 06/26/10 06/26/11 06/25/12 06/26/13 06/26/14 06/26/15
[13,] 07/11/08 07/12/09 07/12/10 07/12/11 07/11/12 07/12/13 07/12/14 07/12/15
[14,] 07/27/08 07/28/09 07/28/10 07/28/11 07/27/12 07/28/13 07/28/14 07/28/15
[15,] 08/12/08 08/13/09 08/13/10 08/13/11 08/12/12 08/13/13 08/13/14 08/13/15
[16,] 08/28/08 08/29/09 08/29/10 08/29/11 08/28/12 08/29/13 08/29/14 08/29/15
[17,] 09/13/08 09/14/09 09/14/10 09/14/11 09/13/12 09/14/13 09/14/14 09/14/15
[18,] 09/29/08 09/30/09 09/30/10 09/30/11 09/29/12 09/30/13 09/30/14 09/30/15
[19,] 10/15/08 10/16/09 10/16/10 10/16/11 10/15/12 10/16/13 10/16/14 10/16/15
[20,] 10/31/08 11/01/09 11/01/10 11/01/11 10/31/12 11/01/13 11/01/14 11/01/15
[21,] 11/16/08 11/17/09 11/17/10 11/17/11 11/16/12 11/17/13 11/17/14 11/17/15
[22,] 12/02/08 12/03/09 12/03/10 12/03/11 12/02/12 12/03/13 12/03/14 12/03/15
[23,] 12/18/08 12/19/09 12/19/10 12/19/11 12/18/12 12/19/13 12/19/14 12/19/15
[,9] [,10]
[1,] 01/01/16 01/01/17
[2,] 01/17/16 01/17/17
[3,] 02/02/16 02/02/17
[4,] 02/18/16 02/18/17
[5,] 03/05/16 03/06/17
[6,] 03/21/16 03/22/17
[7,] 04/06/16 04/07/17
[8,] 04/22/16 04/23/17
[9,] 05/08/16 05/09/17
[10,] 05/24/16 05/25/17
[11,] 06/09/16 06/10/17
[12,] 06/25/16 06/26/17
[13,] 07/11/16 07/12/17
[14,] 07/27/16 07/28/17
[15,] 08/12/16 08/13/17
[16,] 08/28/16 08/29/17
[17,] 09/13/16 09/14/17
[18,] 09/29/16 09/30/17
[19,] 10/15/16 10/16/17
[20,] 10/31/16 11/01/17
[21,] 11/16/16 11/17/17
[22,] 12/02/16 12/03/17
[23,] 12/18/16 12/19/17
I was a bit surprised at this result since I thought the value would be a character matrix, but str shows it's a matrix of chron date elements. Can remove the apparent "matrix" (actually "dates" with a dimension attribute) structure with a call to c:
str(c(dates(sapply( seq.dates("01/01/2008", "01/01/2017", by="years") , function(x) seq.dates(x, to=x+365, by=16, length=23))) ))
'dates' num [1:230] 01/01/08 01/17/08 02/02/08 02/18/08 03/05/08 ...
- attr(*, "format")= chr "m/d/y"
- attr(*, "origin")= num [1:3] 1 1 1970
I have a library of words and punctuation. I am trying to make a dataframe out of it so I can use it later on. The original data set has 2,000,000 rows with punctuation but it is a list. I am having trouble parsing out the punctuation from the list from the rest of the words. I would like spaces between each punctuation character from the words. I can easily do this in excel with find a replace. But want to do it in R. I have an example called = df, and the output I want in r called = output. I attached the code below with what I have so far. I tried str_split for How but it deleted "How " and returned nothing "".
#--------Upload 1st dataset and edit-------#
library("stringr")
sent1<-c("How did Quebec? 1 2 3")
sent2<-c("Why does valve = .245? .66")
sent3<-c("How do I use a period (.) comma [,] and hyphen {-} to columns?")
df <- data.frame(text = c(sent1,sent2,sent3))
df <- as.matrix(df)
str_split(df, " ")#spaces
#-------------output-------------#
words1<-c("How", "did" ,"Quebec"," ? ","1", "2" ,"3")
words2<-c('Why', "does", "valve"," = ",".245","?" ,".66")
words3<-c("How" ,"do", "I", "use", "a", "period", '(',".",')', "comma" ,'[',",","]" ,"and" ,"hyphen" ,"{","-",'}' ,"to" ,"columns",'?')
output<-data.frame(words1,words2,words3)
Here is a rough concept that gets the job done:
First split on all characters that are not word characters (inspired by another answer). Then get the maximum length and fill in the others to have the same length.
dfsplt <- strsplit( gsub("([^\\w])","~\\1~", df, perl = TRUE), "~")
dfsplt <- lapply(dfsplt, function(x) x[!x %in% c("", " ")])
n <- max(lengths(dfsplt))
sapply(dfsplt, function(x) {x <- rep(x, ceiling(n / length(x))); x[1:n]})
# or
sapply(dfsplt, function(x) x[(1:n - 1) %% length(x) + 1])
[,1] [,2] [,3]
[1,] "How" "Why" "How"
[2,] "did" "does" "do"
[3,] "Quebec" "valve" "I"
[4,] "?" "=" "use"
[5,] "1" "." "a"
[6,] "2" "245" "period"
[7,] "3" "?" "("
[8,] "How" "." "."
[9,] "did" "66" ")"
[10,] "Quebec" "Why" "comma"
[11,] "?" "does" "["
[12,] "1" "valve" ","
[13,] "2" "=" "]"
[14,] "3" "." "and"
[15,] "How" "245" "hyphen"
[16,] "did" "?" "{"
[17,] "Quebec" "." "-"
[18,] "?" "66" "}"
[19,] "1" "Why" "to"
[20,] "2" "does" "columns"
[21,] "3" "valve" "?"
Here is an option where we create a space between punctuation characters and then scan it separately
do.call(cbind, lapply(gsub("([[:punct:]])", " \\1 ",
df$text), function(x) scan(text = x, what = "", quiet = TRUE)))
# [,1] [,2] [,3]
# [1,] "How" "Why" "How"
# [2,] "did" "does" "do"
# [3,] "Quebec" "valve" "I"
# [4,] "?" "=" "use"
# [5,] "1" "." "a"
# [6,] "2" "245" "period"
# [7,] "3" "?" "("
# [8,] "How" "." "."
# [9,] "did" "66" ")"
#[10,] "Quebec" "Why" "comma"
#[11,] "?" "does" "["
#[12,] "1" "valve" ","
#[13,] "2" "=" "]"
#14,] "3" "." "and"
#[15,] "How" "245" "hyphen"
#[16,] "did" "?" "{"
#[17,] "Quebec" "." "-"
#[18,] "?" "66" "}"
#[19,] "1" "Why" "to"
#[20,] "2" "does" "columns"
#[21,] "3" "valve" "?"
I have a large csv file in which relevant dates are categorical and formatted in one column as follows: "Thu, 21 Jan 2012 04:59:00 -0000". I am trying to use as.Date, but it doesn't seem to be working. It would be create to have several columns for weekday, day, month, year, but I am happy to settle for one column at this point. Any suggestions?
UPDATE QUESTION: Each row has a different date in the above format (weekday, day, month, year, hour, minutes, seconds. I did not make that clear. How do I transform each date in the column?
The anytime package can parse this without a format:
R> anytime("Thu, 21 Jan 2012 04:59:00 -0000")
[1] "2012-01-21 04:59:00 CST"
R>
It returns a POSIXct you can then operate on, or just format(), at will. It also has a simpler variant anydate() which returns a Date object instead.
library(lubridate)
my_date <- "Thu, 21 Jan 2012 04:59:00 -0000"
# Get it into date format
my_date <- dmy_hms(my_date)
# Use convenience functions to set up the columns you wanted
data.frame(day=day(my_date), month=month(my_date), year=year(my_date),
timestamp = my_date)
day month year timestamp
1 21 1 2012 2012-01-21 04:59:00
We can use
as.Date(str1, "%a, %d %b %Y")
#[1] "2012-01-21"
If we need DateTime format
v1 <- strptime(str1, '%a, %d %b %Y %H:%M:%S %z', tz = "UTC")
v1
#[1] "2012-01-21 04:59:00 UTC"
Or using lubridate
library(lubridate)
dmy_hms(str1)
#[1] "2012-01-21 04:59:00 UTC"
data
str1 <- "Thu, 21 Jan 2012 04:59:00 -0000"
If you really want the separation in components then start with Dirk's powerful suggestion and then transpose the output of as.POSIXlt:
library(anytime)
times <- c("2004-03-21 12:45:33.123456", # example from ?anytime
"2004/03/21 12:45:33.123456",
"20040321 124533.123456",
"03/21/2004 12:45:33.123456",
"03-21-2004 12:45:33.123456",
"2004-03-21",
"20040321",
"03/21/2004",
"03-21-2004",
"20010101")
t( sapply( anytime::anytime(times),
function(x) unlist( as.POSIXlt(x)) ) )
sec min hour mday mon year wday yday isdst
[1,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[2,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[3,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[4,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[5,] "33.1234560012817" "45" "12" "21" "2" "104" "0" "80" "0"
[6,] "0" "0" "0" "21" "2" "104" "0" "80" "0"
[7,] "0" "0" "0" "21" "2" "104" "0" "80" "0"
[8,] "0" "0" "0" "21" "2" "104" "0" "80" "0"
[9,] "0" "0" "0" "21" "2" "104" "0" "80" "0"
[10,] "0" "0" "0" "1" "9" "101" "1" "273" "1"
zone gmtoff
[1,] "PST" "-28800"
[2,] "PST" "-28800"
[3,] "PST" "-28800"
[4,] "PST" "-28800"
[5,] "PST" "-28800"
[6,] "PST" "-28800"
[7,] "PST" "-28800"
[8,] "PST" "-28800"
[9,] "PST" "-28800"
[10,] "PDT" "-25200"
I have a panel dataset with many variables. The three most relevant variables are: "cid" (country code), 'time" (0-65), and "event" (0, 1, 2, 3, 4, 5, 6).
I am trying to run a cox regression (using coxph), however, since the time variable has different starting and ending points for each country, I need to first create a start time and end time variable. Here is where I run into my problem.
Here is what a sample of the three main variables may look like:
> data
cid time event
[1,] "AFG" "20" "0"
[2,] "AFG" "21" "0"
[3,] "AFG" "22" "0"
[4,] "AFG" "23" "0"
[5,] "AFG" "24" "0"
[6,] "AFG" "25" "0"
[7,] "AFG" "26" "1"
[8,] "AFG" "27" "1"
[9,] "AFG" "28" "1"
[10,] "AFG" "29" "1"
The idea is to convert this data into the following:
> data
cid time1 time2 event
[1,] "AFG" "20" "25" "0"
[2,] "AFG" "26" "29" "1"
How exactly does one go about doing this (keeping in mind that there are quite a few other explanatory variables in my dataset)?
You could use dplyr and pipe. This solution will work if your data is always ordered sequentially as in your example.
data<-data.frame(cid=rep("AFG",10),time=seq(20,29,1),event=c(0,0,0,0,0,0,1,1,1,1))
library(dplyr)
data %>% group_by(cid,event) %>%
summarise(time1=min(time),time2=max(time))
subset1<- data[data$event==0,]
subset1
subset2<- data[data$event==1,]
subset2
s1<- cbind(cid="AFG",time1=min(subset1$time),time2=max(subset1$time),event = 0)
s1
s2<- cbind(cid="AFG",time1=min(subset2$time),time2=max(subset2$time),event = 1)
s2
data1=rbind(s1,s2)
data1
# cid time1 time2 event
# [1,] "AFG" "20" "25" "0"
# [2,] "AFG" "26" "29" "1"
Hope this would help a little.