This is my minimal dataset:
df=structure(list(ID = c(3942504L, 3199413L, 1864266L, 4037617L,
2030477L, 1342330L, 5434070L, 3200378L, 4810153L, 4886225L),
MI_TIME = c(1101L, 396L, 1140L, 417L, 642L, 1226L, 1189L,
484L, 766L, 527L), MI_Status = c(0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 1L, 0L), Stroke_status = c(1L, 0L, 1L, 0L, 0L, 0L,
0L, 1L, 1L, 0L), Stroke_time = c(1101L, 396L, 1140L, 417L,
642L, 1226L, 1189L, 484L, 766L, 527L), Arrhythmia_status = c(NA,
NA, TRUE, NA, NA, TRUE, NA, NA, TRUE, NA), Arrythmia_time = c(1101L,
356L, 1122L, 7L, 644L, 126L, 118L, 84L, 76L, 5237L)), row.names = c(NA,
10L), class = "data.frame")
As you can see, I have mainly 2 types of variables "_status" and "_time".
I am preparing my dataset for a survival analysis, and "time" is time to event in days.
But the problem arrives when I try to create a variable called "any cardiovascular outcome" (df$CV) That I have defined as the following:
df$CV = NA
df$CV <- with(df, ifelse(MI_Status=='1' | Stroke_status=='1' | Arrhythmia_status== 'TRUE' ,'1', '0'))
df$CV = as.factor(df$CV)
The problem I have is with selecting the optimal time to event. As now I have a new variable called df$CV, but 3 different "_time" variables.
So I would like to create a new column, called df$CV_time where time, is the time for the event that happened first.
There is a slight difficulty in this problem though, and I put an example:
If we have a subject with MI_status==1, Arrythmia_status==NA, stroke_status==1 and MI_time==200, Arrythmia_time==100, stroke_time==220 --> the correct time for df$CV would be 200, as it is the time for the earliest event.
However, in a case where MI_status==0, Arrythmia_status==NA, stroke_status==0 and MI_time==200, Arrythmia_time==100, stroke_time==220 --> the correct time for df$CV would be 220, as it is the time for latest follow up is 220 days.
How could I select the optimal number for df$CV based on these conditions?
This might be one approach using tidyverse.
First, you may want to make sure your column names are consistent with spelling and case (here using rename).
Then, you can explicitly define your "Arrhythmia" outcome as TRUE or FALSE (instead of using NA).
You can put your data into long form with pivot_longer, and then group_by your ID. You can include the specific columns related to MI, stroke, and arrhythmia here (where there are "time" and "status" columns available). Note that in your actual dataset (where you use glimpse - it is unclear what you want for arrhythmia - there's a pif column name, but nothing specific for time or status).
Your cardiovascular outcome will include status for MI or Stroke that is 1, or Arrhythmia that is TRUE.
The time to event would be the min time if there was a cardiovascular outcome, otherwise use the censored time of latest follow up or max time.
Let me know if this gives you the desired output.
library(tidyverse)
df %>%
rename(MI_time = MI_TIME, MI_status = MI_Status, Arrhythmia_time = Arrythmia_time) %>%
replace_na(list(Arrhythmia_status = F)) %>%
pivot_longer(cols = c(starts_with("MI_"), starts_with("Stroke_"), starts_with("Arrhythmia_")),
names_to = c("event", ".value"),
names_sep = "_") %>%
group_by(ID) %>%
summarise(
any_cv_outcome = any(status[event %in% c("MI", "Stroke")] == 1 | status[event == "Arrhythmia"]),
cv_time_to_event = ifelse(any_cv_outcome, min(time), max(time))
)
Output
ID any_cv_outcome cv_time_to_event
<int> <lgl> <int>
1 1342330 TRUE 126
2 1864266 TRUE 1122
3 2030477 FALSE 644
4 3199413 FALSE 396
5 3200378 TRUE 84
6 3942504 TRUE 1101
7 4037617 FALSE 417
8 4810153 TRUE 76
9 4886225 FALSE 5237
10 5434070 FALSE 1189
I would appreciate any help to randomly select a subset of var.w_X
containing 5 out of 10 var.w_X variables from my sample data sampleDT, while keeping all the other variables that do not start withvar.w_.
Below is the sample data sampleDT which contains, among other variables (those to be kept altogether), X variables starting with var.w_ in their names (those from which to draw the random sample).
In the current example, X=10, so that var.w_ includes var.w_1 to var.w_10, and I want to draw a random sample of 5 out of these 10. However, in my actual data, X>1,000,000and I might want to draw a sample of 7,500 var.w_ variables out of these X>1,000,000.
Therefore, accounting for efficiency is paramount in any given solution since recently I experienced some performance issues with mutate_at whose cause I still don't have an explanation.
Importantly, the other variables to keep (those that do not start with var.w_) are not guaranteed to stay in any pre-specified order, as they might be located before and/or between and/or after the var.w_ variables, for example. So solutions that rely on order of columns will not work.
#sample data
sampleDT<-structure(list(n = c(62L, 96L, 17L, 41L, 212L, 143L, 143L, 143L,
73L, 73L), r = c(3L, 1L, 0L, 2L, 170L, 21L, 0L, 33L, 62L, 17L
), p = c(0.0483870967741935, 0.0104166666666667, 0, 0.0487804878048781,
0.80188679245283, 0.146853146853147, 0, 0.230769230769231, 0.849315068493151,
0.232876712328767), var.w_8 = c(1.94254385942857, 1.18801169942857,
3.16131123942857, 3.16131123942857, 1.13482609242857, 1.13042157942857,
2.13042157942857, 1.13042157942857, 1.12335579942857, 1.12335579942857
), var.w_9 = c(1.942365288, 1.187833128, 3.161132668, 3.161132668,
1.134647521, 1.130243008, 2.130243008, 1.130243008, 1.123177228,
1.123177228), var.w_10 = c(1.94222639911111, 1.18769423911111,
3.16099377911111, 3.16099377911111, 1.13450863211111, 1.13010411911111,
2.13010411911111, 1.13010411911111, 1.12303833911111, 1.12303833911111
), group = c(1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L,
0L, 0L), treat = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L), c1 = c(1.941115288,
1.186583128, 1.159882668, 1.159882668, 1.133397521, 1.128993008,
1.128993008, 1.128993008, 1.121927228, 1.121927228), var.w_6 = c(1.939115288, 1.184583128,
3.157882668, 3.157882668, 1.131397521, 1.126993008, 2.126993008,
1.126993008, 1.119927228, 1.119927228), var.w_7 = c(1.94278195466667,
1.18824979466667, 3.16154933466667, 3.16154933466667, 1.13506418766667,
1.13065967466667, 2.13065967466667, 1.13065967466667, 1.12359389466667,
1.12359389466667), c2 = c(0.1438,
0.237, 0.2774, 0.2774, 0.2093, 0.1206, 0.1707, 0.0699, 0.1351,
0.1206), var.w_1 = c(1.941115288, 1.186583128, 3.159882668, 3.159882668,
1.133397521, 1.128993008, 2.128993008, 1.128993008, 1.121927228,
1.121927228), var.w_2 = c(1.931115288, 1.176583128, 3.149882668,
3.149882668, 1.123397521, 1.118993008, 2.118993008, 1.118993008,
1.111927228, 1.111927228), var.w_3 = c(1.946115288, 1.191583128,
3.164882668, 3.164882668, 1.138397521, 1.133993008, 2.133993008,
1.133993008, 1.126927228, 1.126927228), var.w_4 = c(1.93778195466667,
1.18324979466667, 3.15654933466667, 3.15654933466667, 1.13006418766667,
1.12565967466667, 2.12565967466667, 1.12565967466667, 1.11859389466667,
1.11859389466667), var.w_5 = c(1.943615288, 1.189083128, 3.162382668,
3.162382668, 1.135897521, 1.131493008, 2.131493008, 1.131493008,
1.124427228, 1.124427228)), class = "data.frame", row.names = c(NA, -10L))
#my attempt
//based on the comment by #akrun - this does not keep the other variables as specified above
myvars <- sample(grep("var\\.w_", names(sampleDT), value = TRUE), 5)
sampleDT_test <- sampleDT[myvars]
Thanks in advance for any help
Apologies, had to step into a meeting for a little bit. So, I think you could adapt akrun's solution and keep the first columns for the sample dataframe. Let me know how this scales on the full dataframe. Also, thanks for clarifying further.
> # Subsetting the variable names not matching your pattern using grepl
> names(sampleDT)[!grepl("var\\.w_", names(sampleDT))]
[1] "n" "r" "p" "group" "treat" "c1" "c2"
>
> # Combine that with akrun's solution
> myvars <- c(names(sampleDT)[!grepl("var\\.w_", names(sampleDT))],
+ sample(grep("var\\.w_", names(sampleDT), value = TRUE), 5))
> head(sampleDT[myvars])
n r p group treat c1 c2 var.w_6 var.w_1 var.w_4 var.w_3 var.w_8
1 62 3 0.04838710 1 0 1.941115 0.1438 1.939115 1.941115 1.937782 1.946115 1.942544
2 96 1 0.01041667 1 0 1.186583 0.2370 1.184583 1.186583 1.183250 1.191583 1.188012
3 17 0 0.00000000 0 0 1.159883 0.2774 3.157883 3.159883 3.156549 3.164883 3.161311
4 41 2 0.04878049 1 0 1.159883 0.2774 3.157883 3.159883 3.156549 3.164883 3.161311
5 212 170 0.80188679 0 0 1.133398 0.2093 1.131398 1.133398 1.130064 1.138398 1.134826
6 143 21 0.14685315 1 1 1.128993 0.1206 1.126993 1.128993 1.125660 1.133993 1.130422
I have a data frame with answers from a survey that looks like this:
df = structure(list(Part.2..Question.1..Response = c("You did not know about the existence of the course",
"The email you received was confusing and you did not know what to do",
"Other:", "You did not know about the existence of the course",
"The email you received was confusing and you did not know what to do",
"The email you received was confusing and you did not know what to do|Other:",
"You think is not worth your time", "No Answer", "You think is not worth your time",
"You think is not worth your time", "You did not know about the existence of the course",
"You did not know about the existence of the course", "You think is not worth your time|The email you received was confusing and you did not know what to do|You did not know about the existence of the course",
"You think is not worth your time", "You did not know about the existence of the course",
"You did not know about the existence of the course", "You think is not worth your time|Other:",
"You think is not worth your time", "No Answer", "You did not know about the existence of the course",
"You think is not worth your time", "You think is not worth your time",
"You did not know about the existence of the course", "You did not know about the existence of the course",
"You think is not worth your time"), group = structure(c(1L,
2L, 1L, 3L, 2L, 3L, 2L, 2L, 2L, 3L, 1L, 1L, 1L, 2L, 3L, 1L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 1L, 3L), .Label = c("control", "treatment1",
"treatment2"), class = "factor")), .Names = c("Part.2..Question.1..Response",
"group"), row.names = c(151L, 163L, 109L, 188L, 141L, 158L, 131L,
32L, 86L, 53L, 148L, 64L, 89L, 30L, 159L, 23L, 40L, 101L, 173L,
165L, 15L, 156L, 2L, 174L, 41L), class = "data.frame")
Some people select multiple answers, for example:
df$Part.2..Question.1..Response[13]
I want to create a table that has the number of people that selected a given answers for each "group":
control treatment1 treatment2
You think is not worth your time 0 0 0
The email you received was confusing and you did not know what to do 10 1 4
You did not know about the existence of the course 4 4 1
What is the best way of doing this?
I would first split the responses on the "|" and turn multiple responses into multiple rows. Then, after doing that, I can do a simple table()
dd<-do.call(rbind, Map(data.frame,
group=df$group,
resp=strsplit(df$Part.2..Question.1..Response,"|", fixed=T)
))
with(dd, table(resp, group))
You will get results like
group
resp control treatment1 treatment2
You did not know about the existence ... 6 1 3
The email you received was confusing ... 1 2 1
Other: 1 0 2
You think is not worth your time 1 5 4
No Answer 0 1 1
I have a csv file and when i use this command
SOLK<-read.table('Book1.csv',header=TRUE,sep=';')
I get this output
> SOLK
Time Close Volume
1 10:27:03,6 0,99 1000
2 10:32:58,4 0,98 100
3 10:34:16,9 0,98 600
4 10:35:46,0 0,97 500
5 10:35:50,6 0,96 50
6 10:35:50,6 0,96 1000
7 10:36:10,3 0,95 40
8 10:36:10,3 0,95 100
9 10:36:10,4 0,95 500
10 10:36:10,4 0,95 100
. . . .
. . . .
. . . .
285 17:09:44,0 0,96 404
Here is the result of dput(SOLK[1:10,]):
> dput(SOLK[1:10,])
structure(list(Time = structure(c(1L, 2L, 3L, 4L, 5L, 5L, 6L,
6L, 7L, 7L), .Label = c("10:27:03,6", "10:32:58,4", "10:34:16,9",
"10:35:46,0", "10:35:50,6", "10:36:10,3", "10:36:10,4", "10:36:30,8",
"10:37:23,3", "10:37:38,2", "10:37:39,3", "10:37:45,9", "10:39:07,5",
"10:39:07,6", "10:39:46,6", "10:41:21,8", "10:43:20,6", "10:43:36,4",
"10:43:48,8", "10:43:48,9", "10:43:54,6", "10:44:01,5", "10:44:08,4",
"10:45:47,2", "10:46:16,7", "10:47:03,6", "10:47:48,6", "10:47:55,0",
"10:48:09,9", "10:48:30,6", "10:49:20,6", "10:50:31,9", "10:50:34,6",
"10:50:38,1", "10:51:02,8", "10:51:11,5", "10:55:57,7", "10:57:57,2",
"10:59:06,9", "10:59:33,5", "11:00:31,0", "11:00:31,1", "11:04:46,4",
"11:04:53,4", "11:04:54,6", "11:04:56,1", "11:04:58,9", "11:05:02,0",
"11:05:02,6", "11:05:24,7", "11:05:56,7", "11:06:15,8", "11:13:24,1",
"11:13:24,2", "11:13:32,1", "11:13:36,2", "11:13:37,2", "11:13:44,5",
"11:13:46,8", "11:14:12,7", "11:14:19,4", "11:14:19,8", "11:14:21,2",
"11:14:38,7", "11:14:44,0", "11:14:44,5", "11:15:10,5", "11:15:10,6",
"11:15:12,9", "11:15:16,6", "11:15:23,3", "11:15:31,4", "11:15:36,4",
"11:15:37,4", "11:15:49,5", "11:16:01,4", "11:16:06,0", "11:17:56,2",
"11:19:08,1", "11:20:17,2", "11:26:39,4", "11:26:53,2", "11:27:39,5",
"11:28:33,0", "11:30:42,3", "11:31:00,7", "11:33:44,2", "11:39:56,1",
"11:40:07,3", "11:41:02,1", "11:41:30,1", "11:45:07,0", "11:45:26,6",
"11:49:50,8", "11:59:58,1", "12:03:49,9", "12:04:12,6", "12:06:05,8",
"12:06:49,2", "12:07:56,0", "12:09:37,7", "12:14:25,5", "12:14:32,1",
"12:15:42,1", "12:15:55,2", "12:16:36,9", "12:16:44,2", "12:18:00,3",
"12:18:12,8", "12:28:17,8", "12:28:17,9", "12:28:23,7", "12:28:51,1",
"12:36:33,2", "12:37:45,0", "12:39:22,2", "12:40:19,5", "12:42:22,1",
"12:58:46,3", "13:06:05,8", "13:06:05,9", "13:07:17,6", "13:07:17,7",
"13:09:01,3", "13:09:01,4", "13:09:11,3", "13:09:31,0", "13:10:07,8",
"13:35:43,8", "13:38:27,7", "14:11:16,0", "14:17:31,5", "14:26:13,9",
"14:36:11,8", "14:38:43,7", "14:38:47,8", "14:38:51,8", "14:48:26,7",
"14:52:07,4", "14:52:13,8", "15:09:24,7", "15:10:25,8", "15:29:12,1",
"15:31:55,9", "15:34:04,1", "15:44:10,8", "15:45:07,1", "15:57:04,9",
"15:57:13,9", "16:16:27,9", "16:21:41,7", "16:36:01,5", "16:36:13,2",
"16:46:10,5", "16:46:10,6", "16:47:37,3", "16:50:52,4", "16:50:52,5",
"16:51:44,5", "16:55:11,5", "16:56:21,8", "16:56:37,5", "16:57:37,9",
"16:58:18,6", "16:58:44,5", "17:00:39,1", "17:01:50,7", "17:03:13,2",
"17:03:28,3", "17:03:46,7", "17:03:47,0", "17:04:30,4", "17:08:41,8",
"17:09:44,0"), class = "factor"), Close = structure(c(8L, 7L,
7L, 6L, 5L, 5L, 4L, 4L, 4L, 4L), .Label = c("0,92", "0,93", "0,94",
"0,95", "0,96", "0,97", "0,98", "0,99"), class = "factor"), Volume = c(1000L,
100L, 600L, 500L, 50L, 1000L, 40L, 100L, 500L, 100L)), .Names = c("Time",
"Close", "Volume"), row.names = c(NA, 10L), class = "data.frame")
The first column includes the time stamp of every transaction during a stock's exchange daily session. I would like to convert the Close and Volume columns to an xts object ordered by the Time column.
UPDATE: From your edits, it appears you imported your data using two different commands. It also appears you should be using read.csv2. I've updated my answer with Lines that (I assume) look more like your original CSV (I have to guess because you don't say what the file looks like). The rest of the answer doesn't change.
You have to add a date to your times because xts stores all index values internally as POSIXct (I just used today's date).
I had to convert the "," decimal notation to the "." convention (using gsub), but that may be locale-dependent and you may not need to. paste today's date with the (possibly converted) time and then convert it to POSIXct to create an index suitable for xts.
I've also formatted the index so you can see the fractional seconds.
Lines <- "Time;Close;Volume
10:27:03,6;0,99;1000
10:32:58,4;0,98;100
10:34:16,9;0,98;600
10:35:46,0;0,97;500
10:35:50,6;0,96;50
10:35:50,6;0,96;1000
10:36:10,3;0,95;40
10:36:10,3;0,95;100
10:36:10,4;0,95;500
10:36:10,4;0,95;100"
SOLK <- read.csv2(con <- textConnection(Lines))
close(con)
solk <- xts(SOLK[,c("Close","Volume")],
as.POSIXct(paste("2011-09-02", gsub(",",".",SOLK[,1]))))
indexFormat(solk) <- "%Y-%m-%d %H:%M:%OS6"
solk
# Close Volume
# 2011-09-02 10:27:03.599999 0.99 1000
# 2011-09-02 10:32:58.400000 0.98 100
# 2011-09-02 10:34:16.900000 0.98 600
# 2011-09-02 10:35:46.000000 0.97 500
# 2011-09-02 10:35:50.599999 0.96 50
# 2011-09-02 10:35:50.599999 0.96 1000
# 2011-09-02 10:36:10.299999 0.95 40
# 2011-09-02 10:36:10.299999 0.95 100
# 2011-09-02 10:36:10.400000 0.95 500
# 2011-09-02 10:36:10.400000 0.95 100
That's an odd structure. Translating it to dput syntax
SOLK <- structure(list(structure(c(1L, 2L, 3L, 4L, 5L, 5L, 6L, 6L, 7L,
7L), .Label = c("10:27:03,6", "10:32:58,4", "10:34:16,9", "10:35:46,0",
"10:35:50,6", "10:36:10,3", "10:36:10,4"), class = "factor"),
Close = c(0.99, 0.98, 0.98, 0.97, 0.96, 0.96, 0.95, 0.95,
0.95, 0.95), Volume = c(1000L, 100L, 600L, 500L, 50L, 1000L,
40L, 100L, 500L, 100L)), .Names = c("", "Close", "Volume"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10"))
I'm assuming the comma in the timestamp is decimal separator.
library("chron")
time.idx <- times(gsub(",",".",as.character(SOLK[[1]])))
Unfortunately, it seems xts won't take this as a valid order.by; so a date (today, for lack of a better choice) must be included to make xts happy.
xts(SOLK[[2]], order.by=chron(Sys.Date(), time.idx))