split text to create table with answers from a suvery? - r

I have a data frame with answers from a survey that looks like this:
df = structure(list(Part.2..Question.1..Response = c("You did not know about the existence of the course",
"The email you received was confusing and you did not know what to do",
"Other:", "You did not know about the existence of the course",
"The email you received was confusing and you did not know what to do",
"The email you received was confusing and you did not know what to do|Other:",
"You think is not worth your time", "No Answer", "You think is not worth your time",
"You think is not worth your time", "You did not know about the existence of the course",
"You did not know about the existence of the course", "You think is not worth your time|The email you received was confusing and you did not know what to do|You did not know about the existence of the course",
"You think is not worth your time", "You did not know about the existence of the course",
"You did not know about the existence of the course", "You think is not worth your time|Other:",
"You think is not worth your time", "No Answer", "You did not know about the existence of the course",
"You think is not worth your time", "You think is not worth your time",
"You did not know about the existence of the course", "You did not know about the existence of the course",
"You think is not worth your time"), group = structure(c(1L,
2L, 1L, 3L, 2L, 3L, 2L, 2L, 2L, 3L, 1L, 1L, 1L, 2L, 3L, 1L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 1L, 3L), .Label = c("control", "treatment1",
"treatment2"), class = "factor")), .Names = c("Part.2..Question.1..Response",
"group"), row.names = c(151L, 163L, 109L, 188L, 141L, 158L, 131L,
32L, 86L, 53L, 148L, 64L, 89L, 30L, 159L, 23L, 40L, 101L, 173L,
165L, 15L, 156L, 2L, 174L, 41L), class = "data.frame")
Some people select multiple answers, for example:
df$Part.2..Question.1..Response[13]
I want to create a table that has the number of people that selected a given answers for each "group":
control treatment1 treatment2
You think is not worth your time 0 0 0
The email you received was confusing and you did not know what to do 10 1 4
You did not know about the existence of the course 4 4 1
What is the best way of doing this?

I would first split the responses on the "|" and turn multiple responses into multiple rows. Then, after doing that, I can do a simple table()
dd<-do.call(rbind, Map(data.frame,
group=df$group,
resp=strsplit(df$Part.2..Question.1..Response,"|", fixed=T)
))
with(dd, table(resp, group))
You will get results like
group
resp control treatment1 treatment2
You did not know about the existence ... 6 1 3
The email you received was confusing ... 1 2 1
Other: 1 0 2
You think is not worth your time 1 5 4
No Answer 0 1 1

Related

Create a column value based on a matching regular expression

I have the following character string in a column called "Sentences" for a df:
I like an apple
I would like to create a second column, called Type, whose values are determined by matching strings. I would like to take the regular expression \bapple\b, match it with the sentence and if it matches, add the value Fruit_apple in the Type column.
In the long run I'd like to do this with several other strings and types.
Is there an easy way to do this using a function?
dataset (survey_1):
structure(list(slider_8.response = c(1L, 1L, 3L, 7L, 7L, 7L,
1L, 3L, 2L, 1L, 1L, 7L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 1L, 7L,
7L, 7L, 1L, 1L, 7L, 6L, 6L, 1L, 1L, 7L, 1L, 7L, 7L, 1L, 7L, 7L,
7L, 7L, 7L, 6L, 7L, 7L, 7L, 1L, 1L, 6L, 1L, 1L, 1L, 1L, 7L, 2L
), Sentences = c("He might could do it.", "I ever see the film.",
"I may manage to come visit soon.", "She’ll never be forgotten.",
"They might find something special.", "It might not be a good buy.",
"Maybe my pain will went away.", "Stephen maybe should fix your bicycle.",
"It used to didnʼt matter if you walked in late.", "He’d could climb the stairs.",
"Only Graeme would might notice that.", "I used to cycle a lot. ",
"Your dad belongs to disagree with this. ", "We can were pleased to see her.",
"He may should take us to the city.", "I could never forgot his deep voice.",
"I should can turn this thing over to Ann.", "They must knew who they really are.",
"We used to runs down three flights.", "I don’t care what he may be up to. ",
"That’s something I ain’t know about.", "That must be quite a skill.",
"We must be able to invite Jim.", "She used to play with a trolley.",
"He is done gone. ", "You might can check this before making a decision.",
"It would have a positive effect on the team. ", "Ruth can maybe look for it later.",
"You should tag along at the dance.", "They’re finna leave town.",
"A poem should looks like that.", "I can tell you didn’t do your homework. ",
"I can driving now.", "They should be able to put a blanket over it.",
"We could scarcely see each other.", "I might says I was never good at maths.",
"The next dance will be a quickstep. ", "I might be able to find myself a seat in this place.",
"Andrew thinks we shouldn’t do it.", "Jack could give a hand.",
"She’ll be able to come to the event.", "She’d maybe keep the car the way it is.",
"Sarah used to be able to agree with this proposal.", "I’d like to see your lights working. ",
"I’d be able to get a little bit more sleep.", "John may has a second name.",
"You must can apply for this job.", "I maybe could wait till the 8 o’clock train.",
"She used to could go if she finished early.", "That would meaned something else, eh?",
"You’ll can enjoy your holiday.", "We liketa drowned that day. ",
"I must say it’s a nice feeling.", "I eaten my lunch."), construct = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA)), row.names = c(NA, 54L), class = "data.frame")
type_list:
list("DM_will_can"=c("ll can","will can"), "DM_would_could"=c("d could","would could"),
"DM_might_can"="might can","DM_might_could"="might could","DM_used_to_could"="used to could",
"DM_should_can"="should can","DM_would_might"=c("d might", "would might"),"DM_may_should"="may should",
"DM_must_can"="must can", "SP_will_be_able"=c("ll be able","will be able"),
"SP_would_be_able"=c("d be able","would be able"),"SP_might_be_able"="might be able",
"SP_maybe_could"="maybe could","SP_used_to_be_able"="used to be able","SP_should_be_able"=
"should be able","SP_would_maybe"=c("d maybe", "would maybe"), "SP_maybe_should"="maybe should",
"SP_must_be_able"="must be able", "Filler_will_a"="quickstep","Filler_will_b"="forgotten",
"Filler_would_a"="lights working","Filler_would_b"="positive effect","Filler_can_a"="homework",
"Filler_can_b"="Ruth","Filler_could_a"="scarcely","Filler_could_b"="Jack", "Filler_may_a"="may be up to",
"Filler_may_b"="visit soon", "Filler_might_a"="good buy","Filler_might_be"="something special",
"Filler_should_a"="tag along","Filler_should_b"="Andrew","Filler_used_to_a"="trolley",
"Filler_used_to_b"="cycle a lot","Filler_must_a"="quite a skill","Filler_must_b"="nice feeling",
"Dist_gram_will_went"="will went","Dist_gram_meaned"="meaned","Dist_gram_can_were"="can were",
"Dist_gram_forgot"="never forgot", "Dist_gram_may_has"="may has",
"Dist_gram_might_says"="might says","Dist_gram_used_to_runs"="used to runs",
"Dist_gram_should_looks"="should looks","Dist_gram_must_knew"="must knew","Dist_dial_liketa"="liketa",
"Dist_dial_belongs"="belongs to disagree","Dist_dial_finna"="finna","Dist_dial_used_to_didnt"="used to didn't matter",
"Dist_dial_eaten"="I eaten", "Dist_dial_can_driving"="can driving","Dist_dial_aint_know"="That's something",
"Dist_dial_ever_see"="ever see the film","Dist_dial_done_gone"="done gone")
I want to do this with a Python dictionary, but we're talking about R, so I've more or less translated the approach. There is probably a more idiomatic way to do this in R than two for loops, but this should work:
# Define data
df <- data.frame(
id = c(1:5),
sentences = c("I like apples", "I like dogs", "I have cats", "Dogs are cute", "I like fish")
)
# id sentences
# 1 1 I like apples
# 2 2 I like dogs
# 3 3 I have cats
# 4 4 Dogs are cute
# 5 5 I like fish
type_list <- list(
"fruit" = c("apples", "oranges"),
"animals" = c("dogs", "cats")
)
types <- names(type_list)
df$type <- NA
df$item <- NA
for (type in types) {
for (item in type_list[[type]]) {
matches <- grep(item, df$sentences, ignore.case = TRUE)
df[matches, "type"] = type
df[matches, "item"] = item
}
}
# Output:
# id sentences type item
# 1 1 I like apples fruit apples
# 2 2 I like dogs animals dogs
# 3 3 I have cats animals cats
# 4 4 Dogs are cute animals dogs
# 5 5 I like fish <NA> <NA>
EDIT
Added after data was added. If I read in your data, and call it df, and your type list and call it type_list, the following works:
types <- names(type_list)
df$type <- NA
df$item <- NA
for (type in types) {
for (item in type_list[[type]]) {
matches <- grep(item, df$Sentences, ignore.case = TRUE)
df[matches, "type"] = type
df[matches, "item"] = item
}
}
This is exactly the same as my previous code, except Sentences has an upper case S in your data frame.

Select the optimal number based on conditions

This is my minimal dataset:
df=structure(list(ID = c(3942504L, 3199413L, 1864266L, 4037617L,
2030477L, 1342330L, 5434070L, 3200378L, 4810153L, 4886225L),
MI_TIME = c(1101L, 396L, 1140L, 417L, 642L, 1226L, 1189L,
484L, 766L, 527L), MI_Status = c(0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 1L, 0L), Stroke_status = c(1L, 0L, 1L, 0L, 0L, 0L,
0L, 1L, 1L, 0L), Stroke_time = c(1101L, 396L, 1140L, 417L,
642L, 1226L, 1189L, 484L, 766L, 527L), Arrhythmia_status = c(NA,
NA, TRUE, NA, NA, TRUE, NA, NA, TRUE, NA), Arrythmia_time = c(1101L,
356L, 1122L, 7L, 644L, 126L, 118L, 84L, 76L, 5237L)), row.names = c(NA,
10L), class = "data.frame")
As you can see, I have mainly 2 types of variables "_status" and "_time".
I am preparing my dataset for a survival analysis, and "time" is time to event in days.
But the problem arrives when I try to create a variable called "any cardiovascular outcome" (df$CV) That I have defined as the following:
df$CV = NA
df$CV <- with(df, ifelse(MI_Status=='1' | Stroke_status=='1' | Arrhythmia_status== 'TRUE' ,'1', '0'))
df$CV = as.factor(df$CV)
The problem I have is with selecting the optimal time to event. As now I have a new variable called df$CV, but 3 different "_time" variables.
So I would like to create a new column, called df$CV_time where time, is the time for the event that happened first.
There is a slight difficulty in this problem though, and I put an example:
If we have a subject with MI_status==1, Arrythmia_status==NA, stroke_status==1 and MI_time==200, Arrythmia_time==100, stroke_time==220 --> the correct time for df$CV would be 200, as it is the time for the earliest event.
However, in a case where MI_status==0, Arrythmia_status==NA, stroke_status==0 and MI_time==200, Arrythmia_time==100, stroke_time==220 --> the correct time for df$CV would be 220, as it is the time for latest follow up is 220 days.
How could I select the optimal number for df$CV based on these conditions?
This might be one approach using tidyverse.
First, you may want to make sure your column names are consistent with spelling and case (here using rename).
Then, you can explicitly define your "Arrhythmia" outcome as TRUE or FALSE (instead of using NA).
You can put your data into long form with pivot_longer, and then group_by your ID. You can include the specific columns related to MI, stroke, and arrhythmia here (where there are "time" and "status" columns available). Note that in your actual dataset (where you use glimpse - it is unclear what you want for arrhythmia - there's a pif column name, but nothing specific for time or status).
Your cardiovascular outcome will include status for MI or Stroke that is 1, or Arrhythmia that is TRUE.
The time to event would be the min time if there was a cardiovascular outcome, otherwise use the censored time of latest follow up or max time.
Let me know if this gives you the desired output.
library(tidyverse)
df %>%
rename(MI_time = MI_TIME, MI_status = MI_Status, Arrhythmia_time = Arrythmia_time) %>%
replace_na(list(Arrhythmia_status = F)) %>%
pivot_longer(cols = c(starts_with("MI_"), starts_with("Stroke_"), starts_with("Arrhythmia_")),
names_to = c("event", ".value"),
names_sep = "_") %>%
group_by(ID) %>%
summarise(
any_cv_outcome = any(status[event %in% c("MI", "Stroke")] == 1 | status[event == "Arrhythmia"]),
cv_time_to_event = ifelse(any_cv_outcome, min(time), max(time))
)
Output
ID any_cv_outcome cv_time_to_event
<int> <lgl> <int>
1 1342330 TRUE 126
2 1864266 TRUE 1122
3 2030477 FALSE 644
4 3199413 FALSE 396
5 3200378 TRUE 84
6 3942504 TRUE 1101
7 4037617 FALSE 417
8 4810153 TRUE 76
9 4886225 FALSE 5237
10 5434070 FALSE 1189

Comparing two apparently identical levels from factors

I have two dataframe columns that have apparently identical factors, but they don't:
levels(train$colA)
## [1] "I am currently using (least once over the last 2 weeks)"
## [2] "I have never tried nor used"
## [3] "I have tried or used at some point in time"
levels(test$colA)
## [1] "I am currently using (least once over the last 2 weeks)"
## [2] "I have never tried nor used"
## [3] "I have tried or used at some point in time"
levels(train$colA) == levels(test$colA)
## [1] FALSE TRUE TRUE
I have tried comparing both sentences and actually they are equal:
"I am currently using (least once over the last 2 weeks)" == "I am currently using (least once over the last 2 weeks)"
## [1] TRUE
I am trying to apply xgboost trained model to test data. Trained model comes from train dataframe. Now I am trying to apply it to test, but with no success, as I get the error that test has a new factor.
Edited:
Here is the output of dput():
dput(head(train$colA))
structure(c(1L, 1L, 1L, 2L, 1L, 1L), .Label = c("I am currently using (least once over the last 2 weeks)", "I have never tried nor used"
"I have tried or used at some point in time"
), class = "factor")
dput(head(test$colA))
structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("I am currently using (least once over the last 2 weeks)", "I have never tried nor used"
"I have tried or used at some point in time"
), class = "factor")
I can see there is a difference from: c(1L, 1L, 1L, 2L, 1L, 1L) to c(1L, 1L, 1L, 1L, 1L, 1L) . So I guess here is the key, although I don't know what does it exactly mean.

How to find the most shared species between 3 groups with DPLYR

I'm looking to find the most common species ("spid" variable, which is a code made with the 4 first letters of the genus name and then the 4 first letter of the species name) in a data frame where there is different habitats (variable "hab", modalities : TA,TB,TC).
I don't know how I can apply the "max n" ("slice(which.max(n))") on each habitat to select the species that are the most common for those habitats. As an exemple, if a species has been counted 50 times in 1 habitat and 0 time in the others, compared to a species that has 10 counts in each habitat, the last one would be the more common.
Here is the code I started with :
brk %>%
dplyr::select(spid,hab)%>%
dplyr::group_by(spid) %>%
dplyr::mutate(n = length(unique(hab))) %>%
filter(n == 3)
As first I thought to filter the species that are in the 3 habitats but I couldn't select those species. But then how can I apply my "max" function to select the most shared species ? Is a "apply" function a good approach ?
Here is a reproductible code :
library(dplyr)
brk%>%
dplyr::select(spid,hab)%>%
dplyr::sample_n(20)%>%
dput()
structure(list(spid = structure(c(157L, 21L, 181L, 128L, 191L,
197L, 202L, 122L, 179L, 150L, 15L, 162L, 43L, 202L, 154L, 179L,
57L, 229L, 231L, 183L), .Label = c("ACROEMER", "ACROMEGA", "AEROSUBPM",
"AMAZDIPL", "ANASAURI", "ANASPILI", "ANDRABER", "ANDRBILO", "ANEULATI",
"BAZZDECR", "BAZZDECRM", "BAZZMASC", "BAZZNITI", "BAZZPRAE",
"BAZZROCA", "BRACEURY", "BUCKMEMB", "CALYARGU", "CALYFISS", "CALYMASC",
"CALYPALI", "CALYPERU", "CAMPARCTM", "CAMPAURE", "CAMPCRAT",
"CAMPFLEX", "CAMPJAME", "CAMPROBI", "CAMPTHWA", "CEPHVAGI", "CERABELA",
"CERACORN", "CERAZENK", "CHEICAME", "CHEICORDI", "CHEIDECU",
"CHEIMONT", "CHEISERP", "CHEISURR", "CHEITRIF", "CHEIUSAM", "CHEIXANT",
"COLOCEAT", "COLOHASK", "COLOHILD", "COLOOBLI", "COLOPEPO", "COLOTANZ",
"COLOZENK", "COLUBENO", "COLUCALY", "COLUDIGI", "COLUHUMB", "COLUOBES",
"COLUTENU", "CONOTRAP", "CRYPMART", "CUSPCONT", "CYCLBORB", "CYCLBREV",
"CYLIKIAE", "DALTANGU", "DALTLATI", "DENDBORB", "DICRBILLB",
"DIPLCAVI", "DIPLCOGO", "DIPLCORN", "DREPCULT", "DREPHELE", "DREPMADA",
"DREPPHYS", "ECTRREGU", "ECTRVALE", "FISSASPL", "FISSMEGAH",
"FISSSCIO", "FRULAPIC", "FRULAPICU", "FRULBORB", "FRULCAPE",
"FRULGROS", "FRULHUMB", "FRULLIND", "FRULREPA", "FRULSCHI", "FRULSERR",
"FRULUSAMR", "FRULVARI", "FUSCCONN", "GOTTNEES", "GOTTSCHI",
"GOTTSPHA", "GROULAXO", "HAPLSTIC", "HERBDICR", "HERBJUNI", "HERBMAUR",
"HETEDUBI", "HETESPLE", "HETESPN", "HOLOBORB", "HOLOCYLI", "HYPNCUPR",
"ISOPCHRY", "ISOPCITR", "ISOPINTO", "ISOTAUBE", "JAEGSOLI", "JAEGSOLIR",
"KURZCAPI", "KURZCAPIS", "LEJEALAT", "LEJEANIS", "LEJECONF",
"LEJEECKL", "LEJEFLAV", "LEJELOMA", "LEJEOBTU", "LEJERAMO", "LEJETABU",
"LEJETUBE", "LEJEVILL", "LEPIAFRI", "LEPICESP", "LEPIDELE", "LEPIHIRS",
"LEPISTUH", "LEPISTUHP", "LEPTFLEX", "LEPTINFU", "LEPTMACU",
"LEUCANGU", "LEUCBIFI", "LEUCBORY", "LEUCCANDI", "LEUCCAPI",
"LEUCCINC", "LEUCDELI", "LEUCGRAN", "LEUCHILD", "LEUCISLE", "LEUCLEPE",
"LEUCMAYO", "LEUCSEYC", "LOPHBORB", "LOPHCOAD", "LOPHCONC", "LOPHDIFF",
"LOPHEULO", "LOPHMULT", "LOPHMURI", "LOPHNIGR", "LOPHSUBF", "MACRACID",
"MACRMAUR", "MACRMICR", "MACRPALL", "MACRSERP", "MACRSULC", "MACRTENU",
"MASTDICL", "METZCONS", "METZFURC", "METZLEPT", "METZMADA", "MICRAFRI",
"MICRANKA", "MICRDISP", "MICRINFL", "MICRKAME", "MICROBLO", "MICRSTRA",
"MITTLIMO", "MNIOFUSC", "PAPICOMP", "PLAGANGU", "PLAGDREP", "PLAGPECT",
"PLAGRENA", "PLAGREPA", "PLAGRODR", "PLAGTERE", "PLEUGIGA", "PLICHIRT",
"POLYCOMM", "POROELON", "POROMADA", "POROUSAG", "PRIOGRAT", "PSEUDECI",
"PTYCSTRI", "PYRRSPIN", "RACOAFRI", "RADUANKE", "RADUAPPR", "RADUBORB",
"RADUBORY", "RADUCOMO", "RADUEVEL", "RADUFULV", "RADUMADA", "RADUSTEN",
"RADUTABU", "RADUVOLU", "RHAPCRIS", "RHAPGRAC", "RHAPRUBR", "RICCAMAZ",
"RICCEROS", "RICCFAST", "RICCLIMB", "RICCLONG", "SCHLBADI", "SCHLMICRO",
"SCHLOANGU", "SCHLSQUA", "SEMACRAS", "SEMASCHI", "SEMASUBP",
"SERPCYRT", "SOLEBORG", "SOLEONRA", "SOLESPHA", "SPHATUMI", "SPHEMINU",
"SYRRAFRI", "SYRRAPER", "SYRRDIMO", "SYRRGAUD", "SYRRHISP", "SYRRPOTT",
"SYRRPROL", "SYRRPROLA", "SYZYPURP", "TAXICONFO", "TELACOAC",
"TELADIAC", "TELANEMA", "TRICADHA", "TRICDEBE", "TRICPERV", "ULOTFULV",
"WARBLEPT", "ZYGOINTE", "ZYGOREIN"), class = "factor"), hab = structure(c(3L,
2L, 2L, 1L, 1L, 2L, 3L, 2L, 3L, 1L, 2L, 3L, 3L, 2L, 1L, 2L, 2L,
1L, 1L, 2L), .Label = c("TA", "TB", "TC"), class = "factor")), row.names = c(NA,
-20L), class = "data.frame")
Thank you for your help,
Germain V
We can try
library(dplyr)
brk %>%
group_by(spid) %>%
summarise(n = n_distinct(hab)) %>%
slice(which.max(n))
Thank you for your answer
I would like to have a list with the species (the code spid) that are the most common between those 3 habitats, based on the effectif of those species in each habitat.
spid n_TA n_TB n_TC
DREPPHYS 6 1 1
BUCKMEMB 4 4 4
LEIJCOLE 0 0 0
In this random exemple (I've a total of 246 species) I would like to make a compute on this array to select the "most common" or "most sharred species" - in this case that should be BUCKMEMB, then DREPPHYS and finaly LEIJCOLE.
Maybe "max" function isn't a good approach ?

How can i convert a dataframe with a factor column to a xts object?

I have a csv file and when i use this command
SOLK<-read.table('Book1.csv',header=TRUE,sep=';')
I get this output
> SOLK
Time Close Volume
1 10:27:03,6 0,99 1000
2 10:32:58,4 0,98 100
3 10:34:16,9 0,98 600
4 10:35:46,0 0,97 500
5 10:35:50,6 0,96 50
6 10:35:50,6 0,96 1000
7 10:36:10,3 0,95 40
8 10:36:10,3 0,95 100
9 10:36:10,4 0,95 500
10 10:36:10,4 0,95 100
. . . .
. . . .
. . . .
285 17:09:44,0 0,96 404
Here is the result of dput(SOLK[1:10,]):
> dput(SOLK[1:10,])
structure(list(Time = structure(c(1L, 2L, 3L, 4L, 5L, 5L, 6L,
6L, 7L, 7L), .Label = c("10:27:03,6", "10:32:58,4", "10:34:16,9",
"10:35:46,0", "10:35:50,6", "10:36:10,3", "10:36:10,4", "10:36:30,8",
"10:37:23,3", "10:37:38,2", "10:37:39,3", "10:37:45,9", "10:39:07,5",
"10:39:07,6", "10:39:46,6", "10:41:21,8", "10:43:20,6", "10:43:36,4",
"10:43:48,8", "10:43:48,9", "10:43:54,6", "10:44:01,5", "10:44:08,4",
"10:45:47,2", "10:46:16,7", "10:47:03,6", "10:47:48,6", "10:47:55,0",
"10:48:09,9", "10:48:30,6", "10:49:20,6", "10:50:31,9", "10:50:34,6",
"10:50:38,1", "10:51:02,8", "10:51:11,5", "10:55:57,7", "10:57:57,2",
"10:59:06,9", "10:59:33,5", "11:00:31,0", "11:00:31,1", "11:04:46,4",
"11:04:53,4", "11:04:54,6", "11:04:56,1", "11:04:58,9", "11:05:02,0",
"11:05:02,6", "11:05:24,7", "11:05:56,7", "11:06:15,8", "11:13:24,1",
"11:13:24,2", "11:13:32,1", "11:13:36,2", "11:13:37,2", "11:13:44,5",
"11:13:46,8", "11:14:12,7", "11:14:19,4", "11:14:19,8", "11:14:21,2",
"11:14:38,7", "11:14:44,0", "11:14:44,5", "11:15:10,5", "11:15:10,6",
"11:15:12,9", "11:15:16,6", "11:15:23,3", "11:15:31,4", "11:15:36,4",
"11:15:37,4", "11:15:49,5", "11:16:01,4", "11:16:06,0", "11:17:56,2",
"11:19:08,1", "11:20:17,2", "11:26:39,4", "11:26:53,2", "11:27:39,5",
"11:28:33,0", "11:30:42,3", "11:31:00,7", "11:33:44,2", "11:39:56,1",
"11:40:07,3", "11:41:02,1", "11:41:30,1", "11:45:07,0", "11:45:26,6",
"11:49:50,8", "11:59:58,1", "12:03:49,9", "12:04:12,6", "12:06:05,8",
"12:06:49,2", "12:07:56,0", "12:09:37,7", "12:14:25,5", "12:14:32,1",
"12:15:42,1", "12:15:55,2", "12:16:36,9", "12:16:44,2", "12:18:00,3",
"12:18:12,8", "12:28:17,8", "12:28:17,9", "12:28:23,7", "12:28:51,1",
"12:36:33,2", "12:37:45,0", "12:39:22,2", "12:40:19,5", "12:42:22,1",
"12:58:46,3", "13:06:05,8", "13:06:05,9", "13:07:17,6", "13:07:17,7",
"13:09:01,3", "13:09:01,4", "13:09:11,3", "13:09:31,0", "13:10:07,8",
"13:35:43,8", "13:38:27,7", "14:11:16,0", "14:17:31,5", "14:26:13,9",
"14:36:11,8", "14:38:43,7", "14:38:47,8", "14:38:51,8", "14:48:26,7",
"14:52:07,4", "14:52:13,8", "15:09:24,7", "15:10:25,8", "15:29:12,1",
"15:31:55,9", "15:34:04,1", "15:44:10,8", "15:45:07,1", "15:57:04,9",
"15:57:13,9", "16:16:27,9", "16:21:41,7", "16:36:01,5", "16:36:13,2",
"16:46:10,5", "16:46:10,6", "16:47:37,3", "16:50:52,4", "16:50:52,5",
"16:51:44,5", "16:55:11,5", "16:56:21,8", "16:56:37,5", "16:57:37,9",
"16:58:18,6", "16:58:44,5", "17:00:39,1", "17:01:50,7", "17:03:13,2",
"17:03:28,3", "17:03:46,7", "17:03:47,0", "17:04:30,4", "17:08:41,8",
"17:09:44,0"), class = "factor"), Close = structure(c(8L, 7L,
7L, 6L, 5L, 5L, 4L, 4L, 4L, 4L), .Label = c("0,92", "0,93", "0,94",
"0,95", "0,96", "0,97", "0,98", "0,99"), class = "factor"), Volume = c(1000L,
100L, 600L, 500L, 50L, 1000L, 40L, 100L, 500L, 100L)), .Names = c("Time",
"Close", "Volume"), row.names = c(NA, 10L), class = "data.frame")
The first column includes the time stamp of every transaction during a stock's exchange daily session. I would like to convert the Close and Volume columns to an xts object ordered by the Time column.
UPDATE: From your edits, it appears you imported your data using two different commands. It also appears you should be using read.csv2. I've updated my answer with Lines that (I assume) look more like your original CSV (I have to guess because you don't say what the file looks like). The rest of the answer doesn't change.
You have to add a date to your times because xts stores all index values internally as POSIXct (I just used today's date).
I had to convert the "," decimal notation to the "." convention (using gsub), but that may be locale-dependent and you may not need to. paste today's date with the (possibly converted) time and then convert it to POSIXct to create an index suitable for xts.
I've also formatted the index so you can see the fractional seconds.
Lines <- "Time;Close;Volume
10:27:03,6;0,99;1000
10:32:58,4;0,98;100
10:34:16,9;0,98;600
10:35:46,0;0,97;500
10:35:50,6;0,96;50
10:35:50,6;0,96;1000
10:36:10,3;0,95;40
10:36:10,3;0,95;100
10:36:10,4;0,95;500
10:36:10,4;0,95;100"
SOLK <- read.csv2(con <- textConnection(Lines))
close(con)
solk <- xts(SOLK[,c("Close","Volume")],
as.POSIXct(paste("2011-09-02", gsub(",",".",SOLK[,1]))))
indexFormat(solk) <- "%Y-%m-%d %H:%M:%OS6"
solk
# Close Volume
# 2011-09-02 10:27:03.599999 0.99 1000
# 2011-09-02 10:32:58.400000 0.98 100
# 2011-09-02 10:34:16.900000 0.98 600
# 2011-09-02 10:35:46.000000 0.97 500
# 2011-09-02 10:35:50.599999 0.96 50
# 2011-09-02 10:35:50.599999 0.96 1000
# 2011-09-02 10:36:10.299999 0.95 40
# 2011-09-02 10:36:10.299999 0.95 100
# 2011-09-02 10:36:10.400000 0.95 500
# 2011-09-02 10:36:10.400000 0.95 100
That's an odd structure. Translating it to dput syntax
SOLK <- structure(list(structure(c(1L, 2L, 3L, 4L, 5L, 5L, 6L, 6L, 7L,
7L), .Label = c("10:27:03,6", "10:32:58,4", "10:34:16,9", "10:35:46,0",
"10:35:50,6", "10:36:10,3", "10:36:10,4"), class = "factor"),
Close = c(0.99, 0.98, 0.98, 0.97, 0.96, 0.96, 0.95, 0.95,
0.95, 0.95), Volume = c(1000L, 100L, 600L, 500L, 50L, 1000L,
40L, 100L, 500L, 100L)), .Names = c("", "Close", "Volume"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10"))
I'm assuming the comma in the timestamp is decimal separator.
library("chron")
time.idx <- times(gsub(",",".",as.character(SOLK[[1]])))
Unfortunately, it seems xts won't take this as a valid order.by; so a date (today, for lack of a better choice) must be included to make xts happy.
xts(SOLK[[2]], order.by=chron(Sys.Date(), time.idx))

Resources