Distance between points in R: distCosine() for a table? - r

I have a list of position and I would like to know the distances between the closest points. I tried to use distCosine() but there is an issue. Here is what I did:
my data, sorted by lat
structure(list(lat = c(53.56478, 53.919724, 54.109047, 54.109047,
54.36612, 55.48143, 56.2335, 56.682796, 56.93616, 57.804092,
58.82089, 59.297623, 59.335075, 59.907795, 60.125046, 60.274445,
60.289204, 60.386665, 60.591167, 64.68329), long = c(14.585611,
14.286517, 13.807847, 13.807847, 10.997632, 18.182697, 16.454927,
16.564703, 18.221214, 23.258204, 17.84381, 18.172949, 18.126884,
23.217615, 20.65724, 26.44062, 27.189545, 19.847534, 28.5585,
24.534185)), .Names = c("lat", "long"), row.names = c(2L, 3L,
6L, 11L, 1L, 17L, 15L, 20L, 13L, 19L, 7L, 14L, 4L, 5L, 10L, 12L,
18L, 9L, 8L, 16L), class = "data.frame")
I tried to use distCosine() following an other discussion on stackoverflow to include in a new column the distance from the closest lat (this is why I sorted by lat):
data$a<-outer(seq(nrow(data)),
seq(nrow(data)),
Vectorize(function(i, j) distCosine(data[1,], data[2,]))
)
The result does not work... This is not the distance for each point...
is there an easier way to use distCosine for my request?

I think you just have to replace distCosine(data[1,], data[2,]) by distCosine(data[i,c("long","lat")], data[j,c("long","lat")]):
data <- head(data,5) # smaller example
data$a<-outer( seq(nrow(data)),
seq(nrow(data)),
Vectorize(
function(i, j) distCosine(data[i,c("long","lat")], data[j,c("long","lat")])
)
)
Result:
> data
lat long a.1 a.2 a.3 a.4 a.5
2 53.56478 14.58561 0.00 44146.92 79251.87 79251.87 251291.54
3 53.91972 14.28652 44146.92 0.00 37741.81 37741.81 220118.16
6 54.10905 13.80785 79251.87 37741.81 0.00 0.00 185040.01
11 54.10905 13.80785 79251.87 37741.81 0.00 0.00 185040.01
1 54.36612 10.99763 251291.54 220118.16 185040.01 185040.01 0.00
>

Got it with an other function:
data<-data[c("long","lat")]
distHaversine
t<-distHaversine(p1 = data[-nrow(data),],
p2 = data[-1,]) a<-0 final<-c(a,t) data$dist<-final
a<-0
final<-c(a,t)
data$dist<-final

Related

kmeans complains "NA/NaN/Inf in foreign function call (arg 1)", when there are none?

I'm trying to run kmeans clustering analysis on a relatively simple data frame. However, 
kmeans(sample_data, centers = 4) 
doesn't work, as R states there are "NA/NaN/Inf in foreign function call (arg 1)" (not true). Anyway, I tried
kmeans(na.omit(sample_data), centers = 4)
based on the answers here (and other posts), and that didn't work. The only workaround I found was to exclude the non-numeric column (i.e., the observation names) using
kmeans(sample_data[, 2:5], centers = 4)
Unfortunately, this makes the clusters much less informative, since the points now have numbers instead of names. What's going on? Or how could I get the clustering with the right labels?
Edit: I'm trying to reproduce this procedure / result, but with a different data set. Notice that when the author visualizes the clusters, the points are labelled according to the observations (the states, in that case; or "obs1, obs2, etc." in mine.)
Because of the workaround above (which drops the column with observation names), I get a sequence of numeric labels instead.
Code and dput below:
library(factoextra)
cluster <- kmeans(sample_data, centers = 4) #this doesn't work
cluster <- kmeans(sample_data[, 2:5], centers = 4) #this works
fviz_cluster(cluster, sample_data)
sample_data:
structure(list(name = structure(c(1L, 12L, 19L, 20L, 21L, 22L,
23L, 24L, 25L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 13L,
14L, 15L, 16L, 17L, 18L), .Label = c("obs1", "obs10", "obs11",
"obs12", "obs13", "obs14", "obs15", "obs16", "obs17", "obs18",
"obs19", "obs2", "obs20", "obs21", "obs22", "obs23", "obs24",
"obs25", "obs3", "obs4", "obs5", "obs6", "obs7", "obs8", "obs9"
), class = "factor"), variable1 = c(0, 0.383966783938484, 0.541654398529028,
0.469060314591266, 0.397636449124337, 0.3944696359856, 0.368740430902284,
0.998695171590958, 0.60013559365688, 0.543416096609665, 1, 0.287523586757021,
0.57818096701751, 0.504722587360754, 0.284825226469556, 0.295250085072615,
0.509782836343032, 0.392942062325636, 0.602608457169149, 0.474668174468815,
0.219951650206242, 0.263837738487209, 0.530976492805559, 0.312401708505963,
0.828799458392802), variable2 = c(0, 0.21094954480341, 0.374890541082605,
0.502470003202637, 0.385212751959443, 0.499052863381439, 0.172887314327707,
0.319869014605517, 0.484308813708282, 0.348608342250238, 0.474464311565186,
0.380406312920036, 1, 0.618253544624658, 0.560290273167607, 0.676315913606924,
0.339157532529115, 0.479005841710258, 0.576094917240369, 0.819742646967549,
0.472559283375261, 0.45594685111211, 0.160720270709769, 0.494360626922513,
0.658705091697224), variable3 = c(0, 0.0391726961740698, 0.157000498692027,
0.194883594782107, 0.133290754949737, 0.199085094994071, 0.000551185924636259,
0.418045152251051, 0.434858475480003, 0.443442199844268, 0.257231662911141,
0.195570389942169, 0.46503468971732, 0.358104620337886, 0.391852363829371,
0.39834809992812, 0.258870156344325, 0.38555892877453, 0.480559759927908,
1, 0.15662554228071, 0.279363773961277, 0.11211821625736, 0.180885222092932,
0.339650099009323), variable4 = c(0, 0.0464395032429444, 0.323768557597659,
0.201813172242373, 0.302710768912681, 0.446027132614423, 0.542018940773003,
1, 0.738123811706962, 0.550819613183929, 0.679555989322392, 0.563126171437818,
0.470328070009844, 0.316069092919459, 0.344421820993065, 0.222931758003036,
0.250406547916021, 0.381098780580988, 0.9526031202384, 0.174161621337361,
0.260548409706516, 0.288399563112687, 0.617089845066814, 0.265314653254406,
0.330637996311329)), class = "data.frame", row.names = c(NA,
-25L))
K-means only works on continuous variables.
It probably tried to convert your labels into numbers, and that did not work.
Never include identifier columns in analysis!
Proper data preprocessing is crucial and 90% of the work; you need to understand the requirements precisely. It is not sufficient to just make it run somehow - it is easy to make it run, but return useless results...
The key is to convert the column with the desired labels to row names with
df <- tibble::column_to_rownames(df, var = "labels")
That way the clustering algorithm won't even consider the labels, but will apply them to the points on the cluster.

Plot frequency of events with time in R

I have a reasonable amount of time data, and I'd like to put it in a frequency graph, where the X-axis would be several intervals of time and the Y-axis would be the amount of data I've collected in such period. See this example:
Let's suppose I have this list:
[10:17:55, 10:37:40, 10:40:26, 10:48:18, 11:00:17, 11:01:12, 11:06:58, 11:09:20, 11:43:41, 11:48:24, 11:49:14, 12:07:31, 12:10:52, 12:10:52, 12:19:00, 12:19:00, 12:19:43, 12:20:55, 12:38:27, 12:38:27, 12:55:09, 12:55:10, 12:57:31, 12:57:31, 13:04:16, 13:04:16, 13:06:51 13:06:51, 14:55:06, 14:56:10, 15:01:30, 15:28:42, 3:29:17, 15:35:33, 15:58:32, 16:05:07, 16:09:16, 16:10:36, 16:32:57, 16:32:57, 16:34:32, 16:38:16, 17:43:27, 17:53:01, 17:56:14, 18:08:21, 18:17:23, 18:37:23, 18:37:23, 18:43:13, 18:43:13 18:51:43, 18:51:43, 19:05:39, 19:05:39]
And I'd like to plot a histogram showing how many values are there in intervals of 1h, or 30 minutes (still deciding), such as:
10h - 11h: 4
11h - 12h: 7
.
.
.
19h - 20h: 2
But all that represented in a graph. I know the very basics of how to plot a histogram in R and couldn't figure out how to do that. I've seen some answers making plots throughout the days, which is not much applicable, because these values were collected in different days... Can you guys help me?
EDIT: Here's a dput() of the list:
structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L,
13L, 13L, 14L, 14L, 15L, 16L, 17L, 17L, 18L, 19L, 20L, 20L, 21L,
21L, 22L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L,
33L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 41L, 42L, 42L,
43L, 43L, 44L, 44L), .Label = c("10:17:55", "10:37:40", "10:40:26",
"10:48:18", "11:00:17", "11:01:12", "11:06:58", "11:09:20", "11:43:41",
"11:48:24", "11:49:14", "12:07:31", "12:10:52", "12:19:00", "12:19:43",
"12:20:55", "12:38:27", "12:55:09", "12:55:10", "12:57:31", "13:04:16",
"13:06:51", "14:55:06", "14:56:10", "15:01:30", "15:28:42", "15:29:17",
"15:35:33", "15:58:32", "16:05:07", "16:09:16", "16:10:36", "16:32:57",
"16:34:32", "16:38:16", "17:43:27", "17:53:01", "17:56:14", "18:08:21",
"18:17:23", "18:37:23", "18:43:13", "18:51:43", "19:05:39"), class = "factor")`
There are range, trunc and seq methods for POSIXt or Date objects. Assuming you assign that structure object to a name such as tms this would convert to POSIXct and then construct a range, a sequence of breaks that spanned the hours and then bin within 30 minute intervals:
> tms <- as.POSIXct(tms, format="%H:%M:%S")
> brks <- trunc(range(tms), "hours")
Warning message:
In if (isdst == -1) { :
the condition has length > 1 and only the first element will be used
> hist(tms, breaks=seq(brks[1], brks[2]+3600, by="30 min") )
Notice that the plot method for POSIXt objects handles the x-axis labeling:
I suppose you could check to see if the second "brks" was within the half-hour window for a 30 minute plot. So this would be the code to avoid a blank bin, if targeting half-hour bins:
hist(tms, breaks=seq(brks[1],
brks[2]+ if( as.numeric( max(tms)-brks[2] ) < 30) #diff time in mins
{1800} else{3600},
by="30 min")
)
Here is the method I used to obtain what it is you are after.
This will work for hours and half hours. Not the prettiest, but I think it serves your purpose. You will need to do some massaging of the axes so they display the information you desire. Hopefully that helps!
hours <- as.numeric( format( strptime( times , format = "%H:%M:%S" ) , "%H" ) )
hist( hours , breaks = unique( hours ) )
half_hours <- hours + ( as.numeric( format( strptime( times , format = "%H:%M:%S" ) , "%M" ) ) /60 )
hist(half_hours , breaks = c( unique( hours ) , unique( hours ) + 0.5 ) )

Decimal hours in r (excluding todays date)

For a sample dataframe:
light <- structure(list(daylight.hours = structure(c(62L, 22L, 60L, 58L,
34L, 37L), .Label = c("07:12:05", "07:14:41", "07:18:24", "07:28:59",
"07:31:07", "07:45:51", "07:48:08", "07:51:29", "07:52:06", "07:58:18",
"08:01:16", "08:07:25", "08:10:08", "08:18:16", "08:23:33", "08:27:03",
"08:30:36", "08:34:13", "08:41:35", "08:46:01", "08:53:52", "08:54:17",
"09:31:16", "09:35:29", "09:39:44", "10:27:19", "10:31:45", "10:36:12",
"11:53:41", "12:11:39", "12:16:10", "12:20:23", "12:34:10", "14:18:26",
"14:22:41", "14:26:55", "14:35:21", "14:39:49", "14:44:00", "14:48:09",
"14:54:29", "14:59:08", "15:03:18", "15:11:01", "15:15:38", "15:15:52",
"15:19:09", "15:58:22", "16:07:10", "16:08:33", "16:24:12", "16:27:14",
"16:42:57", "16:55:32", "16:57:52", "17:00:06", "17:02:15", "17:03:49",
"17:04:17", "17:05:24", "17:06:14", "17:06:53", "17:08:05", "17:09:38",
"17:11:04", "17:12:24", "17:13:26", "17:13:47", "17:14:22", "17:14:32",
"17:14:42", "17:14:44", "17:15:39", "17:15:40", "17:16:22", "17:16:51",
"17:17:55"), class = "factor"), school.id = c(4L, 4L, 4L, 4L,
14L, 14L)), .Names = c("daylight.hours", "school.id"), row.names = c(NA,
6L), class = "data.frame")
I want to create another variable called d.daylight to change the daylight hours variable to a decimal. (i.e. 18:30:00 would read 18.5)
When I use the following it automatically puts todays date which is not what I am after (everything is under 24 hours).
light$d.daylight <- as.POSIXlt(light$daylight.hours, format="%H:%M:%S")
Could anyone advise me how to rectify this?
The times function from package chron is useful if you need to deal with times (without dates).
library(chron)
light$d.daylight <- as.numeric(times(light$daylight.hours)) * 24
# daylight.hours school.id d.daylight
#1 17:06:53 4 17.114722
#2 08:54:17 4 8.904722
#3 17:05:24 4 17.090000
#4 17:03:49 4 17.063611
#5 14:18:26 14 14.307222
#6 14:35:21 14 14.589167

How to extract block of rows in R

here is a example of my data frame (the original has ~ 10 000 rows). I would like to extract blocks of row based on VariableC. I only want to keep rows between FALSEs. But only "blocks" with a minimum number of rows of 10 (randomly located in the data frame) and discard the others. In other words, I want to split my data frame into sub data frames (i.e. block of rows). An alternative would be to create a new column with each block having an individual number or letter. The end goal is to plot (regression) VariableA and VariableB for each block and extract the regression and slope coefficients of each block. I know how to do the last part but I can't find a solution on how to extract the blocks.
dput(DF)
structure(list(VariableA = c(-0.427796831, -0.985783635, 0.07381913,
-0.788768923, 2.088999368, 1.634064399, -0.396180684, 1.242763624,
-0.925287904, -1.127545153, -1.392674655, -0.988900906, -0.08007986,
1.123984722, 0.698530819, -0.983565282, 0.568517376, -0.349446274,
0.451443794, -0.525897224, -0.932426185, -1.026114049, -0.502973503,
0.779152951, -0.636137726, -0.488850226, 0.281389897, -0.058183652,
-0.490377469, 0.541441864, 0.101754052, -0.16701156, 0.830697787,
0.383672008, 0.376444634, 0.377695822, -0.167281753, 0.85629382,
0.213632586, -0.180474289, 1.008370316, -0.039110304, -0.498537412,
-2.804652051, -0.308652164, -0.57234963, 0.599951896, 0.52484456,
0.008141731, -0.355182154, -0.401441593, 1.201478908, 0.656311257,
0.459034655), VariableB = c(-0.599169932, -0.874625086, -0.879367189,
0.068133167, -0.800781757, -0.746429115, -0.231178499, -0.905456972,
0.40165965, 0.664579078, -0.386614574, -0.700272577, 1.844891234,
0.277616227, 0.560119708, -2.874313318, 0.835592571, -0.66310824,
0.770336487, 1.547635124, -0.604065751, 1.009519877, -0.54792181,
-0.904229067, -0.309270319, 0.16088111, 0.325712725, -0.931632811,
-1.124531146, -0.24012375, -0.887921437, -1.531276383, 1.565233292,
0.462452663, 0.836271408, -0.721959208, 1.92215585, 0.189964832,
1.661140854, -1.604886269, -1.237132008, 0.811584528, -0.965798536,
2.604504203, -1.124331258, 0.240004185, -0.34902354, -0.447056073,
0.051475583, 0.159486311, -1.86620661, -1.671688795, -1.268626575,
-1.734731137), VariableC = structure(c(11L, 19L, 9L, 36L, 36L,
26L, 7L, 24L, 36L, 5L, 17L, 15L, 33L, 30L, 29L, 21L, 31L, 10L,
36L, 36L, 36L, 36L, 36L, 36L, 36L, 36L, 36L, 36L, 36L, 36L, 8L,
16L, 35L, 25L, 28L, 4L, 32L, 27L, 34L, 18L, 36L, 36L, 14L, 2L,
13L, 3L, 36L, 23L, 22L, 1L, 20L, 6L, 36L, 12L), .Label = c("-0.019569584",
"-0.020014785", "-0.033234545", "-0.034426339", "-0.046296608",
"-0.047020989", "-0.062735918", "-0.078616739", "-0.080554806",
"-0.101255451", "-0.102696676", "-0.127569648", "-0.143298342",
"-0.146433595", "-0.168917348", "-0.169828794", "-0.177928923",
"-0.178536056", "-0.186040872", "-0.22676482", "-0.38578786",
"0.005961731", "0.007778849", "0.033730665", "0.084612467", "0.088763528",
"0.104625865", "0.121271604", "0.125865053", "0.140160095", "0.140410995",
"0.17548741", "0.176481137", "0.187477344", "0.239593108", "FALSE"
), class = "factor")), .Names = c("VariableA", "VariableB", "VariableC"
), class = "data.frame", row.names = c(NA, -54L))
Here's an approach:
# create indicator variable
df$ind <- cumsum(df$VariableC == "FALSE")
# remove "FALSE" rows
df_sub <- df[df$VariableC != "FALSE", ]
# run a regression for each unique ind value
library(MASS)
lmList(VariableA ~ VariableB | ind, data = df_sub)
The result:
Call: lmList(formula = VariableA ~ VariableB | ind, data = df_sub)
Coefficients:
(Intercept) VariableB
0 -0.40531670 0.05261483
2 -0.93213791 -2.80237922
3 -0.26593782 0.31197216
15 0.24240710 0.10646927
17 -0.92256481 -0.65475348
18 0.02793152 -0.22209490
19 0.45903466 NA
Degrees of freedom: 35 total; 21 residual
Residual standard error: 0.6656342
How to create a plot?
library(ggplot2)
ggplot(df_sub, aes(x = VariableB, y = VariableA)) +
geom_point() +
facet_wrap( ~ ind) +
geom_smooth(method = lm)
You could do as follows:
falseIdx <- which(as.character(DF$VariableC) == "FALSE")
# at least 2 FALSE's must be present...
if(length(falseIdx) >= 2){
blocks <-
lapply(2:(length(falseIdx)-1),FUN=function(idx){
currFalse <- falseIdx[idx]
prevFalse <- falseIdx[idx-1]
# we build a block only if it has at least 10 rows
if(currFalse - prevFalse - 1 >= 10){
return(DF[(prevFalse+1):(currFalse-1),])
}else{
return(NULL)
}
})
# remove nulls
blocks[sapply(blocks, is.null)] <- NULL
}else{
blocks <- list()
}
Computing on your example data, blocks contains only one data.frame:
> blocks
[[1]]
VariableA VariableB VariableC
31 0.1017541 -0.8879214 -0.078616739
32 -0.1670116 -1.5312764 -0.169828794
33 0.8306978 1.5652333 0.239593108
34 0.3836720 0.4624527 0.084612467
35 0.3764446 0.8362714 0.121271604
36 0.3776958 -0.7219592 -0.034426339
37 -0.1672818 1.9221558 0.17548741
38 0.8562938 0.1899648 0.104625865
39 0.2136326 1.6611409 0.187477344
40 -0.1804743 -1.6048863 -0.178536056

How can i convert a dataframe with a factor column to a xts object?

I have a csv file and when i use this command
SOLK<-read.table('Book1.csv',header=TRUE,sep=';')
I get this output
> SOLK
Time Close Volume
1 10:27:03,6 0,99 1000
2 10:32:58,4 0,98 100
3 10:34:16,9 0,98 600
4 10:35:46,0 0,97 500
5 10:35:50,6 0,96 50
6 10:35:50,6 0,96 1000
7 10:36:10,3 0,95 40
8 10:36:10,3 0,95 100
9 10:36:10,4 0,95 500
10 10:36:10,4 0,95 100
. . . .
. . . .
. . . .
285 17:09:44,0 0,96 404
Here is the result of dput(SOLK[1:10,]):
> dput(SOLK[1:10,])
structure(list(Time = structure(c(1L, 2L, 3L, 4L, 5L, 5L, 6L,
6L, 7L, 7L), .Label = c("10:27:03,6", "10:32:58,4", "10:34:16,9",
"10:35:46,0", "10:35:50,6", "10:36:10,3", "10:36:10,4", "10:36:30,8",
"10:37:23,3", "10:37:38,2", "10:37:39,3", "10:37:45,9", "10:39:07,5",
"10:39:07,6", "10:39:46,6", "10:41:21,8", "10:43:20,6", "10:43:36,4",
"10:43:48,8", "10:43:48,9", "10:43:54,6", "10:44:01,5", "10:44:08,4",
"10:45:47,2", "10:46:16,7", "10:47:03,6", "10:47:48,6", "10:47:55,0",
"10:48:09,9", "10:48:30,6", "10:49:20,6", "10:50:31,9", "10:50:34,6",
"10:50:38,1", "10:51:02,8", "10:51:11,5", "10:55:57,7", "10:57:57,2",
"10:59:06,9", "10:59:33,5", "11:00:31,0", "11:00:31,1", "11:04:46,4",
"11:04:53,4", "11:04:54,6", "11:04:56,1", "11:04:58,9", "11:05:02,0",
"11:05:02,6", "11:05:24,7", "11:05:56,7", "11:06:15,8", "11:13:24,1",
"11:13:24,2", "11:13:32,1", "11:13:36,2", "11:13:37,2", "11:13:44,5",
"11:13:46,8", "11:14:12,7", "11:14:19,4", "11:14:19,8", "11:14:21,2",
"11:14:38,7", "11:14:44,0", "11:14:44,5", "11:15:10,5", "11:15:10,6",
"11:15:12,9", "11:15:16,6", "11:15:23,3", "11:15:31,4", "11:15:36,4",
"11:15:37,4", "11:15:49,5", "11:16:01,4", "11:16:06,0", "11:17:56,2",
"11:19:08,1", "11:20:17,2", "11:26:39,4", "11:26:53,2", "11:27:39,5",
"11:28:33,0", "11:30:42,3", "11:31:00,7", "11:33:44,2", "11:39:56,1",
"11:40:07,3", "11:41:02,1", "11:41:30,1", "11:45:07,0", "11:45:26,6",
"11:49:50,8", "11:59:58,1", "12:03:49,9", "12:04:12,6", "12:06:05,8",
"12:06:49,2", "12:07:56,0", "12:09:37,7", "12:14:25,5", "12:14:32,1",
"12:15:42,1", "12:15:55,2", "12:16:36,9", "12:16:44,2", "12:18:00,3",
"12:18:12,8", "12:28:17,8", "12:28:17,9", "12:28:23,7", "12:28:51,1",
"12:36:33,2", "12:37:45,0", "12:39:22,2", "12:40:19,5", "12:42:22,1",
"12:58:46,3", "13:06:05,8", "13:06:05,9", "13:07:17,6", "13:07:17,7",
"13:09:01,3", "13:09:01,4", "13:09:11,3", "13:09:31,0", "13:10:07,8",
"13:35:43,8", "13:38:27,7", "14:11:16,0", "14:17:31,5", "14:26:13,9",
"14:36:11,8", "14:38:43,7", "14:38:47,8", "14:38:51,8", "14:48:26,7",
"14:52:07,4", "14:52:13,8", "15:09:24,7", "15:10:25,8", "15:29:12,1",
"15:31:55,9", "15:34:04,1", "15:44:10,8", "15:45:07,1", "15:57:04,9",
"15:57:13,9", "16:16:27,9", "16:21:41,7", "16:36:01,5", "16:36:13,2",
"16:46:10,5", "16:46:10,6", "16:47:37,3", "16:50:52,4", "16:50:52,5",
"16:51:44,5", "16:55:11,5", "16:56:21,8", "16:56:37,5", "16:57:37,9",
"16:58:18,6", "16:58:44,5", "17:00:39,1", "17:01:50,7", "17:03:13,2",
"17:03:28,3", "17:03:46,7", "17:03:47,0", "17:04:30,4", "17:08:41,8",
"17:09:44,0"), class = "factor"), Close = structure(c(8L, 7L,
7L, 6L, 5L, 5L, 4L, 4L, 4L, 4L), .Label = c("0,92", "0,93", "0,94",
"0,95", "0,96", "0,97", "0,98", "0,99"), class = "factor"), Volume = c(1000L,
100L, 600L, 500L, 50L, 1000L, 40L, 100L, 500L, 100L)), .Names = c("Time",
"Close", "Volume"), row.names = c(NA, 10L), class = "data.frame")
The first column includes the time stamp of every transaction during a stock's exchange daily session. I would like to convert the Close and Volume columns to an xts object ordered by the Time column.
UPDATE: From your edits, it appears you imported your data using two different commands. It also appears you should be using read.csv2. I've updated my answer with Lines that (I assume) look more like your original CSV (I have to guess because you don't say what the file looks like). The rest of the answer doesn't change.
You have to add a date to your times because xts stores all index values internally as POSIXct (I just used today's date).
I had to convert the "," decimal notation to the "." convention (using gsub), but that may be locale-dependent and you may not need to. paste today's date with the (possibly converted) time and then convert it to POSIXct to create an index suitable for xts.
I've also formatted the index so you can see the fractional seconds.
Lines <- "Time;Close;Volume
10:27:03,6;0,99;1000
10:32:58,4;0,98;100
10:34:16,9;0,98;600
10:35:46,0;0,97;500
10:35:50,6;0,96;50
10:35:50,6;0,96;1000
10:36:10,3;0,95;40
10:36:10,3;0,95;100
10:36:10,4;0,95;500
10:36:10,4;0,95;100"
SOLK <- read.csv2(con <- textConnection(Lines))
close(con)
solk <- xts(SOLK[,c("Close","Volume")],
as.POSIXct(paste("2011-09-02", gsub(",",".",SOLK[,1]))))
indexFormat(solk) <- "%Y-%m-%d %H:%M:%OS6"
solk
# Close Volume
# 2011-09-02 10:27:03.599999 0.99 1000
# 2011-09-02 10:32:58.400000 0.98 100
# 2011-09-02 10:34:16.900000 0.98 600
# 2011-09-02 10:35:46.000000 0.97 500
# 2011-09-02 10:35:50.599999 0.96 50
# 2011-09-02 10:35:50.599999 0.96 1000
# 2011-09-02 10:36:10.299999 0.95 40
# 2011-09-02 10:36:10.299999 0.95 100
# 2011-09-02 10:36:10.400000 0.95 500
# 2011-09-02 10:36:10.400000 0.95 100
That's an odd structure. Translating it to dput syntax
SOLK <- structure(list(structure(c(1L, 2L, 3L, 4L, 5L, 5L, 6L, 6L, 7L,
7L), .Label = c("10:27:03,6", "10:32:58,4", "10:34:16,9", "10:35:46,0",
"10:35:50,6", "10:36:10,3", "10:36:10,4"), class = "factor"),
Close = c(0.99, 0.98, 0.98, 0.97, 0.96, 0.96, 0.95, 0.95,
0.95, 0.95), Volume = c(1000L, 100L, 600L, 500L, 50L, 1000L,
40L, 100L, 500L, 100L)), .Names = c("", "Close", "Volume"
), class = "data.frame", row.names = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10"))
I'm assuming the comma in the timestamp is decimal separator.
library("chron")
time.idx <- times(gsub(",",".",as.character(SOLK[[1]])))
Unfortunately, it seems xts won't take this as a valid order.by; so a date (today, for lack of a better choice) must be included to make xts happy.
xts(SOLK[[2]], order.by=chron(Sys.Date(), time.idx))

Resources