R : High frequency data statistical analysis - r

I'm working with tick data and would like to have some basic information about the distribution of the change in tick prices. My database is made of tick data during a period of 10 open days.
I've taken the first difference of the tick prices :
Tick spread
2010-02-02 08:00:04 -1
2010-02-02 08:00:04 1
2010-02-02 08:00:04 0
2010-02-02 08:00:04 0
2010-02-02 08:00:04 0
2010-02-02 08:00:04 -1
2010-02-02 08:00:05 1
2010-02-02 08:00:05 1
I've created an array which provides me with the first and last tick of each day :
Open Close
[1,] 1 59115
[2,] 59116 119303
[3,] 119304 207300
[4,] 207301 351379
[5,] 351380 426553
[6,] 426554 516742
[7,] 516743 594182
[8,] 594183 683840
[9,] 683841 754962
[10,] 754963 780725
I would like to know each day the empirical distribution of my tick spreads.
I know that I can use the R function table() but the problem is that it gives me a table object which length varies with days. The second problem is that some day I can have spreads of 3 points whereas the days after I only have spreads less than 3 points.
first day table() output :
-3 -2 -1 0 1 2 3
1 19 6262 46494 6321 16 2
second day table() output :
-2 -1 0 1 2 3 5
27 5636 48902 5588 33 1 1
What I would like is to create a data frame with all table()'s output for my whole tick sample.
Any idea?
thanks

Just use a 2-dimensional table, using as.Date(index(x)) as the rows:
# create some example data
set.seed(21)
p <- sort(runif(6))*(1:6)^2
p <- c(p,rev(p)[-1])
p <- p/sum(p)
P <- sample(-5:5, 1e5, TRUE, p)
x <- .xts(P, (1:1e5)*5)
# create table
table(as.Date(index(x)), x)
# x
# -5 -4 -3 -2 -1 0 1 2 3 4 5
# 1970-01-01 22 141 527 1623 2968 6647 2953 1700 538 139 21
# 1970-01-02 31 142 548 1596 2937 6757 2874 1677 529 167 22
# 1970-01-03 26 172 547 1599 2858 6814 2896 1681 504 163 20
# 1970-01-04 23 178 537 1645 2855 6805 2891 1626 537 165 18
# 1970-01-05 23 139 490 1597 3028 6740 2848 1724 505 158 28
# 1970-01-06 21 134 400 1304 2266 5496 2232 1213 397 112 26

If you want the frequency distribution for the entire 10 day period just concatenate the data and do the same. Is that what you want to do?

Related

duplicate low frequency values in a factor

I need to duplicate those levels whose frequency in my factor variable called groups is less than 500.
> head(groups)
[1] 0000000 1000000 1000000 1000000 0000000 0000000
75 Levels: 0000000 0000001 0000010 0000100 0000110 0001000 0001010 0001100 0001110 0010000 0010010 0010100 0010110 ... 1111110
For example:
> table(group)
group
0000000 0000001 0000010 0000100 0000110 0001000 0001010 0001100 0001110 0010000 0010010 0010100 0010110 0011000 0011010 0011100
58674 6 1033 654 223 1232 31 222 17 818 132 32 15 42 9 9
0011110 0100000 0100001 0100010 0100100 0100101 0100110 0101000 0101010 0101100 0101110 0110000 0110010 0110100 0110110 0111000
1 10609 1 487 64 1 58 132 11 12 3 142 27 9 7 11
0111010 0111100 0111110 1000000 1000001 1000010 1000011 1000100 1000101 1000110 1001000 1001001 1001010 1001100 1001110 1010000
5 1 2 54245 10 1005 1 329 1 138 573 1 31 71 11 969
1010010 1010100 1010110 1011000 1011010 1011100 1011110 1100000 1100001 1100010 1100011 1100100 1100110 1101000 1101010 1101011
147 29 21 63 15 10 4 14161 6 770 1 142 96 260 23 1
1101100 1101110 1110000 1110001 1110010 1110100 1110110 1111000 1111010 1111100 1111110
34 16 439 2 103 13 26 36 13 8 5
Groups 0000001, 0000110, 0001010, 0001100... must be duplicated up to 500.
The ideal would be to have a "sample balanced data" of groups that duplicate those levels often less than 500 and penalize the rest (Levels more than 500 frequency) until reaching 500.
We can use rep on the levels of 'group' for the desired 'n'
factor(rep(levels(group), each = n))
If we need to use the table results as well
factor(rep(levels(group), table(group) + n-table(group)) )
Or with pmax
factor(rep(levels(group), pmax(n, table(levels(group)))))
data
set.seed(24)
group <- factor(sample(letters[1:6], 3000, replace = TRUE))
n <- 500

Stacking time series data vertically

I am struggling with manipulation of time series data. The dataset has first Column containing information about time points of data collection, 2nd column onwards contains data from different studies.I have several hundred studies. As an example I have included sample data for 5 studies. I want to stack the dataset vertically with time and datapoints for each study. Example data set looks like data provided below:
TIME Study1 Study2 Study3 Study4 Study5
0.00 52.12 53.66 52.03 50.36 51.34
90.00 49.49 51.71 49.49 48.48 50.19
180.00 47.00 49.83 47.07 46.67 49.05
270.00 44.63 48.02 44.77 44.93 47.95
360.00 42.38 46.28 42.59 43.25 46.87
450.00 40.24 44.60 40.50 41.64 45.81
540.00 38.21 42.98 38.53 40.08 44.78
I am looking for an output in the form of:
TIME Study ID
0 52.12 1
90 49.49 1
180 47 1
270 44.63 1
360 42.38 1
450 40.24 1
540 38.21 1
0 53.66 2
90 51.71 2
180 49.83 2
270 48.02 2
360 46.28 2
450 44.6 2
540 42.98 2
0 52.03 3
90 49.49 3
180 47.07 3
270 44.77 3
...
This is a classic 'wide to long' dataset manipulation. Below, I show the use of the base function ?reshape for your data:
d.l <- reshape(d, varying=list(c("Study1","Study2","Study3","Study4","Study5")),
v.names="Y", idvar="TIME", times=1:5, timevar="Study",
direction="long")
d.l <- d.l[,c(2,1,3)]
rownames(d.l) <- NULL
d.l
# Study TIME Y
# 1 1 0 52.12
# 2 1 90 49.49
# 3 1 180 47.00
# 4 1 270 44.63
# 5 1 360 42.38
# 6 1 450 40.24
# 7 1 540 38.21
# 8 2 0 53.66
# 9 2 90 51.71
# 10 2 180 49.83
# 11 2 270 48.02
# 12 2 360 46.28
# 13 2 450 44.60
# 14 2 540 42.98
# 15 3 0 52.03
# 16 3 90 49.49
# 17 3 180 47.07
# ...
However, there are many ways to do this in R: the most basic reference on SO (of which this is probably a duplicate) is Reshaping data.frame from wide to long format, but there are many other relevant threads (see this search: [r] wide to long). Beyond using reshape, #lmo's method can be used, as well as methods based on the reshape2, tidyr, and data.table packages (presumably among others).
Here is one method using cbind and stack:
longdf <- cbind(df$TIME, stack(df[,-1], ))
names(longdf) <- c("TIME", "Study", "id")
This returns
longdf
TIME Study id
1 0 52.12 Study1
2 90 49.49 Study1
3 180 47.00 Study1
4 270 44.63 Study1
5 360 42.38 Study1
6 450 40.24 Study1
7 540 38.21 Study1
8 0 53.66 Study2
9 90 51.71 Study2
...
If you want to change id to integers as in your example, use
longdf$id <- as.integer(longdf$id)

Subsetting dataset when data contains no clear groups to subset [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Hi Stackoverflow community,
Background:
I'm using a population modelling program to try to predict genetic outcomes of threatened species populations given a range of management scenarios. At the end of each of my population modelling scenarios I have a .csv file containing information on all the final living individuals over all 1,000 iterations of the modeled population which includes information on all surviving individual's genotypes.
What I want:
From this .csv output file I'd like to determine the frequency of the allele "6" in the columns "Allele2a" and "Allele2b" in each of the 1,000 iterations of the model contained in the file.
The Problem:
The .csv file I'm trying to determine the allele 6's frequency from does not contain information that can be used to easily subset the data (from what can see) into the separate iterations. I have no idea how to split this dataset into it's respective iterations given that the number of individuals surviving to the end of the model (and subsequently the number of individual rows in each iteration) is not the same, and there is no clear subsettable points.
Any guidance on how to separate this data into iteration units which can be analysed, or how to determine the frequency of the allele without complex subsetting would be very greatly appreciated. If any further information is required please don't hesitate to ask.
Thanks!
EDIT: When input into R the data looks like this:
Living<-read.csv("Living Ind.csv", header=F)
colnames(Living) <- c("Iteration","ID","Pop","Sex","alive","Age","DamID","SireID","F","FInd","MtDNA","Alle1a","Alle1b","Alle2a","Alle2b")
attach(Living)
Living
Iteration ID Pop Sex alive Age DamID SireID F FInd MtDNA Alle1a Alle1b Alle2a Alle2b
1 Iteration 1 NA NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA NA NA
3 2511 2 M TRUE 19 545 1376 0.000 0.000 545 1089 2751 6 6
4 2515 2 F TRUE 18 590 1783 0.000 0.000 590 1180 3566 5 5
5 2519 2 F TRUE 18 717 1681 0.000 0.000 717 1434 3362 4 6
6 2526 2 M TRUE 17 412 1780 0.000 0.000 412 823 3559 4 6
7 2529 2 F TRUE 17 324 1473 0.000 0.000 324 647 2945 5 6
107 2676 2 F TRUE 1 2576 2526 0.000 0.000 621 3876 3559 6 4
108 NA NA NA NA NA NA NA NA NA NA NA NA NA
109 Iteration 2 NA NA NA NA NA NA NA NA NA NA NA NA
110 NA NA NA NA NA NA NA NA NA NA NA NA NA
111 2560 2 M TRUE 18 703 1799 0.000 0.000 703 1406 3598 6 6
112 2564 2 M TRUE 18 420 1778 0.000 0.000 420 840 3555 4 6
113 2578 2 F TRUE 17 347 1778 0.000 0.000 347 693 3555 3 5
114 2581 2 M TRUE 16 330 1454 0.000 0.000 330 659 2907 6 6
115 2584 2 F TRUE 16 568 1593 0.000 0.000 568 1135 3185 6 5
116 2591 2 F TRUE 13 318 1423 0.000 0.000 318 635 2846 3 6
117 2593 2 M TRUE 13 341 1454 0.000 0.000 341 682 2907 6 6
118 2610 2 M TRUE 8 2578 2582 0.000 0.000 347 693 2908 5 6
119 2612 2 M TRUE 8 2578 2582 0.000 0.000 347 3555 660 3 6
Just a total mess I'm afraid.
Here's a link to a copy of the .csv file.
https://www.dropbox.com/s/pl6ncy5i0152uv1/Living%20Ind.csv?dl=0
Thank you for providing your data. It makes this much easier. In future you should always do this with questions on SO.
The basic issue is transforming your original data into something more easy to manipulate in R. Since your data-set is fairly large, I'm using the data.table package, but you could also do basically the same thing using data.frames in base R.
library(data.table)
url <- "https://www.dropbox.com/s/pl6ncy5i0152uv1/Living%20Ind.csv?dl=1"
DT <- fread(url,header=FALSE, showProgress = FALSE) # import data
DT <- DT[!is.na(V2)] # remove blank lines (rows)
brks <- which(DT$V1=="Iteration") # identify iteration header rows
iter <- DT[brks,]$V2 # extract iteration numbers
DT <- DT[-brks,Iter:=rep(iter,diff(c(brks,nrow(DT)+1))-1)] # assign iteration number to each row
DT <- DT[-brks] # remove iteration header rows
DT[,V1:=NULL] # remove first column
setnames(DT, c("ID","Pop","Sex","alive","Age","DamID","SireID","F","FInd","MtDNA","Alle1a","Alle1b","Alle2a","Alle2b","Iteration"))
# now can count fraction of allele 6 easily.
DT[,list(frac=sum(Alle2a==6 | Alle2b==6)/.N), by=Iteration]
# Iteration frac
# 1: 1 0.7619048
# 2: 2 0.9130435
# 3: 3 0.6091954
# 4: 4 0.8620690
# 5: 5 0.8850575
# ---
If you are going to be analyzing large datasets like this a lot, it would probably be worth your while to learn how to use data.table.

row average of columns that match string

I have a data frame below and I want to find the average row value for all columns with header *R and all columns with *G.
The output should then be four columns: Rfam, Classes, avg.rowR, avg.rowG
I was playing around with the rowMeans() function, but I am not sure how to specify the columns.
Rfam Classes 26G 26R 35G 35R 46G 46R 48G 48R 55G 55R
5_8S_rRNA rRNA 63 39 8 27 26 17 28 43 41 17
5S_rRNA rRNA 171 149 119 109 681 47 95 161 417 153
7SK 7SK 53 282 748 371 248 42 425 384 316 198
ACA64 Other 7 8 19 2 10 1 36 10 10 4
let-7 miRNA 121825 73207 25259 75080 54301 63510 30444 53800 78961 47533
lin-4 miRNA 10149 16263 5629 19680 11297 37866 3816 9677 11713 10068
Metazoa_SRP SRP 317 1629 1008 418 1205 407 1116 1225 1413 1075
mir-1 miRNA 3 4 1 2 0 26 1 1 0 4
mir-10 miRNA 912163 1411287 523793 1487160 517017 1466085 107597 551381 727720 788201
mir-101 miRNA 461 320 199 553 174 460 278 297 256 254
mir-103 miRNA 937 419 202 497 318 217 328 343 891 439
mir-1180 miRNA 110 32 4 17 53 47 6 29 35 22
mir-1226 miRNA 11 3 0 3 6 0 1 2 5 4
mir-1237 miRNA 3 2 1 1 0 1 0 2 1 1
mir-1249 miRNA 5 14 2 9 4 5 9 5 7 7
newcols <- sapply(c("R$", "G$"), function(x) rowMeans(df[grep(x, names(df))]))
setNames(cbind(df[1:2], newcols), c(names(df)[1:2], "avg.rowR", "avg.rowG"))
# Rfam Classes avg.rowR avg.rowG
# 1 5_8S_rRNA rRNA 28.6 33.2
# 2 5S_rRNA rRNA 123.8 296.6
# 3 7SK 7SK 255.4 358.0
# 4 ACA64 Other 5.0 16.4
# 5 let-7 miRNA 62626.0 62158.0
# 6 lin-4 miRNA 18710.8 8520.8
# 7 Metazoa_SRP SRP 950.8 1011.8
# 8 mir-1 miRNA 7.4 1.0
# 9 mir-10 miRNA 1140822.8 557658.0
# 10 mir-101 miRNA 376.8 273.6
# 11 mir-103 miRNA 383.0 535.2
# 12 mir-1180 miRNA 29.4 41.6
# 13 mir-1226 miRNA 2.4 4.6
# 14 mir-1237 miRNA 1.4 1.0
# 15 mir-1249 miRNA 8.0 5.4
One way to look for patterns in column names is to use the grep family of functions. The function call grep("R$", names(df)) will return the index of all column names that end with R. When we use it with sapply we can search for the R and G columns in one expression.
The core of the second line is cbind(df[1:2], newcols). That is the binding of the first two columns of df and the two new columns of mean values. Wrapping it with setNames(.., c(names(df)f[1:2]....)) formats the column names to match your desired output.

Shift up rows in R

This is a simple example of how my data looks like.
Suppose I got the following data
>x
Year a b c
1962 1 2 3
1963 4 5 6
. . . .
. . . .
2001 7 8 9
I need to form a time series of x with 7 column contains the following variables:
Year a lag(a) b lag(b) c lag(c)
What I did is the following:
> x<-ts(x) # converting x to a time series
> x<-cbind(x,x[,-1]) # adding the same variables to the time series without repeating the year column
> x
Year a b c a b c
1962 1 2 3 1 2 3
1963 4 5 6 4 5 6
. . . . . . .
. . . . . . .
2001 7 8 9 7 8 9
I need to shift the last three column up so they give the lags of a,b,c. then I will rearrange them.
Here's an approach using dplyr
df <- data.frame(
a=1:10,
b=21:30,
c=31:40)
library(dplyr)
df %>% mutate_each(funs(lead(.,1))) %>% cbind(df, .)
# a b c a b c
#1 1 21 31 2 22 32
#2 2 22 32 3 23 33
#3 3 23 33 4 24 34
#4 4 24 34 5 25 35
#5 5 25 35 6 26 36
#6 6 26 36 7 27 37
#7 7 27 37 8 28 38
#8 8 28 38 9 29 39
#9 9 29 39 10 30 40
#10 10 30 40 NA NA NA
You can change the names afterwards using colnames(df) <- c("a", "b", ...)
As #nrussel noted in his answer, what you described is a leading variable. If you want a lagging variable, you can change the lead in my answer to lag.
X <- data.frame(
a=1:100,
b=2*(1:100),
c=3*(1:100),
laga=1:100,
lagb=2*(1:100),
lagc=3*(1:100),
stringsAsFactors=FALSE)
##
Xts <- ts(X)
Xts[1:(nrow(Xts)-1),c(4,5,6)] <- Xts[2:nrow(Xts),c(4,5,6)]
Xts[nrow(Xts),c(4,5,6)] <- c(NA,NA,NA)
> head(Xts)
a b c laga lagb lagc
[1,] 1 2 3 2 4 6
[2,] 2 4 6 3 6 9
[3,] 3 6 9 4 8 12
[4,] 4 8 12 5 10 15
[5,] 5 10 15 6 12 18
[6,] 6 12 18 7 14 21
##
> tail(Xts)
a b c laga lagb lagc
[95,] 95 190 285 96 192 288
[96,] 96 192 288 97 194 291
[97,] 97 194 291 98 196 294
[98,] 98 196 294 99 198 297
[99,] 99 198 297 100 200 300
[100,] 100 200 300 NA NA NA
I'm not sure if by shift up you literally mean shift the rows up 1 place like above (because that would mean you are using lagging values not leading values), but here's the other direction ("true" lagged values):
X2 <- data.frame(
a=1:100,
b=2*(1:100),
c=3*(1:100),
laga=1:100,
lagb=2*(1:100),
lagc=3*(1:100),
stringsAsFactors=FALSE)
##
Xts2 <- ts(X2)
Xts2[2:nrow(Xts2),c(4,5,6)] <- Xts2[1:(nrow(Xts2)-1),c(4,5,6)]
Xts2[1,c(4,5,6)] <- c(NA,NA,NA)
##
> head(Xts2)
a b c laga lagb lagc
[1,] 1 2 3 NA NA NA
[2,] 2 4 6 1 2 3
[3,] 3 6 9 2 4 6
[4,] 4 8 12 3 6 9
[5,] 5 10 15 4 8 12
[6,] 6 12 18 5 10 15
##
> tail(Xts2)
a b c laga lagb lagc
[95,] 95 190 285 94 188 282
[96,] 96 192 288 95 190 285
[97,] 97 194 291 96 192 288
[98,] 98 196 294 97 194 291
[99,] 99 198 297 98 196 294
[100,] 100 200 300 99 198 297

Resources