I'm trying to figure out a formula to be able to divide the max and min number inside the intervals.
x <- sample(10:40,100,rep=TRUE)
factorx<- factor(cut(x, breaks=nclass.Sturges(x)))
xout<-as.data.frame(table(factorx))
xout<- transform(xout, cumFreq = cumsum(Freq), relative = prop.table(Freq))
Using the above code in the R editor program, I get the following:
xout
factorx Freq cumFreq relative
1 (9.97,13.8] 14 14 0.14
2 (13.8,17.5] 13 27 0.13
3 (17.5,21.2] 16 43 0.16
4 (21.2,25] 5 48 0.05
5 (25,28.8] 11 59 0.11
6 (28.8,32.5] 8 67 0.08
7 (32.5,36.2] 16 83 0.16
8 (36.2,40] 17 100 0.17
What I want to know is if there is a way to calculate the interval. For example it would be:
(13.8 + 9.97)/2
It's called the class midpoint in statistics I believe.
Here's a one-liner that is probably close to what you want:
> sapply(strsplit(levels(xout$factorx), ","), function(x) sum(as.numeric(gsub("[[:space:]]", "", chartr(old = "(]", new = " ", x))))/2)
[1] 11.885 15.650 19.350 23.100 26.900 30.650 34.350 38.100
#One possible solution is to split by (,] (xout is your dataframe)
x1<-strsplit(as.character(xout$factorx),",|\\(|]")
x2<-do.call(rbind,x1)
xout$lower=as.numeric(x2[,2])
xout$higher=as.numeric(x2[,3])
xout$ave<-rowMeans(xout[,c("lower","higher")])
> head(xout,3)
factorx Freq cumFreq relative higher lower aver
1 (9.97,13.7] 15 15 0.15 13.7 9.97 11.835
2 (13.7,17.5] 14 29 0.14 17.5 13.70 15.600
3 (17.5,21.2] 12 41 0.12 21.2 17.50 19.350
Related
I have a tracjectory in 2D (list of x,y positions).
I am trying to measure the angles of the motion between consecutive points.
So I calculate the scalar product of the two consecutive vectors, divide by the vector norms, and this gives me the cosinus of the angles I am looking for.
However, when I generate totally random trajectories (by generating random x and random y), I always have a high number of cos results very close to -1, or 1. While I was expecting to have all cos between -1 and 1 equally likely.
Here's my code to generate the trajectories (after correction from the comments below), and calculate the cosinus:
cost = c()
t = seq(0,500,0.5)
x = 1*runif( length(t),-1,1 )
y = 1*runif( length(t),-1,1 )
x = cumsum(x)
y = cumsum(y)
step = 1
dstep = 2
for ( j in 1:((length(t)-dstep)))
{
x1 = x[j+step]-x[j]
y1 = y[j+step]-y[j]
x2 = x[j+dstep]-x[j+step]
y2 = y[j+dstep]-y[j+step]
n1 = sqrt( x1*x1 + y1*y1 )
n2 = sqrt( x2*x2 + y2*y2 )
if ( (n1*n2) > 0 )
{
scal = x1*x2 + y1*y2
cost = c( cost, scal/(n1*n2) )
#print( paste(n1, " ", n2, " ", n1*n2, " ", scal, " ", x1, " ", x2, " ", scal/(n1*n2), sep="") )
}
}
When i look at the histogram of the cost results, I always have a high number of cost very close to -1 and 1:
> hist(cost, plot=F)
$breaks
[1] -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65 -0.60 -0.55 -0.50 -0.45
[13] -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05 0.00 0.05 0.10 0.15
[25] 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75
[37] 0.80 0.85 0.90 0.95 1.00
$counts
[1] 108 43 32 20 22 21 19 20 19 17 16 19 8 19 23 17 15 10 18
[20] 22 15 19 14 15 18 16 21 11 18 20 16 35 23 24 24 20 23 33
[39] 37 107
Any idea where I'm wrong or why it should do that ?
Thanks for help
In case somebody else meet this problem, here's the summary of the solution from the comments:
Actually this distribution of the cos is what you get when angles are uniformly distributed! Consider hist(cos(runif(1000, min = 0, max = 2*pi))). So it's working as expected. cos just moves quickly over 0 and slowly over 1 and -1. See plot(cos, from = 0, to = 2*pi).
Which is indeed explained there: https://math.stackexchange.com/questions/1153339/distribution-of-cosine-of-uniformly-random-variables
The solution is thus that it is normal to have more values of cosinus close to 1 and -1 from a distribution of totally random angles.
How can I apply a package function to a data frame ?
I have a data set (df) with two columns (total and n) on which I would like to apply the pois.exact function (pois.exact(x, pt = 1, conf.level = 0.95)) from the epitools package with x = df$n and pt = df$total f and get a "new" data frame (new_df) with 3 more columns with the corresponding rounded computed rates, lower and upper CI ?
df <- data.frame("total" = c(35725302,35627717,34565295,36170648,38957933,36579643,29628394,18212075,39562754,1265055), "n" = c(24,66,166,461,898,1416,1781,1284,329,12))
> df
total n
1 35725302 24
2 35627717 66
3 34565295 166
4 36170648 461
5 38957933 898
6 36579643 1416
7 29628394 1781
8 18212075 1284
9 9562754 329
In facts, the dataframe in much more longer.
For example, for the first row the desired results are:
require (epitools)
round (pois.exact (24, pt = 35725302, conf.level = 0.95)* 100000, 2)[3:5]
rate lower upper
1 0.07 0.04 0.1
The new dataframe with the added results by applying the pois.exact function should look like that.
> new_df
total n incidence lower_95IC uppper_95IC
1 35725302 24 0.07 0.04 0.10
2 35627717 66 0.19 0.14 0.24
3 34565295 166 0.48 0.41 0.56
4 36170648 461 1.27 1.16 1.40
5 38957933 898 2.31 2.16 2.46
6 36579643 1416 3.87 3.67 4.08
7 29628394 1781 6.01 5.74 6.03
8 18212075 1284 7.05 6.67 7.45
9 9562754 329 3.44 3.08 3.83
Thanks.
df %>%
cbind( pois.exact(df$n, df$total) ) %>%
dplyr::select( total, n, rate, lower, upper )
# total n rate lower upper
# 1 35725302 24 1488554.25 1488066.17 1489042.45
# 2 35627717 66 539813.89 539636.65 539991.18
# 3 34565295 166 208224.67 208155.26 208294.10
# 4 36170648 461 78461.28 78435.71 78486.85
# 5 38957933 898 43383.00 43369.38 43396.62
# 6 36579643 1416 25833.08 25824.71 25841.45
# 7 29628394 1781 16635.82 16629.83 16641.81
# 8 18212075 1284 14183.86 14177.35 14190.37
# 9 39562754 329 120251.53 120214.06 120289.01
# 10 1265055 12 105421.25 105237.62 105605.12
Suppose observed X(1), X(2), ..., X(N) from a continuous time process. How can i discretize the time of this data on the grid {0,1/N,...,(N-1)/N,1} using R?
I really appreciate any help. Thanks.
This would be the way to do it in continuous time:
x <- cumsum(abs(rnorm(20)))
n <- (x-min(x))/diff(range(x))
> n
[1] 0.00000000 0.01884929 0.02874295 0.07230612 0.11253305 0.19770821 0.26356939
[8] 0.33310811 0.36687944 0.47041629 0.53331128 0.61724640 0.72534086 0.74782335
[15] 0.79829820 0.83023417 0.85336221 0.85528100 0.90023497 1.00000000
To get a numeric vector analogous to what you might get from cut or Hmisc::cut2 you can use findInterval:
> findInterval(n, seq(0,1,length=length(n) )/length(n) )
[1] 1 8 11 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
And "normalizing to [0,1] is then simple, even trivial;
> findInterval(n, seq(0,1,length=length(n) ))/length(n)
[1] 0.05 0.05 0.05 0.10 0.15 0.20 0.30 0.35 0.35 0.45 0.55 0.60 0.70 0.75 0.80 0.80 0.85
[18] 0.85 0.90 1.00
I am trying to extract tables from text files and have found several earlier posts here that address similar questions. However, none seem to work efficiently with my problem. The most helpful answer I have found is to one of my earlier questions here: R: removing header, footer and sporadic column headings when reading csv file
An example dummy text file contains:
>
>
> ###############################################################################
>
> # Display AICc Table for the models above
>
>
> collect.models(, adjust = FALSE)
model npar AICc DeltaAICc weight Deviance
13 P1 19 94 0.00 0.78 9
12 P2 21 94 2.64 0.20 9
10 P3 15 94 9.44 0.02 9
2 P4 11 94 619.26 0.00 9
>
>
> ###############################################################################
>
> # the three lines below count the number of errors in the code above
>
> cat("ERROR COUNT:", .error.count, "\n")
ERROR COUNT: 0
> options(error = old.error.fun)
> rm(.error.count, old.error.fun, new.error.fun)
>
> ##########
>
>
I have written the following code to extract the desired table:
my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')
top <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'
my.data <- my.data[-c(grep(bottom, my.data):length(my.data))]
my.data <- my.data[-c(1:grep(top, my.data))]
my.data <- my.data[c(1:(length(my.data)-4))]
aa <- as.data.frame(my.data)
aa
write.table(my.data, 'c:/users/mmiller21/simple R programs/dummy.log.extraction.txt', quote=F, col.names=F, row.name=F)
my.data2 <- read.table('c:/users/mmiller21/simple R programs/dummy.log.extraction.txt', header = TRUE, row.names = c(1))
my.data2
model npar AICc DeltaAICc weight Deviance
13 P1 19 94 0.00 0.78 9
12 P2 21 94 2.64 0.20 9
10 P3 15 94 9.44 0.02 9
2 P4 11 94 619.26 0.00 9
I would prefer to avoid having to write and then read my.data to obtain the desired data frame. Prior to that step the current code returns a vector of strings for my.data:
[1] " model npar AICc DeltaAICc weight Deviance" "13 P1 19 94 0.00 0.78 9"
[3] "12 P2 21 94 2.64 0.20 9" "10 P3 15 94 9.44 0.02 9"
[5] "2 P4 11 94 619.26 0.00 9"
Is there some way I can convert the above vector of strings into a data frame like that in dummy.log.extraction.txt without writing and then reading my.data?
The line:
aa <- as.data.frame(my.data)
returns the following, which looks like what I want:
# my.data
# 1 model npar AICc DeltaAICc weight Deviance
# 2 13 P1 19 94 0.00 0.78 9
# 3 12 P2 21 94 2.64 0.20 9
# 4 10 P3 15 94 9.44 0.02 9
# 5 2 P4 11 94 619.26 0.00 9
However:
dim(aa)
# [1] 5 1
If I can split aa into columns then I think I will have what I want without having to write and then read my.data.
I found the post: Extracting Data from Text Files However, in the posted answer the table in question seems to have a fixed number of rows. In my case the number of rows can vary between 1 and 20. Also, I would prefer to use base R. In my case I think the number of rows between bottom and the last row of the table is a constant (here 4).
I also found the post: How to extract data from a text file using R or PowerShell? However, in my case the column widths are not fixed and I do not know how to split the strings (or rows) so there are only seven columns.
Given all of the above perhaps my question is really how to split the object aa into columns. Thank you for any advice or assistance.
EDIT:
The actual logs are produced by a supercomputer and contain up to 90,000 lines. However, the number of lines varies greatly among logs. That is why I was making use of top and bottom.
May be your real log file is totally different and more complex but with this one, you can use read.table directly, you just have to play with the right parameters.
data <- read.table("c:/users/mmiller21/simple R programs/dummy.log",
comment.char = ">",
nrows = 4,
skip = 1,
header = TRUE,
row.names = 1)
str(data)
## 'data.frame': 4 obs. of 6 variables:
## $ model : Factor w/ 4 levels "P1","P2","P3",..: 1 2 3 4
## $ npar : int 19 21 15 11
## $ AICc : int 94 94 94 94
## $ DeltaAICc: num 0 2.64 9.44 619.26
## $ weight : num 0.78 0.2 0.02 0
## $ Deviance : int 9 9 9 9
data
## model npar AICc DeltaAICc weight Deviance
## 13 P1 19 94 0.00 0.78 9
## 12 P2 21 94 2.64 0.20 9
## 10 P3 15 94 9.44 0.02 9
## 2 P4 11 94 619.26 0.00 9
read.table and its family now have an option to read text:
> df <- read.table(text = paste(my.data, collapse = "\n"))
> df
model npar AICc DeltaAICc weight Deviance
13 P1 19 94 0.00 0.78 9
12 P2 21 94 2.64 0.20 9
10 P3 15 94 9.44 0.02 9
2 P4 11 94 619.26 0.00 9
> summary(df)
model npar AICc DeltaAICc weight Deviance
P1:1 Min. :11.0 Min. :94 Min. : 0.00 Min. :0.000 Min. :9
P2:1 1st Qu.:14.0 1st Qu.:94 1st Qu.: 1.98 1st Qu.:0.015 1st Qu.:9
P3:1 Median :17.0 Median :94 Median : 6.04 Median :0.110 Median :9
P4:1 Mean :16.5 Mean :94 Mean :157.84 Mean :0.250 Mean :9
3rd Qu.:19.5 3rd Qu.:94 3rd Qu.:161.90 3rd Qu.:0.345 3rd Qu.:9
Max. :21.0 Max. :94 Max. :619.26 Max. :0.780 Max. :9
It looks strange that you have to read an R console. Whatever, you can use the fact that your table lines begin with a numeric and extract your inetersting line using something like ^[0-9]+. Then read.table like shown by #kohske do the rest.
readLines('c:/users/mmiller21/simple R programs/dummy.log')
idx <- which(grepl('^[0-9]+',ll))
idx <- c(min(idx)-1,idx) ## header line
read.table(text=ll[idx])
model npar AICc DeltaAICc weight Deviance
13 P1 19 94 0.00 0.78 9
12 P2 21 94 2.64 0.20 9
10 P3 15 94 9.44 0.02 9
2 P4 11 94 619.26 0.00 9
Thank you to those who posted answers. Because of the size, complexity and variability of the actual log files I think I need to continue to make use of the variables top and bottom. However, I used elements of dickoa's answer to come up with the following.
my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')
top <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'
my.data <- my.data[-c(grep(bottom, my.data):length(my.data))]
my.data <- my.data[-c(1:grep(top, my.data))]
x <- read.table(text=my.data, comment.char = ">")
x
# model npar AICc DeltaAICc weight Deviance
# 13 P1 19 94 0.00 0.78 9
# 12 P2 21 94 2.64 0.20 9
# 10 P3 15 94 9.44 0.02 9
# 2 P4 11 94 619.26 0.00 9
Here is even simpler code:
my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log')
top <- '> collect.models\\(, adjust = FALSE)'
bottom <- '> # the three lines below count the number of errors in the code above'
my.data <- my.data[grep(top, my.data):grep(bottom, my.data)]
x <- read.table(text=my.data, comment.char = ">")
x
I haven't found something which precisely matches what I need, so I thought I'd post this.
I have a number of functions which basically rely on a rolling index of a variable, with a function, and should naturally flow back into the dataframe they came from.
For example,
data<-as.data.frame(as.matrix(seq(1:30)))
data$V1<-data$V1/100
str(data)
data$V1<-NA # rolling 5 day product
for (i in 5:nrow(data)){
start<-i-5
end<-i
data$V1_MA5d[i]<- (prod(((data$V1[start:end]/100)+1))-1)*100
}
data
> head(data,15)
V1 V1_MA5d
1 0.01 NA
2 0.02 NA
3 0.03 NA
4 0.04 NA
5 0.05 0.1500850
6 0.06 0.2101751
7 0.07 0.2702952
8 0.08 0.3304453
9 0.09 0.3906255
10 0.10 0.4508358
11 0.11 0.5110762
12 0.12 0.5713467
13 0.13 0.6316473
14 0.14 0.6919780
15 0.15 0.7523389
But really, I should be able to do something like:
data$V1_MA5d<-sapply(data$V1, function(x) prod(((data$V1[i-5:i]/100)+1))-1)*100
But I'm not sure what that would look like.
Likewise, the count of a variable by another variable:
data$V1_MA5_cat<-NA
data$V1_MA5_cat[data$V1_MA5d<.5]<-0
data$V1_MA5_cat[data$V1_MA5d>.5]<-1
data$V1_MA5_cat[data$V1_MA5d>1.5]<-2
table(data$V1_MA5_cat)
data$V1_MA5_cat_n<-NA
data$V1_MA5_cat_n[data$V1_MA5_cat==0]<-nrow(subset(data,V1_MA5_cat==0))
data$V1_MA5_cat_n[data$V1_MA5_cat==1]<-nrow(subset(data,V1_MA5_cat==1))
data$V1_MA5_cat_n[data$V1_MA5_cat==2]<-nrow(subset(data,V1_MA5_cat==2))
> head(data,15)
V1 V1_MA5d V1_MA5_cat V1_MA5_cat_n
1 0.01 NA NA NA
2 0.02 NA NA NA
3 0.03 NA NA NA
4 0.04 NA NA NA
5 0.05 0.1500850 0 6
6 0.06 0.2101751 0 6
7 0.07 0.2702952 0 6
8 0.08 0.3304453 0 6
9 0.09 0.3906255 0 6
10 0.10 0.4508358 0 6
11 0.11 0.5110762 1 17
12 0.12 0.5713467 1 17
13 0.13 0.6316473 1 17
14 0.14 0.6919780 1 17
15 0.15 0.7523389 1 17
I know there is a better way - help!
You can do this one of a few ways. Its worth mentioning here that you did write a "correct" for loop in R. You preallocated the vector by assigning data$V1_MA5d <- NA. This way you are filling rather than growing and its actually fairly efficient. However, if you want to use the apply family:
sapply(5:nrow(data), function(i) (prod(data$V1[(i-5):i]/100 + 1)-1)*100)
[1] 0.1500850 0.2101751 0.2702952 0.3304453 0.3906255 0.4508358 0.5110762 0.5713467 0.6316473 0.6919780 0.7523389 0.8127299
[13] 0.8731511 0.9336024 0.9940839 1.0545957 1.1151376 1.1757098 1.2363122 1.2969448 1.3576077 1.4183009 1.4790244 1.5397781
[25] 1.6005622 1.6613766
Notice my code inside the [] is different from yours. check out the difference:
i <- 10
i - 5:i
(i-5):i
Or you can use rollapply from the zoo package:
library(zoo)
myfun <- function(x) (prod(x/100 + 1)-1)*100
rollapply(data$V1, 5, myfun)
[1] 0.1500850 0.2001551 0.2502451 0.3003552 0.3504853 0.4006355 0.4508057 0.5009960 0.5512063 0.6014367 0.6516872 0.7019577
[13] 0.7522484 0.8025591 0.8528899 0.9032408 0.9536118 1.0040030 1.0544142 1.1048456 1.1552971 1.2057688 1.2562606 1.3067726
[25] 1.3573047 1.4078569
As per the comment, this will give you a vector of length 26... instead you can add a few arguments to rollapply to make it match with your initial data:
rollapply(data$V1, 5, myfun, fill=NA, align='right')
In regard to your second question, plyr is handy here.
library(plyr)
data$cuts <- cut(data$V1_MA5d, breaks=c(-Inf, 0.5, 1.5, Inf))
ddply(data, .(cuts), transform, V1_MA5_cat_n=length(cuts))
But there are many other choices too.