counting values within a certain range within a sliding window - r

The problem that I like to solve is a sliding window going over the measurement data with a defined window width and a controllable stepwidth (there 1).
Within the window I need to detect a number of values within a certain range of the
first value expl. 2.2 +- 0.3 and count the number of such values in a row
expl. 2.2, 2.3, 2.1 , 1.8, 2.2, 2.5, 2.1 --> 3,1,3
d <- read.table(text="Number Time.s Potential.V Current.A
1 0.0000 0.075 -0.7653
2 0.0285 0.074 -0.7597
3 0.0855 0.076 -0.7549
17 0.8835 0.074 -0.7045
18 0.9405 0.073 -0.5983
19 0.9975 0.071 -0.1370
19 1.0175 0.070 -0.1370
20 1.0545 0.072 0.1295
21 1.1115 0.073 0.2680
8013 1.6555 0.076 -1.1070
8014 1.7125 0.075 -1.1850
8015 1.7695 0.073 -1.2610
8016 1.8265 0.072 -1.3460
8017 1.8835 0.071 -1.4380
8018 1.9405 0.070 -1.4350
8019 1.9975 0.061 -1.0720
8020 2.1045 0.062 -0.8823
8021 2.1115 0.058 -0.7917
8022 2.1685 0.060 -0.7481", header=TRUE)
rle(round(diff(d$Time.s[d$Time.s>1 & d$Time.s<2]),digits=2))
I can't use rle, because there is no acceptance interval one could define. Working with
a for loop is possible, but seams very un'R'ish.
width=4
bound.low <- 0.00
bound.high <- 0.03
Ergebnis <- data.frame(
Potential.V=seq(1,(nrow(d)-width),by=1),count=seq(1,(nrow(d)-width),by=1))
for (a in 1:(nrow(d)-width)) {
temp <- d[a:(a+width),c("Time.s","Potential.V")]
counter=0
for (b in 1:nrow(temp)){
if (temp$Potential.V[1] >= (temp$Potential.V[b] - bound.low ) &
temp$Potential.V[1] <= (temp$Potential.V[b] + bound.high) ){
(counter=counter+1)
} else { break() }
}
Ergebnis$Potential.V[a] <- temp$Potential.V[1]
Ergebnis$count[a] <- counter
}
print(Ergebnis)
Result
Potential.V count
1 0.075 2
2 0.074 1
3 0.076 5
4 0.074 5
5 0.073 5
6 0.071 2
7 0.070 1
8 0.072 1
9 0.073 1
10 0.076 5
11 0.075 5
12 0.073 5
13 0.072 5
14 0.071 5
15 0.070 5
rle(Ergebnis$count)
Run Length Encoding
lengths: int [1:6] 1 1 3 1 3 6
values : num [1:6] 2 1 5 2 1 5
So I find the needed counts in the lengths vector.
Is there a more elegant way of solving such problems ? My experiments with xts and zoo didn't worked out like I thought
best regards,
IInatas
P.S.
The reason for this data analysis is log data from an experiment which has a degrading problem with an increasing severity in relation to certain voltages. In the end there is a lifetime account and I try to calculate the rest that is left, based on this log data.

Here's a solution using zoo::rollapply to calculate Ergebnis, but you still need to run rle on the result.
# the function we're going to apply to each window
f <- function(x, upper=0.03, lower=0.00) {
# logical test
l <- x[1] >= (x-lower) & x[1] <= (x+upper)
# first FALSE value
m <- if(any(!l)) which.min(l) else length(l)
c(Potential.V=x[1],count=sum(l[1:m]))
}
Ergebnis <- data.frame(rollapply(d$Potential.V, 5, f, align='left'))
rle(Ergebnis$count)

Related

R filter not retaining values less than -10

I am using R to retain values which are below a condition, it is working but not including row where log10 column has values more than -10. Below is a part of the table. I want to select all the values in log10 column which are between -8 to -14.
My table:
chr rs ps af beta p_wald log10
1 5 S5_10683198 10683198 0.025 0.5628516 9.422555e-15 -14.0258313188689
2 8 S8_361882 361882 0.025 0.5295581 6.825981e-13 -12.1658349249069
3 8 S8_7385592 7385592 0.021 0.5421677 5.944847e-12 -11.2258593181539
4 2 S2_276875 276875 0.025 0.4899961 1.342672e-11 -10.8720300677393
5 3 S3_7418268 7418268 0.021 0.4906429 1.711510e-11 -10.7666205590256
6 2 S2_14021380 14021380 0.025 0.5080194 2.511021e-11 -10.6001496552098
7 3 S3_13987777 13987777 0.021 0.4595375 2.664140e-11 -10.5744429568178
30 8 S8_7395237 7395237 0.021 0.4186731 3.995514e-09 -8.39842734325747
31 6 S6_7387034 7387034 0.028 0.4266190 4.957138e-09 -8.30476899075709
32 5 S5_11495292 11495292 0.028 0.4575658 5.080677e-09 -8.29407841413843
33 1 S1_15059335 15059335 0.025 0.4183669 5.106630e-09 -8.29186560773669
19430 7 S7_14557672 14557672 0.037 -0.1892856 0.005395347 -2.26798061857347
19431 7 S7_2217818 2217818 0.055 -0.1286663 0.005396288 -2.26790488007629
19432 8 S8_3013554 3013554 0.030 0.1430241 0.005396304 -2.26790359239457
19433 4 S4_1154225 1154225 0.045 -0.1572871 0.005396518 -2.2678863700186
19434 1 S1_21402062 21402062 0.074 0.1159478 0.005396604 -2.26787944906923
19435 2 S2_6176209 6176209 0.030 0.1522105 0.005396680 -2.26787333297322
I am using this R code
maf2 <- maf1[ which(maf1$log10 >= -8.00), ]
my results
chr rs ps af beta p_wald log10
30 8 S8_7395237 7395237 0.021 0.4186731 3.995514e-09 -8.39842734325747
31 6 S6_7387034 7387034 0.028 0.4266190 4.957138e-09 -8.30476899075709
32 5 S5_11495292 11495292 0.028 0.4575658 5.080677e-09 -8.29407841413843
33 1 S1_15059335 15059335 0.025 0.4183669 5.106630e-09 -8.29186560773669
The results skipping first 7 rows with > -10.
What to change in the code.
Thanks,
Vinod

How could I use R to pull a few select lines out of a large text file?

I am fairly new to stack overflow but did not find this in the search engine. Please let me know if this question should not be asked here.
I have a very large text file. It has 16 entries and each entry looks like this:
AI_File 10
Version
Date 20200708 08:18:41
Prompt1 LOC
Resp1 H****
Prompt2 QUAD
Resp2 1012
TransComp c-p-s
Model Horizontal
### Computed Results
LAI 4.36
SEL 0.47
ACF 0.879
DIFN 0.031
MTA 40.
SEM 1.
SMP 5
### Ring Summary
MASK 1 1 1 1 1
ANGLES 7.000 23.00 38.00 53.00 68.00
AVGTRANS 0.038 0.044 0.055 0.054 0.030
ACFS 0.916 0.959 0.856 0.844 0.872
CNTCT# 3.539 2.992 2.666 2.076 1.499
STDDEV 0.826 0.523 0.816 0.730 0.354
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.028 0.039 0.034 0.032 0.018
### Contributing Sensors
### Observations
A 1 20200708 08:19:12 x 31.42 38.30 40.61 48.69 60.28
L 2 20200708 08:19:12 1 5.0e-006
B 3 20200708 08:19:21 x 2.279 2.103 1.408 5.027 1.084
B 4 20200708 08:19:31 x 1.054 0.528 0.344 0.400 0.379
B 5 20200708 08:19:39 x 0.446 1.255 2.948 3.828 1.202
B 6 20200708 08:19:47 x 1.937 2.613 5.909 3.665 5.964
B 7 20200708 08:19:55 x 0.265 1.957 0.580 0.311 0.551
Almost all of this is junk information, and I am looking to run some code for the whole file that will only give me the lines for "Resp2" and "LAI" for all 16 of the entries. Is a task like this doable in R? If so, how would I do it?
Thanks very much for any help and please let me know if there's any more information I can give to clear anything up.
I've saved your file as a text file and read in the lines. Then you can use regex to extract the desired rows. However, I feel that my approach is rather clumsy, I bet there are more elegant ways (maybe also with (unix) command line tools).
data <- readLines("testfile.txt")
library(stringr)
resp2 <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^Resp2).*$")))
lai <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
resp2 = resp2[!is.na(resp2)],
lai = lai[!is.na(lai)]
)
data_extract
resp2 lai
1 1012 4.36
A solution based in the tidyverse can look as follows.
library(dplyr)
library(vroom)
library(stringr)
library(tibble)
library(tidyr)
vroom_lines('data') %>%
enframe() %>%
filter(str_detect(value, 'Resp2|LAI')) %>%
transmute(value = str_squish(value)) %>%
separate(value, into = c('name', 'value'), sep = ' ')
# name value
# <chr> <chr>
# 1 Resp2 1012
# 2 LAI 4.36

How to calculate the Bonferroni Lower and Upper limits in R?

With the following data, I am trying to calculate the Chi Square and Bonferroni lower and upper Confidence intervals. The column "Data_No" identifies the dataset (as calculations needs to be done separately for each dataset).
Data_No Area Observed
1 3353 31
1 2297 2
1 1590 15
1 1087 16
1 817 2
1 847 10
1 1014 28
1 872 29
1 1026 29
1 1215 21
2 3353 31
2 2297 2
2 1590 15
3 1087 16
3 817 2
The code I used is
library(dplyr)
setwd("F:/GIS/July 2019/")
total_data <- read.csv("test.csv")
result_data <- NULL
for(i in unique(total_data$Data_No)){
data <- total_data[which(total_data$Data_No == i),] data <- data %>%
mutate(RelativeArea = Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE = Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU = Observed/sum(Observed), Alpha = 0.05/2*count(Data_No),
Zvalue = qnorm(Alpha,lower.tail=FALSE), lower = APU-Zvalue*sqrt(APU*(1-APU)/sum(Observed)), upper = APU+Zvalue*sqrt(APU*(1-APU)/sum(Observed)))
result_data <- rbind(result_data,data) }
write.csv(result_data,file='final_result.csv')
And the error message I get is:
Error in UseMethod("summarise_") : no applicable method for
'summarise_' applied to an object of class "c('integer', 'numeric')"
The column that I am calling "Alpha" is the alpha value of 0.05/2k, where K is the number of categories - in my example, I have 10 categories ("Data_No" column) for the first dataset, so "Alpha" needs to be 0.05/20 = 0.0025, and it's corresponding Z value is 2.807. The second dataset has 3 categories (so 0.05/6) and the third has 2 categories (0.05/4) in my example table (Data_No" column). Using the values from the newly calculated "Alpha" column, I then need to calculate the ZValue column (Zvalue = qnorm(Alpha,lower.tail=FALSE)) which I then use to calculate the lower and upper confidence intervals.
From the above data, here are the results that I should get, but note that I have had to manually calculate Alpha column and Zvalue, rather than insert those calculations within the R code:
Data_No Area Observed RelativeArea Alpha Z value lower upper
1 3353 31 0.237 0.003 2.807 0.092 0.247
1 2297 2 0.163 0.003 2.807 -0.011 0.033
1 1590 15 0.113 0.003 2.807 0.025 0.139
1 1087 16 0.077 0.003 2.807 0.029 0.146
1 817 2 0.058 0.003 2.807 -0.011 0.033
1 847 10 0.060 0.003 2.807 0.007 0.102
1 1014 28 0.072 0.003 2.807 0.078 0.228
1 872 29 0.062 0.003 2.807 0.083 0.234
1 1026 29 0.073 0.003 2.807 0.083 0.234
1 1215 21 0.086 0.003 2.807 0.049 0.181
2 3353 31 0.463 0.008 2.394 0.481 0.811
2 2297 2 0.317 0.008 2.394 -0.027 0.111
2 1590 15 0.220 0.008 2.394 0.152 0.473
3 1087 16 0.571 0.013 2.241 0.723 1.055
3 817 2 0.429 0.013 2.241 -0.055 0.277
Please note that I only included some of the columns generated from the code.
# You need to check the closing bracket for lower c.f. sqrt value. Following code should work.
data <- read.csv("test.csv")
data <- data %>% mutate(RelativeArea =
Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE =
Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU =
Observed/sum(Observed), lower =
APU-2.394*sqrt(APU*(1-APU)/sum(Observed)), upper =
APU+2.394*sqrt(APU*(1-APU)/sum(Observed)))
#Answer to follow-up question.
#Sample Data
Data_No Area Observed
1 3353 31
1 2297 2
2 1590 15
2 1087 16
#Code to run
total_data <- read.csv("test.csv")
result_data <- NULL
for(i in unique(total_data$Data_No)){
data <- total_data[which(total_data$Data_No == i),]
data <- data %>% mutate(RelativeArea =
Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE =
Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU =
Observed/sum(Observed), lower =
APU-2.394*sqrt(APU*(1-APU)/sum(Observed)), upper =
APU+2.394*sqrt(APU*(1-APU)/sum(Observed)))
result_data <- rbind(result_data,data)
}
write.csv(result_data,file='final_result.csv')
#Issue in calculating Alpha. I have updated the code.
library(dplyr)
setwd("F:/GIS/July 2019/")
total_data <- read.csv("test.csv")
#Creating the NO_OF_CATEGORIES column based on your question.
total_data$NO_OF_CATEGORIES <- 0
total_data[which(total_data$Data_No==1),]$NO_OF_CATEGORIES <- 10
total_data[which(total_data$Data_No==2),]$NO_OF_CATEGORIES <- 3
total_data[which(total_data$Data_No==3),]$NO_OF_CATEGORIES <- 2
#Actual code
result_data <- NULL
for(i in unique(total_data$Data_No)){
data <- total_data[which(total_data$Data_No == i),]
data <- data %>%
mutate(RelativeArea = Area/sum(Area), Expected = RelativeArea*sum(Observed), OminusE = Observed-Expected, O2 = OminusE^2, O2divE = O2/Expected, APU = Observed/sum(Observed), Alpha = 0.05/(2*(unique(data$NO_OF_CATEGORIES))),
Zvalue = qnorm(Alpha,lower.tail=FALSE), lower = APU-Zvalue*sqrt(APU*(1-APU)/sum(Observed)), upper = APU+Zvalue*sqrt(APU*(1-APU)/sum(Observed)))
result_data <- rbind(result_data,data) }
write.csv(result_data,file='final_result.csv')

Non linear regression for exponential decay model in R

I have the following problem:
I asked 5 people (i=1, ..., 5) to forecast next period's return of 3 different stocks. This gives me the following data:
S_11_i_c <-read.table(text = "
i c_1 c_2 c_3
1 0.150 0.70 0.190
2 0.155 0.70 0.200
3 0.150 0.75 0.195
4 0.160 0.80 0.190
5 0.150 0.75 0.180
",header = T)
In words, in period t=10 participant i=1 expects the return of stock c_1 to be 0.15 in period t=11.
The forecasts are based on past returns of the stocks. These are the following:
S_t_c <-read.table(text = "
time S_c_1 S_c_2 S_c_3
1 0.020 0.015 0.040
2 0.045 0.030 0.050
3 0.060 0.045 0.060
4 0.075 0.060 0.060
5 0.090 0.070 0.060
6 0.105 0.070 0.090
7 0.120 0.070 0.120
8 0.125 0.070 0.140
9 0.130 0.070 0.160
10 0.145 0.070 0.180
",header = T)
In words, stock c=1 had a return of 0.145 in period 10.
So, the variables in table S_11_i_c are the dependent variables.
The variables in table S_t_c are the independet variables.
The model I want to estimate is the following:
My problem with coding this is as follows:
I do only know how to express
with the help of a loop. As in:
Sum_S_t_c <- data.frame(
s = seq(1:9),
c_1 = rnorm(9)
c_2 = rnorm(9)
c_3 = rnorm(9)
)
Sum_S_t_c = 0
for (c in 2:4) {
for (s in 0:9) {
Sum_S_t_c[s,c] <- Sum_S_t_c + S_t_c[10-s, c]
Sum_S_t_c = Sum_S_t_c[s,c]
}
}
However, loops within a regression are not possible. So, my other solution would be to rewrite the sum to
However, as my actual problem has a much larger n, this isn*t realy working for me.
Any ideas?

Conditional sorting / reordering of column values in R

I have a data set similar to the following with 1 column and 60 rows:
value
1 0.0423
2 0.0388
3 0.0386
4 0.0342
5 0.0296
6 0.0276
7 0.0246
8 0.0239
9 0.0234
10 0.0214
.
40 0.1424
.
60 -0.0312
I want to reorder the rows so that certain conditions are met. For example one condition could be: sum(df$value[4:7]) > 0.1000 & sum(df$value[4:7]) <0.1100
With the data set looking like this for example.
value
1 0.0423
2 0.0388
3 0.0386
4 0.1312
5 -0.0312
6 0.0276
7 0.0246
8 0.0239
9 0.0234
10 0.0214
.
.
.
60 0.0342
What I tried was using repeat and sample as in the following:
repeat{
df1 <- as_tibble(sample(sdf$value, replace = TRUE))
if (sum(df$value[4:7]) > 0.1000 & sum(df$value[4:7]) <0.1100) break
}
Unfortunately, this method takes quite some time and I was wondering if there is a faster way to reorder rows based on mathematical conditions such as sum or prod
Here's a quick implementation of the hill-climbing method I outlined in my comment. I've had to slightly reframe the desired condition as "distance of sum(x[4:7]) from 0.105" to make it continuous, although you can still use the exact condition when doing the check that all requirements are satisfied. The benefit is that you can add extra conditions to the distance function easily.
# Using same example data as Jon Spring
set.seed(42)
vs = rnorm(60, 0.05, 0.08)
get_distance = function(x) {
distance = abs(sum(x[4:7]) - 0.105)
# Add to the distance with further conditions if needed
distance
}
max_attempts = 10000
best_distance = Inf
swaps_made = 0
for (step in 1:max_attempts) {
# Copy the vector and swap two random values
new_vs = vs
swap_inds = sample.int(length(vs), 2, replace = FALSE)
new_vs[swap_inds] = rev(new_vs[swap_inds])
# Keep the new vector if the distance has improved
new_distance = get_distance(new_vs)
if (new_distance < best_distance) {
vs = new_vs
best_distance = new_distance
swaps_made = swaps_made + 1
}
complete = (sum(vs[4:7]) < 0.11) & (sum(vs[4:7]) > 0.1)
if (complete) {
print(paste0("Solution found in ", step, " steps"))
break
}
}
sum(vs[4:7])
There's no real guarantee that this method will reach a solution, but I often try this kind of basic hill-climbing when I'm not sure if there's a "smart" way to approach a problem.
Here's an approach using combn from base R, and then filtering using dplyr. (I'm sure there's a way w/o it but my base-fu isn't there yet.)
With only 4 numbers from a pool of 60, there are "only" 488k different combinations (ignoring order; =60*59*58*57/4/3/2), so it's quick to brute force in about a second.
# Make a vector of 60 numbers like your example
set.seed(42)
my_nums <- rnorm(60, 0.05, 0.08);
all_combos <- combn(my_nums, 4) # Get all unique combos of 4 numbers
library(tidyverse)
combos_table <- all_combos %>%
t() %>%
as_tibble() %>%
mutate(sum = V1 + V2 + V3 + V4) %>%
filter(sum > 0.1, sum < 0.11)
> combos_table
# A tibble: 8,989 x 5
V1 V2 V3 V4 sum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.160 0.00482 0.0791 -0.143 0.100
2 0.160 0.00482 0.101 -0.163 0.103
3 0.160 0.00482 0.0823 -0.145 0.102
4 0.160 0.00482 0.0823 -0.143 0.104
5 0.160 0.00482 -0.0611 -0.00120 0.102
6 0.160 0.00482 -0.0611 0.00129 0.105
7 0.160 0.00482 0.0277 -0.0911 0.101
8 0.160 0.00482 0.0277 -0.0874 0.105
9 0.160 0.00482 0.101 -0.163 0.103
10 0.160 0.00482 0.0273 -0.0911 0.101
# … with 8,979 more rows
This says that in this example, there are about 9000 different sets of 4 numbers from my sequence which meet the criteria. We could pick any of these and put them in positions 4-7 to meet your requirement.

Resources