Using subset() to select ranges of data

Using subset() to select ranges of data - r

I am new to R functions, always preferred to use packages and avoid loops. However, now I am trying to create a loop for a specific question that I have. I would like to subset a dataset based on ranges. I think the code below is self explanatory.
dt = as.data.frame(sample(1:100))
names(dt) = "num"
subs.it <- function(x) {
subs <- subset(dt, num >= (x - 5) & num <= (x + 5))
return(subs)
}
subs.it(c(15, 50))
wrong output:
num
44 55
47 20
65 19
77 17
83 12
91 16
92 51
100 54
correct:
num
4 15
18 11
47 20
50 13
54 10
65 19
66 14
77 17
82 18
83 12
91 16
17 48
19 53
29 45
33 52
39 46
44 55
45 50
49 49
89 47
92 51
100 54
I can't find what I am doing wrong.
Thanks

It seems like the function you are looking for is subset itself. Try:
subset(dt, num > 15 & num <50)
edit:
ah I see you want two different ranges. You can do this:
x = 15
y = 50
subset(dt, (num >= x-5 & num <= x+5) | (num >= y-5 & num <= y+5))
or a more compact version using absolute values:
subset(dt, (abs(num - x) <= 5 | abs(num - y) <= 5))

Here you go.
set.seed(12345)
library(dplyr)
subs.it <- function(x, y, z) {
subs <- x %>% filter(
(num >= (y-5) & num <= (y+5)) | (num >= (z-5) & num <= (z+5))
)
return(subs)
}
subs.it(dt, 15, 55)
num
1 16
2 14
3 15
4 55
5 52
6 17
7 56
8 13
9 57
10 54
11 18
12 53
13 11
14 58
15 19
16 10
17 51
18 60
19 20
20 50
21 12
22 59

Related

R - create vector with sequence c(1,4,5,8,9,12,13,16),etc

We are looking to create a vector with the following sequence:
1,4,5,8,9,12,13,16,17,20,21,...
Start with 1, then skip 2 numbers, then add 2 numbers, then skip 2 numbers, etc., not going above 2000. We also need the inverse sequence 2,3,6,7,10,11,...

We may use recyling vector to filter the sequence
(1:21)[c(TRUE, FALSE, FALSE, TRUE)]
[1] 1 4 5 8 9 12 13 16 17 20 21

Here's an approach using rep and cumsum. Effectively, "add up alternating increments of 1 (successive #s) and 3 (skip two)."
cumsum(rep(c(1,3), 500))
and
cumsum(rep(c(3,1), 500)) - 1

Got this one myself - head(sort(c(seq(1, 2000, 4), seq(4, 2000, 4))), 20)

We can try like below
> (v <- seq(21))[v %% 4 %in% c(0, 1)]
[1] 1 4 5 8 9 12 13 16 17 20 21

You may arrange the data in a matrix and extract 1st and 4th column.
val <- 1:100
sort(c(matrix(val, ncol = 4, byrow = TRUE)[, c(1, 4)]))
# [1] 1 4 5 8 9 12 13 16 17 20 21 24 25 28 29 32 33
#[18] 36 37 40 41 44 45 48 49 52 53 56 57 60 61 64 65 68
#[35] 69 72 73 76 77 80 81 84 85 88 89 92 93 96 97 100

A tidyverse option.
library(purrr)
library(dplyr)
map_int(1:11, ~ case_when(. == 1 ~ as.integer(1),
. %% 2 == 0 ~ as.integer(.*2),
T ~ as.integer((.*2)-1)))
# [1] 1 4 5 8 9 12 13 16 17 20 21

Subsetting data at an irregular interval in an R function

I have a function like this
extract = function(x)
{
a = x$2007[6:18]
b = x$2007[30:42]
c = x$2007[54:66]
}
the subsetting needs to continue up to 744 in this way. I need to skip the first 6 data points, and then pull out every other 12 points into a new object or a list. Is there a more elegant way to do this with a for loop or apply?

Side note: if 2007 is truly a column name (you would have had to explicitly do this, R defaults to converting numbers to names starting with letters, see make.names("2007")), then x$"2007"[6:18] (etc) should work for column reference.
To generate that sequence of integers, let's try
nr <- 100
ind <- seq(6, nr, by = 12)
ind
# [1] 6 18 30 42 54 66 78 90
ind[ seq_along(ind) %% 2 == 1 ]
# [1] 6 30 54 78
ind[ seq_along(ind) %% 2 == 0 ]
# [1] 18 42 66 90
Map(seq, ind[ seq_along(ind) %% 2 == 1 ], ind[ seq_along(ind) %% 2 == 0 ])
# [[1]]
# [1] 6 7 8 9 10 11 12 13 14 15 16 17 18
# [[2]]
# [1] 30 31 32 33 34 35 36 37 38 39 40 41 42
# [[3]]
# [1] 54 55 56 57 58 59 60 61 62 63 64 65 66
# [[4]]
# [1] 78 79 80 81 82 83 84 85 86 87 88 89 90
So you can use this in your function to create a list of subsets:
nr <- nrow(x)
ind <- seq(6, nr, by = 12)
out <- lapply(Map(seq, ind[ seq_along(ind) %% 2 == 1 ], ind[ seq_along(ind) %% 2 == 0 ]),
function(i) x$"2007"[i])

we could use
split( x[7:744] , cut(7:744,seq(7,744,12)) )

Getting the first exceedance date over a threshold in a sequence

I have a csv file with three columns. The first column is pentad dates (73 pentads in a year) while the second and third columns are for precipitation values.
What I want to do:
[1]. Get the first pentad when the precipitation exceeds the "annual mean" in "at least three consecutive pentads".
I can subset the first column like this:
dat<-read.csv("test.csv",header=T,sep=",")
aa<-which(dat$RR>mean(dat$RR))
This gives me the following:
[1] 27 28 29 30 31 34 36 37 38 41 42 43 44 45 46 52 53 54 55 56 57
The correct output should be P27 in this case.
In the second column:
[1] 31 32 36 38 39 40 41 42 43 44 45 46 47 48 49 50 53 54 55 57 59 60 61
The correct output should be P38.
How can I add a conditional statement here taking into consideration the "three consecutive pentads"?
I don't know how I can implement this in R (in a code). I'll appreciate any suggestion.
I have the following data:
Pentad RR YY
1 0 0.5771428571
2 0.0142857143 0
3 0 1.2828571429
4 0.0885714286 1.4457142857
5 0.0714285714 0.1114285714
6 0 0.36
7 0.0657142857 0
8 0.0285714286 0
9 0.0942857143 0
10 0.0114285714 1
11 0 0.0114285714
12 0 0.0085714286
13 0 0.3057142857
14 0 0
15 0 0
16 0 0
17 0.04 0
18 0 0.8
19 0.8142857143 0.0628571429
20 0.2857142857 0
21 1.14 0
22 5.3342857143 0
23 2.3514285714 0
24 1.9857142857 0.0133333333
25 1.4942857143 0.0433333333
26 2.0057142857 1.4866666667
27 20.0485714286 0
28 25.0085714286 2.4866666667
29 16.32 1.9433333333
30 11.0685714286 0.7733333333
31 8.9657142857 8.1066666667
32 3.9857142857 7.7333333333
33 5.2028571429 0.5
34 7.8028571429 4.3566666667
35 4.4514285714 2.66
36 9.22 6.6266666667
37 32.0485714286 4.4042857143
38 19.5057142857 7.9771428571
39 3.1485714286 12.9428571429
40 2.4342857143 18.4942857143
41 9.0571428571 7.3571428571
42 28.7085714286 11.0828571429
43 34.1514285714 9.0342857143
44 33.0257142857 14.2914285714
45 46.5057142857 34.6142857143
46 70.6171428571 45.3028571429
47 3.1685714286 6.66
48 1.9285714286 6.7028571429
49 7.0314285714 5.9628571429
50 0.9028571429 14.8542857143
51 5.3771428571 2.1
52 11.3571428571 2.8371428571
53 15.0457142857 7.3914285714
54 11.6628571429 32.0371428571
55 21.24 9.0057142857
56 11.4371428571 3.5257142857
57 11.6942857143 12.32
58 2.9771428571 2.32
59 4.3371428571 7.9942857143
60 0.8714285714 6.5657142857
61 1.3914285714 4.7714285714
62 0.8714285714 2.3542857143
63 1.1457142857 0.0057142857
64 2.3171428571 2.5085714286
65 0.1828571429 0.8171428571
66 0.2828571429 2.8857142857
67 0.3485714286 0.8971428571
68 0 0
69 0.3457142857 0
70 0.1428571429 0
71 0.18 0
72 4.8942857143 0.1457142857
73 0.0371428571 0.4342857143

Something like this should do it:
first_exceed_seq <- function(x, thresh = mean(x), len = 3)
{
# Logical vector, does x exceed the threshold
exceed_thresh <- x > thresh
# Indices of transition points; where exceed_thresh[i - 1] != exceed_thresh[i]
transition <- which(diff(c(0, exceed_thresh)) != 0)
# Reference index, grouping observations after each transition
index <- vector("numeric", length(x))
index[transition] <- 1
index <- cumsum(index)
# Break x into groups following the transitions
exceed_list <- split(exceed_thresh, index)
# Get the number of values exceeded in each index period
num_exceed <- vapply(exceed_list, sum, numeric(1))
# Get the starting index of the first sequence where more then len exceed thresh
transition[as.numeric(names(which(num_exceed >= len))[1])]
}
first_exceed_seq(dat$RR)
first_exceed_seq(dat$YY)

Filter data frame by results from tapply function

I'm trying to apply a tapply function I wrote to filter a dataset. Here is a sample data frame (df) below to describe what I'm trying to do.
I want to keep in my data frame the rows where the value of df$Cumulative_Time is closest to the value of 14. It should do this for each factor level in df$ID (keep row closest the value 14 for each ID factor).
ID Date Results TimeDiff Cumulative_Time
A 7/10/2015 71 0 0
A 8/1/2015 45 20 20
A 8/22/2015 0 18 38
A 9/12/2015 79 17 55
A 10/13/2015 44 26 81
A 11/27/2015 98 37 118
B 7/3/2015 75 0 0
B 7/24/2015 63 18 18
B 8/21/2015 98 24 42
B 9/26/2015 70 30 72
C 8/15/2015 77 0 0
C 9/2/2015 69 15 15
C 9/4/2015 49 2 17
C 9/8/2015 88 2 19
C 9/12/2015 41 4 23
C 9/19/2015 35 6 29
C 10/10/2015 33 18 47
C 10/14/2015 31 3 50
D 7/2/2015 83 0 0
D 7/28/2015 82 22 22
D 8/27/2015 100 26 48
D 9/17/2015 19 17 65
D 10/8/2015 30 18 83
D 12/9/2015 96 51 134
D 1/6/2016 30 20 154
D 2/17/2016 32 36 190
D 3/19/2016 42 27 217
I got as far as the following:
spec_day = 14 # value I want to compare df$Cumulative_Time to
# applying function to calculate closest value to spec_day
tapply(df$Cumulative_Time, df$ID, function(x) which(abs(x - spec_day) == min(abs(x - spec_day))))
Question: how do I include this tapply function as a means to do the filtering of my data frame df? Am I approaching this problem the right way, or is there some simpler way to accomplish this that I'm not seeing? Any help would be appreciated--thanks!

Here's a way you can do it, note that I didn't use tapply:
spec_day <- 14
new_df <- do.call('rbind',
by(df, df$ID,
FUN = function(x) x[which.min(abs(x$Cumulative_Time - spec_day)), ]
))
new_df
ID Date Results TimeDiff Cumulative_Time
A A 8/1/2015 45 20 20
B B 7/24/2015 63 18 18
C C 9/2/2015 69 15 15
D D 7/28/2015 82 22 22
which.min (and its sibling which.max) is a very useful function.

Here's a more concise and faster alternative using data.table:
library(data.table)
setDT(df)[, .SD[which.min(abs(Cumulative_Time - 14))], by = ID]
# ID Date Results TimeDiff Cumulative_Time
#1: A 8/1/2015 45 20 20
#2: B 7/24/2015 63 18 18
#3: C 9/2/2015 69 15 15
#4: D 7/28/2015 82 22 22

Mean and SD in R

maybe it is a very easy question. This is my data.frame:
> read.table("text.txt")
V1 V2
1 26 22516
2 28 17129
3 30 38470
4 32 12920
5 34 30835
6 36 36244
7 38 24482
8 40 67482
9 42 23121
10 44 51643
11 46 61064
12 48 37678
13 50 98817
14 52 31741
15 54 74672
16 56 85648
17 58 53813
18 60 135534
19 62 46621
20 64 89266
21 66 99818
22 68 60071
23 70 168558
24 72 67059
25 74 194730
26 76 278473
27 78 217860
It means that I have 22516 sequences with length 26, 17129 sequences with length 28, etc. I would like to know the sequence length mean and its standard deviation. I know how to do it, but I know to do it creating a list full of 26 repeated 22516 times and so on... and then compute the mean and SD. However, I thing there is a easier method. Any idea?
Thanks.

For mean: (V1 %*% V2)/sum(V2)
For SD: sqrt(((V1-(V1 %*% V2)/sum(V2))**2 %*% V2)/sum(V2))

I do not find mean(rep(V1,V2)) # 61.902 and sd(rep(V1,V2)) # 14.23891 that complex, but alternatively you might try:
weighted.mean(V1,V2) # 61.902
# recipe from http://www.ltcconline.net/greenl/courses/201/descstat/meansdgrouped.htm
sqrt((sum((V1^2)*V2)-(sum(V1*V2)^2)/sum(V2))/(sum(V2)-1)) # 14.23891

Step1: Set up data:
dat.df <- read.table(text="id V1 V2
1 26 22516
2 28 17129
3 30 38470
4 32 12920
5 34 30835
6 36 36244
7 38 24482
8 40 67482
9 42 23121
10 44 51643
11 46 61064
12 48 37678
13 50 98817
14 52 31741
15 54 74672
16 56 85648
17 58 53813
18 60 135534
19 62 46621
20 64 89266
21 66 99818
22 68 60071
23 70 168558
24 72 67059
25 74 194730
26 76 278473
27 78 217860",header=T)
Step2: Convert to data.table (only for simplicity and laziness in typing)
library(data.table)
dat <- data.table(dat.df)
Step3: Set up new columns with products, and use them to find mean
dat[,pr:=V1*V2]
dat[,v1sq:=as.numeric(V1*V1*V2)]
dat.Mean <- sum(dat$pr)/sum(dat$V2)
dat.SD <- sqrt( (sum(dat$v1sq)/sum(dat$V2)) - dat.Mean^2)
Hope this helps!!

MEAN = (V1*V2)/sum(V2)
SD = sqrt((V1*V1*V2)/sum(V2) - MEAN^2)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using subset() to select ranges of data - r

Related

R - create vector with sequence c(1,4,5,8,9,12,13,16),etc

Subsetting data at an irregular interval in an R function

Getting the first exceedance date over a threshold in a sequence

Filter data frame by results from tapply function

Mean and SD in R

Categories

Resources