Identifying sequences of repeated numbers in R - r

I have a long time series where I need to identify and flag sequences of repeated values. Here's some data:
DATETIME WDIR
1 40360.04 22
2 40360.08 23
3 40360.12 126
4 40360.17 126
5 40360.21 126
6 40360.25 126
7 40360.29 25
8 40360.33 26
9 40360.38 132
10 40360.42 132
11 40360.46 132
12 40360.50 30
13 40360.54 132
14 40360.58 35
So if I need to note when a value is repeated three or more times, I have a sequence of four '126' and a sequence of three '132' that need to be flagged.
I'm very new to R. I expect I use cbind to create a new column in this array with a "T" in the corresponding rows, but how to populate the column correctly is a mystery. Any pointers please? Thanks a bunch.

As Ramnath says, you can use rle.
rle(dat$WDIR)
Run Length Encoding
lengths: int [1:9] 1 1 4 1 1 3 1 1 1
values : int [1:9] 22 23 126 25 26 132 30 132 35
rle returns an object with two components, lengths and values. We can use the lengths piece to build a new column that identifies which values are repeated more than three times.
tmp <- rle(dat$WDIR)
rep(tmp$lengths >= 3,times = tmp$lengths)
[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
This will be our new column.
newCol <- rep(tmp$lengths > 1,times = tmp$lengths)
cbind(dat,newCol)
DATETIME WDIR newCol
1 40360.04 22 FALSE
2 40360.08 23 FALSE
3 40360.12 126 TRUE
4 40360.17 126 TRUE
5 40360.21 126 TRUE
6 40360.25 126 TRUE
7 40360.29 25 FALSE
8 40360.33 26 FALSE
9 40360.38 132 TRUE
10 40360.42 132 TRUE
11 40360.46 132 TRUE
12 40360.50 30 FALSE
13 40360.54 132 FALSE
14 40360.58 35 FALSE

Use rle to do the job!! It is an amazing function that calculates the number of successive repetitions of numbers in a sequence. Here is some example code on how you can use rle to flag the miscreants in your data. This will return all rows from the data frame which have WDIR that are repeated 3 or more times successively.
runs = rle(mydf$WDIR)
subset(mydf, WDIR %in% runs$values[runs$lengths >= 3])

Two options for you.
Assuming the data is loaded:
dat <- read.table(textConnection("
DATETIME WDIR
40360.04 22
40360.08 23
40360.12 126
40360.17 126
40360.21 126
40360.25 126
40360.29 25
40360.33 26
40360.38 132
40360.42 132
40360.46 132
40360.50 30
40360.54 132
40360.58 35"), header=T)
Option 1: Sorting
dat <- dat[order(dat$WDIR),] # needed for the 'repeats' to be pasted into the correct rows in next step
dat$count <- rep(table(dat$WDIR),table(dat$WDIR))
dat$more4 <- ifelse(dat$count < 4, F, T)
dat <- dat[order(dat$DATETIME),] # sort back to original order
dat
Option 2: Oneliner
dat$more4 <- ifelse(dat$WDIR %in% names(which(table(dat$WDIR)>3)),T,F)
dat
I thought being a new user that option 1 might be an easier step by step approach although the rep(table(), table()) may not be intuitive initially.

Related

R: How to compare values in a column with later values in the same column

I am attempting to work with a large dataset in R where I need to create a column that compares the value in an existing column to all values that follow it (ex: row 1 needs to compare rows 1-10,000, row 2 needs to compare rows 2-10,000, row 3 needs to compare rows 3-10,000, etc.), but cannot figure out how to write the range.
I currently have a column of raw numeric values and a column of row values generated by:
samples$row = seq.int(nrow(samples))
I have attempted to generate the column with the following command:
samples$processed = min(samples$raw[samples$row:10000])
but get the error "numerical expression has 10000 elements: only the first used" and the generated column only has the value for row 1 repeated for each of the 10,000 rows.
How do I need to write this command so that the lower bound of the range is the row currently being calculated instead of 1?
Any help would be appreciated, as I have minimal programming experience.
If all you need is the min of the specific row and all following rows, then
rev(cummin(rev(samples$val)))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
If you have some other function that doesn't have a cumulative variant (and your use of min is just a placeholder), then one of:
mapply(function(a, b) min(samples$val[a:b]), seq.int(nrow(samples)), nrow(samples))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
sapply(seq.int(nrow(samples)), function(a) min(samples$val[a:nrow(samples)]))
The only reason to use mapply over sapply is if, for some reason, you want window-like operations instead of always going to the bottom of the frame. (Though if you wanted windows, I'd suggest either the zoo or slider packages.)
Data
set.seed(42)
samples <- data.frame(val = sample(1000, size=20))
samples
# val
# 1 561
# 2 997
# 3 321
# 4 153
# 5 74
# 6 228
# 7 146
# 8 634
# 9 49
# 10 128
# 11 303
# 12 24
# 13 839
# 14 356
# 15 601
# 16 165
# 17 622
# 18 532
# 19 410
# 20 882

reducing repetitive tasks in data.table in R

I notice that i am doing the same thing multiple time, just with slightly different values:
HCCtreshold <- 40000
claimsMonthly[, HCC12mnth := +(HCCtreshold < claim12month)][ HCC12mnth == 1, `:=` (aboveHCCth12mnth = (claim12month - HCCtreshold))][is.na(aboveHCCth12mnth),aboveHCCth12mnth := 0]
claimsMonthly[, HCC11mnth := +(HCCtreshold < claim11month)][ HCC11mnth == 1, `:=` (aboveHCCth11mnth = (claim11month - HCCtreshold))][is.na(aboveHCCth11mnth),aboveHCCth11mnth := 0]
claimsMonthly[, HCC10mnth := +(HCCtreshold < claim10month)][ HCC10mnth == 1, `:=` (aboveHCCth10mnth = (claim10month - HCCtreshold))][is.na(aboveHCCth10mnth),aboveHCCth10mnth := 0]
So started with something like this:
k <- seq.default(from = 8, to = 12, by = 1)
claimsMonthly[paste0("HCC", k, "mnth") := lapply(k, function(x) (+(HCCtreshold < paste0("HCC", k, "mnth"))))]
i get an error:
Error: Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").
I also tried:
for(k in 8:12){
claimsMonthly[, paste0("HCC", k, "mnth") := +(HCCtreshold < paste0("HCC", k, "mnth"))]
}
the columns are created correctly, but i get incorrect values inside them. I get an 1 everywhere
I am not sure what i am doing wrong?
I can offer some suggestions and, with some fake data, try them out.
You can programmatically define names on the left-hand side of := if you wrap a vector in c(...), so for instance DT[ c(vec_of_names) := list(some, values)].
You can programmatically retrieve values of variables with a vector of variable names and mget. While I generally think mget can indicate problematic code, I believe that in here it works with low risk. (While mget and get normally retrieve variables from the operating environment, often .GlobalEnv, from within a data.table operation then retrieve columns just as easily.)
Instead of a double-tap of assignment with == 1 and then is.na(...), we can use some logical trickery and the data.table::fcoalesce function. (If you aren't familiar, fcoalesce operates like SQL's coalesce function which is a vector-friendly way of finding the first non-NA value in arguments of vectors.
fcoalesce(c(1, 2, NA, NA), c(11, 12, 13, NA), c(21, 22, 23, 24))
# [1] 1 2 13 24
We can use fcoalesce(some + math * calc, 0) to do the math and, if NA, replace it with 0. (We use it on the above* variables below, and not necessarily on the HCC* logical variables. It can apply there too, if desired. If those HCC* variables are throw-away, though, it just doesn't matter.)
Fake data:
library(data.table)
set.seed(42)
hccthreshold <- 50
dat <- data.table( claim10month = sample(99, 10), claim11month = sample(99, 10), claim12month = sample(99, 10) )
dat$claim11month[5] <- NA
dat
# claim10month claim11month claim12month
# 1: 91 46 90
# 2: 92 71 14
# 3: 28 91 96
# 4: 80 25 91
# 5: 61 NA 8
# 6: 49 89 49
# 7: 69 97 37
# 8: 13 11 84
# 9: 60 95 41
# 10: 64 51 76
First, let's programmatically determine the column names we want to act on, and from then create the same vectors for the new variables. (I'm a big fan of determining and adapting these variable names programmatically, so that if you get a partial data set your code still works. You might consider setting checks and alarms to catch something wrong. For instance, stopifnot(length(claimnames) == 12L), in case you are expecting to always have precisely 12 months.)
claimnames <- grep("^claim[0-9]+month", colnames(dat), value = TRUE)
hccnames <- gsub("^claim", "HCC", claimnames)
abovenames <- gsub("^claim", "aboveHCC", claimnames)
claimnames
# [1] "claim10month" "claim11month" "claim12month"
hccnames
# [1] "HCC10month" "HCC11month" "HCC12month"
abovenames
# [1] "aboveHCC10month" "aboveHCC11month" "aboveHCC12month"
And now, we can process the data.
dat[, c(hccnames) := lapply(mget(claimnames), `>`, hccthreshold) ]
dat[, c(abovenames) := Map(function(hcc, clm) fcoalesce(clm - hcc * hccthreshold, 0),
mget(hccnames), mget(claimnames)) ]
dat
# claim10month claim11month claim12month HCC10month HCC11month HCC12month aboveHCC10month aboveHCC11month aboveHCC12month
# 1: 91 46 90 TRUE FALSE TRUE 41 46 40
# 2: 92 71 14 TRUE TRUE FALSE 42 21 14
# 3: 28 91 96 FALSE TRUE TRUE 28 41 46
# 4: 80 25 91 TRUE FALSE TRUE 30 25 41
# 5: 61 NA 8 TRUE NA FALSE 11 0 8
# 6: 49 89 49 FALSE TRUE FALSE 49 39 49
# 7: 69 97 37 TRUE TRUE FALSE 19 47 37
# 8: 13 11 84 FALSE FALSE TRUE 13 11 34
# 9: 60 95 41 TRUE TRUE FALSE 10 45 41
# 10: 64 51 76 TRUE TRUE TRUE 14 1 26
I chose to keep the HCC* variables as logical instead of your +(...) integers, but it's directly translatable and up to you.

Summing values after every third position in data frame in R

I am new to R. I have a data frame like following
>df=data.frame(Id=c("Entry_1","Entry_1","Entry_1","Entry_2","Entry_2","Entry_2","Entry_3","Entry_4","Entry_4","Entry_4","Entry_4"),Start=c(20,20,20,37,37,37,68,10,10,10,10),End=c(50,50,50,78,78,78,200,94,94,94,94),Pos=c(14,34,21,50,18,70,101,35,2,56,67),Hits=c(12,34,17,89,45,87,1,5,6,3,26))
Id Start End Pos Hits
Entry_1 20 50 14 12
Entry_1 20 50 34 34
Entry_1 20 50 21 17
Entry_2 37 78 50 89
Entry_2 37 78 18 45
Entry_2 37 78 70 87
Entry_3 68 200 101 1
Entry_4 10 94 35 5
Entry_4 10 94 2 6
Entry_4 10 94 56 3
Entry_4 10 94 67 26
For each entry I would like to iterate the data.frame in 3 different modes. For an example, for Entry_1 mode_1 =seq(20,50,3)and mode_2=seq(21,50,3) and mode_3=seq(22,50,3). I would like sum all the Values in Column "Hits" whose corresponding values in Column "Pos" that falls in mode_1 or_mode_2 or mode_3 and generate a data.frame like follow:
Id Mode_1 Mode_2 Mode_3
Entry_1 0 17 34
Entry_2 87 89 0
Entry_3 1 0 0
Entry_4 26 8 0
I tried the following code:
mode_1=0
mode_2=0
mode_3=0
mode_1_sum=0
mode_2_sum=0
mode_3_sum=0
for(i in dim(df)[1])
{
if(df$Pos[i] %in% seq(df$Start[i],df$End[i],3))
{
mode_1_sum=mode_1_sum+df$Hits[i]
print(mode_1_sum)
}
mode_1=mode_1_sum+counts
print(mode_1)
ifelse(df$Pos[i] %in% seq(df$Start[i]+1,df$End[i],3))
{
mode_2_sum=mode_2_sum+df$Hits[i]
print(mode_2_sum)
}
mode_2_sum=mode_2_sum+counts
print(mode_2)
ifelse(df$Pos[i] %in% seq(df$Start[i]+2,df$End[i],3))
{
mode_3_sum=mode_3_sum+df$Hits[i]
print(mode_3_sum)
}
mode_3_sum=mode_3_sum+counts
print(mode_3_sum)
}
But the above code only prints 26. Can any one guide me how to generate my desired output, please. I can provide much more details if needed. Thanks in advance.
It's not an elegant solution, but it works.
m <- 3 # Number of modes you want
foo <- ((df$Pos - df$Start)%%m + 1) * (df$Start < df$Pos) * (df$End > df$Pos)
tab <- matrix(0,nrow(df),m)
for(i in 1:m) tab[foo==i,i] <- df$Hits[foo==i]
aggregate(tab,list(df$Id),FUN=sum)
# Group.1 V1 V2 V3
# 1 Entry_1 0 17 34
# 2 Entry_2 87 89 0
# 3 Entry_3 1 0 0
# 4 Entry_4 26 8 0
-- EXPLANATION --
First, we find the indices of df$Pos That are both bigger than df$Start and smaller than df$End. These should return 1 if TRUE and 0 if FALSE. Next, we take the difference between df$Pos and df$Start, we take mod 3 (which will give a vector of 0s, 1s and 2s), and then we add 1 to get the right mode. We multiply these two things together, so that the values that fall within the interval retain the right mode, and the values that fall outside the interval become 0.
Next, we create an empty matrix that will contain the values. Then, we use a for-loop to fill in the matrix. Finally, we aggregate the matrix.
I tried looking for a quicker solution, but the main problem I cannot work around is the varying intervals for each row.

Filter rows based on values of multiple columns in R

Here is the data set, say name is DS.
Abc Def Ghi
1 41 190 67
2 36 118 72
3 12 149 74
4 18 313 62
5 NA NA 56
6 28 NA 66
7 23 299 65
8 19 99 59
9 8 19 61
10 NA 194 69
How to get a new dataset DSS where value of column Abc is greater than 25, and value of column Def is greater than 100.It should also ignore any row if value of atleast one column in NA.
I have tried few options but wasn't successful. Your help is appreciated.
There are multiple ways of doing it. I have given 5 methods, and the first 4 methods are faster than the subset function.
R Code:
# Method 1:
DS_Filtered <- na.omit(DS[(DS$Abc > 20 & DS$Def > 100), ])
# Method 2: which function also ignores NA
DS_Filtered <- DS[ which( DS$Abc > 20 & DS$Def > 100) , ]
# Method 3:
DS_Filtered <- na.omit(DS[(DS$Abc > 20) & (DS$Def >100), ])
# Method 4: using dplyr package
DS_Filtered <- filter(DS, DS$Abc > 20, DS$Def >100)
DS_Filtered <- DS %>% filter(DS$Abc > 20 & DS$Def >100)
# Method 5: Subset function by default ignores NA
DS_Filtered <- subset(DS, DS$Abc >20 & DS$Def > 100)

R efficiently add up tables in different order

At some point in my code, I get a list of tables that looks much like this:
[[1]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
13 68 13 117 34 3.275941e-37
23 78 23 117 2 4.503111e-32
....
[[2]]
cluster_size start end number p_value
13 2 12 13 131 4.209645e-233
12 1 12 12 100 6.166824e-185
22 11 12 22 132 6.916323e-143
23 12 12 23 133 1.176194e-139
13 1 13 13 31 3.464284e-38
....
While I don't show the full table here I know they are all the same size. What I want to do is make one table where I add up the p-values. Problem is that the $cluster_size, start, $end and $number columns don't necessarily correspond to the same row when I look at the table in different list elements so I can't just do a simple sum.
The brute force way to do this is to: 1) make a blank table 2) copy in the appropriate $cluster_size, $start, $end, $number columns from the first table and pull the correct p-values using a which() statement from all the tables. Is there a more clever way of doing this? Or is this pretty much it?
Edit: I was asked for a dput file of the data. It's located here:
http://alrig.com/code/
In the sample case, the order of the rows happen to match. That will not always be the case.
Seems like you can do this in two steps
Convert your list to a data.frame
Use any of the split-apply-combine approaches to summarize.
Assuming your data was named X, here's what you could do:
library(plyr)
#need to convert to data.frame since all of your list objects are of class matrix
XDF <- as.data.frame(do.call("rbind", X))
ddply(XDF, .(cluster_size, start, end, number), summarize, sump = sum(p_value))
#-----
cluster_size start end number sump
1 1 12 12 100 5.550142e-184
2 1 13 13 31 3.117856e-37
3 1 22 22 1 9.000000e+00
...
29 105 23 117 2 6.271469e-16
30 106 22 146 13 7.266746e-25
31 107 23 146 12 1.382328e-25
Lots of other aggregation techniques are covered here. I'd look at data.table package if your data is large.

Resources