Creating new columns for consecutive TRUEs in R - r

I want to create new columns that puts TRUE if the number of consecutive wins are two, three etc. So I would like row 3, 6, 7, 8 to be TRUE in a new column called "twoconswins" and row 7, 8 to be true in a new column called "threeconswins" and so on. What is the best way for doing this?
> id date team teamscore opponent opponentscore home win
>9 9 2005-10-05 DET 5 STL 1 1 TRUE
>38 38 2005-10-09 DET 6 CAL 3 1 TRUE
>48 48 2005-10-10 DET 2 VAN 4 1 FALSE
>88 88 2005-10-17 DET 3 SJS 2 1 TRUE
>110 110 2005-10-21 DET 3 ANA 2 1 TRUE
>148 148 2005-10-27 DET 5 CHI 2 1 TRUE
>179 179 2005-11-01 DET 4 CHI 1 1 TRUE
>194 194 2005-11-03 DET 3 EDM 4 1 FALSE
>212 212 2005-11-05 DET 1 PHO 4 1 FALSE

I assumed row 1 should be the header, so that actually rows 2, 5, 6 and 7 should evaluate to TRUE for "twoconswins", and row 6 and 7 for "threeconswins".
You could do:
library(data.table)
df$twoconswins <- (df$win & shift(df$win, 1, NA)) == TRUE
df$threeconswins <- (df$win & shift(df$win, 1, NA) & shift(df$win, 2, NA)) == TRUE
I am thinking this could be more vectorized though, especially if 50 consecutive wins could be possible as well and you'd like to create columns for that as well.

If you like to automatically make the new columns as well, in case it happens sometimes 500 consecutive wins occur, you could do this:
df <- read.table(text =
'id date team teamscore opponent opponentscore home win
9 9 2005-10-05 DET 5 STL 1 1 TRUE
38 38 2005-10-09 DET 6 CAL 3 1 TRUE
48 48 2005-10-10 DET 2 VAN 4 1 FALSE
88 88 2005-10-17 DET 3 SJS 2 1 TRUE
110 110 2005-10-21 DET 3 ANA 2 1 TRUE
148 148 2005-10-27 DET 5 CHI 2 1 TRUE
179 179 2005-11-01 DET 4 CHI 1 1 TRUE
194 194 2005-11-03 DET 3 EDM 4 1 FALSE
212 212 2005-11-05 DET 1 PHO 4 1 FALSE',
header = TRUE)
rles <- data.frame(values = c(rle(df$win)$values),
lengths = c(rle(df$win)$lengths))
maxconwins <- max(rles[rles$values == TRUE,])
for(x in 1: maxconwins){
x <- seq(1,x)
partialstring <- paste("shift(df$win,", x, ",NA)", collapse = " & ")
fullstring <- paste0("df$nr", max(x), "conswins <- (", partialstring, ") == TRUE")
eval(parse(text = fullstring))
}
df[1:maxconwins,9:12][upper.tri(df[1:maxconwins,9:12], diag = TRUE)] <- NA
> df[,8:12]
win nr1conswins nr2conswins nr3conswins nr4conswins
9 TRUE NA NA NA NA
38 TRUE TRUE NA NA NA
48 FALSE TRUE TRUE NA NA
88 TRUE FALSE FALSE FALSE NA
110 TRUE TRUE FALSE FALSE FALSE
148 TRUE TRUE TRUE FALSE FALSE
179 TRUE TRUE TRUE TRUE FALSE
194 FALSE TRUE TRUE TRUE TRUE
212 FALSE FALSE FALSE FALSE FALSE
BTW, I only added the last line because (FALSE & TRUE & TRUE & NA) == TRUE evaluates to FALSE, while you probably like these cells to be NA. I just made sure of this here by setting the upper triagonal of the symmetric submatrix to NA afterwards. For readibility I manually added the column numbers 9 and 12 in here, but you could specify those with a function as well if you'd like.
UPDATE:
When using the Reduce() function as suggested by Frank, you could do this for loop instead of the above:
for(x in 1: maxconwins){
x <- seq(1,x)
eval(parse(text = paste0("df$nr", max(x), "conswins <- (Reduce(`&`, shift(df$win, 1:", max(x), "))) == TRUE")))
}

Related

reducing repetitive tasks in data.table in R

I notice that i am doing the same thing multiple time, just with slightly different values:
HCCtreshold <- 40000
claimsMonthly[, HCC12mnth := +(HCCtreshold < claim12month)][ HCC12mnth == 1, `:=` (aboveHCCth12mnth = (claim12month - HCCtreshold))][is.na(aboveHCCth12mnth),aboveHCCth12mnth := 0]
claimsMonthly[, HCC11mnth := +(HCCtreshold < claim11month)][ HCC11mnth == 1, `:=` (aboveHCCth11mnth = (claim11month - HCCtreshold))][is.na(aboveHCCth11mnth),aboveHCCth11mnth := 0]
claimsMonthly[, HCC10mnth := +(HCCtreshold < claim10month)][ HCC10mnth == 1, `:=` (aboveHCCth10mnth = (claim10month - HCCtreshold))][is.na(aboveHCCth10mnth),aboveHCCth10mnth := 0]
So started with something like this:
k <- seq.default(from = 8, to = 12, by = 1)
claimsMonthly[paste0("HCC", k, "mnth") := lapply(k, function(x) (+(HCCtreshold < paste0("HCC", k, "mnth"))))]
i get an error:
Error: Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=").
I also tried:
for(k in 8:12){
claimsMonthly[, paste0("HCC", k, "mnth") := +(HCCtreshold < paste0("HCC", k, "mnth"))]
}
the columns are created correctly, but i get incorrect values inside them. I get an 1 everywhere
I am not sure what i am doing wrong?
I can offer some suggestions and, with some fake data, try them out.
You can programmatically define names on the left-hand side of := if you wrap a vector in c(...), so for instance DT[ c(vec_of_names) := list(some, values)].
You can programmatically retrieve values of variables with a vector of variable names and mget. While I generally think mget can indicate problematic code, I believe that in here it works with low risk. (While mget and get normally retrieve variables from the operating environment, often .GlobalEnv, from within a data.table operation then retrieve columns just as easily.)
Instead of a double-tap of assignment with == 1 and then is.na(...), we can use some logical trickery and the data.table::fcoalesce function. (If you aren't familiar, fcoalesce operates like SQL's coalesce function which is a vector-friendly way of finding the first non-NA value in arguments of vectors.
fcoalesce(c(1, 2, NA, NA), c(11, 12, 13, NA), c(21, 22, 23, 24))
# [1] 1 2 13 24
We can use fcoalesce(some + math * calc, 0) to do the math and, if NA, replace it with 0. (We use it on the above* variables below, and not necessarily on the HCC* logical variables. It can apply there too, if desired. If those HCC* variables are throw-away, though, it just doesn't matter.)
Fake data:
library(data.table)
set.seed(42)
hccthreshold <- 50
dat <- data.table( claim10month = sample(99, 10), claim11month = sample(99, 10), claim12month = sample(99, 10) )
dat$claim11month[5] <- NA
dat
# claim10month claim11month claim12month
# 1: 91 46 90
# 2: 92 71 14
# 3: 28 91 96
# 4: 80 25 91
# 5: 61 NA 8
# 6: 49 89 49
# 7: 69 97 37
# 8: 13 11 84
# 9: 60 95 41
# 10: 64 51 76
First, let's programmatically determine the column names we want to act on, and from then create the same vectors for the new variables. (I'm a big fan of determining and adapting these variable names programmatically, so that if you get a partial data set your code still works. You might consider setting checks and alarms to catch something wrong. For instance, stopifnot(length(claimnames) == 12L), in case you are expecting to always have precisely 12 months.)
claimnames <- grep("^claim[0-9]+month", colnames(dat), value = TRUE)
hccnames <- gsub("^claim", "HCC", claimnames)
abovenames <- gsub("^claim", "aboveHCC", claimnames)
claimnames
# [1] "claim10month" "claim11month" "claim12month"
hccnames
# [1] "HCC10month" "HCC11month" "HCC12month"
abovenames
# [1] "aboveHCC10month" "aboveHCC11month" "aboveHCC12month"
And now, we can process the data.
dat[, c(hccnames) := lapply(mget(claimnames), `>`, hccthreshold) ]
dat[, c(abovenames) := Map(function(hcc, clm) fcoalesce(clm - hcc * hccthreshold, 0),
mget(hccnames), mget(claimnames)) ]
dat
# claim10month claim11month claim12month HCC10month HCC11month HCC12month aboveHCC10month aboveHCC11month aboveHCC12month
# 1: 91 46 90 TRUE FALSE TRUE 41 46 40
# 2: 92 71 14 TRUE TRUE FALSE 42 21 14
# 3: 28 91 96 FALSE TRUE TRUE 28 41 46
# 4: 80 25 91 TRUE FALSE TRUE 30 25 41
# 5: 61 NA 8 TRUE NA FALSE 11 0 8
# 6: 49 89 49 FALSE TRUE FALSE 49 39 49
# 7: 69 97 37 TRUE TRUE FALSE 19 47 37
# 8: 13 11 84 FALSE FALSE TRUE 13 11 34
# 9: 60 95 41 TRUE TRUE FALSE 10 45 41
# 10: 64 51 76 TRUE TRUE TRUE 14 1 26
I chose to keep the HCC* variables as logical instead of your +(...) integers, but it's directly translatable and up to you.

Weird behavior of ggplot2::geom_text()

I want ggplot() to label observations with residuals higher than 1.5 times the standard error of the regression. The data are these (from Frank 1984):
d <- data.frame(x=c(43,32,32,30,26,25,23,22,22,21,20,20,19,19,19,18,18,17,17,16,16,16,15,13,12,12,10,10,9,7,6,3), y=c(63.0,54.3,51.0,39.0,52.0,55.0,41.2,47.7,44.5,43.0,46.8,42.4,56.5,55.0,53.0,55.0,45.0,50.7,37.5,61.0,48.1,30.0,51.5,40.6,51.3,50.3,62.4,39.3,43.2,40.4,37.7,27.7))
The model is simple:
m <- lm(y~x,data=d)
Then the ggplot() is:
ggplot(d, aes(x=x, y=y)) + geom_point() + geom_text(label=ifelse(abs(resid(m))>(1.5*sigma(m)),rownames(d),""),
nudge_x = 1, nudge_y = 0, check_overlap = T, color="blue")
giving this plot
which is missing a label for the observation in the top left corner (obs #27). Compare:
abs(resid(m))>(1.5*sigma(m))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
which indicates correctly that 27 satisfies the condition. Why is it not labelled?
Your labels in geom_text aren't inside an aes like they should be, although I'm unsure why you still got partially working labels without it.
I'm including some intermediate steps to work through this more slowly; for me, that helps with debugging and investigating how things work. Feel free to condense.
Assigning d and m are identical to the OP. With the extra steps:
library(tidyverse)
d2 <- d %>%
mutate(row = row_number()) %>%
mutate(abs_resid = abs(resid(m)), sig = sigma(m)) %>%
mutate(is_outlier = abs_resid > 1.5 * sig) %>%
mutate(label = ifelse(is_outlier, row, ""))
head(d2)
#> x y row abs_resid sig is_outlier label
#> 1 43 63.0 1 4.8398378 7.934235 FALSE
#> 2 32 54.3 2 0.9561793 7.934235 FALSE
#> 3 32 51.0 3 2.3438207 7.934235 FALSE
#> 4 30 39.0 4 13.4681223 7.934235 TRUE 4
#> 5 26 52.0 5 1.2832746 7.934235 FALSE
#> 6 25 55.0 6 4.7211239 7.934235 FALSE
ggplot(d2, aes(x = x, y = y)) +
geom_point() +
geom_text(aes(label = label), nudge_x = 1, color = "blue")
Created on 2018-07-31 by the reprex package (v0.2.0).

How come I get NA values when joining non-spatial object to geometry data/polygon?

I am trying to join a non-spatial object (Merged_Census2011) to a shapefile polygon (LDN_wards) by "ward.name". It seems to work fine until I look at the newly created object and see that all data has turn into NAs. Here's how I proceeded.
#Join Merged_Census2011 data to LDN_wards shapefile
LDN_wards <- readOGR(dsn = "data", layer = "LDN_wards")
head(LDN_wards#data)
#Explore the object
plot(LDN_wards)
summary(LDN_wards)
names(Merged_Census2011)
names(LDN_wards)
names(LDN_wards) <- c("Code", "ward.name") #rename LND-wards name heading to ward.name so it can be matched later
#Join datasets
LDN_wards#data <- left_join(LDN_wards#data, Merged_Census2011)
head(LDN_wards#data)
And I get:
LDN_wards#data <- left_join(LDN_wards#data, Merged_Census2011)
Joining by: "ward.name"
Warning message:
In left_join_impl(x, y, by$x, by$y) :
joining factors with different levels, coercing to character vector
> head(LDN_wards#data)
Code ward.name ward.code.x electorate votescast ward.code.y per.owner per.white per.noquals per.degree per.couple
1 E05000001 Aldersgate <NA> NA NA <NA> NA NA NA NA NA
2 E05000002 Aldgate <NA> NA NA <NA> NA NA NA NA
I have the intuition this is because there is a different number of row between the two sets. Could that be the issue? Is it not possible to join datasets with different row levels (whereby the missing data in one just remains unmatched by the corresponding observations)?
I had compared the two datasets as follows:
#Compare the two datasets
nrow(LDN_wards)
nrow(Merged_Census2011)
LDN_wards$ward.name %in% Merged_Census2011$ward.name
LDN_wards$ward.name %in% Merged_Census2011$ward.name
> nrow(LDN_wards)
[1] 787
> nrow(Merged_Census2011)
[1] 668
> LDN_wards$ward.name %in% Merged_Census2011$ward.name
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSEFALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[21] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[41] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ETC...
> summary(LDN_wards$ward.name %in% Merged_Census2011$ward.name)
Mode FALSE TRUE NA's
logical 24 763 0
Could it be because of the FALSE=24? If it is, how do I delete those FALSE?
Apologies if this sounds obvious, I'm fairly new :)
Thanks for your help!
I have just tried again using the (newly discovered) inner_join function and it seemed to work. If I understood well, the inner_join function only merges the matching rows... So I thought it would be better. And indeed, I do not get NA values anymore. But strangely I get duplicated observations... So if anyone has a better suggestion, you're very welcome to share. See my proceedings below.
#Join datasets
LDN_wards#data <- inner_join(LDN_wards#data, Merged_Census2011)
head(LDN_wards#data, n=10)
> #Join datasets
> LDN_wards#data <- inner_join(LDN_wards#data, Merged_Census2011)
Joining by: c("ward.name", "ward.code.x", "electorate", "votescast","ward.code.y", "per.owner", "per.white", "per.noquals", "per.degree", "per.couple", "per.higher.managerial", "per.christian", "per.no.car", "per.limill", "per.goodhealth", "per.males", "per.aged60plus")
Warning message:
In inner_join_impl(x, y, by$x, by$y) :
joining character vector and factor, coercing into character vector
> head(LDN_wards#data, n=10)
Code ward.name ward.code.x electorate votescast ward.code.y per.owner per.white per.noquals per.degree per.couple
1 E05000007 Bridge E05000497 8677 5654 E05000497 69.8 71.9 19.9 29.9 55.3
2 E05000026 Abbey E05000026 8110 4712 E05000026 32.7 28.1 16.4 34.5 47.2
3 E05000026 Abbey E05000026 8110 4712 E05000455 48.5 73.4 10.1 55.4 52.4
4 E05000026 Abbey E05000455 7250 4808 E05000026 32.7 28.1 16.4 34.5 47.2
5 E05000026 Abbey E05000455 7250 4808 E05000455 48.5 73.4 10.1 55.4 52.4
6 E05000027 Alibon E05000027 6971 4127 E05000027 45.1 70.1 31.2 16.7 49.2
7 E05000028 Becontree E05000028 7535 4538 E05000028 46.7 58.8 28.0 20.6

Match objects with same IDs except for one

I have a dataframe with the following format:
df <- data.frame(DS.ID=c(123,214,543,325,123,214),
P.ID=c("AAC","JGK","DIF","ADL","AAE","JGK"),
OP.ID=c("xxab","xxac","xxad","xxae","xxab","xxac"))
DS.ID P.ID OP.ID
1 123 AAC xxab
2 214 JGK xxac
3 543 DIF xxad
4 325 ADL xxae
5 123 AAE xxab
6 214 JGK xxac
I'm trying to find instances where DS.ID is equal to another DS.ID, OP.ID is equal to another OP.ID, but the P.ID's are not equal. I know how to do it with a loop but I'd rather do a quicker method so it returns the DS.ID's/information of those that do not match. Either with a logical vector in another column or through the DS.ID's.
Using duplicated:
df$match <- duplicated(df$DS.ID,df$OP.ID,fromLast=TRUE) |
duplicated(df$DS.ID,df$OP.ID)
# df
# DS.ID P.ID OP.ID match
# 1 123 AAC xxab TRUE
# 2 214 JGK xxac TRUE
# 3 543 DIF xxad FALSE
# 4 325 ADL xxab FALSE
# 5 123 AAE xxab TRUE
# 6 214 JGK xxac TRUE
EDIT after OP clarification
dupli.2 <- duplicated(df$DS.ID,df$OP.ID,fromLast=TRUE) | duplicated(df$DS.ID,df$OP.ID)
dupli.all <- duplicated(df) | duplicated(df,fromLast=TRUE)
as.logical(dupli.2 - dupli.all)
[1] TRUE FALSE FALSE FALSE TRUE FALSE

Identifying sequences of repeated numbers in R

I have a long time series where I need to identify and flag sequences of repeated values. Here's some data:
DATETIME WDIR
1 40360.04 22
2 40360.08 23
3 40360.12 126
4 40360.17 126
5 40360.21 126
6 40360.25 126
7 40360.29 25
8 40360.33 26
9 40360.38 132
10 40360.42 132
11 40360.46 132
12 40360.50 30
13 40360.54 132
14 40360.58 35
So if I need to note when a value is repeated three or more times, I have a sequence of four '126' and a sequence of three '132' that need to be flagged.
I'm very new to R. I expect I use cbind to create a new column in this array with a "T" in the corresponding rows, but how to populate the column correctly is a mystery. Any pointers please? Thanks a bunch.
As Ramnath says, you can use rle.
rle(dat$WDIR)
Run Length Encoding
lengths: int [1:9] 1 1 4 1 1 3 1 1 1
values : int [1:9] 22 23 126 25 26 132 30 132 35
rle returns an object with two components, lengths and values. We can use the lengths piece to build a new column that identifies which values are repeated more than three times.
tmp <- rle(dat$WDIR)
rep(tmp$lengths >= 3,times = tmp$lengths)
[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
This will be our new column.
newCol <- rep(tmp$lengths > 1,times = tmp$lengths)
cbind(dat,newCol)
DATETIME WDIR newCol
1 40360.04 22 FALSE
2 40360.08 23 FALSE
3 40360.12 126 TRUE
4 40360.17 126 TRUE
5 40360.21 126 TRUE
6 40360.25 126 TRUE
7 40360.29 25 FALSE
8 40360.33 26 FALSE
9 40360.38 132 TRUE
10 40360.42 132 TRUE
11 40360.46 132 TRUE
12 40360.50 30 FALSE
13 40360.54 132 FALSE
14 40360.58 35 FALSE
Use rle to do the job!! It is an amazing function that calculates the number of successive repetitions of numbers in a sequence. Here is some example code on how you can use rle to flag the miscreants in your data. This will return all rows from the data frame which have WDIR that are repeated 3 or more times successively.
runs = rle(mydf$WDIR)
subset(mydf, WDIR %in% runs$values[runs$lengths >= 3])
Two options for you.
Assuming the data is loaded:
dat <- read.table(textConnection("
DATETIME WDIR
40360.04 22
40360.08 23
40360.12 126
40360.17 126
40360.21 126
40360.25 126
40360.29 25
40360.33 26
40360.38 132
40360.42 132
40360.46 132
40360.50 30
40360.54 132
40360.58 35"), header=T)
Option 1: Sorting
dat <- dat[order(dat$WDIR),] # needed for the 'repeats' to be pasted into the correct rows in next step
dat$count <- rep(table(dat$WDIR),table(dat$WDIR))
dat$more4 <- ifelse(dat$count < 4, F, T)
dat <- dat[order(dat$DATETIME),] # sort back to original order
dat
Option 2: Oneliner
dat$more4 <- ifelse(dat$WDIR %in% names(which(table(dat$WDIR)>3)),T,F)
dat
I thought being a new user that option 1 might be an easier step by step approach although the rep(table(), table()) may not be intuitive initially.

Resources