Match objects with same IDs except for one - r

I have a dataframe with the following format:
df <- data.frame(DS.ID=c(123,214,543,325,123,214),
P.ID=c("AAC","JGK","DIF","ADL","AAE","JGK"),
OP.ID=c("xxab","xxac","xxad","xxae","xxab","xxac"))
DS.ID P.ID OP.ID
1 123 AAC xxab
2 214 JGK xxac
3 543 DIF xxad
4 325 ADL xxae
5 123 AAE xxab
6 214 JGK xxac
I'm trying to find instances where DS.ID is equal to another DS.ID, OP.ID is equal to another OP.ID, but the P.ID's are not equal. I know how to do it with a loop but I'd rather do a quicker method so it returns the DS.ID's/information of those that do not match. Either with a logical vector in another column or through the DS.ID's.

Using duplicated:
df$match <- duplicated(df$DS.ID,df$OP.ID,fromLast=TRUE) |
duplicated(df$DS.ID,df$OP.ID)
# df
# DS.ID P.ID OP.ID match
# 1 123 AAC xxab TRUE
# 2 214 JGK xxac TRUE
# 3 543 DIF xxad FALSE
# 4 325 ADL xxab FALSE
# 5 123 AAE xxab TRUE
# 6 214 JGK xxac TRUE
EDIT after OP clarification
dupli.2 <- duplicated(df$DS.ID,df$OP.ID,fromLast=TRUE) | duplicated(df$DS.ID,df$OP.ID)
dupli.all <- duplicated(df) | duplicated(df,fromLast=TRUE)
as.logical(dupli.2 - dupli.all)
[1] TRUE FALSE FALSE FALSE TRUE FALSE

Related

Creating new columns for consecutive TRUEs in R

I want to create new columns that puts TRUE if the number of consecutive wins are two, three etc. So I would like row 3, 6, 7, 8 to be TRUE in a new column called "twoconswins" and row 7, 8 to be true in a new column called "threeconswins" and so on. What is the best way for doing this?
> id date team teamscore opponent opponentscore home win
>9 9 2005-10-05 DET 5 STL 1 1 TRUE
>38 38 2005-10-09 DET 6 CAL 3 1 TRUE
>48 48 2005-10-10 DET 2 VAN 4 1 FALSE
>88 88 2005-10-17 DET 3 SJS 2 1 TRUE
>110 110 2005-10-21 DET 3 ANA 2 1 TRUE
>148 148 2005-10-27 DET 5 CHI 2 1 TRUE
>179 179 2005-11-01 DET 4 CHI 1 1 TRUE
>194 194 2005-11-03 DET 3 EDM 4 1 FALSE
>212 212 2005-11-05 DET 1 PHO 4 1 FALSE
I assumed row 1 should be the header, so that actually rows 2, 5, 6 and 7 should evaluate to TRUE for "twoconswins", and row 6 and 7 for "threeconswins".
You could do:
library(data.table)
df$twoconswins <- (df$win & shift(df$win, 1, NA)) == TRUE
df$threeconswins <- (df$win & shift(df$win, 1, NA) & shift(df$win, 2, NA)) == TRUE
I am thinking this could be more vectorized though, especially if 50 consecutive wins could be possible as well and you'd like to create columns for that as well.
If you like to automatically make the new columns as well, in case it happens sometimes 500 consecutive wins occur, you could do this:
df <- read.table(text =
'id date team teamscore opponent opponentscore home win
9 9 2005-10-05 DET 5 STL 1 1 TRUE
38 38 2005-10-09 DET 6 CAL 3 1 TRUE
48 48 2005-10-10 DET 2 VAN 4 1 FALSE
88 88 2005-10-17 DET 3 SJS 2 1 TRUE
110 110 2005-10-21 DET 3 ANA 2 1 TRUE
148 148 2005-10-27 DET 5 CHI 2 1 TRUE
179 179 2005-11-01 DET 4 CHI 1 1 TRUE
194 194 2005-11-03 DET 3 EDM 4 1 FALSE
212 212 2005-11-05 DET 1 PHO 4 1 FALSE',
header = TRUE)
rles <- data.frame(values = c(rle(df$win)$values),
lengths = c(rle(df$win)$lengths))
maxconwins <- max(rles[rles$values == TRUE,])
for(x in 1: maxconwins){
x <- seq(1,x)
partialstring <- paste("shift(df$win,", x, ",NA)", collapse = " & ")
fullstring <- paste0("df$nr", max(x), "conswins <- (", partialstring, ") == TRUE")
eval(parse(text = fullstring))
}
df[1:maxconwins,9:12][upper.tri(df[1:maxconwins,9:12], diag = TRUE)] <- NA
> df[,8:12]
win nr1conswins nr2conswins nr3conswins nr4conswins
9 TRUE NA NA NA NA
38 TRUE TRUE NA NA NA
48 FALSE TRUE TRUE NA NA
88 TRUE FALSE FALSE FALSE NA
110 TRUE TRUE FALSE FALSE FALSE
148 TRUE TRUE TRUE FALSE FALSE
179 TRUE TRUE TRUE TRUE FALSE
194 FALSE TRUE TRUE TRUE TRUE
212 FALSE FALSE FALSE FALSE FALSE
BTW, I only added the last line because (FALSE & TRUE & TRUE & NA) == TRUE evaluates to FALSE, while you probably like these cells to be NA. I just made sure of this here by setting the upper triagonal of the symmetric submatrix to NA afterwards. For readibility I manually added the column numbers 9 and 12 in here, but you could specify those with a function as well if you'd like.
UPDATE:
When using the Reduce() function as suggested by Frank, you could do this for loop instead of the above:
for(x in 1: maxconwins){
x <- seq(1,x)
eval(parse(text = paste0("df$nr", max(x), "conswins <- (Reduce(`&`, shift(df$win, 1:", max(x), "))) == TRUE")))
}

Apply a rule to calculate sum of specific

Hi I have a data set like this.
Num C Pr Value Volume
111 aa Alen 111 222
111 aa Paul 100 200
222 vv Iva 444 555
222 vv John 333 444
I would like to filter the data according to Num and to add a new row where take the sum of column Value and Volume but to keep the information of column Num and C, but in column Pr to put Total. It should look like this way.
Num C Pr Value Volume
222 vv Total 777 999
Could you suggest me how to do it? I would like only for Num 222.
When I try to use res command I end up with this result.
# Num C Pr Value Volume
1: 111 aa Alen 111 222
2: 111 aa Paul 100 200
3: 111 aa Total NA NA
4: 222 vv Iva 444 555
5: 222 vv John 333 444
6: 222 vv Total NA NA
What cause this?
The structure of my data is the following one.
'data.frame': 4 obs. of 5 variables:
$ Num : Factor w/ 2 levels "111","222": 1 1 2 2
$ C : Factor w/ 2 levels "aa","vv": 1 1 2 2
$ Pr : Factor w/ 4 levels "Alen","Iva","John",..: 1 4 2 3
$ Value : Factor w/ 4 levels "100","111","333",..: 2 1 4 3
$ Volume: Factor w/ 4 levels "200","222","444",..: 2 1 4 3
We could use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Num', 'C' columns and specifying the columns to do the sum in .SDcols, we loop those columns using lapply, get the sum, and create the 'Pr' column. We can rbind the original dataset with the new summarised output ('DT1') and order the result based on 'Num'.
library(data.table)#v1.9.5+
DT1 <- setDT(df1)[,lapply(.SD, sum) , by = .(Num,C),
.SDcols=Value:Volume][,Pr:='Total'][]
rbind(df1, DT1)[order(Num)]
# Num C Pr Value Volume
#1: 111 aa Alen 111 222
#2: 111 aa Paul 100 200
#3: 111 aa Total 211 422
#4: 222 vv Iva 444 555
#5: 222 vv John 333 444
#6: 222 vv Total 777 999
This can be done using base R methods as well. We get the sum of 'Value', 'Volume' columns grouped by 'Num', 'C', using the formula method of aggregate, transform the output by creating the 'Pr' column, rbind with original dataset and order the output ('res') based on 'Num'.
res <- rbind(df1,transform(aggregate(.~Num+C, df1[-3], FUN=sum), Pr='Total'))
res[order(res$Num),]
# Num C Pr Value Volume
#1 111 aa Alen 111 222
#2 111 aa Paul 100 200
#5 111 aa Total 211 422
#3 222 vv Iva 444 555
#4 222 vv John 333 444
#6 222 vv Total 777 999
EDIT: Noticed that the OP mentioned filter. If this is for a single 'Num', we subset the data, and then do the aggregate, transform steps.
transform(aggregate(.~Num+C, subset(df1, Num==222)[-3], FUN=sum), Pr='Total')
# Num C Value Volume Pr
#1 222 vv 777 999 Total
Or we may not need aggregate. After subsetting the data, we convert the 'Num' to 'factor', loop through the output dataset ('df2') get the sum if it the column is numeric class or else we get the first element and wrap with data.frame.
df2 <- transform(subset(df1, Num==222), Num=factor(Num))
data.frame(c(lapply(df2[-3], function(x) if(is.numeric(x))
sum(x) else x[1]), Pr='Total'))
# Num C Value Volume Pr
#1 222 vv 777 999 Total
data
df1 <- structure(list(Num = c(111L, 111L, 222L, 222L), C = c("aa", "aa",
"vv", "vv"), Pr = c("Alen", "Paul", "Iva", "John"), Value = c(111L,
100L, 444L, 333L), Volume = c(222L, 200L, 555L, 444L)), .Names = c("Num",
"C", "Pr", "Value", "Volume"), class = "data.frame",
row.names = c(NA, -4L))
Or using dplyr:
library(dplyr)
df1 %>%
filter(Num == 222) %>%
summarise(Value = sum(Value),
Volume = sum(Volume),
Pr = 'Total',
Num = Num[1],
C = C[1])
# Value Volume Pr Num C
# 1 777 999 Total 222 vv
where we first filter to keep only Num == 222, and then use summarise to obtain the sums and the values for Num and C. This assumes that:
You do not want to get the result for each unique Num (I select one here, you could select multiple). If you need this, use group_by.
There is only ever one C for every unique Num.
You can also use a dplyr package:
df %>%
filter(Num == 222) %>%
group_by(Num, C) %>%
summarise(
Pr = "Total"
, Value = sum(Value)
, Volume = sum(Volume)
) %>%
rbind(df, .)
# Num C Pr Value Volume
# 1 111 aa Alen 111 222
# 2 111 aa Paul 100 200
# 3 222 vv Iva 444 555
# 4 222 vv John 333 444
# 5 222 vv Total 777 999
If you want the total for each Num value you just comment a filter line

Looping with filter (dplyr) through multiple columns with multiple values to match

I have the following data set, 'data':
A1 A2 A3 A4 A5
case1 1234 2234 223 4455 7334
case2 1254 2234 200
case3 1234 9234
case4 1274 5234 228 4655 2534
case5 1234 2634 273 7455
Each row represents a case and each column represents a code attribute for that case, where each case can have multiple codes or no codes at all. No case can have multiple instances of the same code.
Also important is that each code can be either 3 or 4 digits and the 4 digit codes are children of the 3 digit codes. For instance, '123' would be a more broad code than '1234'; the '4' indicates a more specific type of '123' code.
My original function for pulling at cases that have the code '1234' was
dataf <- filter(data, A1:A5 == 1234)
, but the statement attempted to evaluate with an & statement rather than an OR statement, which is not what I'm looking for.
So currently I am filtering like so:
dataf <- filter(data, A1 == 1234 | A2 == 1234 | A3 == 1234 | A4 == 1234 | A5 == 1234)
which works fine for one specific code, but the amount of typing if I wanted to filter out more codes would be prohibitive.
Ultimately, what I would like to do is query all columns A1:A5 for any cases that have codes matching any values in the ranges
123:150
or
1230:1500
, as querying both of these ranges will return me cases that match any instances of both parent codes and child codes.
You could try something along the lines of:
x <- c(123:150, 1230:1500)
filter(df, rowSums(mutate_each(df, funs(. %in% x))) >= 1L)
# case A1 A2 A3 A4 A5
#1 case1 1234 2234 223 4455 7334
#2 case2 1254 2234 200 NA NA
#3 case3 1234 9234 NA NA NA
#4 case4 1274 5234 228 4655 2534
#5 case5 1234 2634 273 7455 NA
If you do that for the example you provided (using only 1234 as value to be checked) it would result in what you show above:
filter(df, rowSums(mutate_each(df, funs(. %in% 1234))) >= 1L)
# case A1 A2 A3 A4 A5
#1 case1 1234 2234 223 4455 7334
#2 case3 1234 9234 NA NA NA
#3 case5 1234 2634 273 7455 NA
The inner function produces a data.frame of logical values:
mutate_each(df, funs(. %in% x))
# case A1 A2 A3 A4 A5
#1 FALSE TRUE FALSE FALSE FALSE FALSE
#2 FALSE TRUE FALSE FALSE FALSE FALSE
#3 FALSE TRUE FALSE FALSE FALSE FALSE
#4 FALSE TRUE FALSE FALSE FALSE FALSE
#5 FALSE TRUE FALSE FALSE FALSE FALSE
Which is (almost) the same as when using sapply
sapply(df, function(.) . %in% x)
# case A1 A2 A3 A4 A5
#[1,] FALSE TRUE FALSE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE FALSE FALSE
#[3,] FALSE TRUE FALSE FALSE FALSE FALSE
#[4,] FALSE TRUE FALSE FALSE FALSE FALSE
#[5,] FALSE TRUE FALSE FALSE FALSE FALSE
And then you use rowSums to check for each row if at least one value is TRUE to subset the data accordingly.
I think this is a data format problem, so first let's get the data into the right format:'
df <- read.table(text = ' A1 A2 A3 A4 A5
case1 1234 2234 223 4455 7334
case2 1254 2234 200 NA NA
case3 1234 9234 NA NA NA
case4 1274 5234 228 4655 2534
case5 1234 2634 273 7455 NA')
library(dplyr)
library(tidyr)
newdf <- df %>%
mutate(case = rownames(df)) %>%
gather(Anum, code, -case)
> head(newdf)
# case Anum code
#1 case1 A1 1234
#2 case2 A1 1254
#3 case3 A1 1234
#4 case4 A1 1274
#5 case5 A1 1234
#6 case1 A2 2234
Now you can run filter on just the one column:
filtdf <- filter(newdf, code == 1234)
# case Anum code
#1 case1 A1 1234
#2 case3 A1 1234
#3 case5 A1 1234
To get your range of codes this should work:
filter(newdf, code > 122 & code < 151)

How to find NAs in groups and create new column for data frame

I have a data frame consisting of an "ID" column and a "Diff" column. The ID column is responsible for marking groups of corresponding Diff values.
An example looks like this:
structure(list(ID = c(566, 566, 789, 789, 789, 487, 487, 11,
11, 189, 189), Diff = c(100, 277, 529, 43, NA, 860, 780, 445,
NA, 578, 810)), .Names = c("ID", "Diff"), row.names = c(9L, 10L,
20L, 21L, 22L, 25L, 26L, 51L, 52L, 62L, 63L), class = "data.frame")
My goal is to search each group for NAs in the Diff column and create a new column, that has either a "True" or "False" value for each row, depending if the corresponding group has an NA in Diff.
I tried
x <- aggregate(Diff ~ ID, data, is.na)
and
y <- aggregate(Diff ~ ID, data, function(x) any(is.na(x)))
The idea was to merge the result depending on ID. However, none of the above created a useful result. I know R can do it … and after searching for quite a while I ask you how :)
You can use the plyr and ddply
require(plyr)
ddply(data, .(ID), transform, na_diff = any(is.na(Diff)))
## ID Diff na_diff
## 1 11 445 TRUE
## 2 11 NA TRUE
## 3 189 578 FALSE
## 4 189 810 FALSE
## 5 487 860 FALSE
## 6 487 780 FALSE
## 7 566 100 FALSE
## 8 566 277 FALSE
## 9 789 529 TRUE
## 10 789 43 TRUE
## 11 789 NA TRUE
A very similar solution to #dickoa except in base:
do.call(rbind,by(data,data$ID,function(x)transform(x,na_diff=any(is.na(Diff)))))
# ID Diff na_diff
# 11.51 11 445 TRUE
# 11.52 11 NA TRUE
# 189.62 189 578 FALSE
# 189.63 189 810 FALSE
# 487.25 487 860 FALSE
# 487.26 487 780 FALSE
# 566.9 566 100 FALSE
# 566.10 566 277 FALSE
# 789.20 789 529 TRUE
# 789.21 789 43 TRUE
# 789.22 789 NA TRUE
Similarly, you could avoid transform with:
data$na_diff<-with(data,by(Diff,ID,function(x) any(is.na(x)))[as.character(ID)])
(You have two viable strategies already, but here is another which may be conceptually easier to follow if you are relatively new to R and aren't familiar with the way plyr works.)
I often need to know how many NAs I have in different variables, so here is a convenience function I use standard:
sna <- function(x){
sum(is.na(x))
}
From there, I sometimes use aggregate(), but sometimes I find ?summaryBy in the doBy package to be more convenient. Here's an example:
library(doBy)
z <- summaryBy(Diff~ID, data=my.data, FUN=sna)
z
ID Diff.sna
1 11 1
2 189 0
3 487 0
4 566 0
5 789 1
After this, you just need to use ?merge and convert the count of NAs to a logical to get your final data frame:
my.data <- merge(my.data, z, by="ID")
my.data$Diff.sna <- my.data$Diff.sna>0
my.data
ID Diff Diff.sna
1 11 445 TRUE
2 11 NA TRUE
3 189 578 FALSE
4 189 810 FALSE
5 487 860 FALSE
6 487 780 FALSE
7 566 100 FALSE
8 566 277 FALSE
9 789 529 TRUE
10 789 43 TRUE
11 789 NA TRUE

Identifying sequences of repeated numbers in R

I have a long time series where I need to identify and flag sequences of repeated values. Here's some data:
DATETIME WDIR
1 40360.04 22
2 40360.08 23
3 40360.12 126
4 40360.17 126
5 40360.21 126
6 40360.25 126
7 40360.29 25
8 40360.33 26
9 40360.38 132
10 40360.42 132
11 40360.46 132
12 40360.50 30
13 40360.54 132
14 40360.58 35
So if I need to note when a value is repeated three or more times, I have a sequence of four '126' and a sequence of three '132' that need to be flagged.
I'm very new to R. I expect I use cbind to create a new column in this array with a "T" in the corresponding rows, but how to populate the column correctly is a mystery. Any pointers please? Thanks a bunch.
As Ramnath says, you can use rle.
rle(dat$WDIR)
Run Length Encoding
lengths: int [1:9] 1 1 4 1 1 3 1 1 1
values : int [1:9] 22 23 126 25 26 132 30 132 35
rle returns an object with two components, lengths and values. We can use the lengths piece to build a new column that identifies which values are repeated more than three times.
tmp <- rle(dat$WDIR)
rep(tmp$lengths >= 3,times = tmp$lengths)
[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
This will be our new column.
newCol <- rep(tmp$lengths > 1,times = tmp$lengths)
cbind(dat,newCol)
DATETIME WDIR newCol
1 40360.04 22 FALSE
2 40360.08 23 FALSE
3 40360.12 126 TRUE
4 40360.17 126 TRUE
5 40360.21 126 TRUE
6 40360.25 126 TRUE
7 40360.29 25 FALSE
8 40360.33 26 FALSE
9 40360.38 132 TRUE
10 40360.42 132 TRUE
11 40360.46 132 TRUE
12 40360.50 30 FALSE
13 40360.54 132 FALSE
14 40360.58 35 FALSE
Use rle to do the job!! It is an amazing function that calculates the number of successive repetitions of numbers in a sequence. Here is some example code on how you can use rle to flag the miscreants in your data. This will return all rows from the data frame which have WDIR that are repeated 3 or more times successively.
runs = rle(mydf$WDIR)
subset(mydf, WDIR %in% runs$values[runs$lengths >= 3])
Two options for you.
Assuming the data is loaded:
dat <- read.table(textConnection("
DATETIME WDIR
40360.04 22
40360.08 23
40360.12 126
40360.17 126
40360.21 126
40360.25 126
40360.29 25
40360.33 26
40360.38 132
40360.42 132
40360.46 132
40360.50 30
40360.54 132
40360.58 35"), header=T)
Option 1: Sorting
dat <- dat[order(dat$WDIR),] # needed for the 'repeats' to be pasted into the correct rows in next step
dat$count <- rep(table(dat$WDIR),table(dat$WDIR))
dat$more4 <- ifelse(dat$count < 4, F, T)
dat <- dat[order(dat$DATETIME),] # sort back to original order
dat
Option 2: Oneliner
dat$more4 <- ifelse(dat$WDIR %in% names(which(table(dat$WDIR)>3)),T,F)
dat
I thought being a new user that option 1 might be an easier step by step approach although the rep(table(), table()) may not be intuitive initially.

Resources