Understanding R's anyDuplicated - r

Quick question in understanding's R's anyDuplicated, when passed on a dataframe (lets say x y z columns with 1k observations) will if check if any of the rows has the exact same x y z values as another row in the same dataframe? Thanks

I would use duplicated and combine it from front to back.
mydf <- data.frame(x = c(1:3,1,1), y = c(3:5,3,3))
mydf
# x y
# 1 1 3
# 2 2 4
# 3 3 5
# 4 1 3
# 5 1 3
There three duplicated rows 1, 4, and 5. But 'duplicated' will only mark what is duplicated not the original value also.
duplicated(mydf)
#[1] FALSE FALSE FALSE TRUE TRUE
duplicated(mydf, fromLast = TRUE)
#[1] TRUE FALSE FALSE TRUE FALSE
Using from last looks from the end to front to include the original value. By the way, I will ask the R core team to add a unified function to do both.
myduplicates <- duplicated(mydf) | duplicated(mydf, fromLast = TRUE)
Saving the expression as a variable allows us to count and subset later.
sum(myduplicates)
#[1] 3
mydf[myduplicates,]
# x y
# 1 1 3
# 4 1 3
# 5 1 3
mydf[!myduplicates,]
# x y
# 2 2 4
# 3 3 5

Related

How to add dummy variables to data with specific characteristic

My question is probably quite basic but I've been struggling with it so I'd be really grateful if someone could offer a solution.
I have data in the following format:
ORG_NAME
var_1_12
var_1_13
var_1_14
A
12
11
5
B
13
13
11
C
6
7
NA
D
NA
NA
5
I have data on organizations over 5 years, but over that time, some organizations have merged and others have disappeared. I'm planning on conducting a fixed-effects regression, so I need to add a dummy variable which is "0" when organizations have remained the same (in this case row A and row B), and "1" in the year before the merge, and after the merge. In this case, I know that orgs C and D merged, so I would like for the data to look like this:
ORG_NAME
var_1_12
dum_12
var_1_13
dum_13
A
12
0
5
0
B
13
0
11
0
C
6
1
NA
1
D
NA
1
5
1
How would I code this?
This approach (as is any, according to your description) is absolutely dependent on the companies being in consecutive rows.
mtx <- apply(is.na(dat[,-1]), MARGIN = 2,
function(vec) zoo::rollapply(vec, 2, function(z) xor(z[1], z[2]), fill = FALSE))
mtx
# var_1_12 var_1_13 var_1_14
# [1,] FALSE FALSE FALSE
# [2,] FALSE FALSE TRUE
# [3,] TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE
out <- rowSums(mtx) == ncol(mtx)
out
# [1] FALSE FALSE TRUE FALSE
out | c(FALSE, out[-length(out)])
# [1] FALSE FALSE TRUE TRUE
### and your 0/1 numbers, if logical isn't right for you
+(out | c(FALSE, out[-length(out)]))
# [1] 0 0 1 1
Brief walk-through:
is.na(dat[,-1]) returns a matrix of whether the values (except the first column) are NA; because it's a matrix, we use apply to call a function on each column (using MARGIN=2);
zoo::rollapply is a function that does rolling calculations on a portion ("window") of the vector at a time, in this case 2-wide. For example, if we have 1:5, then it first looks at c(1,2), then c(2,3), then c(3,4), etc.
xor is an eXclusive OR, meaning it will be true when one of its arguments are true and the other is false;
mtx is a matrix indicating that a cell and the one below it met the conditions (one is NA, the other is not). We then check to see which of these rows are all true, forming out.
since we need a 1 in both rows, we vector-AND & out with itself, shifted, to produce your intended output
If I understand well, you want to code with "1" rows with at least one NA. if it's so, you just need one dummy var for all the years, right? Somthing like this
set.seed(4)
df <- data.frame(org=as.factor(LETTERS[1:5]),y1=sample(c(1:4,NA),5),y2=sample(c(3:6,NA),5),y3=sample(c(2:5,NA),5))
df$dummy <- as.numeric(apply(df, 1, function(x)any(is.na(x))))
which give you
org y1 y2 y3 dummy
1 A 3 5 3 0
2 B NA 4 5 1
3 C 4 3 2 0
4 D 1 6 NA 1
5 E 2 NA 4 1

Assigning vector elements a value associated with preceding matching value [duplicate]

This question already has answers here:
Calculating cumulative sum for each row
(6 answers)
Sum of previous rows in a column R
(1 answer)
Closed 3 years ago.
I have a vector of alternating TRUE and FALSE values:
dat <- c(T,F,F,T,F,F,F,T,F,T,F,F,F,F)
I'd like to number each instance of TRUE with a unique sequential number and to assign each FALSE value the number associated with the TRUE value preceding it.
therefore, my desired output using the example dat above (which has 4 TRUE values):
1 1 1 2 2 2 2 3 3 4 4 4 4 4
What I tried:
I've tried the following (which works), but I know there must be a simpler solution!!
whichT <- which(dat==T)
whichF <- which(dat==F)
l1 <- lapply(1:length(whichT),
FUN = function(x)
which(whichF > whichT[x] & whichF < whichT[(x+1)])
)
l1[[length(l1)]] <- which(whichF > whichT[length(whichT)])
replaceFs <- unlist(
lapply(1:length(whichT),
function(x) l1[[x]] <- rep(x,length(l1[[x]]))
)
)
replaceTs <- 1:length(whichT)
dat2 <- dat
dat2[whichT] <- replaceTs
dat2[whichF] <- replaceFs
dat2
[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
I need a simpler and quicker solution b/c my real data set is 181k rows long!
Base R solutions preferred, but any solution works
cumsum(dat) will do what you want. When used in mathematical functions TRUE gets converted to 1 and FALSE to 0 so taking the cumulative sum will add 1 every time you see a TRUE and add nothing when there is a FALSE which is what you want.
dat <- c(T,F,F,T,F,F,F,T,F,T,F,F,F,F)
cumsum(dat)
# [1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
Instead of doing the indexing, it can be easily done with cumsum from base R. Here, TRUE/FALSE gets coerced to 1/0 and when we do the cumulative sum, whereever there is 1, it gets increment by 1
cumsum(dat)
#[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
cumsum() is the most straightforward way, however, you can also do:
Reduce("+", dat, accumulate = TRUE)
[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4

Customer latency in r [duplicate]

This question already has answers here:
Remove duplicated rows
(10 answers)
Closed 5 years ago.
New to R doing customer latency.
In the dataset i have around 300,000 rows with 15 columns. Some
relevant columns are "Account", "Account Open Date",
"Shipment pick up date" etc.
Account numbers are repeated and just want the rows with account numbers where it is recorded for the first time, not the subsequent rows.
For eg. acc # 610829952 is in the first row as well as in the 5th row, 6th row etc. I need to filter out the first row alone and i need to do this for all the account numbers.
I am not sure how to do this. Could someone please help me with this?
There is a function in R called duplicated(). It allows you to check whether a certain value, like your account, has already been recorded.
First you check in the relevant column account which account numbers have already appeared before using duplicated(). You will get a TRUE / FALSE vector (TRUE indicating that the corresponding value has already appeared). With that information, you will index your data.frame in order to only retrieve the rows you are interested in. I will assume you have your data looks like df below:
df <- data.frame(segment = sample(LETTERS, 20, replace = TRUE),
account = sample(1:5, 20, replace = TRUE))
# account segment
# 1 3 N
# 2 2 V
# 3 4 T
# 4 4 Y
# 5 4 M
# 6 4 E
# 7 5 H
# 8 3 A
# 9 3 J
# 10 3 Y
# 11 4 R
# 12 5 O
# 13 4 O
# 14 1 R
# 15 5 U
# 16 2 Q
# 17 5 F
# 18 2 J
# 19 4 E
# 20 2 H
inds <- duplicated(df$account)
# [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
# [11] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
df <- df[!inds, ]
# account segment
# 1 3 N
# 2 2 V
# 3 4 T
# 7 5 H
# 14 1 R

Smartest way to check if an observation in data.frame(x) exists also in data.frame(y) and populate a new column according with the result

Having two dataframes:
x <- data.frame(numbers=c('1','2','3','4','5','6','7','8','9'), coincidence="NA")
and
y <- data.frame(numbers=c('1','3','10'))
How can I check if the observations in y (1, 3 and 10) also exist in x and fill accordingly the column x["coincidence"] (for example with YES|NO, TRUE|FALSE...).
I would do the same in Excel with a formula combining IFERROR and VLOOKUP, but I don't know how to do the same with R.
Note:
I am open to change data.frames to tables or use libraries. The dataframe with the numbers to check (y) will never have more than 10-20 observations, while the other one (x) will never have more than 1K observations. Therefore, I could also iterate with an if, if it´s necessary
We can create the vector matching the desired output with a set difference search that outputs boolean TRUE and FALSE values where appropriate. The sign %in%, is a binary operator that compares the values on the left-hand side to the set of values on the right:
x$coincidence <- x$numbers %in% y$numbers
# numbers coincidence
# 1 1 TRUE
# 2 2 FALSE
# 3 3 TRUE
# 4 4 FALSE
# 5 5 FALSE
# 6 6 FALSE
# 7 7 FALSE
# 8 8 FALSE
# 9 9 FALSE
Do numbers have to be factors, as you've set them up? (They're not numbers, but character.) If not, it's easy:
x <- data.frame(numbers=c('1','2','3','4','5','6','7','8','9'), coincidence="NA", stringsAsFactors=FALSE)
y <- data.frame(numbers=c('1','3','10'), stringsAsFactors=FALSE)
x$coincidence[x$numbers %in% y$numbers] <- TRUE
> x
numbers coincidence
1 1 TRUE
2 2 NA
3 3 TRUE
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
If they need to be factors, then you'll need to either set common levels or use as.character().

Group index from column labeling the last element in each group

I'm trying to subset a data frame. The data frame is to be broken into subsets, where the last element in each subset has a "TRUE" value in the "bool" column. Consider the following data frame:
df <- data.frame(c(3,1,3,4,1,1,4), rnorm(7))
df <- cbind(df, df[,1] != 1)
names(df) <- c("ind", "var", "bool")
df
# ind var bool
# 1 3 0.02343906 TRUE
# 2 1 0.94786193 FALSE
# 3 3 0.50632766 TRUE
# 4 4 0.24655548 TRUE
# 5 1 -1.58103304 FALSE
# 6 1 0.73999468 FALSE
# 7 4 0.10929906 TRUE
Row 1 should be a subset, rows 2 and 3 should be a subset, row 4 a subset and then rows 5 through 7 a subset. The code I have below works (I can subset on the new column), but I was wondering if there was a more "R" way of doing it.
index = 1
for (i in 1:nrow(df))
{
if(df$bool[i])
{df$index[i] = index
index = index + 1
}
else
{df$index[i] = index
}
}
df
# ind var bool index
# 1 3 0.02343906 TRUE 1
# 2 1 0.94786193 FALSE 2
# 3 3 0.50632766 TRUE 2
# 4 4 0.24655548 TRUE 3
# 5 1 -1.58103304 FALSE 4
# 6 1 0.73999468 FALSE 4
# 7 4 0.10929906 TRUE 4
The first thought I would have would be to use the cumulative sum (cumsum) on the bool column to get the group indices -- this will increase the index value by 1 every time the bool value is TRUE:
df$index <- cumsum(df$bool)
df
# ind var bool index
# 1 3 -1.0712125 TRUE 1
# 2 1 0.4994369 FALSE 1
# 3 3 2.1335274 TRUE 2
# 4 4 -1.5950432 TRUE 3
# 5 1 0.5919880 FALSE 3
# 6 1 2.7039831 FALSE 3
# 7 4 -1.3526646 TRUE 4
This is not quite right because all the observations before the TRUE of each group are assigned to the previous group. We can fix that by adding 1 for all the observations with bool set to FALSE:
df$index <- cumsum(df$bool) + !df$bool
df
# ind var bool index
# 1 3 -1.0712125 TRUE 1
# 2 1 0.4994369 FALSE 2
# 3 3 2.1335274 TRUE 2
# 4 4 -1.5950432 TRUE 3
# 5 1 0.5919880 FALSE 4
# 6 1 2.7039831 FALSE 4
# 7 4 -1.3526646 TRUE 4
Splitting the data frame into a list of subsets can now be achieved efficiently with subsets <- split(df, df$index).

Resources