How does this odds/even extraction work in R language? - r

> a <- sample(c(1:10), 20, replace = TRUE)
> a
[1] 6 3 6 2 6 9 3 9 9 8 2 10 7 9 1 5 3 10 5 5
> a[c(TRUE,FALSE)]
[1] 6 6 6 3 9 2 7 1 3 5
Why a[c(TRUE,FALSE)] gives me an ODD elements of my array? c(TRUE, FALSE) has length of 2. And on my mind, this supposed to give me a single index 1, which is TRUE.
Why is this comes by this way?

Logical subsets are recycled to match the length of the vector (numerical subsets are not recycled).
From help("["):
Arguments
i, j, …
...
For [-indexing only: i, j, … can be logical vectors,
indicating elements/slices to select. Such vectors are recycled if
necessary to match the corresponding extent. i, j, … can also be
negative integers, indicating elements/slices to leave out of the
selection.
When indexing arrays by [ a single argument i can be a matrix with
as many columns as there are dimensions of x; the result is then a
vector with elements corresponding to the sets of indices in each row
of i.
To illustrate, try:
cbind.data.frame(x = 1:10, odd = c(TRUE, FALSE), even = c(FALSE, TRUE))
# x odd even
# 1 1 TRUE FALSE
# 2 2 FALSE TRUE
# 3 3 TRUE FALSE
# 4 4 FALSE TRUE
# 5 5 TRUE FALSE
# 6 6 FALSE TRUE
# 7 7 TRUE FALSE
# 8 8 FALSE TRUE
# 9 9 TRUE FALSE
# 10 10 FALSE TRUE

a[TRUE] gives your all the elements and a[FALSE] gives none. for a[c(TRUE,FALSE] it will wrap length(c(TRUE,FALSE)) which is 2 to length(a) which is 20, so for example it would be like TRUE, FALSE, TRUE, .... , then it will give you just odds indexes.

Related

List creation with incremental values triggered by logical vector in R

I am trying to create a vector of ordered integers of the same length as a vector of logicals. The list starts at 1 and continues with this value until a TRUE is encountered, which triggers an increase in the next element, and so on until all logicals are parsed.
I've got this working in a for loop, but I need to repeat the process many times over different groups and am wondering if something more efficient is recommended.
dat <- c(TRUE, FALSE, TRUE, TRUE, FALSE,
TRUE, TRUE, TRUE, TRUE, FALSE,
TRUE, FALSE, TRUE, FALSE, TRUE,
FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
counter <- 1
output <- c()
for (i in dat) {
output <- c(output, counter)
if(i == T)
counter <- counter + 1
}
output
> [1] 1 2 2 3 4 4 5 6 7 8 8 9 9 10 10 11 11 11 12 13 14
Thanks.
Update
Per Stéphane's comment - c(1, 1+cumsum(dat))[1:length(dat)] works great, thanks!
We can use cumsum with an additional FALSE value as we want to increment the counter after we see a TRUE value and not at TRUE value.
cumsum(c(F, dat)) + 1
#[1] 1 2 2 3 4 4 5 6 7 8 8 9 9 10 10 11 11 11 12 13 14 15
Since, we have added an extra FALSE value we need to remove the last value as it would give us n+1 entries.
x <- cumsum(c(F, dat)) + 1
x[-length(x)]
#[1] 1 2 2 3 4 4 5 6 7 8 8 9 9 10 10 11 11 11 12 13 14

Find all the duplicate records using duplicated() on R [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 4 years ago.
I have one question in R.
I have the following example code for a question.
> exdata <- data.frame(a = rep(1:4, each = 3),
+ b = c(1, 1, 2, 4, 5, 3, 3, 2, 3, 9, 9, 9))
> exdata
a b
1 1 1
2 1 1
3 1 2
4 2 4
5 2 5
6 2 3
7 3 3
8 3 2
9 3 3
10 4 9
11 4 9
12 4 9
> exdata[duplicated(exdata), ]
a b
2 1 1
9 3 3
11 4 9
12 4 9
I tried to use the duplicated() function to find all the duplicate records in the exdata dataframe, but it only finds a part of the duplicated records, so it is difficult to confirm intuitively whether duplicates exist.
I'm looking for a solution that returns the following results
a b
1 1 1
2 1 1
7 3 3
9 3 3
10 4 9
11 4 9
12 4 9
Can use the duplicated() function to find the right solution?
Or is there a way to use another function?
I would appreciate your help.
duplicated returns a logical vector with the length equal to the length of its argument, corresponding to the second time a value exists. It has a method for data frames, duplicated.data.frame, that looks for duplicated rows (and so has a logical vector of length nrow(exdata). Your extraction using that as a logical vector is going to return exactly those rows that have occurred once before. It WON'T however, return the first occurence of those rows.
Look at the index vector your using:
duplicated(exdata)
# [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
But you can combine it with fromLast = TRUE to get all of the occurrences of these rows:
exdata[duplicated(exdata) | duplicated(exdata, fromLast = TRUE),]
# a b
# 1 1 1
# 2 1 1
# 7 3 3
# 9 3 3
# 10 4 9
# 11 4 9
# 12 4 9
look at the logical vector for duplicated(exdata, fromLast = TRUE) , and the combination with duplicated(exdata) to convince yourself:
duplicated(exdata, fromLast = TRUE)
# [1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
duplicated(exdata) | duplicated(exdata, fromLast = TRUE)
# [1] TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE

Customer latency in r [duplicate]

This question already has answers here:
Remove duplicated rows
(10 answers)
Closed 5 years ago.
New to R doing customer latency.
In the dataset i have around 300,000 rows with 15 columns. Some
relevant columns are "Account", "Account Open Date",
"Shipment pick up date" etc.
Account numbers are repeated and just want the rows with account numbers where it is recorded for the first time, not the subsequent rows.
For eg. acc # 610829952 is in the first row as well as in the 5th row, 6th row etc. I need to filter out the first row alone and i need to do this for all the account numbers.
I am not sure how to do this. Could someone please help me with this?
There is a function in R called duplicated(). It allows you to check whether a certain value, like your account, has already been recorded.
First you check in the relevant column account which account numbers have already appeared before using duplicated(). You will get a TRUE / FALSE vector (TRUE indicating that the corresponding value has already appeared). With that information, you will index your data.frame in order to only retrieve the rows you are interested in. I will assume you have your data looks like df below:
df <- data.frame(segment = sample(LETTERS, 20, replace = TRUE),
account = sample(1:5, 20, replace = TRUE))
# account segment
# 1 3 N
# 2 2 V
# 3 4 T
# 4 4 Y
# 5 4 M
# 6 4 E
# 7 5 H
# 8 3 A
# 9 3 J
# 10 3 Y
# 11 4 R
# 12 5 O
# 13 4 O
# 14 1 R
# 15 5 U
# 16 2 Q
# 17 5 F
# 18 2 J
# 19 4 E
# 20 2 H
inds <- duplicated(df$account)
# [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
# [11] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
df <- df[!inds, ]
# account segment
# 1 3 N
# 2 2 V
# 3 4 T
# 7 5 H
# 14 1 R

Why do I get different results indexing with data.table

Here is a simple example of trying to extract some rows
from a data.table, but what appear to be the same type
of logical vectors, I get different answers:
a <- data.table(a=1:10, b = 10:1)
a # so here is the data we are working with
a b
1: 1 10
2: 2 9
3: 3 8
4: 4 7
5: 5 6
6: 6 5
7: 7 4
8: 8 3
9: 9 2
10: 10 1
let's extract just the first column since I need to dynamically
specify the column number as part of my processing
col <- 1L # get column 1 ('a')
x <- a[[col]] > 5 # logical vector specifying condition
x
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
str(x)
logi [1:10] FALSE FALSE FALSE FALSE FALSE TRUE ...
look at the structure of a[[col]] > 5
a[[col]] > 5
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
str(a[[col]] > 5)
logi [1:10] FALSE FALSE FALSE FALSE FALSE TRUE ...
this looks very much like 'x', so why do these two different ways
of indexing 'a' give different results
a[x] # using 'x' as the logical vector
a b
1: 6 5
2: 7 4
3: 8 3
4: 9 2
5: 10 1
a[a[[col]] > 5] # using the expression as the logical vector
Empty data.table (0 rows) of 2 cols: a,b

Smartest way to check if an observation in data.frame(x) exists also in data.frame(y) and populate a new column according with the result

Having two dataframes:
x <- data.frame(numbers=c('1','2','3','4','5','6','7','8','9'), coincidence="NA")
and
y <- data.frame(numbers=c('1','3','10'))
How can I check if the observations in y (1, 3 and 10) also exist in x and fill accordingly the column x["coincidence"] (for example with YES|NO, TRUE|FALSE...).
I would do the same in Excel with a formula combining IFERROR and VLOOKUP, but I don't know how to do the same with R.
Note:
I am open to change data.frames to tables or use libraries. The dataframe with the numbers to check (y) will never have more than 10-20 observations, while the other one (x) will never have more than 1K observations. Therefore, I could also iterate with an if, if it´s necessary
We can create the vector matching the desired output with a set difference search that outputs boolean TRUE and FALSE values where appropriate. The sign %in%, is a binary operator that compares the values on the left-hand side to the set of values on the right:
x$coincidence <- x$numbers %in% y$numbers
# numbers coincidence
# 1 1 TRUE
# 2 2 FALSE
# 3 3 TRUE
# 4 4 FALSE
# 5 5 FALSE
# 6 6 FALSE
# 7 7 FALSE
# 8 8 FALSE
# 9 9 FALSE
Do numbers have to be factors, as you've set them up? (They're not numbers, but character.) If not, it's easy:
x <- data.frame(numbers=c('1','2','3','4','5','6','7','8','9'), coincidence="NA", stringsAsFactors=FALSE)
y <- data.frame(numbers=c('1','3','10'), stringsAsFactors=FALSE)
x$coincidence[x$numbers %in% y$numbers] <- TRUE
> x
numbers coincidence
1 1 TRUE
2 2 NA
3 3 TRUE
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
If they need to be factors, then you'll need to either set common levels or use as.character().

Resources