Customer latency in r [duplicate] - r

This question already has answers here:
Remove duplicated rows
(10 answers)
Closed 5 years ago.
New to R doing customer latency.
In the dataset i have around 300,000 rows with 15 columns. Some
relevant columns are "Account", "Account Open Date",
"Shipment pick up date" etc.
Account numbers are repeated and just want the rows with account numbers where it is recorded for the first time, not the subsequent rows.
For eg. acc # 610829952 is in the first row as well as in the 5th row, 6th row etc. I need to filter out the first row alone and i need to do this for all the account numbers.
I am not sure how to do this. Could someone please help me with this?

There is a function in R called duplicated(). It allows you to check whether a certain value, like your account, has already been recorded.
First you check in the relevant column account which account numbers have already appeared before using duplicated(). You will get a TRUE / FALSE vector (TRUE indicating that the corresponding value has already appeared). With that information, you will index your data.frame in order to only retrieve the rows you are interested in. I will assume you have your data looks like df below:
df <- data.frame(segment = sample(LETTERS, 20, replace = TRUE),
account = sample(1:5, 20, replace = TRUE))
# account segment
# 1 3 N
# 2 2 V
# 3 4 T
# 4 4 Y
# 5 4 M
# 6 4 E
# 7 5 H
# 8 3 A
# 9 3 J
# 10 3 Y
# 11 4 R
# 12 5 O
# 13 4 O
# 14 1 R
# 15 5 U
# 16 2 Q
# 17 5 F
# 18 2 J
# 19 4 E
# 20 2 H
inds <- duplicated(df$account)
# [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
# [11] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
df <- df[!inds, ]
# account segment
# 1 3 N
# 2 2 V
# 3 4 T
# 7 5 H
# 14 1 R

Related

How to add dummy variables to data with specific characteristic

My question is probably quite basic but I've been struggling with it so I'd be really grateful if someone could offer a solution.
I have data in the following format:
ORG_NAME
var_1_12
var_1_13
var_1_14
A
12
11
5
B
13
13
11
C
6
7
NA
D
NA
NA
5
I have data on organizations over 5 years, but over that time, some organizations have merged and others have disappeared. I'm planning on conducting a fixed-effects regression, so I need to add a dummy variable which is "0" when organizations have remained the same (in this case row A and row B), and "1" in the year before the merge, and after the merge. In this case, I know that orgs C and D merged, so I would like for the data to look like this:
ORG_NAME
var_1_12
dum_12
var_1_13
dum_13
A
12
0
5
0
B
13
0
11
0
C
6
1
NA
1
D
NA
1
5
1
How would I code this?
This approach (as is any, according to your description) is absolutely dependent on the companies being in consecutive rows.
mtx <- apply(is.na(dat[,-1]), MARGIN = 2,
function(vec) zoo::rollapply(vec, 2, function(z) xor(z[1], z[2]), fill = FALSE))
mtx
# var_1_12 var_1_13 var_1_14
# [1,] FALSE FALSE FALSE
# [2,] FALSE FALSE TRUE
# [3,] TRUE TRUE TRUE
# [4,] FALSE FALSE FALSE
out <- rowSums(mtx) == ncol(mtx)
out
# [1] FALSE FALSE TRUE FALSE
out | c(FALSE, out[-length(out)])
# [1] FALSE FALSE TRUE TRUE
### and your 0/1 numbers, if logical isn't right for you
+(out | c(FALSE, out[-length(out)]))
# [1] 0 0 1 1
Brief walk-through:
is.na(dat[,-1]) returns a matrix of whether the values (except the first column) are NA; because it's a matrix, we use apply to call a function on each column (using MARGIN=2);
zoo::rollapply is a function that does rolling calculations on a portion ("window") of the vector at a time, in this case 2-wide. For example, if we have 1:5, then it first looks at c(1,2), then c(2,3), then c(3,4), etc.
xor is an eXclusive OR, meaning it will be true when one of its arguments are true and the other is false;
mtx is a matrix indicating that a cell and the one below it met the conditions (one is NA, the other is not). We then check to see which of these rows are all true, forming out.
since we need a 1 in both rows, we vector-AND & out with itself, shifted, to produce your intended output
If I understand well, you want to code with "1" rows with at least one NA. if it's so, you just need one dummy var for all the years, right? Somthing like this
set.seed(4)
df <- data.frame(org=as.factor(LETTERS[1:5]),y1=sample(c(1:4,NA),5),y2=sample(c(3:6,NA),5),y3=sample(c(2:5,NA),5))
df$dummy <- as.numeric(apply(df, 1, function(x)any(is.na(x))))
which give you
org y1 y2 y3 dummy
1 A 3 5 3 0
2 B NA 4 5 1
3 C 4 3 2 0
4 D 1 6 NA 1
5 E 2 NA 4 1

Find all the duplicate records using duplicated() on R [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 4 years ago.
I have one question in R.
I have the following example code for a question.
> exdata <- data.frame(a = rep(1:4, each = 3),
+ b = c(1, 1, 2, 4, 5, 3, 3, 2, 3, 9, 9, 9))
> exdata
a b
1 1 1
2 1 1
3 1 2
4 2 4
5 2 5
6 2 3
7 3 3
8 3 2
9 3 3
10 4 9
11 4 9
12 4 9
> exdata[duplicated(exdata), ]
a b
2 1 1
9 3 3
11 4 9
12 4 9
I tried to use the duplicated() function to find all the duplicate records in the exdata dataframe, but it only finds a part of the duplicated records, so it is difficult to confirm intuitively whether duplicates exist.
I'm looking for a solution that returns the following results
a b
1 1 1
2 1 1
7 3 3
9 3 3
10 4 9
11 4 9
12 4 9
Can use the duplicated() function to find the right solution?
Or is there a way to use another function?
I would appreciate your help.
duplicated returns a logical vector with the length equal to the length of its argument, corresponding to the second time a value exists. It has a method for data frames, duplicated.data.frame, that looks for duplicated rows (and so has a logical vector of length nrow(exdata). Your extraction using that as a logical vector is going to return exactly those rows that have occurred once before. It WON'T however, return the first occurence of those rows.
Look at the index vector your using:
duplicated(exdata)
# [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
But you can combine it with fromLast = TRUE to get all of the occurrences of these rows:
exdata[duplicated(exdata) | duplicated(exdata, fromLast = TRUE),]
# a b
# 1 1 1
# 2 1 1
# 7 3 3
# 9 3 3
# 10 4 9
# 11 4 9
# 12 4 9
look at the logical vector for duplicated(exdata, fromLast = TRUE) , and the combination with duplicated(exdata) to convince yourself:
duplicated(exdata, fromLast = TRUE)
# [1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
duplicated(exdata) | duplicated(exdata, fromLast = TRUE)
# [1] TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE

Replace n number of rows with condition in R? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I have a df :
number=c(3,3,3,3,3,1,1,1,1,4,4,4,4,4,4)
data.frame(number)
but with thousands of rows.
How can i replace n number of rows out of way more and turn 3 into 1 for example.
If you can explain the logic too would be great.
No special requirements just replace a certain amount of 3 into 1. Not all.
Either randomly or the first n numbers.
Here are two versions for you. The first assumes you randomly want to convert n rows from 3 to 1. The second assumes that you want to choose the first n rows from 3 to 1.
To randomly select n of the rows where the value is currently 3, and then convert to 1:
> number=c(3,3,3,3,3,1,1,1,1,4,4,4,4,4,4)
>
>
> # to randomly change n rows (assume here that n = 4)
> set.seed(1)
> df <- data.frame(v1 = number)
> df$v1[sample(which(df$v1 == 3), 4)] <- 1
> df
v1
1 1
2 1
3 1
4 1
5 3
6 1
7 1
8 1
9 1
10 4
11 4
12 4
13 4
14 4
15 4
To change to the first n rows (assume again that n = 4):
> df <- data.frame(v1 = number)
> df$v1[which(df$v1 == 3)[1:4]] <- 1
> df
v1
1 1
2 1
3 1
4 1
5 3
6 1
7 1
8 1
9 1
10 4
11 4
12 4
13 4
14 4
15 4
Since you wanted the logic for how this works:
Both answers rely on the which() command. Which will give you the location where a vector is TRUE, so when we do which(df$v1 == 3) this is going to give us the location of all the rows where the df$v1 is 3:
> df$v1 == 3
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> which(df$v1 == 3)
[1] 1 2 3 4 5
We then simply specify that we want to reassign df$v1 at those positions to 1. However, since you wanted to specify how many rows to do this for, we subset the result of our which() vector by using [1:n] to select the first n results, or sample(x, n) to randomly select n results.
I am assuming you want to select n appearances of some value in a data.frame column.
For that you can sample, with or without replacement, all the values that match your requirements.
Below I show how to do that for 3 instances of 3's
number =c (3,3,3,3,3,1,1,1,1,4,4,4,4,4,4)
foo = data.frame(number)
indexes = sample(which(foo$number == 3), size = 3, replace = F)
foo$number[indexes] = 'your value'

Understanding R's anyDuplicated

Quick question in understanding's R's anyDuplicated, when passed on a dataframe (lets say x y z columns with 1k observations) will if check if any of the rows has the exact same x y z values as another row in the same dataframe? Thanks
I would use duplicated and combine it from front to back.
mydf <- data.frame(x = c(1:3,1,1), y = c(3:5,3,3))
mydf
# x y
# 1 1 3
# 2 2 4
# 3 3 5
# 4 1 3
# 5 1 3
There three duplicated rows 1, 4, and 5. But 'duplicated' will only mark what is duplicated not the original value also.
duplicated(mydf)
#[1] FALSE FALSE FALSE TRUE TRUE
duplicated(mydf, fromLast = TRUE)
#[1] TRUE FALSE FALSE TRUE FALSE
Using from last looks from the end to front to include the original value. By the way, I will ask the R core team to add a unified function to do both.
myduplicates <- duplicated(mydf) | duplicated(mydf, fromLast = TRUE)
Saving the expression as a variable allows us to count and subset later.
sum(myduplicates)
#[1] 3
mydf[myduplicates,]
# x y
# 1 1 3
# 4 1 3
# 5 1 3
mydf[!myduplicates,]
# x y
# 2 2 4
# 3 3 5

Smartest way to check if an observation in data.frame(x) exists also in data.frame(y) and populate a new column according with the result

Having two dataframes:
x <- data.frame(numbers=c('1','2','3','4','5','6','7','8','9'), coincidence="NA")
and
y <- data.frame(numbers=c('1','3','10'))
How can I check if the observations in y (1, 3 and 10) also exist in x and fill accordingly the column x["coincidence"] (for example with YES|NO, TRUE|FALSE...).
I would do the same in Excel with a formula combining IFERROR and VLOOKUP, but I don't know how to do the same with R.
Note:
I am open to change data.frames to tables or use libraries. The dataframe with the numbers to check (y) will never have more than 10-20 observations, while the other one (x) will never have more than 1K observations. Therefore, I could also iterate with an if, if it´s necessary
We can create the vector matching the desired output with a set difference search that outputs boolean TRUE and FALSE values where appropriate. The sign %in%, is a binary operator that compares the values on the left-hand side to the set of values on the right:
x$coincidence <- x$numbers %in% y$numbers
# numbers coincidence
# 1 1 TRUE
# 2 2 FALSE
# 3 3 TRUE
# 4 4 FALSE
# 5 5 FALSE
# 6 6 FALSE
# 7 7 FALSE
# 8 8 FALSE
# 9 9 FALSE
Do numbers have to be factors, as you've set them up? (They're not numbers, but character.) If not, it's easy:
x <- data.frame(numbers=c('1','2','3','4','5','6','7','8','9'), coincidence="NA", stringsAsFactors=FALSE)
y <- data.frame(numbers=c('1','3','10'), stringsAsFactors=FALSE)
x$coincidence[x$numbers %in% y$numbers] <- TRUE
> x
numbers coincidence
1 1 TRUE
2 2 NA
3 3 TRUE
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
If they need to be factors, then you'll need to either set common levels or use as.character().

Resources