How to find the number of unique values in vector for each values from another vetor - r

I have two vectors:
x <- c(1,5,3,2,3, 4,1,2,3,4, 10,5,2,10,12)
y <- c(1,1,2,2,2, 3,3,1,4,4, 4,5,5,4,4)
How can I find the number of unique numbers from X for each number from Y?
I know how to find the number of non-unique numbers from X for each number from Y:
r=aggregate(x ~ y , data= data, FUN=length)

Using data.table, this is pretty easy:
require(data.table)
DT = data.table(x,y)
unique(DT, by=c("x", "y"))[, .N, by=y]
# y N
# 1: 1 3
# 2: 2 2
# 3: 3 2
# 4: 4 4
# 5: 5 2

You can do it wih dplyr this way :
data.frame(x,y) %>%
group_by(y) %>%
summarize(nb=length(unique(x)))
Which gives :
y nb
1 1 3
2 2 2
3 3 2
4 4 4
5 5 2

You could do:
rowSums(!!table(y,x))
# 1 2 3 4 5
# 3 2 2 4 2

Related

How to vectorize the RHS of dplyr::case_when?

Suppose I have a dataframe that looks like this:
> data <- data.frame(x = c(1,1,2,2,3,4,5,6), y = c(1,2,3,4,5,6,7,8))
> data
x y
1 1 1
2 1 2
3 2 3
4 2 4
5 3 5
6 4 6
7 5 7
8 6 8
I want to use mutate and case_when to create a new id variable that will identify rows using the variable x, and give rows missing x a unique id. In other words, I should have the same id for rows one and two, rows three and four, while rows 5-8 should have their own unique ids. Suppose I want to generate these id values with a function:
id_function <- function(x, n){
set.seed(x)
res <- character(n)
for(i in seq(n)){
res[i] <- paste0(sample(c(letters, LETTERS, 0:9), 32), collapse="")
}
res
}
id_function(1, 1)
[1] "4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf"
I am trying to use this function on the RHS of a case_when expression like this:
data %>%
mutate(my_id = id_function(1234, nrow(.)),
my_id = dplyr::case_when(!is.na(x) ~ id_function(x, 1),
TRUE ~ my_id))
But the RHS does not seem to be vectorized and I get the same value for all non-missing values of x:
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
4 2 4 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
5 NA 5 0vnws5giVNIzp86BHKuOZ9ch4dtL3Fqy
6 NA 6 IbKU6DjvW9ypitl7qc25Lr4sOwEfghdk
7 NA 7 8oqQMPx6IrkGhXv4KlUtYfcJ5Z1RCaDy
8 NA 8 BRsjumlCEGS6v4ANrw1bxLynOKkF90ao
I'm sure there's a way to vectorize the RHS, what am I doing wrong? Is there an easier approach to solving this problem?
I guess rowwise() would do the trick:
data %>%
rowwise() %>%
mutate(my_id = id_function(x, 1))
x y my_id
1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN
purrr map functions can be used for non-vectorized functions. The following will give you a similar result. map2 will take the two arguments expected by your id_function.
library(tidyverse)
data %>%
mutate(my_id = map2(x, 1, id_function))
Output
x y my_id
1 1 1 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
2 1 2 4dMaHwQnrYGu0PTjgioXKOyW75NRZtcf
3 2 3 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
4 2 4 uof7FhqC3lOXkacp54MGZJLUR6siSKDb
5 3 5 e5lMJNQEhtj4VY1KbCR9WUiPrpy7vfXo
6 4 6 3kYcgR7109DLbxatQIAKXFeovN8pnuUV
7 5 7 bQ4ok7OuDgscLUlpzKAivBj2T3m6wrWy
8 6 8 0jSn3Jcb2HDA5uhvG8g1ytsmRpl6CQWN

R - subset and include calculated column

Let's say I have this simple data frame:
df <- data.frame(x=c(1,3,3,1,3,1), y = c(2,2,2,2,2,2),z = c('a','b','c','d','e','f'))
> df
x y z
1 1 2 a
2 3 2 b
3 3 2 c
4 1 2 d
5 3 2 e
6 1 2 f
I would like to subset where x= 3, return only column x and y and include a calculated colum x+y.
I can get the first 2 things done, but I can't get the caclulated column to also appear.
df[df$x==3,c("x","y")]
How can I do that, but using base R only.
Staying in base, just do a rowSums before your subset.
df$xy <- rowSums(df[, c("x", "y")])
df[df$x == 3, c("x", "y", "xy")]
# x y xy
# 2 3 2 5
# 3 3 2 5
# 5 3 2 5
Personally, I do prefer the dplyr approach, which #akrun commented on your question.
You can also do like this
df <- data.frame(x=c(1,3,3,1,3,1), y = c(2,2,2,2,2,2),z = c('a','b','c','d','e','f'))
df$z <- ifelse(df$x == 3, (df$x + df$y), df$y)
df
x y z
1 1 2 2
2 3 2 5
3 3 2 5
4 1 2 2
5 3 2 5
6 1 2 2

How to select unique point

I am a novice R programmer. I have a following series of points.
df <- data.frame(x = c(1 , 2, 3, 4), y = c(6 , 3, 7, 5))
df <- df %>% mutate(k = 1)
df <- df %>% full_join(df, by = 'k')
df <- subset(df, select = c('x.x', 'y.x', 'x.y', 'y.y'))
df
Is there way to select for "unique" points? (the order of the points do not matter)
EDIT:
x.x y.x x.y y.y
1 6 2 3
2 3 3 7
.
.
.
(I changed the 2 to 7 to clarify the problem)
With data.table (and working from the OP's initial df):
library(data.table)
setDT(df)
df[, r := .I ]
df[df, on=.(r > r), nomatch=0]
x y r i.x i.y
1: 2 3 1 1 6
2: 3 2 1 1 6
3: 4 5 1 1 6
4: 3 2 2 2 3
5: 4 5 2 2 3
6: 4 5 3 3 2
This is a "non-equi join" on row numbers. In x[i, on=.(r > r)] the left-hand r refers to the row in x and the right-hand one to a row of i. The columns named like i.* are taken from i.
Data.table joins, which are of the form x[i], use i to look up rows of x. The nomatch=0 option drops rows of i that find no matches.
In the tidyverse, you can save a bit of work by doing the self-join with tidyr::crossing. If you add row indices pre-join, reducing is a simple filter call:
library(tidyverse)
df %>% mutate(i = row_number()) %>% # add row index column
crossing(., .) %>% # Cartesian self-join
filter(i < i1) %>% # reduce to lower indices
select(-i, -i1) # remove extraneous columns
## x y x1 y1
## 1 1 6 2 3
## 2 1 6 3 7
## 3 1 6 4 5
## 4 2 3 3 7
## 5 2 3 4 5
## 6 3 7 4 5
or in all base R,
df$m <- 1
df$i <- seq(nrow(df))
df <- merge(df, df, by = 'm')
df[df$i.x < df$i.y, c(-1, -4, -7)]
## x.x y.x x.y y.y
## 2 1 6 2 3
## 3 1 6 3 7
## 4 1 6 4 5
## 7 2 3 3 7
## 8 2 3 4 5
## 12 3 7 4 5
You can use the duplicated.matrix() function from base, to find the rows which are no duplicator - which means in fact that there are unique. When you call the duplicated() function you have to clarify that you only want to use the to first colons. With this call you check which line is unique. In a second step you call in your dataframe for this rows, with all columns.
unique_lines = !duplicated.matrix(df[,c(1,2)])
df[unique_lines,]

Identifying unique duplicates in vector in R

I am trying to identify duplicates based of a match of elements in two vectors. Using duplicate() provides a vector of all matches, however I would like to index which are matches with each other or not. Using the following code as an example:
x <- c(1,6,4,6,4,4)
y <- c(3,2,5,2,5,5)
frame <- data.frame(x,y)
matches <- duplicated(frame) | duplicated(frame, fromLast = TRUE)
matches
[1] FALSE TRUE TRUE TRUE TRUE TRUE
Ultimately, I would like to create a vector that identifies elements 2 and 4 are matches as well as 3,5,6. Any thoughts are greatly appreciated.
Another data.table answer, using the group counter .GRP to assign every distinct element a label:
d <- data.table(frame)
d[,z := .GRP, by = list(x,y)]
# x y z
# 1: 1 3 1
# 2: 6 2 2
# 3: 4 5 3
# 4: 6 2 2
# 5: 4 5 3
# 6: 4 5 3
How about this with plyr::ddply()
ddply(cbind(index=1:nrow(frame),frame),.(x,y),summarise,count=length(index),elems=paste0(index,collapse=","))
x y count elems
1 1 3 1 1
2 4 5 3 3,5,6
3 6 2 2 2,4
NB = the expression cbind(index=1:nrow(frame),frame) just adds an element index to each row
Using merge against the unique possibilities for each row, you can get a result:
labls <- data.frame(unique(frame),num=1:nrow(unique(frame)))
result <- merge(transform(frame,row = 1:nrow(frame)),labls,by=c("x","y"))
result[order(result$row),]
# x y row num
#1 1 3 1 1
#5 6 2 2 2
#2 4 5 3 3
#6 6 2 4 2
#3 4 5 5 3
#4 4 5 6 3
The result$num vector gives the groups.

get rows of unique values by group

I have a data.table and want to pick those lines of the data.table where some values of a variable x are unique relative to another variable y
It's possible to get the unique values of x, grouped by y in a separate dataset, like this
dt[,unique(x),by=y]
But I want to pick the rows in the original dataset where this is the case. I don't want a new data.table because I also need the other variables.
So, what do I have to add to my code to get the rows in dt for which the above is true?
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
y x z
1: a 1 1
2: a 2 2
3: a 2 3
4: b 3 4
5: b 2 5
6: b 1 6
What I want:
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
The idiomatic data.table way is:
require(data.table)
unique(dt, by = c("y", "x"))
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 3 4
# 4: b 2 5
# 5: b 1 6
data.table is a bit different in how to use duplicated. Here's the approach I've seen around here somewhere before:
dt <- data.table(y=rep(letters[1:2],each=3),x=c(1,2,2,3,2,1),z=1:6)
setkey(dt, "y", "x")
key(dt)
# [1] "y" "x"
!duplicated(dt)
# [1] TRUE TRUE FALSE TRUE TRUE TRUE
dt[!duplicated(dt)]
# y x z
# 1: a 1 1
# 2: a 2 2
# 3: b 1 6
# 4: b 2 5
# 5: b 3 4
The simpler data.table solution is to grab the first element of each group
> dt[, head(.SD, 1), by=.(y, x)]
y x z
1: a 1 1
2: a 2 2
3: b 3 4
4: b 2 5
5: b 1 6
Thanks to dplyR
library(dplyr)
col1 = c(1,1,3,3,5,6,7,8,9)
col2 = c("cust1", 'cust1', 'cust3', 'cust4', 'cust5', 'cust5', 'cust5', 'cust5', 'cust6')
df1 = data.frame(col1, col2)
df1
distinct(select(df1, col1, col2))

Resources