dplyr unique occurrence count on columns [duplicate] - r

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
How to count the number of unique values by group? [duplicate]
(1 answer)
Closed 5 years ago.
I want to get the number of unique values from one column grouped by another column using dplyr. Preferable function friendly, that is i can put this in a function and it will work easily.
So for example for the following data frame.
test = data.frame(one=rep(letters[1:5],each=2), two=c(rep("c", 3), rep("d", 2), rep("e", 4), "f") )
one two
1 a c
2 a c
3 b c
4 b d
5 c d
6 c e
7 d e
8 d e
9 e e
10 e f
I would want something like the number of unique values column two gives column one.
Desired output:
one n
1 a 1
2 b 2
3 c 2
4 d 1
5 e 2
From column one, a has 1 unique value "c" only, b has 2 unique value "c" and "d", c has 2 unique values "d" and "e", d has 1 unique value "e".
I managed to get something working by group_by() twice and summarize(), is there a more simple way i could use?
Hope this is understandable.
Thanks

We can group by 'one' and get the number of unique elements with n_distinct
library(dplyr)
test %>%
group_by(one) %>%
summarise(n = n_distinct(two))

Related

Is there a good way to compare 2 data tables but compare the data from i to data of i+1 in second data table [duplicate]

This question already has answers here:
Remove duplicated rows
(10 answers)
Closed 2 years ago.
I have tried various functions including compare and all.equal but I am having difficulty finding a test to see if variables are the same.
For context, I have a data.frame which in some cases has a duplicate result. I have tried copying the data.frame so I can compare it with itself. I would like to remove the duplicates.
One approach I considered was to look at row A from dataframe 1 and subtract it from row B from dataframe 2. If they equal to zero, I planned to remove one of them.
Is there an approach I can use to do this without copying my data?
Any help would be great, I'm new to R coding.
Suppose I had a data.frame named data:
data
Col1 Col2
A 1 3
B 2 7
C 2 7
D 2 8
E 4 9
F 5 12
I can use the duplicated function to identify duplicated rows and not select them:
data[!duplicated(data),]
Col1 Col2
A 1 3
B 2 7
D 2 8
E 4 9
F 5 12
I can also perform the same action on a single column:
data[!duplicated(data$Col1),]
Col1 Col2
A 1 3
B 2 7
E 4 9
F 5 12
Sample Data
data <- data.frame(Col1 = c(1,2,2,2,4,5), Col2 = c(3,7,7,8,9,12))
rownames(data) <- LETTERS[1:6]

add a new column conditional on another character column in R

I have been trying to make a procedure in R. I want to ADD a new column base on several categories of another column.
I put an example :
Column New Column
A 1
B 2
C 3
D 4
D 4
A 1
My question is how to add this new column with a particular value base on the values (in characters) of the first column.
It is really similar using the function of MUTATE and CASE_WHEN. The problem is that this function just takes into consideration numeric values and in this case I want take characters (categories) and base on this give a specific value to the new column.
Assuming you have a column of categories (not only letters), you can convert it to "ordered factors" to order the categories and then convert to integers.
x <- c("A", "B", "C", "D", "D", "A")
# make the dataframe
v <- data.frame(x, as.integer(as.ordered(x)))
#
colnames(v) <- c("Column", "New Column")
v
# output
> v
Column New Column
1 A 1
2 B 2
3 C 3
4 D 4
5 D 4
6 A 1
If I understand you correctly you want to create a new column that has numbers corresponding to letters, with 1 corresponding to the first letter of the alphabet A, 2 corresponding to B, 3 to C, and so on. If that premise is correct, then this code will work for you:
ILLUSTRATIVE DATA
set.seed(12)
df <- data.frame(
Column = sample(LETTERS[1:5],10, replace = T)
)
df
Column
1 A
2 E
3 E
4 B
5 A
6 A
7 A
8 D
9 A
10 A
SOLUTION:
Assign the indices of LETTERS, which is an ordered sequence of integers starting with 1, to the letters in df$COlumn where they match the letters in LETTERS:
df$Newcolumn <- seq(LETTERS)[match(df$Column, LETTERS)]
RESULt:
df
Column Newcolumn
1 A 1
2 E 5
3 E 5
4 B 2
5 A 1
6 A 1
7 A 1
8 D 4
9 A 1
10 A 1

Transform a column into variables in R [duplicate]

This question already has answers here:
Aggregating by unique identifier and concatenating related values into a string [duplicate]
(4 answers)
Closed 5 years ago.
My current dataset :
order product
1 a
1 b
1 c
2 b
2 d
3 a
3 c
3 e
what I want
product order
a 1,3
b 1,2
c 1,3
d 2
e 3
I have tried cast, reshape, but they didn't work
I recently spent way too much time trying to do something similar. What you need here, I believe, is a list-column. The code below will do that, but it turns the order number into a character value.
library(tidyverse)
df <- tibble(order=c(1,1,1,2,2,3,3,3), product=c('a','b','c','b','d','a','c','e')) %>%
group_by(product) %>%
summarise(order=toString(.$order)) %>%
mutate(order=str_split(order, ', ')

Counting the times a value in a vector is different per combination of 4 other vectors

This is what my dataframe looks like:
a <- c(1,1,4,4,5)
b <- c(1,2,3,3,5)
c <- c(1,4,4,4,5)
d <- c(2,2,4,4,5)
e <- c(1,5,3,2,5)
df <- data.frame(a,b,c,d,e)
I'd like to write something that returns all unique instances of vectors a,b,c,d that have a different value in vector e.
For example:
a b c d e
1 1 1 1 2 1
2 1 2 4 2 5
3 4 3 4 4 3
4 4 3 4 4 2
5 5 5 5 5 5
Rows 3 and 4 are exactly the same till vector d (having a combination of 4344) so only one instance of those should be returned, but they have 2 different values in vector e. I would want to get a count on those - so the combination of 4344 has 2 different values in vector e.
The expected output would tell me how many times a certain combination such as 4344 had different values in vector e. So in this case it would be something like:
a b c d e
4 3 4 4 2
So far I have something like this:
library(tidyr)
library(dplyr)
df %>%
unite(key_abcd, a, b, c, d) %>%
count(key_abcd, e)
But this will count the times e has been repeated per combination of a,b,c,d. I would like to instead count the times e is different per combination of a,b,c,d.
NOTE: There are both repeated combinations of values in vectors a,b,c,d and repeated values in vector e. I would like to return only the count of unique values in e for unique combinations of a,b,c,d.
You could try adding a little dplyr on:
library(dplyr)
df %>%
unite(key_abcd, a, b, c, d) %>%
group_by(key_abcd) %>%
summarise(e = n()) %>%
filter(e>1)

r remove rows from a data frame that contain a duplicate of either combination of 2 columns [duplicate]

This question already has answers here:
Remove duplicate column pairs, sort rows based on 2 columns [duplicate]
(3 answers)
Closed 7 years ago.
I am trying to remove rows from a data frame that contain either combination of 2 columns. For example, the following code:
vct <- c("A", "B", "C")
a <- b <- vct
combo <- expand.grid(a,b) #generate all posible combinations
combo <- combo[!combo[,1] == combo[,2],] #removes rows with matching column
generates this data frame:
Var1 Var2
2 B A
3 C A
4 A B
6 C B
7 A C
8 B C
How can I remove rows are duplicates of any combination of the 2 columns, so that i.e. #4 A B is removed because #2 B A is already present? The resulting data frame would look like this:
Var1 Var2
2 B A
3 C A
4 C B
We can sort by row using apply with MARGIN=1, transpose (t) the output, use duplicated to get the logical index of duplicate rows, negate (!) to get the rows that are not duplicated, and subset the dataset.
combo[!duplicated(t(apply(combo, 1, sort))),]
# Var1 Var2
#2 B A
#3 C A
#6 C B

Resources