Distinct in dplyr does not work (sometimes)

Distinct in dplyr does not work (sometimes) - r

I have the following data frame which I have obtained from a count. I have used dput to make the data frame available and then edited the data frame so there is a duplicate of A.
df <- structure(list(Procedure = structure(c(4L, 1L, 2L, 3L), .Label = c("A", "A", "C", "D", "-1"),
class = "factor"), n = c(10717L, 4412L, 2058L, 1480L)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), .Names = c("Procedure", "n"))
print(df)
# A tibble: 4 x 2
Procedure n
<fct> <int>
1 D 10717
2 A 4412
3 A 2058
4 C 1480
Now I would like to take distinct on Procedure and only keep the first A.
df %>%
distinct(Procedure, .keep_all=TRUE)
# A tibble: 4 x 2
Procedure n
<fct> <int>
1 D 10717
2 A 4412
3 A 2058
4 C 1480
It does not work. Strange...

If we print the Procedure column, we can see that there are duplicated levels for a, which is problematic for the distinct function.
df$Procedure
[1] D A A C
Levels: A A C D -1
Warning message:
In print.factor(x) : duplicated level [2] in factor
One way to fix is to drop the factor levels. We can use factor function to achieve this. Another way is to convert the Procedure column to character.
df <- structure(list(Procedure = structure(c(4L, 1L, 2L, 3L), .Label = c("A", "A", "C", "D", "-1"),
class = "factor"), n = c(10717L, 4412L, 2058L, 1480L)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), .Names = c("Procedure", "n"))
library(tidyverse)
df %>%
mutate(Procedure = factor(Procedure)) %>%
distinct(Procedure, .keep_all=TRUE)
# # A tibble: 3 x 2
# Procedure n
# <fct> <int>
# 1 D 10717
# 2 A 4412
# 3 C 1480

You have duplicated value in a label parameter .Label = c("A", "A", "C", "D", "-1"). That is an issue. Btw your way of initializing of a tibble seems to be very strange (i do not know exactly your goal but still)
Why not use
df <- tibble(
Procedure = c("D", "A", "A", "C"),
n = c(10717L, 4412L, 2058L, 1480L)
)

Related

Summation over multiple rows based on a condition in R

I have a dataset like the following
name
city
number
A
M
2
A
N
3
A
O
5
A
P
7
B
M
7
B
N
8
B
0
9
B
P
2
For each of of the name category, I want to sum the number of M and N value and put it in a new variable. The same goes for O and P value.
The dataset should look like the following:
name
city
number
A
X
5
A
Y
12
B
X
15
B
Y
11
I'm new in R programming. I have tried to use group by and mutate method but was not successful.

We could modify the values in the column 'city' to 'X', 'Y', and do a group by sum
library(dplyr)
df1 %>%
group_by(name, city = case_when(city %in% c("M", "N") ~ 'X',
city %in% c("O", "P") ~ "Y")) %>%
summarise(number = sum(number), .groups = 'drop')
-output
# A tibble: 4 × 3
name city number
<chr> <chr> <int>
1 A X 5
2 A Y 12
3 B X 15
4 B Y 11
data
df1 <- structure(list(name = c("A", "A", "A", "A", "B", "B", "B", "B"
), city = c("M", "N", "O", "P", "M", "N", "O", "P"), number = c(2L,
3L, 5L, 7L, 7L, 8L, 9L, 2L)), row.names = c(NA, -8L), class = "data.frame")

Creating new columns based on data in row separated by specific character in R

I've the following table
Owner
Pet
Housing_Type
A
Cats;Dog;Rabbit
3
B
Dog;Rabbit
2
C
Cats
2
D
Cats;Rabbit
3
E
Cats;Fish
1
The code is as follows:
Data_Pets = structure(list(Owner = structure(1:5, .Label = c("A", "B", "C", "D",
"E"), class = "factor"), Pets = structure(c(2L, 5L, 1L,4L, 3L), .Label = c("Cats ",
"Cats;Dog;Rabbit", "Cats;Fish","Cats;Rabbit", "Dog;Rabbit"), class = "factor"),
House_Type = c(3L,2L, 2L, 3L, 1L)), class = "data.frame", row.names = c(NA, -5L))
Can anyone advise me how I can create new columns based on the data in Pet column by creating a new column for each animal separated by ; to look like the following table?
Owner
Cats
Dog
Rabbit
Fish
Housing_Type
A
Y
Y
Y
N
3
B
N
Y
Y
N
2
C
N
Y
N
N
2
D
Y
N
Y
N
3
E
Y
N
N
Y
1
Thanks!

One approach is to define a helper function that matches for a specific animal, then bind the columns to the original frame.
Note that some wrangling is done to get rid of whitespace to identify the unique animals to query.
f <- Vectorize(function(string, match) {
ifelse(grepl(match, string), "Y", "N")
}, c("match"))
df %>%
bind_cols(
f(df$Pets, unique(unlist(strsplit(trimws(as.character(df$Pets)), ";"))))
)
Owner Pets House_Type Cats Dog Rabbit Fish
1 A Cats;Dog;Rabbit 3 Y Y Y N
2 B Dog;Rabbit 2 N Y Y N
3 C Cats 2 Y N N N
4 D Cats;Rabbit 3 Y N Y N
5 E Cats;Fish 1 Y N N Y
Or more generalized if you don't know for sure that the separator is ;, and whitespace is present, stringi is useful:
dplyr::bind_cols(
df,
f(df$Pets, unique(unlist(stringi::stri_extract_all_words(df$Pets))))
)

You can use separate_rows and pivot_wider from tidyr library:
library(tidyr)
library(dplyr)
Data_Pets %>%
separate_rows(Pets , sep = ";") %>%
mutate(Pets = trimws(Pets)) %>%
mutate(temp = row_number()) %>%
pivot_wider(names_from = Pets, values_from = temp) %>%
mutate(across(c(Cats:Fish), function(x) if_else(is.na(x), "N", "Y"))) %>%
dplyr::relocate(House_Type, .after = Fish)
which will generate:
# Owner Cats Dog Rabbit Fish House_Type
# <fct> <chr> <chr> <chr> <chr> <int>
# 1 A Y Y Y N 3
# 2 B N Y Y N 2
# 3 C Y N N N 2
# 4 D Y N Y N 3
# 5 E Y N N Y 1
Data:
Data_Pets = structure(list(Owner = structure(1:5, .Label = c("A", "B", "C", "D",
"E"), class = "factor"), Pets = structure(c(2L, 5L, 1L,4L, 3L), .Label = c("Cats ",
"Cats;Dog;Rabbit", "Cats;Fish","Cats;Rabbit", "Dog;Rabbit"), class = "factor"),
House_Type = c(3L,2L, 2L, 3L, 1L)), class = "data.frame", row.names = c(NA, -5L))

How to collapse rows by identical values in a column

Good evening,
I have a two columns tab separated .txt file, as the following:
number letter
1 a
1 b
2 a
2 b
3 b
I would like to collapse rows where the column "number" has identical value, by creating a comma separated value in the corresponding column "letter".
In other words, this should be the output:
number letter
1 a,b
2 a,b
3 b
I have looked up the web but I did not find an actual solution.
Thank you in advance,
Giuseppe

We can use aggregate in base R
aggregate(letter ~ number, df1, FUN = paste, collapse=",")
-output
# number letter
#1 1 a,b
#2 2 a,b
#3 3 b
Or with tidyverse
library(dplyr)
library(stringr)
df1 %>%
group_by(number) %>%
summarise(letter = str_c(letter, collapse=","))
data
df1 <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))

We can also combine aggregate() with toString:
#Code
newdf <- aggregate(letter~.,df,toString)
Output:
number letter
1 1 a, b
2 2 a, b
3 3 b
Some data:
#Data
df <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))

Is there a way to capture the sequence of values based on there rank

Hi all I have got a dataframe. I need to create another column so that it should tell at what place each categories are there. For example PLease refer expected output
df
ColB ColA
X A>B>C
U B>C>A
Z C>A>B
Expected output
df1
ColB ColA A B C
X A>B>C 1 2 3
U B>C>A 3 1 2
Z C>A>B 2 3 1

We can first bring ColA into separate rows, group_by ColB and give an unique row number for each entry and then convert the data into wide format using pivot_wider.
library(dplyr)
library(tidyr)
df %>%
mutate(ColC = ColA) %>%
separate_rows(ColC, sep = ">") %>%
group_by(ColB) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = ColC, values_from = row)
# ColB ColA A B C
# <fct> <fct> <int> <int> <int>
#1 X A>B>C 1 2 3
#2 U B>C>A 3 1 2
#3 Z C>A>B 2 3 1
data
df <- structure(list(ColB = structure(c(2L, 1L, 3L), .Label = c("U",
"X", "Z"), class = "factor"), ColA = structure(1:3, .Label = c("A>B>C",
"B>C>A", "C>A>B"), class = "factor")), class = "data.frame", row.names = c(NA, -3L))

We can do this in base R
df[LETTERS[1:3]] <- t(sapply(regmatches(df$ColA, gregexpr("[A-Z]",
df$ColA)), match, x = LETTERS[1:3]))
df
# ColB ColA A B C
#1 X A>B>C 1 2 3
#2 U B>C>A 3 1 2
#3 Z C>A>B 2 3 1
data
df <- structure(list(ColB = structure(c(2L, 1L, 3L), .Label = c("U",
"X", "Z"), class = "factor"), ColA = structure(1:3, .Label = c("A>B>C",
"B>C>A", "C>A>B"), class = "factor")), class = "data.frame",
row.names = c(NA,
-3L))

Count matching instances between two data frames [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm a newbie with R and can't find my answer/anything that works.
I've got two data frames that look like..
Teams
A
B
C
...
and
TCF
A
B
C
C
B
A
...
I need to count the number of instances that each of the first DF column occurs in the second DF and return the value to the first DF. Thanks in advance!

You could use base R to do this:
sapply(unique(df1$Teams), function(x) sum(df2$TCF %in% x))
#A B C
#2 2 2
Or
setNames(table(match(df2$TCF, unique(df1$Teams))), unique(df1$Teams))
#A B C
#2 2 2
Or using data.table
library(data.table)
setkey(setDT(df1), Teams)
setkey(setDT(df2), TCF)
df2[J(unique(df1$Teams)),.N, by=.EACHI]
# TCF N
#1: A 2
#2: B 2
#3: C 2
data
df1 <- structure(list(Teams = c("A", "B", "C")), .Names = "Teams",
class = "data.frame", row.names = c(NA,-3L))
df2 <- structure(list(TCF = c("A", "B", "C", "C", "B", "A")), .Names = "TCF",
class = "data.frame", row.names = c(NA, -6L))

Would this option be easier to your eyes?
library(dplyr)
df2 %>% count(TCF) %>% filter(TCF %in% unique(df1$Teams))
# Source: local data frame [3 x 2]
# TCF n
# 1 A 2
# 2 B 2
# 3 C 2
Data
df1 <- structure(list(Teams = c("A", "B", "C")), .Names = "Teams", class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(TCF = structure(c(1L, 2L, 3L, 3L, 2L, 1L, 4L,
5L, 5L), .Label = c("A", "B", "C", "X", "Y"), class = "factor")), .Names = "TCF", row.names = c(NA,
-9L), class = "data.frame")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Distinct in dplyr does not work (sometimes) - r

Related

Summation over multiple rows based on a condition in R

Creating new columns based on data in row separated by specific character in R

How to collapse rows by identical values in a column

Is there a way to capture the sequence of values based on there rank

Count matching instances between two data frames [closed]

Categories

Resources