Find out the row with different value with in same name [duplicate] - r

This question already has answers here:
How to remove rows that have only 1 combination for a given ID
(4 answers)
Selecting & grouping dual-category data from a data frame
(4 answers)
Closed 5 years ago.
I have a df looks like
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
Basically, A is 1, B is 2, C is 3 and so on.
However, as you can see, B has "2" and "15"."15" is the wrong value and it should not be here.
I would like to find out the row which Value won't matches within the same Name.
Ideal output will looks like
B 2
B 15

you can use tidyverse functions like:
df %>%
group_by(Name, Value) %>%
unique()
giving:
Name Value
1 A 1
2 B 2
3 B 15
4 C 3
5 D 4
6 E 5
then, to keep only the Name with multiple Value, append above with:
df %>%
group_by(Name) %>%
filter( n() > 1)

Something like this? This searches for Names that are associated to more than 1 value and outputs one copy of each pair {Name - Value}.
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
res <- do.call(rbind, lapply(unique(df$Name), (function(i){
if (length(unique(df[df$Name == i,]$Value)) > 1 ) {
out <- df[df$Name == i,]
out[!duplicated(out$Value), ]
}
})))
res
Result as expected
Name Value
4 B 2
5 B 15

Filter(function(x)nrow(unique(x))!=1,split(df,df$Name))
$B
Name Value
4 B 2
5 B 15
Or:
Reduce(rbind,by(df,df$Name,function(x) if(nrow(unique(x))>1) x))
Name Value
4 B 2
5 B 15

Related

Create cross-tabulation of most frequent value of string variable and sort by frequency

I have a sample dataset:
df <- data.frame(category = c("A", "A", "B", "C", "C", "D", "E", "C", "E", "A", "B", "C", "B", "B", "B", "D", "D", "D", "D", "B"), year = c(1, 2, 1, 2, 3, 2, 3, 1, 3, 2, 1, 1, 2, 1, 2, 3, 1, 2, 3, 1))
and would like to create a cross-tabulation of year and category such that only the 3 most frequent categories are in the table and also sorted by total number of occurences:
1 2 3
B 4 2 0
D 1 2 2
C 2 1 1
Using something like
df %>%
add_count(category) %>%
filter(n %in% tail(sort(unique(n)),3)) %>%
arrange(desc(n)) %>% {table(.$category, .$year)}
will filter for the three most occurring categories but leave the table unsorted
1 2 3
B 4 2 0
C 2 1 1
D 1 2 2
This should give you what you want.
# Make a table
df.t <- table(df)
# Order by top occurrences (sum over margin 1)
df.t <- df.t[order(apply(df.t, 1, sum), decreasing=TRUE),]
# Keep top 3 results
df.t <- df.t[1:3,]
Output:
year
category 1 2 3
B 4 2 0
D 1 2 2
C 2 1 1
You'd want to arrange by the rowsums after creating table. If you want to stay (more) within tidyverse, e.g.:
df |>
janitor::tabyl(category, year) |>
arrange(desc(rowSums(across(where(is.numeric))))) |>
head(3)
Here with janitor::tabyl(), but you could also use dplyr::tally() and tidyr::pivot_longer() directly or do df |> table() |> as.data.frame.matrix() like #Adamm.
It's not elegent solution using base R but it works
result <- as.data.frame.matrix(table(df))
result$sum <- rowSums(result)
result <- result[order(-result$sum),]
result <- result[1:3,]
result$sum <- NULL
1 2 3
B 4 2 0
D 1 2 2
C 2 1 1

How can I assign a value with case_when() from dplyr based on another column value?

Is there a way to assign the value of the column being created using an existing value from another column when using case_when() with mutate()?
The actual dataframe I'm dealing with is quite complicated so here is a trivial example of what I want:
library(dplyr)
df = tibble(Assay = c("A", "A", "B", "C", "D", "D"),
My_ID = c(3, 12, 36, 5, 13, 1),
Modifier = c(12, 6, 5, 9, 3, 6))
new_df = df %>%
mutate(Assay = case_when(
My_ID == 5 ~ "C/D",
My_ID == 12 ~ "Rm",
My_ID == 13 | My_ID == 3 ~ Modifier * 3,
TRUE ~ Assay)) %>%
select(-Modifier)
Expected new_df:
# A tibble: 6 x 2
Assay My_ID
<chr> <dbl>
1 36 3
2 Rm 12
3 B 36
4 C/D 5
5 9 13
6 D 1
I can successfully assign the NA values to the column I am mutating when no cases match, but haven't found a way to assign a value based on the value of some other column in the data frame if I'm manipulating it. I get this error:
Error: Problem with `mutate()` column `Assay`.
i `Assay = case_when(...)`.
x must be a character vector, not a double vector.
Is there a way to do this?
I found that I was able to do this using paste() after experimenting. As noted by a commenter, paste() works because the underlying issue here is an object type issue. The Assay column is a character vector, but the modification includes an integer. The function paste() implicitly converts to a character. The function paste0() will fix the problem, but using as.character() directly addresses the issue.
library(dplyr)
df = tibble(Assay = c("A", "A", "B", "C", "D", "D"),
My_ID = c(3, 12, 36, 5, 13, 1),
Modifier = c(12, 6, 5, 9, 3, 6))
new_df = df %>%
mutate(Assay = case_when(
My_ID == 5 ~ "C/D",
My_ID == 12 ~ "Rm",
My_ID == 13 | My_ID == 3 ~ as.character(Modifier * 3),
TRUE ~ Assay)) %>%
select(-Modifier)
This is the output:
print(new_df)
# A tibble: 6 x 2
Assay My_ID
<chr> <dbl>
1 36 3
2 Rm 12
3 B 36
4 C/D 5
5 9 13
6 D 1

How to find pattern match in the occurrence of two separate columns in R

I have a dataset where there are two columns: Names and Age. It is a very big dataset and it looks something like the table below:
Name: A, A, A, B, B, E, E, E, E, E
Age: 10, 10, 10, 15, 14, 20, 20, 20, 19
I want to find out how many times it appears that these two columns, Name and Age, are not co-occurring. Basically, how many times it is identifying that the names of the people and age matches, for instance it could happen that B who is 15 years old and the one with age 14 years are different.
If I understand the question, you're looking to see how many different ages each name has in the data.
One dplyr approach would be to identify those distinct combinations of age and name, and then count by name. This tells us A has only one age, while B and E each have two.
library(dplyr)
my_data %>%
distinct(name, age) %>%
count(name)
name n
1 A 1
2 B 2
3 E 2
If you want more info about what those combinations are, you could use add_count to keep all the combinations, plus the count by name.
my_data %>%
distinct(name, age) %>%
add_count(name)
name age n
1 A 10 1
2 B 15 2
3 B 14 2
4 E 20 2
5 E 19 2
Sample data
Please note, it is best practice to include in your question the code to generate a specific sample data object. This reduces redundant work for people who want to help you, and reduces ambiguity (e.g. in your example there aren't as many ages as names).
my_data <- data.frame(
name = c("A", "A", "A", "B", "B", "E", "E", "E", "E", "E"),
age = c(10, 10, 10, 15, 14, 20, 20, 20, 19, 20))
If you want to subset the data frame, you can try
subset(
df,
ave(Age, Name, FUN = sd) == 0
)
which gives
Name Age
1 A 10
2 A 10
3 A 10
Or a summary like
> aggregate(cbind(n = Age) ~ Name, df, function(x) length(unique(x)))
Name n
1 A 1
2 B 2
3 E 2
Data
df <- data.frame(
Name = c("A", "A", "A", "B", "B", "E", "E", "E", "E"),
Age = c(10, 10, 10, 15, 14, 20, 20, 20, 19)
)
An option with data.table
library(data.table)
setDT(df)[, .SD[sd(Age) == 0], Name]
This works:
tibble(
Name = c(rep("A", 3), rep("B", 2), rep("E", 5)),
Age = c(rep(10, 3), 15, 14, rep(20, 3), 19, 19)
) %>%
group_by(Name, Age) %>%
summarise(n())
gives:
# A tibble: 5 x 3
# Groups: Name [3]
Name Age `n()`
<chr> <dbl> <int>
1 A 10 3
2 B 14 1
3 B 15 1
4 E 19 2
5 E 20 3

Subset a data frame based on count of values of column x. Want only the top two in R

here is the data frame
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b")
df <- data.frame(p, x)
I want to subset the data frame such that I get a new data frame with only the top two"x" based on the count of "x".
One of the simplest ways to achieve what you want to do is with the package data.table. You can read more about it here. Basically, it allows for fast and easy aggregation of your data.
Please note that I modified your initial data by appending the elements 10 and c to p and x, respectively. This way, you won't see a NA when filtering the top two observations.
The idea is to sort your dataset and then operate the function .SD which is a convenient way for subsetting/filtering/extracting observations.
Please, see the code below.
library(data.table)
p <- c(1, 3, 45, 1, 1, 54, 6, 6, 2, 10)
x <- c("a", "b", "a", "a", "b", "c", "a", "b", "b", "c")
df <- data.table(p, x)
# Sort by the group x and then by p in descending order
setorder( df, x, -p )
# Extract the first two rows by group "x"
top_two <- df[ , .SD[ 1:2 ], by = x ]
top_two
#> x p
#> 1: a 45
#> 2: a 6
#> 3: b 6
#> 4: b 3
#> 5: c 54
#> 6: c 10
Created on 2021-02-16 by the reprex package (v1.0.0)
Does this work for you?
Using dplyr:
library(dplyr)
df %>%
add_count(x) %>%
slice_max(n, n = 2)
p x n
1 1 a 4
2 3 b 4
3 45 a 4
4 1 a 4
5 1 b 4
6 6 a 4
7 6 b 4
8 2 b 4

Assign value in one column if character in other column is x in R

I am trying to assign a specific interger/number in a column if the character in another column is x. I have 5 characters which repeat down the column, and in a new column I need to assign a number to each repeating character. Basically each of the 5 characters has a specific number that needs to go in the new column. Please help!
Here are two solutions to what I think is your task (a little difficult to judge as you do not provide any specifica data).
Let's assume this is (like) your data:
df <- data.frame(col1 = sample(LETTERS[1:5], 10, replace = T))
Solution 1: base R
df$new <- ifelse(df$col1 == "A", 1,
ifelse(df$col1 == "B", 2,
ifelse(df$col1 == "C", 3,
ifelse(df$col1 == "D", 4, 5))))
Solution 2: dplyr
library(dplyr)
df$new <- df %>%
mutate(col1 = case_when(col1 == "A" ~ 1,
col1 == "B" ~ 2,
col1 == "C" ~ 3,
col1 == "D" ~ 4,
TRUE ~ 5))
The results are identical:
df
col1 new
1 E 5
2 C 3
3 D 4
4 C 3
5 A 1
6 E 5
7 B 2
8 A 1
9 B 2
10 E 5

Resources