Renaming factor variables if a condition is satisfied in separate column

Renaming factor variables if a condition is satisfied in separate column - r

I have a df that looks like this:
df <- data.frame(
A = sample(c("Dog", "Cat", "Cat", "Dog", "Fish", "Fish")),
B = sample(c("Brown", "Black", "Brown", "Black", "Brown", "Black")))
df
A B
1 Dog Brown
2 Cat Black
3 Cat Brown
4 Dog Black
5 Fish Brown
6 Fish Black
I want to rename (with dplyr, preferably) the factor variable "Fish" to "Dog" whenever the condition "Brown" is satisfied in the second column.
A B
1 Dog Brown
2 Cat Black
3 Cat Brown
4 Dog Black
5 Dog Brown #rename here
6 Fish Black

You could use replace:
df %>% mutate(A = replace(A, which(A == "Fish" & B == "Brown"), "Dog"))
# A B
#1 Cat Black
#2 Fish Black
#3 Dog Black
#4 Dog Brown
#5 Cat Brown
#6 Dog Brown
And here's a data.table version:
library(data.table)
setDT(df)[A == "Fish" & B == "Brown", A := "Dog"]

Related

replace strings in a column with their equivalents in another column in another dataframe R

consider two dataframes
df1 <- data.frame(a=LETTERS[1:6],
b=c("apple", "apple","dog", "red", "red","red"))
df2 <- data.frame(col1=c("apple", "golf", "dog", "red"),
col2=c("fruit", "sport","animal", "color"))
> df1
a b
1 A apple
2 B apple
3 C dog
4 D red
5 E red
6 F red
> df2
col1 col2
1 apple fruit
2 golf sport
3 dog animal
4 red color
I want to create
> output
a b
1 A fruit
2 B fruit
3 C animal
4 D color
5 E color
6 F color
I get the output I am looking for using the basic for loop. But is there any neat nice way to get this through pipes of dplyr?
for(i in 1:nrow(df1)){
df1[i,2] <- df2[df2$col1==df1[i,2], 2]
}

Use a join
library(dplyr)
left_join(df1, df2, by = c("b" = "col1")) %>%
select(a, b = col2)
-output
a b
1 A fruit
2 B fruit
3 C animal
4 D color
5 E color
6 F color
Or in base R with match or named vector
df1$b <- setNames(df2$col2, df2$col1)[df1$b]

A solution with lapply and match:
df1$b <- unlist(lapply(df1$b, function(x) df2$col2[match(x, df2$col1)]))
df1
a b
1 A fruit
2 B fruit
3 C animal
4 D color
5 E color
6 F color

df1$b <- df2[ df2$col1 == df1$b, 'col2' ]

Using R/dplyr to filter columns?

I have a simple Q... I have a dataset I need to filter by certain parameters. I was hoping for a solution in R?
Dummy case:
colour age animal
red 10 dog
yellow 5 cat
pink 6 cat
I want to classify this dataset e.g. by:
If colour is 'red' OR 'pink' AND age is <7 AND animal is 'cat' then = category 1.
Else category 2.
Output would be:
colour age animal category
red 10 dog 2
yellow 5 cat 2
pink 6 cat 1
Is there a way to manipulate dplyr to achieve this? I'm a clinician not a bioinformatician so go easy!

I like the case_when function in dplyr to set up more complex selections with mutate.
library(tidyverse)
df <- data.frame(colour = c("red", "yellow", "pink", "red", "pink"),
age = c(10, 5, 6, 12, 10),
animal = c("dog", "cat", "cat", "hamster", "cat"))
df
#> colour age animal
#> 1 red 10 dog
#> 2 yellow 5 cat
#> 3 pink 6 cat
#> 4 red 12 hamster
#> 5 pink 10 cat
df <- mutate(df, category = case_when(
((colour == "red" | colour == "pink") & age < 7 & animal == "cat") ~ 1,
(colour == "yellow" | age != 5 & animal == "dog") ~ 2,
(colour == "pink" | animal == "cat") ~ 3,
(TRUE) ~ 4) )
df
#> colour age animal category
#> 1 red 10 dog 2
#> 2 yellow 5 cat 2
#> 3 pink 6 cat 1
#> 4 red 12 hamster 4
#> 5 pink 10 cat 3
Created on 2021-01-17 by the reprex package (v0.3.0)

You could also manipulate this as :
df$category <- with(df,!(colour %in% c('red', 'pink') & age < 7 & animal == 'cat')) + 1
df
# colour age animal category
#1 red 10 dog 2
#2 yellow 5 cat 2
#3 pink 6 cat 1
And in dplyr :
df %>%
mutate(category = as.integer(!(colour %in% c('red', 'pink') &
age < 7 & animal == 'cat')) + 1)

How to group comma-separated variables in the same column?

Here is my false data:
#> id column
#> 1 blue, red, dog, cat
#> 2 red, blue, dog
#> 3 blue
#> 4 red
#> 5 dog, cat
#> 6 cat
#> 7 red, cat
#> 8 dog
#> 9 cat, red
#> 10 blue, cat
I want to tell R for example that dog and cat = animal and red and blue = colour. I want to basically count the number (and eventually percentage) of animals, colours and both.
#> id column newcolumn
#> 1 blue, red, dog, cat both
#> 2 red, blue, dog both
#> 3 blue colour
#> 4 red colour
#> 5 dog, cat animal
#> 6 cat animal
#> 7 red, cat both
#> 8 dog animal
#> 9 cat, red both
#> 10 blue, cat both
So far I have only been able to total up the number of red, blue, dog and cat by doing the following:
column.string<-paste(df$column, collapse=",")
column.vector<-strsplit(column.string, ",")[[1]]
column.vector.clean<-gsub(" ", "", column.vector)
table(column.vector.clean)
Would be very grateful for help, here is my sample false data:
df <- data.frame(id = c(1:10),
column = c("blue, red, dog, cat", "red, blue, dog", "blue", "red", "dog, cat", "cat", "red, cat", "dog", "cat, red", "blue, cat"))

You can define all possible animals and colours in a vector. Split column on comma and test :
animal <- c('dog', 'cat')
colour <- c('red', 'blue')
df$newcolumn <- sapply(strsplit(df$column, ',\\s*'), function(x) {
x <- x[x != "NA"]
if(!length(x)) return(NA)
if(all(x %in% animal)) 'animal'
else if(all(x %in% colour)) 'colour'
else 'both'
})
df
# id column newcolumn
#1 1 blue, red, dog, cat both
#2 2 red, blue, dog both
#3 3 blue colour
#4 4 red colour
#5 5 dog, cat animal
#6 6 cat animal
#7 7 red, cat both
#8 8 dog animal
#9 9 cat, red both
#10 10 blue, cat both
To calculate the proportion, you can then use prop.table with table :
prop.table(table(df$newcolumn, useNA = "ifany"))
#animal both colour
# 0.3 0.5 0.2
Using dplyr, we can separate the rows on comma, for each id create a newcolumn based on conditions and calculate the proportions.
library(dplyr)
df %>%
tidyr::separate_rows(column, sep = ',\\s*') %>%
group_by(id) %>%
summarise(newcolumn = case_when(all(column %in% animal) ~ 'animal',
all(column %in% colour) ~ 'colour',
TRUE ~ 'both'),
column = toString(column)) %>%
count(newcolumn) %>%
mutate(n = n/sum(n))

R variable number of string concatenations within group_by

Let's say I have the following table of houses (or anything) and their colors:
I'm trying to:
group_by(Group)
count rows (I assume with length(unique(ID)),
mutate or summarize into a new row with a count of each color in group, as a string.
Result should be:
So I know step 3 could be done by manually entering every possible combination with something like
df <- df %>%
group_by(Group) %>%
mutate(
Summary = case_when(
all(
sum(count_green) > 0
) ~ paste(length(unique(ID)), " houses, ", count_green, " green")
)
)
but what if I have hundreds of possible combinations? Is there a way to paste into a string and append for each new color/count?

Here is one approach where we count the frequency of 'Group', 'Color' with add_count, unite that with 'Color', then grouped by 'Group', create the 'Summary' column by concatenating the unique elements of 'nColor' with the frequency (n())
library(dplyr)
library(tidyr)
library(stringr)
df %>%
add_count(Group, Color) %>%
unite(nColor, n, Color, sep= ' ', remove = FALSE) %>%
group_by(Group) %>%
mutate(
Summary = str_c(n(), ' houses, ', toString(unique(nColor)))) %>%
select(-nColor)
# Groups: Group [2]
# ID Group Color n Summary
# <int> <chr> <chr> <int> <chr>
#1 1 a Green 2 3 houses, 2 Green, 1 Orange
#2 2 a Green 2 3 houses, 2 Green, 1 Orange
#3 3 a Orange 1 3 houses, 2 Green, 1 Orange
#4 4 b Blue 2 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 1 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 2 3 houses, 2 Blue, 1 Yellow
data
df <- structure(list(ID = 1:6, Group = c("a", "a", "a", "b", "b", "b"
), Color = c("Green", "Green", "Orange", "Blue", "Yellow", "Blue"
)), class = "data.frame", row.names = c(NA, -6L))

Here's an approach with map_chr from purrr and a lot of pasting.
library(dplyr)
library(purrr)
df %>%
group_by(Group) %>%
mutate(Summary = paste(n(),"houses,",
paste(map_chr(unique(as.character(Color)),
~paste(sum(Color == .x),.x)),
collapse = ", ")))
## A tibble: 6 x 4
## Groups: Group [2]
# ID Group Color Summary
# <int> <fct> <fct> <chr>
#1 1 a Green 3 houses, 2 Green, 1 Orange
#2 2 a Green 3 houses, 2 Green, 1 Orange
#3 3 a Orange 3 houses, 2 Green, 1 Orange
#4 4 b Blue 3 houses, 2 Blue, 1 Yellow
#5 5 b Yellow 3 houses, 2 Blue, 1 Yellow
#6 6 b Blue 3 houses, 2 Blue, 1 Yellow

What is the frequency of any cell of column to appear in any other column?

I have a question regarding comparing columns in a data frame....
Say I have a few data that look like this:
Unique <- c("apple", "orange", "melon", "car", "mouse", "headphones", "light")
a1 <- c("apple", "tomato", "banana", "dog", "cat", "headphones", "future")
a2 <- c("apple", "orange", "pear", "monkey", "dog", "cat", "river")
a3 <- c("tomato", "pineapple", "cherry", "car", "space", "mars", "rocket")
df <- data.frame(Unique, a1, a2, a3)
df
> ## df
## Unique a1 a2 a3
## 1: apple apple apple tomato
## 2: orange tomato orange pineapple
## 3: melon banana pear cherry
## 4: car dog monkey car
## 5: mouse cat dog space
## 6: headphones headphones cat mars
## 7: light future river rocket
The question I am trying to answer is: what is the frequency of each cell of column "Unique" to appear in the entire data frame except in Unique column?
I would like an output that looks something like this:
apple 2
orange 1
melon 0
car 1
mouse 0
headphones 0
light 0
because in the entire data frame except the "Unique" column, apple appears 2 times, orange appears 1 time, melon appears 0 time, so on and so forth...
How would you go about getting this?
Also, how would we sort them based on the number of frequency, say highest to lowest?
I have been trying to figure this out for a couple of days now, and I just can't crack it...
any help would be extremely appreciated!
p.s. also, in R, it seems like each "cell" in a dataframe is not referred to a cell..? am I correct? What are they referred to, elements?

We can unlist the columns other than the 'Unique', convert it to factor with levels specified as 'Unique' and get the table in base R
table(factor(unlist(df[-1]), levels = df$Unique))
# apple orange melon car mouse headphones light
# 2 1 0 1 0 1 0
Or using tidyverse
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -Unique) %>%
mutate(value = factor(value, levels = unique(Unique))) %>%
filter(!is.na(value)) %>%
count(value, .drop = FALSE)
# A tibble: 7 x 2
# value n
#* <fct> <int>
#1 apple 2
#2 orange 1
#3 melon 0
#4 car 1
#5 mouse 0
#6 headphones 1
#7 light 0

Here is a solution based on the tidyverse.
Unique <- c("apple", "orange", "melon", "car", "mouse", "headphones", "light")
a1 <- c("apple", "tomato", "banana", "dog", "cat", "headphones", "future")
a2 <- c("apple", "orange", "pear", "monkey", "dog", "cat", "river")
a3 <- c("tomato", "pineapple", "cherry", "car", "space", "mars", "rocket")
df <- data.frame(Unique, a1, a2, a3,stringsAsFactors = FALSE)
df
library(tidyr)
library(dplyr)
df[,2:4] %>% pivot_longer(.,cols=c("a1","a2","a3")) %>%
group_by(value) %>% summarise(.,count = n()) %>%
right_join(.,df[1],by = c('value' = 'Unique')) %>%
mutate(count = ifelse(is.na(count),0,count))
...and the output.
# A tibble: 7 x 2
value count
<chr> <dbl>
1 apple 2
2 orange 1
3 melon 0
4 car 1
5 mouse 0
6 headphones 1
7 light 0
>

with library(data.table)
Transforme your data.frame into a data.table
setDT(df)
Then you can melt the data.table with id="Unique". It is very convenient as for each values of Unique you have a value of all the columns of df in one column
## melt(df,id.vars = "Unique")
## Unique variable value
## 1: apple a1 apple
## 2: orange a1 tomato
## 3: melon a1 banana
## 4: car a1 dog
## 5: mouse a1 cat
## 6: headphones a1 headphones
## 7: light a1 future
## 8: apple a2 apple
## 9: orange a2 orange
## 10: melon a2 pear
## 11: car a2 monkey
## 12: mouse a2 dog
## 13: headphones a2 cat
## 14: light a2 river
## 15: apple a3 tomato
## 16: orange a3 pineapple
## 17: melon a3 cherry
## 18: car a3 car
## 19: mouse a3 space
## 20: headphones a3 mars
## 21: light a3 rocket
## Unique variable value
Finally for each value of Unique we just have to count how many values in the Unique column are equal to value.
melt(df,id.vars = "Unique")[,sum(Unique==value),Unique]
## Unique V1
## 1: apple 2
## 2: orange 1
## 3: melon 0
## 4: car 1
## 5: mouse 0
## 6: headphones 1
## 7: light 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Renaming factor variables if a condition is satisfied in separate column - r

You could use replace: df %>% mutate(A = replace(A, which(A == "Fish" & B == "Brown"), "Dog")) # A B #1 Cat Black #2 Fish Black #3 Dog Black #4 Dog Brown #5 Cat Brown #6 Dog Brown And here's a data.table version: library(data.table) setDT(df)[A == "Fish" & B == "Brown", A := "Dog"]

Related

replace strings in a column with their equivalents in another column in another dataframe R

Using R/dplyr to filter columns?

How to group comma-separated variables in the same column?

R variable number of string concatenations within group_by

What is the frequency of any cell of column to appear in any other column?

Categories

Resources