How to use chained ifelse and grepl? - r

I have a database of thoroughbred names that is structured as follows:
HorseName <- c("Grey emperor", "Smokey grey", "Gaining greys", "chestnut", "Glowing Chestnuts", "Ruby red", "My fair lady", "Man of war")
Number <- seq(1:8)
df <- data.frame(HorseName, Number)
I now wish to search for occurences of colours within each horse's name. Specifically, I wish to select all the instances of 'grey' and 'chestnut', creating a new column that identifies these colours. Any other names can be simply 'other' Unfortunately, the names are not consistent, with plurals included and varying case formats. How would I go about doing this in R?
My anticipated output would be:
df$Type <- c("Grey", "Grey", "Grey", "Chestnut", "Chestnut", "Other", "Other", "Other")
I am familiar with chained ifelse statements but unsure how to handle the plural occurences and case sensitivities!

In case you are interested in other ways to do this, here's a tidyverse alternative which has the same end result as #amrrs answer.
library(tidyverse)
library(stringr)
df %>%
mutate(Type = str_extract(str_to_lower(HorseName), "grey|chestnut")) %>%
mutate(Type = str_to_title(if_else(is.na(Type), "other", Type)))
#> HorseName Number Type
#> 1 Grey emperor 1 Grey
#> 2 Smokey grey 2 Grey
#> 3 Gaining greys 3 Grey
#> 4 chestnut 4 Chestnut
#> 5 Glowing Chestnuts 5 Chestnut
#> 6 Ruby red 6 Other
#> 7 My fair lady 7 Other
#> 8 Man of war 8 Other

Converting all the input text df$HorseName to lower case before pattern matching with grepl (using lower-cased pattern) solves this problem.
> df$Type <- ifelse(grepl('grey',tolower(df$HorseName)),'Grey',
+ ifelse(grepl('chestnut',tolower(df$HorseName)),'Chestnut',
+ 'others'))
> df
HorseName Number Type
1 Grey emperor 1 Grey
2 Smokey grey 2 Grey
3 Gaining greys 3 Grey
4 chestnut 4 Chestnut
5 Glowing Chestnuts 5 Chestnut
6 Ruby red 6 others
7 My fair lady 7 others
8 Man of war 8 others
>

Related

Renaming coat colors in R goes wrong with str_detect

I have a dataset with horses and want to group them based on coat colors. In my dataset more than 140 colors are used, I would like to go back to only a few coat colors and assign the rest to Other. But for some horses the coat color has not been registered, i.e. those are unknown. Below is what the new colors should be. (To illustrate the problem I have an old coat color and a new one. But I want to simply change the coat colors, not create a new column with colors)
Horse ID
Coatcolor(old)
Coatcolor
1
black
Black
2
bayspotted
Spotted
3
chestnut
Chestnut
4
grey
Grey
5
cream dun
Other
6
Unknown
7
blue roan
Other
8
chestnutgrey
Grey
9
blackspotted
Spotted
10
Unknown
Instead, I get the data below(second table), where unknown and other are switched.
Horse ID
Coatcolor
1
Black
2
Spotted
3
Chestnut
4
Grey
5
Unknown
6
Other
7
Unknown
8
Grey
9
Spotted
10
Other
I used the following code
mydata <- data %>%
mutate(Coatcolor = case_when(
str_detect(Coatcolor, "spotted") ~ "Spotted",
str_detect(Coatcolor, "grey") ~ "Grey",
str_detect(Coatcolor, "chestnut") ~ "Chestnut",
str_detect(Coatcolor, "black") ~ "Black",
str_detect(Coatcolor, "") ~ "Unknown",
TRUE ~ Coatcolor
))
mydata$Coatcolor[!mydata$Coatcolor %in% c("Spotted", "Grey", "Chestnut", "Black", "Unknown")] <- "Other"
So what am I doing wrong/missing? Thanks in advance.
You can use the recode function of thedplyr package. Assuming the missing spots are NA' s, you can then subsequently set all NA's to "Other" with replace_na of the tidyr package. It depends on the format of your missing data spots.
mydata <- tibble(
id = 1:10,
coatcol = letters[1:10]
)
mydata$coatcol[5] <- NA
mydata$coatcol[4] <- ""
mydata <- mydata %>%
mutate_all(list(~na_if(.,""))) %>% # convert empty string to NA
mutate(Coatcolor_old = replace_na(coatcol, "Unknown")) %>% #set all NA to Unknown
mutate(Coatcolor_new = recode(
Coatcolor_old,
'spotted'= 'Spotted',
'bayspotted' = 'Spotted',
'old_name' = 'new_name',
'a' = 'A', #etc.
))
mydata

How to use tidyr in Rstudio to seperate a column with numbers and characters?

So I am using tidyr in Rstudio and I am trying to separate the data in the 'player' column (attached below) into 4 individual columns: 'number', 'name','position' and 'school'. I tried using the separate() function, but can't get the number to separate and can't use a str_sub because some numbers are double digits. Does anyone know how to separate this column to the appropriate 4 columns?
A method using a series of separate calls.
# Example data
df <- data.frame(
player = c('11Vita VeaDT | Washington',
'16Clelin FerrellEDGE | Clemson',
"17K'Lavon ChaissonEdge | LSU",
'15Cody FordOT | Oklahoma',
'20Derrius GuiceRB',
'1Joe BurrowQB | LSU'))
The steps are:
separate school using |
separate number using the distinction of numbers and letters
separate position using capital and lowercase, but starting at the end
cleanup, trim off white space, or extra spaces around the text
df %>%
separate(player, into = c('player', 'school'), '\\|') %>%
separate(player, into = c('number', 'player'), '(?<=[0-9])(?=[A-Za-z])') %>%
separate(player, into = c('last', 'position'), '(?<=[a-z])(?=[A-Z])') %>%
mutate_if(is.character, trimws)
# Results
number name position school
1 11 Vita Vea DT Washington
2 16 Clelin Ferrell EDGE Clemson
3 17 K'Lavon Chaisson Edge LSU
4 15 Cody Ford OT Oklahoma
5 20 Derrius Guice RB <NA>
6 1 Joe Burrow QB LSU

R - Finding identical rows or rows that only differ by x columns

I'm trying to use R on a large CSV file that for this example can be said to represent a list of people and forms of transportation. If a person owns that mode of transportation, this is represented by a X in the corresponding cell. Example data of this is as per below:
Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,
The below image makes it easier to see what it represents:
What I'm after is to learn which persons have identical modes of transportation, or, ideally, where the modes of transportation differs by no more than one.
The format is a bit weird but, assuming the csv file is named example.csv, I can read it into a data frame and transpose it as per below (it should be fairly obvious that I'm a complete R noob)
ex <- read.csv('example.csv')
ext <- as.data.frame(t(ex))
This post explained how to find duplicates and it seems to work
duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1]
which(duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1])
This returns the following indexes:
1 2 4 5 6 7
That does indeed correspond with what I consider to be duplicate rows. That is, Peter has the same modes of transportation as Mary and Stan (indexes 2, 4 and 6); Don and Mike likewise share the same modes of transportation, indexes 5 and 7.
Again, that seems to work ok but if the modes of transportation and number of people are significant, it becomes really difficult finding/knowing not just which rows are duplicates, but which indexes actually matched. In this case that indexes 2, 4 and 6 are identical and that 5 and 7 are identical.
Is there an easy way of getting that information so that one doesn't have to try and find the matches manually?
Also, given all of the above, is it possible to alter the code in any way so that it would consider rows to match if there was only a difference in X positions (for example a difference of one is acceptable so as long as the persons in the above example have no more than one mode of transportation that is different, it's still considered a match)?
Happy to elaborate further and very grateful for any and all help.
library(dplyr)
library(tidyr)
ex <- read.csv(text = "Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,", )
ext <- tidyr::pivot_longer(ex, -Type, names_to = "person")
# head(ext)
ext <- ext %>%
group_by(person) %>%
filter(value == "X") %>%
summarise(Modalities = n(), Which = paste(Type, collapse=", ")) %>%
arrange(desc(Modalities), Which) %>%
mutate(IdenticalGrp = rle(Which)$lengths %>% {rep(seq(length(.)), .)})
ext
#> # A tibble: 6 x 4
#> person Modalities Which IdenticalGrp
#> <chr> <int> <chr> <int>
#> 1 Paul 3 Scooter, Skateboard, Boat 1
#> 2 Don 2 Car, Skateboard 2
#> 3 Mike 2 Car, Skateboard 2
#> 4 Mary 2 Scooter, Skateboard 3
#> 5 Peter 2 Scooter, Skateboard 3
#> 6 Stan 2 Scooter, Skateboard 3
To get a membership list in any particular IndenticalGrp you can just pull like this.
ext %>% filter(IdenticalGrp == 3) %>% pull(person)
#> [1] "Mary" "Peter" "Stan"

How to combine multiple variable data to a single variable data?

After making my data frame, and selecting the variables i want to look at, i face a dilemma. The excel sheet which acts as my data source was used by different people recording the same type of data.
Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White
1 9 1 1 1 2 18 4 4 4 2 26
As you can see, because the data is written diffently, Major groups (Redwine, Whitewine and Water) have now been split into undergroups . How do i combine the undergroups into a combined group eg. red+Red+RedWine -> Total wine. I use the phyloseq package for this kind of dataset
names <- c("red","white","water")
df2 <- setNames(data.frame(matrix(ncol = length(names), nrow = nrow(df))),names)
for(col in names){
df2[,col] <- rowSums(df[,grep(col,tolower(names(df)))])
}
here
grep(col,tolower(names(df)))
looks for all the column names that contain the strings like "red" in the names of your vector. You then just sum them in a new data.frame df2 defined with the good lengths
I would just create a new data.frame, easiest to do with dplyr but also doable with base R:
with dplyr
newFrame <- oldFrame %>% mutate(Mock = Mock, Neg = Neg + Neg1PCR + Neg2PCR + NegPBS, Red = red + Red + RedWine, Water = water + Water, White = white = White)
with base R (not complete but you get the point)
newFrame <- data.frame(Red = oldFrame$Red + oldFrame$red + oldFrame$RedWine...)
One can use dplyr:starts_with and dplyr::select to combine columns. The ignore.case is by default TRUE in dplyr:starts_with with help in the data.frame OP has posted.
library(dplyr)
names <- c("red", "white", "water")
cbind(df[1], t(mapply(function(x)rowSums(select(df, starts_with(x))), names)))
# Mock red white water
# 1 1 24 28 8
Data:
df <- read.table(text =
"Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White
1 9 1 1 1 2 18 4 4 4 2 26",
header = TRUE, stringsAsFactors = FALSE)

dynamic dplyr column name calculation

I have the following code.
colName is passed in. I've been trying to get it to be evaluated as the value of colName but have not had much success. I've tried "eval", "setNames", etc. Using the "_", still has not provided success.
Essentially, if my colName = "MyCol", I want the dplyr chain to execute as if the last line read:
mutate(MyCol = ifelse(is.na(MyCol), "BLANK", MyCol))
makeSummaryTable <- function(colName,originalData){
result <- originalData %>%
group_by_(colName) %>%
summarise(numObs = n()) %>%
ungroup() %>%
arrange(desc(numObs)) %>%
rowwise() %>%
mutate_(colName = ifelse(is.na(colName), "BLANK",colName))
return(result)
}
Here's how to do it with dplyr 0.6.0 using the new tidyeval approach to non-standard evaluation. (I'm not sure if it's even possible to do with standard evaluation, at least in a straightforward manner):
library(dplyr)
makeSummaryTable <- function(colName, originalData){
colName <- enquo(colName)
originalData %>%
count(!!colName) %>%
arrange(desc(n)) %>%
mutate(
old_col = !!colName,
!!quo_name(colName) := if_else(is.na(!!colName), "BLANK",!!colName)
)
}
makeSummaryTable(hair_color, starwars)
#> # A tibble: 13 x 3
#> hair_color n old_col
#> <chr> <int> <chr>
#> 1 none 37 none
#> 2 brown 18 brown
#> 3 black 13 black
#> 4 BLANK 5 <NA>
#> 5 white 4 white
#> 6 blond 3 blond
#> 7 auburn 1 auburn
#> 8 auburn, grey 1 auburn, grey
#> 9 auburn, white 1 auburn, white
#> 10 blonde 1 blonde
#> 11 brown, grey 1 brown, grey
#> 12 grey 1 grey
#> 13 unknown 1 unknown
enquo turns the unquoted column name into some fancy object called a quosure. !! then unquotes the quosure so that it can get evaluated as if it would be typed directly in the function. For a more in-depth and accurate explanation, see Hadley's "Programming with dplyr".
EDIT: I realized that the original question was to name the new column with the user-supplied value of colName and not just colName so I updated my answer. To accomplish that, the quosure needs to be turned into a string (or label) using quo_name. Then, it can be "unquoted" using !! just as a regular quosure would be. The only caveat is that since R can't make head or tails of the expression mutate(!!foo = bar), tidyeval introduces the new definition operator := (which might be familiar to users from data.table where it has a somewhat different use). Unlike the traditional assignment operator =, the := operator allows unquoting on both the right-hand and left-hand side.
(updated the answer to use a dataframe that has NA in one of its rows, to illustrate that the last mutate works. I also used count instead of group by + summarize, and I dropped the unnecessary rowwise.)

Resources