How to rename observations with semi-consistent format? - r

I'm working with the following data frame:
Team Direction Side
Joe HB-L L
Eric HB-R R
Tim FB-L R
Mike HB L
I would like to eliminate the "HB" or "FB" preceding the "L" or "R" in the 'Direction' column. I would also like to eliminate the observations for which there is no "L" or "R" in the 'Direction' column. I would like it to look like this:
Team Direction Side
Joe L L
Eric R R
Tim L R
Then, I want to add a column that indicates if the 'Direction' and 'Side' columns are the same. If yes, I would like it to read 'NEAR', if not I would like it to read "FAR."
Team Direction Side Relation
Joe L L NEAR
Eric R R NEAR
Tim L R FAR

We can first filter the rows that have "-" in them , remove everything until "-" and with if_else assign 'NEAR' OR 'FAR' value to Relation.
library(dplyr)
library(stringr)
df %>%
filter(str_detect(Direction, '-')) %>%
mutate(Direction = str_remove(Direction, '.*-'),
Relation = if_else(Direction == Side, 'NEAR', 'FAR'))
# Team Direction Side Relation
#1 Joe L L NEAR
#2 Eric R R NEAR
#3 Tim L R FAR

We can do this in base R with sub to remove the substring till the - after subsettng the rows based on the presence of - in 'Direction
df1 <- subset(df1, grepl('-', Direction))
df1$Direction <- sub(".*-", "", df1$Direction)
-output
df1
# Team Direction Side
#1 Joe L L
#2 Eric R R
#3 Tim L R
Then, we can use == to create a logical condition to replace the valuess to 'FAR', 'NEAR'
df1$Relation <- with(df1, c('FAR', 'NEAR')[(Direction == Side) + 1])
-output
df1
# Team Direction Side Relation
#1 Joe L L NEAR
#2 Eric R R NEAR
#3 Tim L R FAR
Or with tidyverse
library(dplyr)
library(stringr)
df1 %>%
filter(grepl('-', Direction)) %>%
mutate(Direction = str_replace(Direction, '.*-', ''),
Relation = case_when(Direction == Side ~ 'NEAR', TRUE ~ 'FAR'))
# Team Direction Side Relation
#1 Joe L L NEAR
#2 Eric R R NEAR
#3 Tim L R FAR
Or using data.table
library(data.table)
setDT(df1)[grepl('-', Direction)][, Direction := trimws(Direction,
whitespace = '.*-')][, Relation := fifelse(Direction == Side, 'NEAR', 'FAR')][]
# Team Direction Side Relation
#1: Joe L L NEAR
#2: Eric R R NEAR
#3: Tim L R FAR
data
df1 <- structure(list(Team = c("Joe", "Eric", "Tim", "Mike"), Direction = c("HB-L",
"HB-R", "FB-L", "HB"), Side = c("L", "R", "R", "L")),
class = "data.frame", row.names = c(NA,
-4L))

Related

Change cell contents if it contains a certain letter

I have a column that lists the race/ethnicity of individuals. I am trying to make it so that if the cell contains an 'H' then I only want H. Similarly, if the cell contains an 'N' then I want an N. Finally, if the cell has multiple races, not including H or N, then I want it to be M. Below is how it is listed currently and the desired output.
Current output
People | Race/Ethnicity
PersonA| HAB
PersonB| NHB
PersonC| AB
PersonD| ABW
PersonE| A
Desired output
PersonA| H
PersonB| N
PersonC| M
PersonD| M
PersonE| A
You can try the following dplyr approach, which combines grepl with dplyr::case_when to first search for N values, then among those not with N values, search for H values, then among those without an H or an N will assign M to those with >1 races and the original letter to those with only one race (assuming each race is represented by a single character).
A base R approach is below as well - no need for dependencies but but less elegant.
Data
df <- read.table(text = "person ethnicity
PersonA HAB
PersonB NHB
PersonC AB
PersonD ABW
PersonE A", header = TRUE)
dplyr (note order matters given your priority)
df %>% mutate(eth2 = case_when(
grepl("N", ethnicity) ~ "N",
grepl("H", ethnicity) ~ "H",
!grepl("H|N", ethnicity) & nchar(ethnicity) > 1 ~ "M",
TRUE ~ ethnicity
))
You could also do it "manually" in base r by indexing (note order matters given your priority):
df[grepl("H", df$ethnicity), "eth2"] <- "H"
df[grepl("N", df$ethnicity), "eth2"] <- "N"
df[!grepl("H|N", df$ethnicity) & nchar(df$ethnicity) > 1, "eth2"] <- "M"
df[nchar(df$ethnicity) %in% 1, "eth2"] <- df$ethnicity[nchar(df$ethnicity) %in% 1]
In both cases the output is:
# person ethnicity eth2
# 1 PersonA HAB H
# 2 PersonB NHB N
# 3 PersonC AB M
# 4 PersonD ABW M
# 5 PersonE A A
Note this is based on your comment about assigning superiority (that N anywhere supersedes those with both N and H, etc)
We could use str_extract. When the number of characters in the column is greater than 1, extract, the 'N', 'M' separately, do a coalesce with the extracted elements along with 'M' (thus if there is no match, we get 'M', or else it will be in the order we placed the inputs in coalecse, For the other case, i.e. number of characters is 1, return the column values. Thus, N supersedes 'H' no matter the position in the string.
library(dplyr)
library(stringr)
df1 %>%
mutate(output = case_when(nchar(`Race/Ethnicity`) > 1
~ coalesce(str_extract(`Race/Ethnicity`, 'N'),
str_extract(`Race/Ethnicity`, 'H'), "M"),
TRUE ~ `Race/Ethnicity`))
-output
People Race/Ethnicity output
1 PersonA HAB H
2 PersonB NHB N
3 PersonC AB M
4 PersonD ABW M
5 PersonE A A
data
df1 <- structure(list(People = c("PersonA", "PersonB", "PersonC", "PersonD",
"PersonE"), `Race/Ethnicity` = c("HAB", "NHB", "AB", "ABW", "A"
)), class = "data.frame", row.names = c(NA, -5L))

Delete column and next column based on text

I have a data like this :
and I want to delete the column which contain "rico" and also delete all next columns. I am looking to get this :
This is what i did but it doesnt work :
mydata = data.frame(
X1 = c("john", "max", "jay", "douglas"),
X2 = c("alexia", "miguel", "vince", "gary"),
X3 = c("peter", "rico", "joe", "jenny"),
X4 = c("marc", "kelly", "max", "jones")
)
mydata[,grepl("rico", names(mydata))]
Some help would be appreciated
You can subset mydata with a range ending one before where grepl hits rico.
mydata[1:(grep("rico", mydata)-1)]
#mydata[1:(grep("rico", mydata)[1]-1)] #Alternative when there are more hists
# X1 X2
#1 john alexia
#2 max miguel
#3 jay vince
#4 douglas gary
You can use colSums -
mydata[cumsum(colSums(mydata == 'rico') > 0) == 0]
# X1 X2
#1 john alexia
#2 max miguel
#3 jay vince
#4 douglas gary
Using colSums we count number of times 'rico' is present in each column, we create a logical vector by comparing it with > 0, using cumsum we select all the columns before the 1st occurrence of the word.

Replacing integers in a dataframe column that's a list of integer vectors (not just single integers) with character strings in R

I have a dataframe with a column that's really a list of integer vectors (not just single integers).
# make example dataframe
starting_dataframe <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
starting_dataframe$player_indices <-
list(as.integer(1),
as.integer(c(2, 5)),
as.integer(3),
as.integer(4),
as.integer(c(6, 7)))
I want to replace the integers with character strings according to a second concordance dataframe.
# make concordance dataframe
example_concord <-
data.frame(last_names = c("Rapinoe",
"Wambach",
"Naeher",
"Morgan",
"Dahlkemper",
"Mitts",
"O'Reilly"),
player_ids = as.integer(c(1,2,3,4,5,6,7)))
The desired result would look like this:
# make dataframe of desired result
desired_result <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
desired_result$player_indices <-
list(c("Rapinoe"),
c("Wambach", "Dahlkemper"),
c("Naeher"),
c("Morgan"),
c("Mitts", "O'Reilly"))
I can't for the life of me figure out how to do it and failed to find a similar case here on stackoverflow. How do I do it? I wouldn't mind a dplyr-specific solution in particular.
I suggest creating a "lookup dictionary" of sorts, and lapply across each of the ids:
example_concord_idx <- setNames(as.character(example_concord$last_names),
example_concord$player_ids)
example_concord_idx
# 1 2 3 4 5 6
# "Rapinoe" "Wambach" "Naeher" "Morgan" "Dahlkemper" "Mitts"
# 7
# "O'Reilly"
starting_dataframe$result <-
lapply(starting_dataframe$player_indices,
function(a) example_concord_idx[a])
starting_dataframe
# first_names player_indices result
# 1 Megan 1 Rapinoe
# 2 Abby 2, 5 Wambach, Dahlkemper
# 3 Alyssa 3 Naeher
# 4 Alex 4 Morgan
# 5 Heather 6, 7 Mitts, O'Reilly
(Code golf?)
Map(`[`, list(example_concord_idx), starting_dataframe$player_indices)
For tidyverse enthusiasts, I adapted the second half of the accepted answer by r2evans to use map() and %>%:
require(tidyverse)
starting_dataframe <-
starting_dataframe %>%
mutate(
result = map(.x = player_indices, .f = function(a) example_concord_idx[a])
)
Definitely won't win code golf, though!
Another way is to unlist the list-column, and relist it after modifying its contents:
df1$player_indices <- relist(df2$last_names[unlist(df1$player_indices)], df1$player_indices)
df1
#> first_names player_indices
#> 1 Megan Rapinoe
#> 2 Abby Wambach, Dahlkemper
#> 3 Alyssa Naeher
#> 4 Alex Morgan
#> 5 Heather Mitts, O'Reilly
Data
## initial data.frame w/ list-column
df1 <- data.frame(first_names = c("Megan", "Abby", "Alyssa", "Alex", "Heather"), stringsAsFactors = FALSE)
df1$player_indices <- list(1, c(2,5), 3, 4, c(6,7))
## lookup data.frame
df2 <- data.frame(last_names = c("Rapinoe", "Wambach", "Naeher", "Morgan", "Dahlkemper",
"Mitts", "O'Reilly"), stringsAsFactors = FALSE)
NB: I set stringsAsFactors = FALSE to create character columns in the data.frames, but it works just as well with factor columns instead.

Replace values into previous row with a condition

I want to get data where ID column doesn´t start with 00 and append this value of ID column to the end of Description column in previous row.
Then replace the rest of values into after Name column in the previous row. How can I do that with R?
Here is source of dummy data: https://docs.google.com/spreadsheets/d/1SbmaM8hXck-z5nsNfDMbhwijvAGPkPPBgQ_eY4JAMC8/edit?usp=sharing
ID Year Description Name User Factor_1 Factor_2 Factor_3
0011 2016 blue colour AA James Xfac NA NA
is nice XXX XLM Yfac different Yfac NA NA
0024 2017 red colour DD Mark Zfac NA NA
is good YYY STM Lfac unique Zfac NA NA
What I want to have:
ID Year Description Name User Factor_1 Factor_2 Factor_3
0011 2016 blue colour is nice XXX XLM Yfac different Yfac
0024 2017 red colour is good YYY STM Lfac unique Zfac
There's the first part where you want to paste the descriptions together, and there's the part where you want to move your variables as well, as you want "XXX" and "YYY" in your "user" column.
Also, in Viveks answer all wrong lines are pasted with ALL "right" lines, which works in your example, but not if you have a few right lines, and then a wrong one.
Working with booleans (TRUE/FALSE) sometimes works fine, but in this case, I think you want to use an integer index, as that makes it easier to refer to "the previous line". Which gives me code:
rmlines <- which(!substr(df$ID,1,2)=="00")
df$Description[rmlines-1] <- paste(df$Description[rmlines-1], df[rmlines,1], sep=" ")
df[rmlines-1, 4:8] <- df[rmlines, 2:6]
df <- df[-rmlines,]
But there's one more problem to consider: what classes are your columns?
When I tried it out, I treated everything as a character, which means you can move columns around fine. In your data, some may be factors or something else, so you might want to change the classes. I think it's easiest to first change it all to character, and then change it (back) to the final class you want your columns to be.
# To change everything to character:
df <- as.data.frame(lapply(df, as.character), stringsAsFactors = FALSE)
# And to assign the right classes, you need to decide case-by-case:
df$Year <- as.integer(df$Year)
df$Factor_1 <- as.factor(df$Factor1) # Optionally provide levels
Here's a solution with dplyr:
library(dplyr)
df %>%
bind_cols(df %>% rename_all(function(x) paste0(x, "_dummy"))) %>%
mutate(
Description = ifelse(substr(lead(ID), 1, 2) != "00",
paste(Description, lead(ID)), Description),
Name = lead(Year_dummy),
User = lead(Description_dummy),
Factor_1 = lead(Name_dummy),
Factor_2 = lead(User_dummy),
Factor_3 = lead(Factor_1_dummy)
) %>% select(-ends_with("dummy")) %>%
filter(substr(ID, 1, 2) == "00")
Output:
ID Year Description Name User Factor_1 Factor_2 Factor_3
1 0011 2016 blue colour is nice XXX XLM Yfac different Yfac
2 0024 2017 red colour is good YYY STM Lfac unique Zfac
In case you're dealing with a large number of columns, a combination of dplyr and base R could do it:
library(dplyr)
df_combo <- cbind(df, df)
df$Description <- ifelse(substr(lead(df$ID), 1, 2) != "00",
paste(df$Description, lead(df$ID)), df$Description)
for (i in (ncol(df) + 4):ncol(df_combo)) {
df_combo[[i]] <- lead(df_combo[[i - ncol(df) - 2]])
}
df_combo <- subset(df_combo, substr(ID, 1, 2) == "00")
df_descr <- subset(df, substr(ID, 1, 2) == "00")
df_final <- df_combo[, (ncol(df) + 1):ncol(df_combo)]
df_final$Description <- df_descr$Description
rm(df_descr, df_combo)
Output:
ID Year Description Name User Factor_1 Factor_2 Factor_3
1: 0011 2016 blue colour is nice XXX XLM Yfac different Yfac
2: 0024 2017 red colour is good YYY STM Lfac unique Zfac
Use -
bools <- !substr(df$ID,1,2)=="00"
values <- df[bools,1]
df <- df[!bools,]
df$Description <- paste(df[substr(df$ID,1,2)=="00","Description"],values,sep=" ")
df
Output
ID Year Description Name User Factor_1 Factor_2
1 0011 2016 blue colour is nice AA James Xfac NA
3 0024 2017 red colour is good DD Mark Zfac NA
Factor_3
1 NA
3 NA

Replacing vector values in R based on a list (hash)

I have a dataframe, one column of which is names. In a later phase of analysis, I will need to merge with other data by this name column, and there are a few names which vary by source. I'd like to clean up my names using a hash (map) of names->cleaned names. I've found several references to using R lists as hashes (e.g., this question on SE), but I can't figure out how to extract values for keys in a vector only as they occur. So for example,
> players=data.frame(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8))
> xref = c("Bob"="Robert", "Fred Jr." = "Fred")
> players$names
[1] Joe John Bob
Levels: Bob Joe John
Whereas players$names gives a vector of names from the original frame, I need the same vector, only with any values that occur in xref replaced with their equivalent (lookup) values; my desired result is the vector Joe John Robert.
The closest I've come is:
> players$names %in% names(xref)
[1] FALSE FALSE TRUE
Which correctly indicates that only "Bob" in players$names exists in the "keys" (names) of xref, but I can't figure out how to extract the value for that name and combine it with the other names in the vector that don't belong to xref as needed.
note: in case it's not completely clear, I'm pretty new to R, so if I'm approaching this in the wrong fashion, I'm happy to be corrected, but my core issue is essentially as stated: I need to clean up some incoming data within R by replacing some incoming values with known replacements and keeping all other values; further, the map of original->replacement should be stored as data (like xref), not as code.
Updated answer: ifelse
ifelse is an even more straightforward solution, in the case that xref is a named vector and not a list.
players <- data.frame(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8), stringsAsFactors = FALSE)
xref <- c("Bob" = "Robert", "Fred Jr." = "Fred")
players$clean <- ifelse(is.na(xref[players$names]), players$names, xref[players$names])
players
Result
names scores clean
1 Joe 9.8 Joe
2 John 9.9 John
3 Bob 8.8 Robert
Previous answer: sapply
If xref is a list, then sapply function can be used to do conditional look-ups
players <- data.frame(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8))
xref <- list("Bob" = "Robert", "Fred Jr." = "Fred")
players$clean <- sapply(players$names, function(x) ifelse( x %in% names(xref), xref[x], as.vector(x)) )
players
Result
> players
names scores clean
1 Joe 9.8 Joe
2 John 9.9 John
3 Bob 8.8 Robert
You can replace the factor levels with the desired text. Here's an example which loops through xref and does the replacement:
for (n in names(xref)) {
levels(players$names)[levels(players$names) == n ] <- xref[n]
}
players
## names scores
## 1 Joe 9.8
## 2 John 9.9
## 3 Robert 8.8
Another example of replacing the factor levels.
allnames = levels(players$names)
levels(players$names)[ !is.na(xref[allnames]) ] = na.omit(xref[allnames])
players
# names scores
# 1 Joe 9.8
# 2 John 9.9
# 3 Robert 8.8
If you get into really big data sets, you might take a look at merge function or the data.table package. Here is a data.table example of a join.
library(data.table)
players=data.table(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8), key="names")
nms = data.table(names=names(xref),names2=xref, key="names")
out = nms[players]
out[is.na(names2),names2:=names]
out
# names names2 scores
# 1: Bob Robert 8.8
# 2: Joe Joe 9.8
# 3: John John 9.9
Here is an similar example with the merge function.
players=data.frame(names=c("Joe", "John", "Bob"), scores=c(9.8, 9.9, 8.8))
nms = data.frame(names=names(xref),names2=xref,row.names=NULL)
merge(nms,players,all.y=TRUE)
# names names2 scores
# 1 Bob Robert 8.8
# 2 Joe <NA> 9.8
# 3 John <NA> 9.9

Resources