How can I check efficiently check variables for a particular value in R and flag rows containing it? - r

I want to create a variable that flags whether one or more of multiple variables has a particular value.
week Mon Tues Weds Thurs Fri Sat
1 jon jon jon jon mary mary
2 jane jane jane jane jane jane
3 mary mary mary mary mary jane
I want to create a binary variable that flags for each week whether Mon, Weds, or Sat of that week == "jon" or "mary" Is there a way to do this without creating a long ifelse statement that checks each variable individually?
week Mon Tues Weds Thurs Fri Sat flag
1 jon jon jon jon mary mary 1
2 jane jane jane jane jane jane 0
3 mary mary mary mary mary jane 1
I tried
df %>%
rowwise() %>%
mutate(flag = +any(c_across(Mon, Weds, Sat)
%in% ("jon", "mary")) %>%
ungroup()
but I get an error
Error: Problem with `mutate()` input `flag`.
x unused arguments (Mon, Weds, Sat)
i Input `flag` is `+...`.
i The error occurred in row 1.

df %>%
mutate(flag = colSums(apply(cbind(Mon, Weds, Sat), 1, `%in%`, c("jon", "mary"))) > 0)
# week Mon Tues Weds Thurs Fri Sat flag
# 1 1 jon jon jon jon mary mary TRUE
# 2 2 jane jane jane jane jane jane FALSE
# 3 3 mary mary mary mary mary jane TRUE
I think the problem with across is that it's trying to do something to each column, not a summary of sorts of all of them. Let's try purrr::pmap insteadL
library(purrr)
df %>%
mutate(flag = pmap(list(Mon, Weds, Sat),
~ +any(unlist(...) %in% c("jon", "mary"))))
# week Mon Tues Weds Thurs Fri Sat flag
# 1 1 jon jon jon jon mary mary 1
# 2 2 jane jane jane jane jane jane 0
# 3 3 mary mary mary mary mary jane 1
A third (using your request for c_across):
df %>%
rowwise() %>%
mutate(flag = +any(c_across(c(Mon, Weds, Sat)) %in% c("jon", "mary"))) %>%
ungroup()
# # A tibble: 3 x 8
# week Mon Tues Weds Thurs Fri Sat flag
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <int>
# 1 1 jon jon jon jon mary mary 1
# 2 2 jane jane jane jane jane jane 0
# 3 3 mary mary mary mary mary jane 1

Instead of the rowwise or looping over the rows, we can make it more efficient if we loop over the columns with map and reduce it
library(purrr)
library(dplyr)
df %>%
mutate(flag = map(select(., Mon, Weds, Sat), `%in%`, c("jon", "mary")) %>%
reduce(`|`) %>% `+`)
# week Mon Tues Weds Thurs Fri Sat flag
#1 1 jon jon jon jon mary mary 1
#2 2 jane jane jane jane jane jane 0
#3 3 mary mary mary mary mary jane 1
A corresponding option in base R is lapply/Reduce
df$flag <- +(Reduce(`|`, lapply(df[c('Mon', 'Weds', 'Sat')],
`%in%`, c("jon", "mary"))))
data
df <- structure(list(week = 1:3, Mon = c("jon", "jane", "mary"), Tues = c("jon",
"jane", "mary"), Weds = c("jon", "jane", "mary"), Thurs = c("jon",
"jane", "mary"), Fri = c("mary", "jane", "mary"), Sat = c("mary",
"jane", "jane")), class = "data.frame", row.names = c(NA, -3L
))

Here is another base R option using rowSums + Reduce
df$flag <- +(rowSums(
Reduce(
`+`,
lapply(
c("jon", "mary"),
`==`,
df[c("Mon", "Weds", "Sat")]
)
)
) > 0)
such that
week Mon Tues Weds Thurs Fri Sat flag
1 1 jon jon jon jon mary mary 1
2 2 jane jane jane jane jane jane 0
3 3 mary mary mary mary mary jane 1

Related

Separate column into two: before and after a certain word

I have the following data set
> data
firm_name
1: Light Ltd John Smith
2: Bolt Night Ltd Mary Poppins
3: Bright Yellow Sun Ltd Harry Potter
---
I want to separate it into two columns depending on the position of the "Ltd". So, the data would look like:
> data
firm_name name
1: Light Ltd John Smith
2: Bolt Night Ltd Mary Poppins
3: Bright Yellow Sun Ltd Harry Potter
---
I tried with the stringr package but did not find any particular solution.
thanks in advance
You can use separate from tidyr with a lookbehind regular expression for this.
library(tidyr)
df %>%
separate(col = firm_name, into = c("firm_name", "name"), sep = "(?<=Ltd)")
#> firm_name name
#> 1 Light Ltd John Smith
#> 2 Bolt Night Ltd Mary Poppins
#> 3 Bright Yellow Sun Ltd Harry Potter
data
df <- data.frame(firm_name = c("Light Ltd John Smith",
"Bolt Night Ltd Mary Poppins",
"Bright Yellow Sun Ltd Harry Potter"))
We can use base R with read.csv
read.csv(text = sub("(Ltd)", "\\1,", df$names),
header = FALSE, col.names = c('firm_name', 'name'))
# firm_name name
#1 Light Ltd John Smith
#2 Bolt Night Ltd Mary Poppins
#3 Bright Yellow Sun Ltd Harry Potter
data
df <- structure(list(names = c("Light Ltd John Smith",
"Bolt Night Ltd Mary Poppins",
"Bright Yellow Sun Ltd Harry Potter")), row.names = c(NA, -3L
), class = "data.frame")
Are you after something like this?
df <-
tibble(
names = c("Light Ltd John Smith",
"Bolt Night Ltd Mary Poppins",
"Bright Yellow Sun Ltd Harry Potter")
)
df %>%
tidyr::separate(names, c("half_1", "half_2"), sep = "Ltd")
Does this work:
> df %>% mutate(name = gsub('([A-z].*Ltd) (.*)','\\2', df$firm_name), firm_name = gsub('([A-z].*Ltd) (.*)','\\1', df$firm_name))
# A tibble: 3 x 2
firm_name name
<chr> <chr>
1 Light Ltd John Smith
2 Bolt Night Ltd Mary Poppins
3 Bright Yellow Sun Ltd Harry Potter
>
Data used:
> df
# A tibble: 3 x 1
firm_name
<chr>
1 Light Ltd John Smith
2 Bolt Night Ltd Mary Poppins
3 Bright Yellow Sun Ltd Harry Potter
>
Using tidyr::extract :
tidyr::extract(df, names, c('firm_name', 'name'), regex = '(.*Ltd)\\s(.*)')
# A tibble: 3 x 2
# firm_name name
# <chr> <chr>
#1 Light Ltd John Smith
#2 Bolt Night Ltd Mary Poppins
#3 Bright Yellow Sun Ltd Harry Potter
Or in base R :
df$name <- sub('.*Ltd\\s', '', df$names)
df$firm_name <- sub('(.*Ltd).*', '\\1', df$names)
df$names <- NULL
Another base R option
setNames(
data.frame(
do.call(
rbind,
strsplit(df$names, "(?<=Ltd)\\s+", perl = TRUE)
)
),
c("firm_name", "name")
)
giving
firm_name name
1 Light Ltd John Smith
2 Bolt Night Ltd Mary Poppins
3 Bright Yellow Sun Ltd Harry Potter

Melting dataframe in R

I have the following R dataframe :
foo <- data.frame("Department" = c('IT', 'IT', 'Sales'),
"Name.boy" = c('John', 'Mark', 'Louis'),
"Age.boy" = c(21,23,44),
"Name.girl" = c('Jane', 'Charlotte', 'Denise'),
"Age.girl" = c(16,25,32))
which looks like the following :
Department Name.boy Age.boy Name.girl Age.girl
IT John 21 Jane 16
IT Mark 23 Charlotte 25
Sales Louis 44 Denise 32
How do I 'melt' the dataframe, so that for a given Department, I have three columns : Name, Age, and Sex ?
Department Name Age Sex
IT John 21 Boy
IT Jane 16 Girl
IT Mark 23 Boy
IT Charlotte 25 Girl
Sales Louis 44 Boy
Sales Denise 32 Girl
We can use pivot_longer from tidyr
library(tidyr)
pivot_longer(foo, cols = -Department, names_to = c(".value", "Sex"),
names_sep="\\.")
# A tibble: 6 x 4
# Department Sex Name Age
# <chr> <chr> <chr> <dbl>
#1 IT boy John 21
#2 IT girl Jane 16
#3 IT boy Mark 23
#4 IT girl Charlotte 25
#5 Sales boy Louis 44
#6 Sales girl Denise 32
Using reshape:
reshape(foo, direction="long", varying=2:5, tiemvar="Sex")
Department Sex Name Age id
1.boy IT boy John 21 1
2.boy IT boy Mark 23 2
3.boy Sales boy Louis 44 3
1.girl IT girl Jane 16 1
2.girl IT girl Charlotte 25 2
3.girl Sales girl Denise 32 3

Replace multiple strings/values based on separate list

I have a data frame that looks similar to this:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 John Smith GROUP1 2015 1 John Smith 5 Adam Smith 12 Mike Smith 20 Sam Smith 7 Luke Smith 3 George Smith
Each row repeats for new logs, but the values in X.1 : Y.3 change often.
The ID's and the ID's present in X.1 : Y.3 have a numeric value and then the name ID, i.e., "1 John Smith" or "20 Sam Smith" will be the string.
I have an issue where in certain instances, the ID will remain as "1 John Smith" but in X.1 : Y.3 the number may change preceding "John Smith", so for example it might be "14 John Smith". The names will always be correct, it's just the number that sometimes gets mixed up.
I have a list of 200+ ID's that are impacted by this mismatch - what is the most efficient way to replace the values in X.1 : Y.3 so that they match the correct ID in column ID?
I won't know which column "14 John Smith" shows up in, it could be X.1, or Y.2, or Y.3 depending on the row.
I can use a replace function in a dplyr line of code, or gsub for each 200+ ID's and for each column effected, but it seems very inefficient. Is there a quicker way than repeated something like the below x times?
df%>%mutate(X.1=replace(X.1, grepl('John Smith', X.1), "1 John Smith"))%>%as.data.frame()
Sometimes it helps to temporarily reshape the data. That way we can operate on all the X and Y values without iterating over them.
library(stringr)
library(tidyr)
## some data to work with
exd <- read.csv(text = "EVENT,ID,GROUP,YEAR,X.1,X.2,X.3,Y.1,Y.2,Y.3
1,1 John Smith,GROUP1,2015,19 John Smith,11 Adam Smith,9 Sam Smith,5 George Smith,13 Mike Smith,12 Luke Smith
2,2 John Smith,GROUP1,2015,1 George Smith,9 Luke Smith,19 Adam Smith,7 Sam Smith,17 Mike Smith,11 John Smith
3,3 John Smith,GROUP1,2015,5 George Smith,18 John Smith,12 Sam Smith,6 Luke Smith,2 Mike Smith,4 Adam Smith",
stringsAsFactors = FALSE)
## re-arrange to put X and Y columns into a single column
exd <- gather(exd, key = "var", value = "value", X.1, X.2, X.3, Y.1, Y.2, Y.3)
## find the X and Y values that contain the ID name
matches <- str_detect(exd$value, str_replace_all(exd$ID, "^\\d+ *", ""))
## replace X and Y values with the matching ID
exd[matches, "value"] <- exd$ID[matches]
## put it back in the original shape
exd <- spread(exd, key = "var", value = value)
exd
## EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
## 1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
## 2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
## 3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Not sure if you're set on dplyr and piping, but I think this is a plyr solution that does what you need. Given this example dataset:
> df
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 19 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 11 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 18 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
This adply function goes row by row and replaces any matching X:Y column values with the one from the ID column:
library(plyr)
adply(df, .margins = 1, function(x) {
idcol <- as.character(x$ID)
searchname <- trimws(gsub('[[:digit:]]+', "", idcol))
sapply(x[5:10], function(y) {
ifelse(grepl(searchname, y), idcol, as.character(y))
})
})
Output:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Data:
names <- c("EVENT","ID",'GROUP','YEAR', paste(rep(c("X.", "Y."), each = 3), 1:3, sep = ""))
first <- c("John", "Sam", "Adam", "Mike", "Luke", "George")
set.seed(2017)
randvals <- t(sapply(1:3, function(x) paste(sample(1:20, size = 6),
paste(sample(first, replace = FALSE, size = 6), "Smith"))))
df <- cbind(data.frame(1:3, paste(1:3, "John Smith"), "GROUP1", 2015), randvals)
names(df) <- names
I think that the most efficient way to accomplish this is by building a loop. The reason is that you will have to repeat the function to replace the names for every name in your ID list. With a loop, you can automate this.
I will make some assumptions first:
The ID list can be read as a character vector
You don't have any typos in the ID list or in your data.frame, including
different lowercase and uppercase letters in the names.
Your ID list does not contain the numbers. In case that it does contain numbers, you have to use gsub to erase them.
The example can work with a data.frame (DF) with the same structure that
you put in your question.
>
ID <- c("John Smith", "Adam Smith", "George Smith")
for(i in 1:length(ID)) {
DF[, 5:10][grep(ID[i], DF[, 5:10])] <- ID[i]
}
With each round this loop will:
Identify the positions in the columns X.1:Y.3 (columns 5 to 10 in your question) where the name "i" appears.
Then, it will change all those values to the one in the "i" position of the ID vector.
So, the first iteration will do: 1) Search for every position where the name "John Smith" appears in the data frame. 2) Replace all those "# John Smith" with "John Smith".
Note: If you simply want to delete the numbers, you can use gsub to replace them. Take into account that you probably want to erase the first space between the number and the name too. One way to do this is using gsub and a regular expression:
DF[, 5:10] <- gsub("[0-9]+ ", "", DF[, 5:10])

How to unpack lists in a data.frame column? [duplicate]

I have the following data.frame:
id name altNames
1001 Joan character(0)
1002 Jane c("Janie", "Janet", "Jan")
1003 John Jon
1004 Bill Will
1005 Tom character(0)
The column altNames could be empty (i.e. character(0)), have just one name, or a list of names. What I want is a data.frame (or a list) where each entry from name and/or altNames appears just once along with the corresponding id, like this:
id name
1001 Joan
1002 Jane
1002 Janie
1002 Janet
1002 Jan
1003 John
1003 Jon
1004 Bill
1004 Will
1005 Tom
What's the most efficient way of doing it? Even better is dplyr is utilized.
Thanks
Edit: Here's the data:
df <- data_frame(
id = c("1001", "1002","1003", "1004", "1005"),
name = c("Joan", "Jane", "John", "Bill", "Tom"),
altNames = list(character(0), c("Janie", "Janet", "Jan"), "Jon", "Will", character(0))
)
Here's a possible data.table approach
library(data.table)
setDT(dat)[, .(name = c(name, unlist(altNames))), by = id]
# id name
# 1: 1001 Joan
# 2: 1002 Jane
# 3: 1002 Janie
# 4: 1002 Janet
# 5: 1002 Jan
# 6: 1003 John
# 7: 1003 Jon
# 8: 1004 Bill
# 9: 1004 Will
# 10: 1005 Tom
A base R version (using the df added by #rawr)
with(df, {
ns <- mapply(c, name, altNames)
data.frame(id = rep(id, times=lengths(ns)), name=unlist(ns), row.names=NULL)
})
# id name
#1 1001 Joan
#2 1002 Jane
#3 1002 Janie
#4 1002 Janet
#5 1002 Jan
#6 1003 John
#7 1003 Jon
#8 1004 Bill
#9 1004 Will
#10 1005 Tom
Here's a full dplyr + tidyr solution, the way I'd tackle it:
library(dplyr)
library(tidyr)
df <- data_frame(
id = c("1001", "1002","1003", "1004", "1005"),
name = c("Joan", "Jane", "John", "Bill", "Tom"),
altNames = list(character(0), c("Janie", "Janet", "Jan"), "Jon", "Will", character(0))
)
# Need some way to concatenate a list of vectors with a vectors
# in a "rowwise" way
vector_c <- function(...) {
Map(c, ...)
}
df %>%
mutate(
names = vector_c(name, altNames),
altNames = NULL,
name = NULL
) %>%
unnest(names)
#> Source: local data frame [10 x 2]
#>
#> id names
#> 1 1001 Joan
#> 2 1002 Jane
#> 3 1002 Janie
#> 4 1002 Janet
#> 5 1002 Jan
#> 6 1003 John
#> 7 1003 Jon
#> 8 1004 Bill
#> 9 1004 Will
#> 10 1005 Tom
Most of the hard work is done by tidyr::unnest(): it's designed to take data frame with a list-column and unnest it, repeating the other columns as needed.
Using tidyr, after cleaning the data with data.table:
First, fix the data:
library(data.table)
dat<-setDT(dat)
dat$altNames[sapply(dat$altNames, length) == 0] <- NA
Now unnest from tidyr and some dplyr:
library(dplyr)
library(tidyr)
dat %>% unnest(altNames) %>%
group_by(id) %>%
do(unique(c(.[["name"]],.[["altNames"]])))
id V1
1 1001 Joan
2 1001 NA
3 1002 Jane
4 1002 Janie
5 1002 Janet
6 1002 Jan
7 1003 John
8 1003 Jon
9 1004 Bill
10 1004 Will
11 1005 Tom
12 1005 NA
it has the NAs, but they are easily removed with %>% na.omit.
I think data.table is the winner on this one.

Unpacking and merging lists in a column in data.frame

I have the following data.frame:
id name altNames
1001 Joan character(0)
1002 Jane c("Janie", "Janet", "Jan")
1003 John Jon
1004 Bill Will
1005 Tom character(0)
The column altNames could be empty (i.e. character(0)), have just one name, or a list of names. What I want is a data.frame (or a list) where each entry from name and/or altNames appears just once along with the corresponding id, like this:
id name
1001 Joan
1002 Jane
1002 Janie
1002 Janet
1002 Jan
1003 John
1003 Jon
1004 Bill
1004 Will
1005 Tom
What's the most efficient way of doing it? Even better is dplyr is utilized.
Thanks
Edit: Here's the data:
df <- data_frame(
id = c("1001", "1002","1003", "1004", "1005"),
name = c("Joan", "Jane", "John", "Bill", "Tom"),
altNames = list(character(0), c("Janie", "Janet", "Jan"), "Jon", "Will", character(0))
)
Here's a possible data.table approach
library(data.table)
setDT(dat)[, .(name = c(name, unlist(altNames))), by = id]
# id name
# 1: 1001 Joan
# 2: 1002 Jane
# 3: 1002 Janie
# 4: 1002 Janet
# 5: 1002 Jan
# 6: 1003 John
# 7: 1003 Jon
# 8: 1004 Bill
# 9: 1004 Will
# 10: 1005 Tom
A base R version (using the df added by #rawr)
with(df, {
ns <- mapply(c, name, altNames)
data.frame(id = rep(id, times=lengths(ns)), name=unlist(ns), row.names=NULL)
})
# id name
#1 1001 Joan
#2 1002 Jane
#3 1002 Janie
#4 1002 Janet
#5 1002 Jan
#6 1003 John
#7 1003 Jon
#8 1004 Bill
#9 1004 Will
#10 1005 Tom
Here's a full dplyr + tidyr solution, the way I'd tackle it:
library(dplyr)
library(tidyr)
df <- data_frame(
id = c("1001", "1002","1003", "1004", "1005"),
name = c("Joan", "Jane", "John", "Bill", "Tom"),
altNames = list(character(0), c("Janie", "Janet", "Jan"), "Jon", "Will", character(0))
)
# Need some way to concatenate a list of vectors with a vectors
# in a "rowwise" way
vector_c <- function(...) {
Map(c, ...)
}
df %>%
mutate(
names = vector_c(name, altNames),
altNames = NULL,
name = NULL
) %>%
unnest(names)
#> Source: local data frame [10 x 2]
#>
#> id names
#> 1 1001 Joan
#> 2 1002 Jane
#> 3 1002 Janie
#> 4 1002 Janet
#> 5 1002 Jan
#> 6 1003 John
#> 7 1003 Jon
#> 8 1004 Bill
#> 9 1004 Will
#> 10 1005 Tom
Most of the hard work is done by tidyr::unnest(): it's designed to take data frame with a list-column and unnest it, repeating the other columns as needed.
Using tidyr, after cleaning the data with data.table:
First, fix the data:
library(data.table)
dat<-setDT(dat)
dat$altNames[sapply(dat$altNames, length) == 0] <- NA
Now unnest from tidyr and some dplyr:
library(dplyr)
library(tidyr)
dat %>% unnest(altNames) %>%
group_by(id) %>%
do(unique(c(.[["name"]],.[["altNames"]])))
id V1
1 1001 Joan
2 1001 NA
3 1002 Jane
4 1002 Janie
5 1002 Janet
6 1002 Jan
7 1003 John
8 1003 Jon
9 1004 Bill
10 1004 Will
11 1005 Tom
12 1005 NA
it has the NAs, but they are easily removed with %>% na.omit.
I think data.table is the winner on this one.

Resources