Dynamically determine if a dataframe column exists and mutate if it does - r

I have code that pulls and processes data from a database based upon a client name. Some clients may have data that does not include a specific column name, e.g., last_name or first_name. For clients that do not use last_name or first_name, I don't care. For clients that do use either of those fields, I need to mutate() those columns with toupper() so that I can join on those standardized fields later in the ETL process.
Right now, I'm using a series of if() statements and some helper functions to look into the names of a dataframe then mutate if they exist. I'm using if() statements because ifelse() is mostly vectorized and doesn't handle dataframes well.
library(dplyr)
set.seed(256)
b <- data.frame(id = sample(1:100, 5, FALSE),
col_name = sample(1000:9999, 5, FALSE),
another_col = sample(1000:9999, 5, FALSE))
d <- data.frame(id = sample(1:100, 5, FALSE),
col_name = sample(1000:9999, 5, FALSE),
last_name = sample(letters, 5, FALSE))
mutate_first_last <- function(df){
mutate_first_name <- function(df){
df %>%
mutate(first_name = first_name %>% toupper())
}
mutate_last_name <- function(df){
df %>%
mutate(last_name = last_name %>% toupper())
}
n <- c("first_name", "last_name") %in% names(df)
if (n[1] & n[2]) return(df %>% mutate_first_name() %>% mutate_last_name())
if (n[1] & !n[2]) return(df %>% mutate_first_name())
if (!n[1] & n[2]) return(df %>% mutate_last_name())
if (!n[1] & !n[2]) return(df)
}
I get what I expect to get this way
> b %>% mutate_first_last()
id col_name another_col
1 48 8318 6207
2 39 7155 7170
3 16 4486 4321
4 55 2521 8024
5 15 1412 4875
> d %>% mutate_first_last()
id col_name last_name
1 64 7438 A
2 43 4551 Q
3 48 7401 K
4 78 3682 Z
5 87 2554 J
but is this the best way to handle this kind of task? To dynamically look to see if a column name exists in a dataframe then mutate it if it does? It seems strange to have to have multiple if() statements in this function. Is there a more streamlined way to process these data?

You can use mutate_at with one_of, both from dplyr. This will mutate column only if it matches with one of c("first_name", "last_name"). If no match, it will generate a simple warning but you can ignore or suppress it.
library(dplyr)
d %>%
mutate_at(vars(one_of(c("first_name", "last_name")), toupper)
id col_name last_name
1 19 7461 V
2 52 9651 H
3 56 1901 P
4 13 7866 Z
5 25 9527 U
# example with no match
b %>%
mutate_at(vars(one_of(c("first_name", "last_name"))), toupper)
id col_name another_col
1 34 9315 8686
2 26 5598 4124
3 17 3318 2182
4 32 1418 4369
5 49 4759 6680
Warning message:
Unknown variables: `first_name`, `last_name`
Here are a bunch of other ?select_helpers in dplyr -
These functions allow you to select variables based on their names.
starts_with(): starts with a prefix
ends_with(): ends with a prefix
contains(): contains a literal string
matches(): matches a regular expression
num_range(): a numerical range like x01, x02, x03.
one_of(): variables in character vector.
everything(): all variables.

Update dplyr 1.0.0
In dplyr 1.0, the scoped variants of mutate such as _at or _all were replaced by across().
In addition, the best tidy_select helper for this case is any_of as it will perform on the variables which exist, but ignores those that don't exist (without warning message).
As result, you can write the following:
# purrr syntax
d %>% mutate(across(any_of(c("first_name", "last_name")), ~toupper(.x)))
# function name syntax
d %>% mutate(across(any_of(c("first_name", "last_name")), toupper))
which both return the mutated column
id col_name last_name
1 19 4398 Q
2 72 1135 S
3 54 9767 V
4 60 4364 K
5 35 1564 X
while
b %>% mutate(across(any_of(c("first_name", "last_name")), toupper))
ignores the columns and thus returns (without warning message):
id col_name another_col
1 42 7601 4482
2 22 1773 7072
3 47 2719 5884
4 1 9595 5945
5 81 8044 3927

Related

understanding count the number of occurrences of a pattern in a string

my input:
library(tidyverse)
library(stringi)
tdf<-data.frame("foo"=c('|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|ReviewNG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|ReviewNG-BB.2|NG-BB.3','|TI'),
"bar"=c('|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI|ReviewNG-BB.2','|AI'),
"xyz" = c('|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ICV|NG-AI','|ReviewNG-ICV|TI|BB.2',
'|ReviewNG-ICV|TI|BB.2','|ReviewNG-ICV|TI|BB.2','|ReviewNG-ICV|TI|BB.2','|ICV'),
"gaz" = c('|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI',
'|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|BB.3|ReviewNG-AI|NG-TI','|NG-BB.2|ICV|AI|TI','|NG-BB.2|ICV|AI|TI','|NG-BB.2|ICV|AI|TI',
'|NG-BB.2|ICV|AI|TI','|BB.2'))
I trying count the number of occurrences of each label in my tdf, all label have 4 "form": Total count of occurences, ReviewNG-label, NG-label and at least "pure" |label, |label|. For example label AI, have all matches total, have ReviewNG-AI, NG-AI, and |AI or |AI| pure form. So that my code:
pt_t <- c("AI" )
sum(stringi::stri_count_fixed(tdf, regex(pt_t)))
pt_rng <- c("ReviewNG-AI")
sum(stringi::stri_count_fixed(tdf, regex(pt_rng)))
pt_ng<-c("NG-AI")
sum(stringi::stri_count_fixed(tdf, regex(pt_ng)))
pt<-c("|AI","|AI|")
sum(stringi::stri_count_fixed(tdf, regex(pt)))
And my output:
Warning in stringi::stri_count_fixed(tdf, regex(pt_t)) :
argument is not an atomic vector; coercing
[1] 30
Warning in stringi::stri_count_fixed(tdf, regex(pt_rng)) :
argument is not an atomic vector; coercing
[1] 7
Warning in stringi::stri_count_fixed(tdf, regex(pt_ng)) :
argument is not an atomic vector; coercing
[1] 14
Warning in stringi::stri_count_fixed(tdf, regex(pt)) :
argument is not an atomic vector; coercing
[1] 15
First of all, I don't exactly understand at warning message.
Now let's look a count: For total it's Ok, for ReviewNG-AI stil good. But next a problematic:
for NG-AI I understand is double count NG plus ReviewNG, and last "pure" count for |AI' or '|AI| I totally don't understand how it equally 15, where manually I count 16.
I also trying stringr in tidyverse but here really erroneous output:
sum(str_count(tdf,pt))
res<-tdf %>%
summarise(across(everything(),
~sum(str_count(.x, paste(pt)))))
rowSums(res)
Your problem here is using an special character in RegEx: | is reserved for or in RegEx. If we want to search for | we need to escape with \\|. So, for example:
library(dplyr)
library(stringr)
pt <- c("\\|AI", "\\|AI\\|")
Now, we want to count every occurence of |AI and |AI|, so the search pattern looks like this:
paste(pt, collapse = "|")
#> [1] "\\|AI|\\|AI\\|"
So, putting it all together:
tdf %>%
summarise(across(everything(),
~sum(str_count(.x, paste(pt, collapse = "|")))))
returns
foo bar xyz gaz
1 0 12 0 4
Maybe this kind of solution. As Martin already explained why and how we could do a different strategy.
If all Labels are separated by |
we could pivot_longer and count them. Depending on your desired output:
library(dplyr)
library(tidyr)
tdf %>%
pivot_longer(
everything()
) %>%
mutate(value = sub('\\|', '', value)) %>%
separate_rows(value, sep = "\\|") %>%
group_by(name, value) %>%
summarise(Labels = n())
name value Labels
<chr> <chr> <int>
1 bar AI 12
2 bar BB.2 7
3 bar ReviewNG-BB.2 4
4 foo NG-BB.3 6
5 foo ReviewNG-BB.2 11
6 foo ReviewNG-BB.3 5
7 foo TI 1
8 gaz AI 4
9 gaz BB.2 1
10 gaz BB.3 7
11 gaz ICV 4
12 gaz NG-BB.2 4
13 gaz NG-TI 7
14 gaz ReviewNG-AI 7
15 gaz TI 4
16 xyz BB.2 4
17 xyz ICV 8
18 xyz NG-AI 7
19 xyz ReviewNG-ICV 4
20 xyz TI 4

Why is R not ordering my tibble in the correct order?

I've got a dataframe with two columns that I need to arrange in chronological order and then combine. R is strangely placing the integer 100 just after 10. I can't figure out how to stop this behavior.
Here is a reprex example.
library(tidyverse)
library(glue)
set.seed(123)
df <- tibble(x = 0:100,
y = sample(0:100, 101, T))
df_i <- df %>%
mutate(id = row_number(),
z = glue('{x}.{y}'))
df_i %>%
arrange(z)
# A tibble: 101 x 4
x y id z
<int> <int> <int> <glue>
1 0 30 1 0.30
2 1 78 2 1.78
3 10 24 11 10.24
4 100 22 101 100.22
5 11 89 12 11.89
6 12 90 13 12.90
7 13 68 14 13.68
8 14 90 15 14.90
9 15 56 16 15.56
10 16 91 17 16.91
# … with 91 more rows
You can see that the fourth row is not in the correct order. It looks like the x and y columns are not in order either. I feel like this is something trivial but it's causing some sneaky problems.
'z' is a glue object (and according to ?glue - Format and interpolate a string, so it would be a string output) and it needs to be converted to numeric
df_i %>%
arrange(as.numeric(z))
If we check the glue source code, it calls glue_data which in turn calls as_glue and checking the as_glue it converts to character
methods('as_glue')
getAnywhere('as_glue.default')
#function (x, ...)
#{
# as_glue(as.character(x))
#}
The behaviour is similar to sorting a character vector of numbers
sort(c('1', '2', '10', '20', '100'))
#[1] "1" "10" "100" "2" "20"

split dataframe with multiple delimiters in R

df1 <-
Gene GeneLocus
CPA1|1357 chr7:130020290-130027948:+
GUCY2D|3000 chr17:7905988-7923658:+
UBC|7316 chr12:125396194-125399577:-
C11orf95|65998 chr11:63527365-63536113:-
ANKMY2|57037 chr7:16639413-16685398:-
expected output
df2 <-
Gene.1 Gene.2 chr start end
CPA1 1357 7 130020290 130027948
GUCY2D 3000 17 7905988 7923658
UBC 7316 12 125396194 125399577
C11orf95 65998 11 63527365 63536113
ANKMY2 57037 7 16639413 16685398]]
I tried this way..
install.packages("splitstackshape")
library(splitstackshape)
df1 <- cSplit(df1,"Gene", sep="|", direction="wide", fixed=T)
df1 <- cSplit(df1,"GeneLocus",sep=":",direction="wide", fixed=T)
df1 <- cSplit(df1,"GeneLocus_2",sep="-",direction="wide", fixed=T)
df1 <- data.frame(df1)
df2$GeneLocus_1 <- gsub("chr","", df1$GeneLocus_1)
I would like to know if there is any other alternative way to do it in simpler way
Here you go...Just ignore the warning that does not affect the output; it actually has the side effect of removing the strand information (:+ or :-).
library(tidyr)
library(dplyr)
df1 %>% separate(Gene, c("Gene.1","Gene.2")) %>% separate(GeneLocus, c("chr","start","end")) %>% mutate(chr=sub("chr","",chr))
Output:
Gene.1 Gene.2 chr start end
1 CPA1 1357 7 130020290 130027948
2 GUCY2D 3000 17 7905988 7923658
3 UBC 7316 12 125396194 125399577
4 C11orf95 65998 11 63527365 63536113
5 ANKMY2 57037 7 16639413 16685398
I would suggest something like the following approach:
Make a single delimiter in your "GeneLocus" column (and strip out the unnecessary parts while you're at it).
Split both columns at once. Note that cSplit "balances" the columns being split according to the number of output columns detected. Thus, since the first column would only result in 2 columns when split, but the second would result in 4, you would need to drop columns 3 and 4 from the result.
library(splitstackshape)
GLPat <- "^chr(\\d+):(\\d+)-(\\d+):([+-])$"
cSplit(as.data.table(mydf)[, GeneLocus := gsub(
GLPat, "\\1|\\2|\\3|\\4", GeneLocus)], names(mydf), "|")[
, 3:4 := NULL, with = FALSE][]
# Gene_1 Gene_2 GeneLocus_1 GeneLocus_2 GeneLocus_3 GeneLocus_4
# 1: CPA1 1357 7 130020290 130027948 +
# 2: GUCY2D 3000 17 7905988 7923658 +
# 3: UBC 7316 12 125396194 125399577 -
# 4: C11orf95 65998 11 63527365 63536113 -
# 5: ANKMY2 57037 7 16639413 16685398 -
Alternatively, you can try col_flatten from my "SOfun" package, with which you can do:
library(SOfun)
Pat <- "^chr(\\d+):(\\d+)-(\\d+):([+-])$"
Fun <- function(invec) strsplit(gsub(Pat, "\\1|\\2|\\3|\\4", invec), "|", TRUE)
col_flatten(as.data.table(mydf)[, lapply(.SD, Fun)], names(mydf), drop = TRUE)
# Gene_1 Gene_2 GeneLocus_1 GeneLocus_2 GeneLocus_3 GeneLocus_4
# 1: CPA1 1357 7 130020290 130027948 +
# 2: GUCY2D 3000 17 7905988 7923658 +
# 3: UBC 7316 12 125396194 125399577 -
# 4: C11orf95 65998 11 63527365 63536113 -
# 5: ANKMY2 57037 7 16639413 16685398 -
SOfun is only on GitHub, so you can install it with:
source("http://news.mrdwab.com/install_github.R")
install_github("mrdwab/SOfun")

replacing for loops in a function with vector calculations to speed up R

Say I have some data in data frame d1, that describes how frequently different sample individuals eat different foods, and a final column describing whether or not those foods are cool to eat. The data are structured like this.
OTU.ID<- c('pizza','taco','pizza.taco','dirt')
s1<-c(5,20,14,70)
s2<-c(99,2,29,5)
s3<-c(44,44,33,22)
cool<-c(1,1,1,0)
d1<-data.frame(OTU.ID,s1,s2,s3,cool)
print(d1)
OTU.ID s1 s2 s3 cool
1 pizza 5 99 44 1
2 taco 20 2 44 1
3 pizza.taco 14 29 33 1
4 dirt 70 5 22 0
I have written a function that, for each sample, s1:s3, the number of cool foods that were consumed, and the total number of foods that were consumed. It runs as a for loop on each line of the data table (which is extremely slow).
cool.food.abundance<- function(food.table){
samps<-colnames(food.table)
#remove column names that are not sample names
samps<-samps[!samps %in% c("OTU.ID","cool")]
#create output vectors for for loop
id<-c()
cool.foods<-c()
all.foods<-c()
#run a loop that stores output ids and results as vectors
for(i in 1:length(samps)){
x<- samps[i]
y1<-sum(food.table[samps[i]]*food.table$cool)
y2<-sum(food.table[samps[i]])
id<-c(id,x)
cool.foods<-c(cool.foods,y1)
all.foods<-c(all.foods,y2)
}
#save results as a data frame and return the data frame object
results<-data.frame(id,cool.foods,all.foods)
return(results)
}
So, if you run this function, you will get a new table of sample IDs, the number of cool foods that sample ate, and the total number of foods that sample ate.
cool.food.abundance(d1)
id cool.foods all.foods
1 s1 39 109
2 s2 130 135
3 s3 121 143
How can I replace this for-loop with vector calculations to speed this up? I would really like to be able for the function to operate on dataframes loaded with the fread function in the data.table package.
You can try
library(data.table)#v1.9.5+
dcast(melt(setDT(d1), id.var=c('OTU.ID', 'cool'))[,
sum(value) ,.(cool, variable)], variable~c('notcool.foods',
'cool.foods')[cool+1L], value.var='V1')[,
all.foods:= cool.foods+notcool.foods][, notcool.foods:=NULL]
# variable cool.foods all.foods
#1: s1 39 109
#2: s2 130 135
#3: s3 121 143
Or instead of using dcast we can summarise the result (as in #jeremycg's post) as there are only two groups
melt(setDT(d1), id.var=c('OTU.ID', 'cool'), variable.name='id')[,
list(all.foods=sum(value), cool.foods=sum(value[cool==1])) , id]
# id all.foods cool.foods
#1: s1 109 39
#2: s2 135 130
#3: s3 143 121
Or you can use base R
nm1 <- paste0('s', 1:3)
res <- t(addmargins(rowsum(as.matrix(d1[nm1]), group=d1$cool),1)[-1,])
colnames(res) <- c('cool.foods', 'all.foods')
res
# cool.foods all.foods
#s1 39 109
#s2 130 135
#s3 121 143
Here's how I would do it, with reshape2 and dplyr:
library(reshape2)
library(dplyr)
d1 <- melt(d1, id = c("OTU.ID", "cool"))
d1 %>% group_by(variable) %>%
summarise(all.foods = sum(value), cool.foods = sum(value[cool == 1]))

filtering a dataset dependant on a value within a string

I am currently working with Google Analytics and R and have a query I hope someone can help me with.
I have exported my data from GA into R and have it in a dataframe ready for processing.
I want to create a for loop which goes through my data and sums a number of columns in my dataframe if one column contains a certain value.
For example, my dataframe looks like this
I have a list of ID's which are the individual 3 digit numbers, which I can use in a for loop.
My past experience of R I have been able to filter the list so that I have
data[data$ID == 341,] -> datanew
and I have found some code which can see if there is a certain string within a string producing a bool
grepl(value, chars)
Is there a way to link these up together so that I have a sum code similar to below
aggregate(cbind(users, conversion)~ID,data=datanew,FUN=sum) -> resultforID
Basically taking that data and for every 341 add the users and conversions..
I hope I have explained this the best way possible.
Thanks in advance
data table has 3 columns. ID, users, Conversion with the users and Conversion linked to the IDs.
Some ID's are on their own, so 341, others are 341|246 and some will have three numbers with them seperated by the |
# toy data
mydata = data.frame(ID = c("341|243","341|243","341|242","341","243",
"999","111|341|222"),
Users = 10:16,
Conv = 5:11)
# ID Users Conv
# 1 341|243 10 5
# 2 341|243 11 6
# 3 341|242 12 7
# 4 341 13 8
# 5 243 14 9
# 6 999 15 10
# 7 111|341|222 16 11
# are you looking for something like below:
# presume you just want to filter those IDs have 341.
library(dplyr)
mydata[grep("341",mydata$ID),] %>%
group_by(ID) %>%
summarise_each(funs(sum))
# ID Users Conv
# 1 111|341|222 16 11
# 2 341 13 8
# 3 341|242 12 7
# 4 341|243 21 11
If I understand your question correctly, you may want to look at cSplit from my "splitstackshape" package.
Using #KFB's sample data (which is hopefully representative of your actual data), try:
library(splitstackshape)
cSplit(mydata, "ID", "|", "long")[, lapply(.SD, sum), by = ID]
# ID Users Conv
# 1: 341 62 37
# 2: 243 35 20
# 3: 242 12 7
# 4: 999 15 10
# 5: 111 16 11
# 6: 222 16 11
Alternatively, from the Hadleyverse, you can use "dplyr" and "tidyr" together, like this:
library(dplyr)
library(tidyr)
mydata %>%
transform(ID = strsplit(as.character(ID), "|", fixed = TRUE)) %>%
unnest(ID) %>%
group_by(ID) %>%
summarise_each(funs(sum))
# Source: local data frame [6 x 3]
#
# ID Users Conv
# 1 111 16 11
# 2 222 16 11
# 3 242 12 7
# 4 243 35 20
# 5 341 62 37
# 6 999 15 10
I think this should work:
library(dplyr)
sumdf <- yourdf %>%
group_by(ID) %>%
summarise_each(funs(sum))
I'm not clear about the structure of your ID column, but if you need to just get the numbers you could try this:
library(tidyr)
newdf <- separate(yourdf, ID, c('id1', 'id2'), '|') %>%
filter(id1 == 341) # optional if you just want one ID
Here are two answers. The first being with subset and the second is with 'grep' using a string
initial run
x1<-sample(1:4,10,replace=TRUE)
x2<-sample(10:40,10)
x3<-sample(10:40,10)
dat<-as.data.frame(cbind(x1,x2,x3))
for(i in unique(dat$x1)) {
dat1<-subset(dat,subset=x1==i)
z<-(aggregate(.~x1,data=dat1,FUN=sum))
assign(paste0('x1',i),z)
}
with GREP
x1<-sample(letters[1:3],10,replace=TRUE)
x2<-sample(10:40,10)
x3<-sample(10:40,10)
dat<-as.data.frame(cbind(x1,x2,x3))
for(i in unique(dat$x1)) {
dat1<-dat[grep(i,dat$x1),]
z<-(aggregate(.~x1,data=dat1,FUN=sum))
assign(paste0('x1',i),z) #this will assign separate objects as your aggregates with names based on the string
}

Resources