Reinsert special character * into strings at predefined positions - r

(This question is based on a previous question Convert letters with duplicates to numbers)
I have series of events and non-events in column aoi, with events expressed as capital letters and non-events expressed as "*":
df <- data.frame(
Partcpt = c("B","A","B","C","A","B"),
aoi = c("B*B*B","*A*C*A*C","*B*B","A*C","*A*","*")
)
I need to convert the letters to consecutive numbers unless they are duplicates, in which case the previous number should be repeated. This conversion is accomplished by this:
df$aoi_0 <- sapply(strsplit(df$aoi, split = ""), function(x) paste(match(x[x!="*"], unique(x[x!="*"])), collapse = ""))
df
Partcpt aoi aoi_0
1 B B*B*B 111
2 A *A*C*A*C 1212
3 B *B*B 11
4 C A*C 12
5 A *A* 1
6 B *
But now the information on the non-events is lost. How can I reinstate that information in the strings themselves, by re-inserting the "*" character where appropriate, like so:
df
Partcpt aoi aoi_0
1 B B*B*B 1*1*1
2 A *A*C*A*C *1*2*1*2
3 B *B*B *1*1
4 C A*C 1*2
5 A *A* *1*
6 B * *

You can modify the anonymous function with an ifelse() to return * if the input is * but otherwise to follow the logic of your previous code, i.e. match the input to the vector of unique values.
df$aoi_1 <- sapply(
strsplit(df$aoi, split = ""),
\(x) paste0(
ifelse(
x=="*",
"*",
match(x, unique(x[x!="*"]))
), collapse = ""
)
)
df
# Partcpt aoi aoi_0 aoi_1
# 1 B B*B*B 111 1*1*1
# 2 A *A*C*A*C 1212 *1*2*1*2
# 3 B *B*B 11 *1*1
# 4 C A*C 12 1*2
# 5 A *A* 1 *1*
# 6 B * *

Another possible solution, which is based on the following ideas:
Try to match * with unique(x[x!="*"].
This outcomes no match for *.
Configure nomatch = 0.
Use gsub to replace 0 by *.
df$aoi_0 <- sapply(strsplit(df$aoi, split = ""),
function(x) gsub("0", "*", paste(match(x, unique(x[x!="*"]), nomatch = 0),
collapse = "")))
df
#> Partcpt aoi aoi_0
#> 1 B B*B*B 1*1*1
#> 2 A *A*C*A*C *1*2*1*2
#> 3 B *B*B *1*1
#> 4 C A*C 1*2
#> 5 A *A* *1*
#> 6 B * *

Related

Replace ( gsub) all rows in a column from values in another column?

Suppose I have a dataframe as such,
df = data.frame ( a = c(1,14,15,11) , b= c("xxxchrxxx","xxxchryy","zzchrzz","aachraa") )
a b
1 1 xxxchrxxx
2 14 xxxchryy
3 15 zzchrzz
4 11 aachraa
what I want is to replace chr from column b with chrx, x derive from column a
a b
1 1 xxxchr1xxx
2 14 xxxchr14yy
3 15 zzchr15zz
4 11 aachr11aa
however I cant get gsub to work since its expecting a single element
df$b = gsub ( "chr",paste0("chr",df$a), df$b)
any way to do this?
The reason is that gsub replacement takes only a vector with length 1. According to ?gsub
replacement - if a character vector of length 2 or more is supplied, the first element is used with a warning.
If it needs to have a vectorized replacement, use str_replace
library(stringr)
str_replace(df$b, "chr", paste0("chr", df$a))
#[1] "xxxchr1xxx" "xxxchr14yy" "zzchr15zz" "aachr11aa"
Based on the example, it is only a simple paste
df$b <- with(df, paste0(b, a))
EDIT:: With stringr:
stringr::str_replace_all(df$b,"chr",paste0("chr",df$a))
Continuing with paste0:
df$b<-paste0(df$b,df$a)
a b
1 1 chr1
2 14 chr14
3 15 chr15
4 11 chr11
df = data.frame ( a = c(1,14,15,11) , b= c("chr","chr","chr","chr") )
df$b <- paste0(df$b, df$a)
df
#> a b
#> 1 1 chr1
#> 2 14 chr14
#> 3 15 chr15
#> 4 11 chr11
Created on 2019-02-22 by the reprex package (v0.2.1)

Identify and replace minimum value from a numeric column present in all dataframes in a list of dataframes

I need a way to identify the minimum value in a particular column presents in all dataframes in a list of dataframes and replace it with some non-numeric character. For example:
df1 <- data.frame(x=c("a","b","c"), y=c(2,4,6))
df2 <- data.frame(x=c("a","b","c"), y=c(10,20,30))
myList <- list(df1, df2)
[[1]]
x y
1 a 2
2 b 4
3 c 6
[[2]]
x y
1 a 10
2 b 20
3 c 30
should become
[[1]]
x y
1 a *
2 b 4
3 c 6
[[2]]
x y
1 a *
2 b 20
3 c 30
What's the best way? It would be great if someone knew a Base R and external packages (purrr) solution.
Thanks!
Here is a base R option
lapply(myList, function(df) transform(df, y = replace(y, which.min(y), "*")))
#[[1]]
# x y
#1 a *
#2 b 4
#3 c 6
#
#[[2]]
# x y
#1 a *
#2 b 20
#3 c 30
Or the same in the tidyverse
library(tidyverse)
map(myList, ~.x %>% mutate(y = replace(y, which.min(y), "*")))
for(i in 1:length(myList)){
currMin = min(myList[[i]]$y)
myList[[i]]$y[myList[[i]]$y==currMin] <- '*'
}
please note, assigning '*' will convert type to character

How to correct/standardize variable names if their format is not consistent

I am writing a script that loads RData files containing the results of earlier experiments and parses data frames saved in them. I've noticed that, while the names of variables are not consistent , for instance, sometimes symbol is called gene_name or gene_symbol. The order of variables is also different between the different data frames, so I can't just rename them all with colnames(df) <- c('a', 'b', ...)
I'm looking for a way to rename variables based on their name that won't give an error if that variable isn't found. The below is what I want to do, but (ideally) without needing dozens of conditional statements.
if ('gene_name' %in% colnames(df)) {
df <- df %>% dplyr::rename('symbol' = gene_name)
}
In the below example, I'd like to find an elegant way to rename the variable b to D that I can use safely on data frames that lack a variable b
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
dfs <- list(x,y)
dfs.fixed <- lapply(dfs, function(x) ?????)
Desired result:
dfs.fixed
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Try this approach:
STEP 1
A function substituting a list of colnames with another string (both info parameterized):
colnames_rep<-function(df,to_find,to_sub)
{
colnames(df)[which(colnames(df) %in% to_find)]<-to_sub
return(df)
}
STEP 2
Use lapply to apply the function over each data.frame:
lapply(dfs,colnames_rep,to_find=c("b"),to_sub="D")
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Thanks to divibisan for the suggestion
We can use rename_at with map
map(dfs, ~ .x %>%
rename_at(b, sub, pattern = "^b$", replacement = "D"))
#[[1]]
# a D
#1 1 4
#2 2 5
#3 3 6
#[[2]]
# a c
#1 1 4
#2 2 5
#3 3 6
Here's an approach that is similar in concept to Terru_theTerror's, but extends it by allowing regular expressions. It might be overkill, but ...
First, we define a simple "map" that maps to the desired name (first string in each vector of the list) from any string (remaining strings in each vector). The function that does the matching accepts an argument of fixed=FALSE, in which case the 2nd and remaining strings can be regular expressions, which gives more power and responsibility.
If using fixed=TRUE (the default), then the map might look like this:
colnamemap <- list(
c("symbol", "gene_name", "gene_symbol"),
c("D", "c", "quux"),
c("bbb", "b", "ccc")
)
where "gene_name" and "gene_symbol" will both be changed to "symbol", etc. If you want to use patterns (fixed=FALSE), however, you should be as specific as possible to preclude mis- or multiple-matches (across columns).
colnamemapptn <- list(
c("symbol", "^gene_(name|symbol)$"),
c("D", "^D$", "^c$", "^quux$"),
c("bbb", "^b$", "^ccc$")
)
The function that does the actual remapping:
fixfunc <- function(df, namemap, fixed = TRUE, ignore.case = FALSE) {
compare <- if (fixed) `%in%` else grepl
downcase <- if (ignore.case) tolower else c
newcn <- cn <- colnames(df)
newnames <- sapply(namemap, `[`, 1L)
matches <- sapply(namemap, function(nmap) {
apply(outer(downcase(nmap[-1]), downcase(cn), Vectorize(compare)), 2, any)
}) # dims: 1=cn; 2=map-to
for (j in seq_len(ncol(matches))) {
if (sum(matches[,j]) > 1) {
warning("rule ", sQuote(newnames[j]), " matches multiple columns: ",
paste(sQuote(cn[ matches[,j] ]), collapse=","))
matches[,j] <- FALSE
}
}
for (i in seq_len(nrow(matches))) {
rowmatches <- sum(matches[i,])
if (rowmatches == 1) {
newcn[i] <- newnames[ matches[i,] ]
} else if (rowmatches > 1) {
warning("column ", sQuote(cn[i]), " matches multiple rules: ",
paste(sQuote(newnames[ matches[i,]]), collapse=","))
matches[i,] <- FALSE
}
}
if (any(matches)) colnames(df) <- newcn
df
}
(You might extend it to ensure unique-ness, using make.names and/or make.unique. There's also ignore.case, not really tested here but easily done, I believe.)
I'm going to extend your sample data by including one that will match multiple patterns resulting in ambiguity:
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
z <- data.frame('cc' = 1:3, 'ccc' = 2:4)
dfs <- list(x,y,z)
where the third data.frame has two columns that match my third non-pattern vector. When there are multiple matches, I think the safer thing to do is warn about it and change none of them.
This is correct, fixed-strings only:
lapply(dfs, fixfunc, colnamemap, fixed=TRUE)
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
This incorrectly uses the strings as patterns, which causes one of them to warn about multiple matches:
lapply(dfs, fixfunc, colnamemap, fixed=FALSE)
# Warning in FUN(X[[i]], ...) :
# rule 'D' matches multiple columns: 'cc','ccc'
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
A better use of fixed=FALSE, with strict patterns instead:
lapply(dfs, fixfunc, colnamemapptn, fixed=FALSE)
# same output as the first call

Assign results of apply to multiple columns of data frame

I would like to process all rows in data frame df by applying function f to every row. As function f returns numeric vector with two elements I would like to assign individual elements to new columns in df.
Sample df, trivial function f returning two elements and my trial with using apply
df <- data.frame(a = 1:3, b = 3:5)
f <- function (a, b) {
c(a + b, a * b)
}
df[, c('apb', 'amb')] <- apply(df, 1, function(x) f(a = x[1], b = x[2]))
This does not work results are assigned by columns:
> df
a b apb amb
1 1 3 4 8
2 2 4 3 8
3 3 5 6 15
You could also use Reduce instead of apply as it is generally more efficient. You just need to slightly modify your function to use cbind instead of c
f <- function (a, b) {
cbind(a + b, a * b) # midified to use `cbind` instead of `c`
}
df[c('apb', 'amb')] <- Reduce(f, df)
df
# a b apb amb
# 1 1 3 4 3
# 2 2 4 6 8
# 3 3 5 8 15
Note: This will only work nicely if you have only two columns (as in your example), thus if you have more columns in you data set, run this only on a subset
You need to transpose apply results to get what you want :
df[, c('apb', 'amb')] <- t(apply(df, 1, function(x) f(a = x[1], b = x[2])))
> df
a b apb amb
1 1 3 4 3
2 2 4 6 8
3 3 5 8 15

Rows With Blank Entries in R

I have a 721 x 26 dataframe. Some rows have entries that are blank. It's not NULL
or NA but just empty like the following. How can I delete those rows that have these kind of entries?
1 Y N Y N 86.8
2 N N Y N 50.0
3 76.8
4 N N Y N 46.6
5 Y Y Y Y 30.0
The answer to this question depends on how paranoid you want to be about the sort of things that might be in 'blank'-appearing character strings. Here's a fairly careful approach that will match the zero-length blank string "" as well as any string composed of one or more [[:space:]] characters (i.e. "tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters", according to the ?regex help page).
## An example data.frame containing all sorts of 'blank' strings
df <- data.frame(A = c("a", "", "\n", " ", " \t\t", "b"),
B = c("b", "b", "\t", " ", "\t\t\t", "d"),
C = 1:6)
## Test each element to see if is either zero-length or contains just
## space characters
pat <- "^[[:space:]]*$"
subdf <- df[-which(names(df) %in% "C")] # removes columns not involved in the test
matches <- data.frame(lapply(subdf, function(x) grepl(pat, x)))
## Subset df to remove rows fully composed of elements matching `pat`
df[!apply(matches, 1, all),]
# A B C
# 1 a b 1
# 2 b 2
# 6 b d 6
## OR, to remove rows with *any* blank entries
df[!apply(matches, 1, any),]
# A B C
# 1 a b 1
# 6 b d 6

Resources