str_replace within mutate(across()) matching nth character from cur_column - r

A summary of my aim
I have the following dataframe structure:
my.df <-data.frame("col1_A.C"=c("AA","AC","CC"),
"col2_A.T"=c("TT","AT","TT"),
"col3_C.G"=c("GG","CG","CG"))
my.df
# col1_A.C col2_A.T col1_C.G
# 1 AA TT GG
# 2 AC AT CG
# 3 CC TT CG
For each column, I want to replace any character that matches the 3rd last character of the column name with the character "R".
Using the above dataframe I thus would like to obtain this:
my.df2 <- data.frame("col1_A.C"=c("RR","RC","CC"),
"col2_A.T"=c("TT","RT","TT"),
"col3_C.G"=c("GG","RG","RG"))
my.df2
# col1_A.C col2_A.T col1_C.G
# 1 RR TT GG
# 2 RC RT RG
# 3 CC TT RG
In the first column for instance the column name is col1_A.C, and A is the 3rd last character. All the A's were thus replaced with an R.
My code so far
To achieve this, I have produced the following code
my.df2 <- my.df %>% mutate(across(.cols=everything(),
.funs=str_replace_all(.,
substr(cur_column(),
nchar(cur_column()-2),
nchar(cur_column()-2)
),
"R")
)
)
Unfortunately, the resulting dataframe, my.df2, looks exactly like my.df and no character replacement occurred. No error is returned although.
I have tested the str_replace_all() approach in the following way and it works on a vector. I imagine then there is something I am missing/not understanding in the way str_replace_all() is interpreted within the mutate(across()) function.
first.column <- c("CC","CT","CC")
first.column <- str_replace_all(first.column,
substr(colnames(my.df)[1],
nchar(colnames(my.df)[1])-2,
nchar(colnames(my.df)[1])-2
),
"R")
print(first.column)
# [1] "RR" "RT" "RR"
I have ran out of ideas of what might not be working. My understanding of R and its functions is not very thorough so I apologise if I have missed something simple. I have also searched for similar questions but to no avail.

I think you just needed a tilde ~, and to use .fns instead of .funs.
my.df %>%
mutate(
across(
.cols = everything(),
.fns = ~ str_replace_all(
string = ..1,
pattern = str_sub(cur_column(), nchar(cur_column()) - 2, nchar(cur_column()) - 2),
replacement = "R"
)
)
)

You can use Map :
my.df[] <- Map(function(x, y) gsub(y, 'R', x), my.df,
substring(names(my.df), nchar(names(my.df)) - 2,nchar(names(my.df)) - 2))
my.df
# col1_A.C col2_A.T col3_C.G
31 RR TT GG
#2 RC RT RG
#3 CC TT RG
Using #thelatemail's chartr trick with imap_dfc from purrr :
purrr::imap_dfc(my.df, ~chartr(substr(.y, nchar(.y)-2, nchar(.y)-2), 'R', .x))

The same can be achieved by first converting your data from wide to long format:
library(tidyverse)
my.df %>%
gather(colx, rowx) %>%
mutate(rowx = str_replace_all(rowx, substring(colx, nchar(colx) - 2, nchar(colx) -
2), "R")) %>%
group_by(colx) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = colx, values_from = rowx)

Related

R Subsetting text from a comma seperated column in a data-frame

I have a data.frame with a column that looks like that:
diagnosis
F.31.2,A.43.2,R.45.2,F.43.1
I want to somehow split this column into two colums with one containing all the values with F and one for all the other values, resulting in two columns in a df that looks like that.
F other
F.31.2,F43.1 A.43.2,R.45.2
Thanks in advance
Try next tidyverse approach. You can separate the rows by , and then create a group according to the pattern in order to reshape to wide and obtain the expected result:
library(dplyr)
library(tidyr)
#Data
df <- data.frame(diagnosis='F.31.2,A.43.2,R.45.2,F.43.1',stringsAsFactors = F)
#Code
new <- df %>% separate_rows(diagnosis,sep = ',') %>%
mutate(Group=ifelse(grepl('F',diagnosis),'F','Other')) %>%
pivot_wider(values_fn = toString,names_from=Group,values_from=diagnosis)
Output:
# A tibble: 1 x 2
F Other
<chr> <chr>
1 F.31.2, F.43.1 A.43.2, R.45.2
First, use strsplit at the commas. Then, using grep find indexes of F, and select/antiselect them by multiplying by 1 or -1 and paste them.
tmp <- el(strsplit(d$diagnosis, ","))
res <- lapply(c(1, -1), function(x) paste(tmp[grep("F", tmp)*x], collapse=","))
res <- setNames(as.data.frame(res), c("F", "other"))
res
# F other
# 1 F.31.2,F.43.1 A.43.2,R.45.2
Data:
d <- setNames(read.table(text="F.31.2,A.43.2,R.45.2,F.43.1"), "diagnosis")

Data cleaning in R: grouping by number and then by name

A small sample of my dataset looks something like this:
x <- c(1,2,3,4,1,7,1)
y <- c("A","b","a","F","A",".A.","B")
data <- cbind(x,y)
My goal is to first group data that have the same number together and then followed by the same name together (A,a,.A. are considered as the same name for my case).
In other words, the final output should look something like this:
xnew <- c(1,1,3,7,1,2,4)
ynew <- c("A","A","a",".A.","B","b","F")
datanew <- cbind(xnew,ynew)
Currently, I am only able to group by number in the column labelled x. I am unable to group by name yet. I would appreciate any help given.
Note: I need an automated solution as my raw dataset contains over 10,000 lines for the x and y columns.
Assuming what you have is a dataframe data <- data.frame(x,y) and not a matrix which is being generated with cbind you could combine different values into one using fct_collapse and then arrange the data by this new column (z) and x value.
library(dplyr)
library(forcats)
data %>%
mutate(z = fct_collapse(y,
"A" = c('A', '.A.', 'a'),
"B" = c('B', 'b'))) %>%
arrange(z, x) %>%
select(-z) -> result
result
# x y
#1 1 A
#2 1 A
#3 3 a
#4 7 .A.
#5 1 B
#6 2 b
#7 4 F
Or you can remove all the punctuations from y column, make them into upper or lower case and then arrange.
data %>%
mutate(z = toupper(gsub("[[:punct:]]", "", y))) %>%
arrange(z, x) %>%
select(-z) -> result
result
library(dplyr)
data %>%
as.data.frame() %>%
group_by(x, y) %>%
summarise(records = n()) %>%
arrange(x, y)
According to your question it's just a matter of ordering data.
result <- data[order(data$x, data$y),]
or considering that you wan to collate A a .A.
result <- data[order(data$x, toupper(gsub("[^A-Za-z]","",data$y))),]

mutate_at with two sets of variables

I just asked a question about generating multiple columns at once with dplyr, and I'm a bonehead and oversimplified the problem and have another question. I'd like to find a dplyr method for dynamically generating columns based on other columns.
cols <- c("x", "y")
foo <- c("a", "b")
bar <- c("c", "d")
df <- data.frame(a = 1, b = 2, c = 10, d = 20)
df[cols] <- df[foo] * df[bar]
In my first iteration of the question, I included only one set of previously defined columns, so the following worked:
df %>%
mutate_at(vars(foo), list(new = ~ . * 5)) %>%
rename_at(vars(matches('new')), ~ c('x', 'y'))
However, as the first few lines of code suggest, I would like to instead multiply two existing columns together, and am unable to figure out how to do this. I have tried:
df %>%
mutate_at(c(vars(foo), vars(bar)),
function(x,y) {x * y})
which returns the error:
Error in (function (x, y) : argument "y" is missing, with no default
Is it possible to reference multiple sets of columns to be used on each other with mutate_at?
Well as you want to work with two columns, I think purrr::map2 is the function to work with:
library(purrr)
library(dplyr)
map2(foo, bar, ~ df[[.x]] * df[[.y]]) %>%
set_names(cols) %>%
bind_cols(df, .)
#> a b c d x y
#> 1 1 2 10 20 10 40

Using the pipe in unique() function in r is not working

I have some troubles using the pipe operator (%>%) with the unique function.
df = data.frame(
a = c(1,2,3,1),
b = 'a')
unique(df$a) # no problem here
df %>% unique(.$a) # not working here
# I got "Error: argument 'incomparables != FALSE' is not used (yet)"
Any idea?
As other answers mention : df %>% unique(.$a) is equivalent to df %>% unique(.,.$a).
To force the dots to be explicit you can do:
df %>% {unique(.$a)}
# [1] 1 2 3
An alternative option from magrittr
df %$% unique(a)
# [1] 1 2 3
Or possibly stating the obvious:
df$a %>% unique()
# [1] 1 2 3
What is happening is that %>% takes the object on the left hand side and feeds it into the first argument of the function by default, and then will feed in other arguments as provided. Here is an example:
df = data.frame(
a = c(1,2,3,1),
b = 'a')
MyFun<-function(x,y=FALSE){
return(match.call())
}
> df %>% MyFun(.$a)
MyFun(x = ., y = .$a)
What is happening is that %>% is matching df to x and .$a to y.
So for unique your code is being interpreted as:
unique(x=df, incomparables=.$a)
which explains the error. For your case you need to pull out a before you run unique. If you want to keep with %>% you can use df %>% .$a %>% unique() but obviously there are lots of other ways to do that.

Non-standard evaluation (NSE) in dplyr's filter_ & pulling data from MySQL

I'd like to pull some data from a sql server with a dynamic filter. I'm using the great R package dplyr in the following way:
#Create the filter
filter_criteria = ~ column1 %in% some_vector
#Connect to the database
connection <- src_mysql(dbname <- "mydbname",
user <- "myusername",
password <- "mypwd",
host <- "myhost")
#Get data
data <- connection %>%
tbl("mytable") %>% #Specify which table
filter_(.dots = filter_criteria) %>% #non standard evaluation filter
collect() #Pull data
This piece of code works fine but now I'd like to loop it somehow on all the columns of my table, thus I'd like to write the filter as:
#Dynamic filter
i <- 2 #With a loop on this i for instance
which_column <- paste0("column",i)
filter_criteria <- ~ which_column %in% some_vector
And then reapply the first code with the updated filter.
Unfortunately this approach doesn't give the expected results. In fact it does not give any error but doesn't even pull any result into R.
In particular, I looked a bit into the SQL query generated by the two pieces of code and there is one important difference.
While the first, working, code generates a query of the form:
SELECT ... FROM ... WHERE
`column1` IN ....
(` sign in the column name), the second one generates a query of the form:
SELECT ... FROM ... WHERE
'column1' IN ....
(' sign in the column name)
Does anyone have any suggestion on how to formulate the filtering condition to make it work?
It's not really related to SQL. This example in R does not work either:
df <- data.frame(
v1 = sample(5, 10, replace = TRUE),
v2 = sample(5,10, replace = TRUE)
)
df %>% filter_(~ "v1" == 1)
It does not work because you need to pass to filter_ the expression ~ v1 == 1 — not the expression ~ "v1" == 1.
To solve the problem, simply use the quoting operator quo and the dequoting operator !!
library(dplyr)
which_column = quot(v1)
df %>% filter(!!which_column == 1)
An alternative solution, with dplyr version 0.5.0 (probably implemented earlier than that), it is possible to pass a composed string as the .dots argument, which I find more readable than the lazyeval::interp solution:
df <- data.frame(
v1 = sample(5, 10, replace = TRUE),
v2 = sample(5,10, replace = TRUE)
)
which_col <- "v1"
which_val <- 1
df %>% filter_(.dots= paste0(which_col, "== ", which_val))
v1 v2
1 1 1
2 1 2
3 1 4
UPDATE for dplyr 0.6 and later:
packageVersion("dplyr")
# [1] ‘0.5.0.9004’
df %>% filter(UQ(rlang::sym(which_col))==which_val)
#OR
df %>% filter((!!rlang::sym(which_col))==which_val)
(Similar to #Matthew 's response for dplyr 0.6, but I assume that which_col is a string variable.)
2nd UPDATE: Edwin Thoen created a nice cheatsheet for tidy evaluation: https://edwinth.github.io/blog/dplyr-recipes/
Here's a slightly less verbose solution and one which uses the typical behavior of the extract function, '[' in selecting a column by character value rather than converting it to a language element:
df %>% filter(., '['(., which_column)==1 )
set.seed(123)
df <- data.frame(
v1 = sample(5, 10, replace = TRUE),
v2 = sample(5,10, replace = TRUE)
)
which_column <- "v1"
df %>% filter(., '['(., which_column)==1)
# v1 v2
#1 1 5

Resources