Apply codon translation function to all elements of data.frame - r

I have a data.frame that looks like
df <- data.frame(P1 = c("ATG","GTA","GGG","GGG"), P2 = c("TGG","GAT","GGG","GCG"))
I want to convert each DNA codon to an amino-acid using the below function (but any translate option is viable), and output an identical data.frame but with single letter amino-acids rather than codons:
library(Biostrings)
library(seqinr)
translate_R <- function(x)
{
translate(s2c(as.character(x)))
}
It works for individual elements of the data.frame
> translate_R(df[1,1])
[1] "M"
But trying to apply this to the whole data.frame isn't working. What am I missing? I don't understand why there is an error, as googling how to do this suggests it should work. Missing something fundamental I guess.
> df[] <- lapply(df, translate_R)
Error in seq.default(from = frame + 1, to = frame + l, by = 3) :
wrong sign in 'by' argument
In addition: Warning message:
In s2c(as.character(x)) :
Error in seq.default(from = frame + 1, to = frame + l, by = 3) :
wrong sign in 'by' argument

Your translate_R function is expecting a single value, but it's getting a vector. You can fix this by passing in individual values.
In other words, iterate over columns of df with an outer apply and then over values in each column with an inner apply.
Here's how to do it with base R:
data.frame(lapply(df, function(x) sapply(x, translate_R)))
And here's a tidyverse version with map:
library(tidyverse)
df %>% mutate(across(everything(), ~map(., translate_R)))
In both cases, the output is:
P1 P2
1 M W
2 V D
3 G G
4 G A

Another potential tidyverse solution is to use the "rowwise" tidyverse function:
library(tidyverse)
library(Biostrings)
library(seqinr)
translate_R <- function(x) {
translate(s2c(as.character(x)))
}
df <- data.frame(P1 = c("ATG","GTA","GGG","GGG"), P2 = c("TGG","GAT","GGG","GCG"))
df %>%
rowwise() %>%
mutate(across(everything(), ~ translate_R(.x)))
#> # A tibble: 4 x 2
#> # Rowwise:
#> P1 P2
#> <chr> <chr>
#> 1 M W
#> 2 V D
#> 3 G G
#> 4 G A
Created on 2021-07-21 by the reprex package (v2.0.0)

Related

Is there a way to put all columns from my database as integer with a simple code? [duplicate]

What is the most efficient way to convert multiple columns in a data frame from character to numeric format?
I have a dataframe called DF with all character variables.
I would like to do something like
for (i in names(DF){
DF$i <- as.numeric(DF$i)
}
Thank you
You could try
DF <- data.frame("a" = as.character(0:5),
"b" = paste(0:5, ".1", sep = ""),
"c" = letters[1:6],
stringsAsFactors = FALSE)
# Check columns classes
sapply(DF, class)
# a b c
# "character" "character" "character"
cols.num <- c("a","b")
DF[cols.num] <- sapply(DF[cols.num],as.numeric)
sapply(DF, class)
# a b c
# "numeric" "numeric" "character"
If you're already using the tidyverse, there are a few solution depending on the exact situation.
Basic if you know it's all numbers and doesn't have NAs
library(dplyr)
# solution
dataset %>% mutate_if(is.character,as.numeric)
Test cases
df <- data.frame(
x1 = c('1','2','3'),
x2 = c('4','5','6'),
x3 = c('1','a','x'), # vector with alpha characters
x4 = c('1',NA,'6'), # numeric and NA
x5 = c('1',NA,'x'), # alpha and NA
stringsAsFactors = F)
# display starting structure
df %>% str()
Convert all character vectors to numeric (could fail if not numeric)
df %>%
select(-x3) %>% # this removes the alpha column if all your character columns need converted to numeric
mutate_if(is.character,as.numeric) %>%
str()
Check if each column can be converted. This can be an anonymous function. It returns FALSE if there is a non-numeric or non-NA character somewhere. It also checks if it's a character vector to ignore factors. na.omit removes original NAs before creating "bad" NAs.
is_all_numeric <- function(x) {
!any(is.na(suppressWarnings(as.numeric(na.omit(x))))) & is.character(x)
}
df %>%
mutate_if(is_all_numeric,as.numeric) %>%
str()
If you want to convert specific named columns, then mutate_at is better.
df %>% mutate_at('x1', as.numeric) %>% str()
You can use index of columns:
data_set[,1:9] <- sapply(dataset[,1:9],as.character)
I used this code to convert all columns to numeric except the first one:
library(dplyr)
# check structure, row and column number with: glimpse(df)
# convert to numeric e.g. from 2nd column to 10th column
df <- df %>%
mutate_at(c(2:10), as.numeric)
Using the across() function from dplyr 1.0
df <- df %>% mutate(across(, ~as.numeric(.))
You could use convert from the hablar package:
library(dplyr)
library(hablar)
# Sample df (stolen from the solution by Luca Braglia)
df <- tibble("a" = as.character(0:5),
"b" = paste(0:5, ".1", sep = ""),
"c" = letters[1:6])
# insert variable names in num()
df %>% convert(num(a, b))
Which gives you:
# A tibble: 6 x 3
a b c
<dbl> <dbl> <chr>
1 0. 0.100 a
2 1. 1.10 b
3 2. 2.10 c
4 3. 3.10 d
5 4. 4.10 e
6 5. 5.10 f
Or if you are lazy, let retype() from hablar guess the right data type:
df %>% retype()
which gives you:
# A tibble: 6 x 3
a b c
<int> <dbl> <chr>
1 0 0.100 a
2 1 1.10 b
3 2 2.10 c
4 3 3.10 d
5 4 4.10 e
6 5 5.10 f
type.convert()
Convert a data object to logical, integer, numeric, complex, character or factor as appropriate.
Add the as.is argument type.convert(df,as.is = T) to prevent character vectors from becoming factors when there is a non-numeric in the data set.
See.
Slight adjustment to answers from ARobertson and Kenneth Wilson that worked for me.
Running R 3.6.0, with library(tidyverse) and library(dplyr) in my environment:
library(tidyverse)
library(dplyr)
> df %<>% mutate_if(is.character, as.numeric)
Error in df %<>% mutate_if(is.character, as.numeric) :
could not find function "%<>%"
I did some quick research and found this note in Hadley's "The tidyverse style guide".
The magrittr package provides the %<>% operator as a shortcut for modifying an object in place. Avoid this operator.
# Good x <- x %>%
abs() %>%
sort()
# Bad x %<>%
abs() %>%
sort()
Solution
Based on that style guide:
df_clean <- df %>% mutate_if(is.character, as.numeric)
Working example
> df_clean <- df %>% mutate_if(is.character, as.numeric)
Warning messages:
1: NAs introduced by coercion
2: NAs introduced by coercion
3: NAs introduced by coercion
4: NAs introduced by coercion
5: NAs introduced by coercion
6: NAs introduced by coercion
7: NAs introduced by coercion
8: NAs introduced by coercion
9: NAs introduced by coercion
10: NAs introduced by coercion
> df_clean
# A tibble: 3,599 x 17
stack datetime volume BQT90 DBT90 DRT90 DLT90 FBT90 RT90 HTML90 RFT90 RLPP90 RAT90 SRVR90 SSL90 TCP90 group
<dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
I think I figured it out. Here's what I did (perhaps not the most elegant solution - suggestions on how to imp[rove this are very much welcome)
#names of columns in data frame
cols <- names(DF)
# character variables
cols.char <- c("fx_code","date")
#numeric variables
cols.num <- cols[!cols %in% cols.char]
DF.char <- DF[cols.char]
DF.num <- as.data.frame(lapply(DF[cols.num],as.numeric))
DF2 <- cbind(DF.char, DF.num)
I realize this is an old thread but wanted to post a solution similar to your request for a function (just ran into the similar issue myself trying to format an entire table to percentage labels).
Assume you have a df with 5 character columns you want to convert. First, I create a table containing the names of the columns I want to manipulate:
col_to_convert <- data.frame(nrow = 1:5
,col = c("col1","col2","col3","col4","col5"))
for (i in 1:max(cal_to_convert$row))
{
colname <- col_to_convert$col[i]
colnum <- which(colnames(df) == colname)
for (j in 1:nrow(df))
{
df[j,colnum] <- as.numericdf(df[j,colnum])
}
}
This is not ideal for large tables as it goes cell by cell, but it would get the job done.
like this?
DF <- data.frame("a" = as.character(0:5),
"b" = paste(0:5, ".1", sep = ""),
"c" = paste(10:15),
stringsAsFactors = FALSE)
DF <- apply(DF, 2, as.numeric)
If there are "real" characters in dataframe like 'a' 'b' 'c', i would recommend answer from davsjob.
Use data.table set function
setDT(DF)
for (j in YourColumns)
set(DF, j=j, value = as.numeric(DF[[j]])
If you need to keep as data.frame then just use setDF(DF)
Try this to change numeric column to character:
df[,1:11] <- sapply(df[,1:11],as.character)
DF[,6:11] <- sapply(DF[,6:11], as.numeric)
or
DF[,6:11] <- sapply(DF[,6:11], as.character)
for (i in 1:names(DF){
DF[[i]] <- as.numeric(DF[[i]])
}
I solved this using double brackets [[]]
Since we can index a data frame column by it's name, a simple change can be made:
for (i in names(DF)){ DF[i] <- as.data.frame(as.numeric(as.matrix(DF[i]))) }
A<- read.csv("Environment_Temperature_change_E_All_Data_NOFLAG.csv",header = F)
Now, convert to character
A<- type.convert(A,as.is=T)
Convert some columns to numeric from character
A[,c(1,3,5,c(8:66))]<- as.numeric(as.character(unlist(A[,c(1,3,5,c(8:66))])))

Keeping the the name of the vector element in the output

My foo function below works great, however it omits the name associated with its vector output.
For the example below, I expect foo to return:
scale
[1] 2
But it simply returns:
[1] 2
Is there fix for this?
library(tidyverse)
library(rlang)
txt = "
study scale
1 1
1 1
2 2
3 2
3 2
3 2
4 1
"
h <- read.table(text = txt,h=T)
foo <- function(data, ...){
dot_cols <- rlang::ensyms(...)
str_cols <- purrr::map_chr(dot_cols, rlang::as_string)
min(sapply(str_cols, function(i) length(unique(data[[i]]))))
}
foo(h, study, scale)
We may use which.min to get the index and then use it to subset the element. Also, loop over a named vector
foo <- function(data, ...){
dot_cols <- rlang::ensyms(...)
str_cols <- purrr::map_chr(dot_cols, rlang::as_string)
v1 <- sapply(setNames(str_cols, str_cols),
function(i) length(unique(data[[i]])))
v1[which.min(v1)]
}
-testing
> foo(h, study, scale)
scale
2
You can skip the rlang stuff by using summarise and passing ... to across
library(dplyr)
foo <- function(data, ...){
data %>%
summarise(across(c(...), n_distinct)) %>%
unlist() %>%
.[which.min(.)]
}
foo(h, study, scale)
#> scale
#> 2

Extract values from list of arbitrary depth

I have a messy, highly nested, list:
m <- list('form' = list('elements' = list('name' = 'Bob', 'code' = 12), 'name' = 'Mary', 'code' = 15))
> m
$form
$form$elements
$form$elements$name
[1] "Bob"
$form$elements$code
[1] 12
$form$name
[1] "Mary"
$form$code
[1] 15
How can I extract from the object m the name and code, regardless as to how nested name and code appears within a list?
Expected output:
# A tibble: 2 x 2
name code
<chr> <dbl>
1 Bob 12
2 Mary 15
1) rrapply Flatten m using rrapply giving r and then separate the name and code fields of unlist(r) using tapply, remove the dimensions using c, convert to data.frame and set the order of the columns.
Note that this is not hard coded to name and code and would work with other fields and numbers of fields.
library(rrapply)
r <- rrapply(m, f = c, how = "flatten")
nms <- names(r)
as.data.frame(c(tapply(unname(r), nms, unlist)))[unique(nms)]
giving:
name code
1 Bob 12
2 Mary 15
An alternative to the final two lines of code above would be:
out <- unstack(stack(r))
out[] <- lapply(out, type.convert)
If there could be other fields in m in addition to name and code that we want ignored then use this in place of the statement that defines r above:
cond <- function(x, .xname) .xname %in% c("name", "code")
r <- rrapply(m, cond, c, how = "flatten")
2) Base R A base R solution is the following which unlists m, and then uses tapply as in (1) grouping by the suffixes of names(r). Like (1) this is a general approach that is not hard coded to name and code. Note that tools comes with R so it is part of Base R.
r <- unlist(m)
nms <- tools::file.ext(names(r))
as.data.frame(c(tapply(unname(r), nms, unlist)))[unique(nms)]
This could help formating the list into a dataframe and then reshaping it:
library(tidyverse)
#Process
y1 <- as.data.frame(lapply(m,unlist),stringsAsFactors = F)
y1$id <- rownames(y1)
rownames(y1)<-NULL
#Dplyr mutation
y1 %>% mutate(Var=ifelse(grepl('name',id,),'name',
ifelse(grepl('code',id),'code',NA))) %>%
select(-id) %>% group_by(Var) %>%
mutate(i=1:n())%>% pivot_wider(names_from = Var,values_from = form) %>%
select(-i) %>% mutate(code=as.numeric(code))
Output:
# A tibble: 2 x 2
name code
<chr> <dbl>
1 Bob 12
2 Mary 15

create a tibble where one column is a function

lets say I have a tibble which looks like this:
library(tidyverse)
tib <- tibble (a = 1:3, b = 4:6, d = -1:1)
I want to add a column to this tibble where each entry is a function with parameters a,b and d (like f(x) = ax^2 + bx +d).
This would mean that (e.g) in the first row I would like to add the function f(x) = 1 x ^2 + 4 x -1, and so on.
I tried the following:
tib2 <- tib %>%
mutate(fun = list(function(x) {a*x^2+b*x+d}))
This does not work since the functions do not know what a, b and d are.
I managed to build a work-around solution using the function mapply
lf <- mapply(function(a,b,d){return(function(x){a*x^2 + b*x + d})}, tib$a, tib$b, tib$d)
tib3 <- tib %>%
add_column(lf)
I was wondering if anyone knows a more elegant way of doing this within the tidyverse. It feels like there is a way using the map function from the purrr package, but I did not manage to get it working.
Thank you
When you used mutate in your example, you were giving it a list with one element (function). So this one function was recycled for all the other rows. Also, inside the definition of the function, it doesn't have any visibility of a, b or d.
You can instead use pmap so that each row has its own function.
tib2 <- tib %>%
mutate(
fun = pmap(
list(a, b, d),
~function(x) ..1 * x^2 + ..2 * x + ..3))
tib2
#> # A tibble: 3 x 4
#> a b d fun
#> <int> <int> <int> <list>
#> 1 1 4 -1 <fun>
#> 2 2 5 0 <fun>
#> 3 3 6 1 <fun>
tib2$fun[[1]](1)
#> [1] 4

converting multiple columns from character to numeric format in r

What is the most efficient way to convert multiple columns in a data frame from character to numeric format?
I have a dataframe called DF with all character variables.
I would like to do something like
for (i in names(DF){
DF$i <- as.numeric(DF$i)
}
Thank you
You could try
DF <- data.frame("a" = as.character(0:5),
"b" = paste(0:5, ".1", sep = ""),
"c" = letters[1:6],
stringsAsFactors = FALSE)
# Check columns classes
sapply(DF, class)
# a b c
# "character" "character" "character"
cols.num <- c("a","b")
DF[cols.num] <- sapply(DF[cols.num],as.numeric)
sapply(DF, class)
# a b c
# "numeric" "numeric" "character"
If you're already using the tidyverse, there are a few solution depending on the exact situation.
Basic if you know it's all numbers and doesn't have NAs
library(dplyr)
# solution
dataset %>% mutate_if(is.character,as.numeric)
Test cases
df <- data.frame(
x1 = c('1','2','3'),
x2 = c('4','5','6'),
x3 = c('1','a','x'), # vector with alpha characters
x4 = c('1',NA,'6'), # numeric and NA
x5 = c('1',NA,'x'), # alpha and NA
stringsAsFactors = F)
# display starting structure
df %>% str()
Convert all character vectors to numeric (could fail if not numeric)
df %>%
select(-x3) %>% # this removes the alpha column if all your character columns need converted to numeric
mutate_if(is.character,as.numeric) %>%
str()
Check if each column can be converted. This can be an anonymous function. It returns FALSE if there is a non-numeric or non-NA character somewhere. It also checks if it's a character vector to ignore factors. na.omit removes original NAs before creating "bad" NAs.
is_all_numeric <- function(x) {
!any(is.na(suppressWarnings(as.numeric(na.omit(x))))) & is.character(x)
}
df %>%
mutate_if(is_all_numeric,as.numeric) %>%
str()
If you want to convert specific named columns, then mutate_at is better.
df %>% mutate_at('x1', as.numeric) %>% str()
You can use index of columns:
data_set[,1:9] <- sapply(dataset[,1:9],as.character)
I used this code to convert all columns to numeric except the first one:
library(dplyr)
# check structure, row and column number with: glimpse(df)
# convert to numeric e.g. from 2nd column to 10th column
df <- df %>%
mutate_at(c(2:10), as.numeric)
Using the across() function from dplyr 1.0
df <- df %>% mutate(across(, ~as.numeric(.))
You could use convert from the hablar package:
library(dplyr)
library(hablar)
# Sample df (stolen from the solution by Luca Braglia)
df <- tibble("a" = as.character(0:5),
"b" = paste(0:5, ".1", sep = ""),
"c" = letters[1:6])
# insert variable names in num()
df %>% convert(num(a, b))
Which gives you:
# A tibble: 6 x 3
a b c
<dbl> <dbl> <chr>
1 0. 0.100 a
2 1. 1.10 b
3 2. 2.10 c
4 3. 3.10 d
5 4. 4.10 e
6 5. 5.10 f
Or if you are lazy, let retype() from hablar guess the right data type:
df %>% retype()
which gives you:
# A tibble: 6 x 3
a b c
<int> <dbl> <chr>
1 0 0.100 a
2 1 1.10 b
3 2 2.10 c
4 3 3.10 d
5 4 4.10 e
6 5 5.10 f
type.convert()
Convert a data object to logical, integer, numeric, complex, character or factor as appropriate.
Add the as.is argument type.convert(df,as.is = T) to prevent character vectors from becoming factors when there is a non-numeric in the data set.
See.
Slight adjustment to answers from ARobertson and Kenneth Wilson that worked for me.
Running R 3.6.0, with library(tidyverse) and library(dplyr) in my environment:
library(tidyverse)
library(dplyr)
> df %<>% mutate_if(is.character, as.numeric)
Error in df %<>% mutate_if(is.character, as.numeric) :
could not find function "%<>%"
I did some quick research and found this note in Hadley's "The tidyverse style guide".
The magrittr package provides the %<>% operator as a shortcut for modifying an object in place. Avoid this operator.
# Good x <- x %>%
abs() %>%
sort()
# Bad x %<>%
abs() %>%
sort()
Solution
Based on that style guide:
df_clean <- df %>% mutate_if(is.character, as.numeric)
Working example
> df_clean <- df %>% mutate_if(is.character, as.numeric)
Warning messages:
1: NAs introduced by coercion
2: NAs introduced by coercion
3: NAs introduced by coercion
4: NAs introduced by coercion
5: NAs introduced by coercion
6: NAs introduced by coercion
7: NAs introduced by coercion
8: NAs introduced by coercion
9: NAs introduced by coercion
10: NAs introduced by coercion
> df_clean
# A tibble: 3,599 x 17
stack datetime volume BQT90 DBT90 DRT90 DLT90 FBT90 RT90 HTML90 RFT90 RLPP90 RAT90 SRVR90 SSL90 TCP90 group
<dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
I think I figured it out. Here's what I did (perhaps not the most elegant solution - suggestions on how to imp[rove this are very much welcome)
#names of columns in data frame
cols <- names(DF)
# character variables
cols.char <- c("fx_code","date")
#numeric variables
cols.num <- cols[!cols %in% cols.char]
DF.char <- DF[cols.char]
DF.num <- as.data.frame(lapply(DF[cols.num],as.numeric))
DF2 <- cbind(DF.char, DF.num)
I realize this is an old thread but wanted to post a solution similar to your request for a function (just ran into the similar issue myself trying to format an entire table to percentage labels).
Assume you have a df with 5 character columns you want to convert. First, I create a table containing the names of the columns I want to manipulate:
col_to_convert <- data.frame(nrow = 1:5
,col = c("col1","col2","col3","col4","col5"))
for (i in 1:max(cal_to_convert$row))
{
colname <- col_to_convert$col[i]
colnum <- which(colnames(df) == colname)
for (j in 1:nrow(df))
{
df[j,colnum] <- as.numericdf(df[j,colnum])
}
}
This is not ideal for large tables as it goes cell by cell, but it would get the job done.
like this?
DF <- data.frame("a" = as.character(0:5),
"b" = paste(0:5, ".1", sep = ""),
"c" = paste(10:15),
stringsAsFactors = FALSE)
DF <- apply(DF, 2, as.numeric)
If there are "real" characters in dataframe like 'a' 'b' 'c', i would recommend answer from davsjob.
Use data.table set function
setDT(DF)
for (j in YourColumns)
set(DF, j=j, value = as.numeric(DF[[j]])
If you need to keep as data.frame then just use setDF(DF)
Try this to change numeric column to character:
df[,1:11] <- sapply(df[,1:11],as.character)
DF[,6:11] <- sapply(DF[,6:11], as.numeric)
or
DF[,6:11] <- sapply(DF[,6:11], as.character)
for (i in 1:names(DF){
DF[[i]] <- as.numeric(DF[[i]])
}
I solved this using double brackets [[]]
Since we can index a data frame column by it's name, a simple change can be made:
for (i in names(DF)){ DF[i] <- as.data.frame(as.numeric(as.matrix(DF[i]))) }
A<- read.csv("Environment_Temperature_change_E_All_Data_NOFLAG.csv",header = F)
Now, convert to character
A<- type.convert(A,as.is=T)
Convert some columns to numeric from character
A[,c(1,3,5,c(8:66))]<- as.numeric(as.character(unlist(A[,c(1,3,5,c(8:66))])))

Resources