I have a list of 78 data frames (list_of_df) that all have the same first column with all annotated ensembl transcript id:s, however they have the extension ".1", i e ("ENST00000448914.1" and so on) and I would like to remove that in order to match them against pure ENST-IDs.
I have tried to use lapply with a sapply inside like this:
lapply(list_of_df, function(x)
cbind(x,sapply(x$target_id, function(y) unlist(strsplit(y,split=".",fixed=T))[1])) )
but it takes forever, does anyone have a better idea of how to possibly do it?
We loop through the list of data.frames, and use sub to remove the . followed by numbers in the first column.
lapply(list_of_df, function(x) {
x[,1] <-sub('\\.\\d+', '', x[,1])
x })
#[[1]]
# target_id value
#1 ENST000049 39
#2 ENST010393 42
#[[2]]
# target_id value
#1 ENST123434 423
#2 ENST00838 23
NOTE: Even if the OP's first column is factor, this should work.
data
list_of_df <- list(data.frame(target_id= c("ENST000049.1",
"ENST010393.14"), value= c(39, 42), stringsAsFactors=FALSE),
data.frame(target_id=c("ENST123434.42", "ENST00838.22"),
value= c(423, 23), stringsAsFactors=FALSE))
You could simplify your code to:
lapply(list_of_df, function(x) x[,1] = unlist(strsplit(x[,1], split=".", fixed=TRUE))[1])
If your columns have factor as class, you can wrap x[,1] in as.character:
lapply(list_of_df, function(x) x[,1] = unlist(strsplit(as.character(x[,1]), split=".", fixed=TRUE))[1])
You could also make use of the stringi package:
library(stringi)
lapply(list_of_df, function(x) x[,1] = stri_split_fixed(x[,1], ".", n=1, tokens_only=TRUE))
Related
I'm trying to find an elegant function to order data.frames held in a list object. I already know that lapply(df, function(x) x[with(x, order(var)), ]) works fine, but that seems way too complicated. I'm trying to use the "[" function, which works fine if I input the row numbers manually. But I'd like to use the row numbers generated by the order function, obviously.
df <- list(
data.frame(name = c("John", "Paul", "George", "Ringo"), height = c(60, 58, 65, 55)),
data.frame(name = c("Frank", "Tony", "Arthur", "Edward"), height = c(55, 65, 60, 50))
)
lapply(df, "[", c("height", "name"))
lapply(df, "[", c(3:1), )
order <- lapply(df, with, order(name))
order
lapply(df, with, order(name), "[")
lapply(df, with, "[", order(name), )
lapply(df, "[", with, order(name), )
Map("[", order , , df)
If you use library(data.table) rather than data.frame then the solution is quite elegant.
We can convert your object to data.table using lapply(df, setDT). Note this step won't be required if you create your objects as data.table in the first place, which would be a more typical workflow.
Then the ordering can be done using
lapply(df, setkey, 'name')
# [[1]]
# name height
# 1: George 65
# 2: John 60
# 3: Paul 58
# 4: Ringo 55
#
# [[2]]
# name height
# 1: Arthur 60
# 2: Edward 50
# 3: Frank 55
# 4: Tony 65
Other packages would simplify this. If you want to use base, you can extend what you were attempting with Map():
order <- lapply(df, with, order(name))
Map("[", df, order, TRUE)
#pseudo translation: df[order, TRUE]
## where TRUE will repeat and select all columns of df
The simplest way with base is to move forward with what you suggest is too complicated. You could simplify with a premade function:
lapply(df, function(x) x[order(x[["name"]]), ])
#or
fx_reorder = function (x, col){
x[order(x[[col]]), ]
}
lapply(df, fx_reorder, "name")
## or to accept multiple columns
fx_reorder2 = function(x, cols) {
if (missing(cols)) cols = names(x)
x[do.call("order", x[cols]), ]
}
lapply(df, fx_reorder2)
lapply(df, fx_reorder2, "name")
lapply(df, fx_reorder2, 1:2)
We can use arrange with map
library(dplyr)
library(purrr)
map(df, ~ .x %>%
arrange(name))
Are you familiar with the dplyr package? You can put the arrange function inside lapply.
lapply(df, arrange, -height)
If you're not familiar with dplyr I would look into it. It sounds like your problem might be solvable with bind_rows and group_by.
In R a vector can not contain different types. Everything must e.g. be an integer or everything must be character etc. This gives me headaches sometimes. E.g. when I want to add a margin to a data.frame, and need some coloumns to be numeric and other to be characters.
Below a reproducible example:
# dummy data.frame
set.seed(42)
test <- data.frame("name"=sample(letters[1:4], 10, replace=TRUE),
"val1" = runif(10,2,5),
"val2"=rnorm(10,10,5),
"Status"=sample(c("In progres", "Done"), 10, replace=TRUE),
stringsAsFactors = FALSE)
# check that e.g. "val1" is indeed numeric
is.numeric(test$val1)
# TRUE
# create coloumn sums for my margin.
tmpSums <- colSums(test[,c(2:3)])
# Are the sums numeric?
is.numeric(tmpSums[1])
#TRUE
# So add the margin
test2 <- rbind(test, c("All", tmpSums, "Mixed"))
# is it numeric
is.numeric(test2$val1)
#FALSE
# DAMN. Because the vector `c("All", tmpSums, "Mixed")` contains strings
# the whole vector is forced to be a string. And when doing the rbind
# the orginal data.frame is forced to a new type also
# my current workaround is to convert back to numeric
# but this seems convoluted, back and forward.
valColoumns <- grepl("val", names(test2))
test2[,valColoumns] <- apply(test2[,valColoumns],2, function(x) as.numeric(x))
is.numeric(test2$val1)
# finally. It works.
there must be an easier / better way?
Use a list object in your rbind, like:
test2 <- rbind(test, c("All", unname(as.list(tmpSums)), "Mixed"))
Where the second argument to rbind is a list, removed of conflicting names that will cause rbind to fail:
c("All", unname(as.list(tmpSums)), "Mixed")
#[[1]]
#[1] "All"
#
#[[2]]
#[1] 37.70092
#
#[[3]]
#[1] 91.82716
#
#[[4]]
#[1] "Mixed"
Here is an option using data.table. We convert the 'data.frame' to 'data.table' (setDT(test)), get the sum of the numeric columns using lapply, concatenate (c) with the values that should represent for other columns, place it in a list and use rbindlist
library(data.table)
rAll <- setDT(test)[, c(name="All", lapply(.SD, sum),
Status="Mixed"), .SDcols= val1:val2]
rbindlist(list(test, rAll))
If we need to make it a bit more automatic,
i1 <- sapply(test, is.numeric)
v1 <- setNames(list("All", "Mixed"), setdiff(names(test),
names(test)[i1]))
rAll <- setDT(test)[, c(v1, lapply(.SD, sum)),
.SDcols=i1][, names(test), with=FALSE]
rbindlist(list(test, rAll))
I have a vector of strings like:
vector=c("a","hb","cd")
and also I have a matrix which has a column, each element of this column is a list of strings which separated by "|" separator, like:
1 "ab|hb"
2 "ab|hbc|cd"
I want to find each string of vector appears in which row of matrix completely.
For the above vector, the result is:
NA, 1, 2
You can use strsplit for splitting strings:
x <- strsplit("ab|hbc|cd", split="|", fixed=T)
and then check if values of vector appear in the data, e.g.
sapply(vector, function(x) x %in% strsplit("a|ab|cd|efg|bh",
split="|", fixed=T)[[1]])
Warning: strsplit outputs data as a list, so in the example above I extract only the first element of the list with [[1]], however you can deal with it in other way if you choose.
EDIT: answering to your question on data as a vector:
data <- c("ab|cd|ef", "aaa|b", "ab", "wf", "fg|hb|a", "cd|cd|df")
sapply(sapply(data, function(x) strsplit(x, split="|", fixed=T)[[1]]),
function(y) sapply(vector, function(z) z %in% y))
Here's an approach using regular expressions:
# Example data
vector <- c("a","hb","cd")
mat <- matrix(c("ab|hb", "ab|hbc|cd"), nrow = 2)
sapply(paste0("\\b", vector, "\\b"), function(x)
if(length(tmp <- grep(x, mat[ , 1]))) tmp else NA,
USE.NAMES = FALSE)
# [1] NA 1 2
I have a list of 50 data.frame objects and each data.frame object contains 20 rows. I need to exclude a row or a vector at each iteration from each of the data.frame object.
The single iteration may look something like this:
to_exclude <- 0 # 0 will be replaced by the induction variable
training_temp <- lapply(training_data, function(x) {
# Exclude the vector numbered to_exclude
}
Regards
df <- data.frame(x=1:10,y=1:10)
thelist <- list(df,df,df,df)
lapply(thelist, function(x) x[-c(1) ,] )
this will always remove the first row. Is this like what you want, or you want to remove rows based on a value?
This will always exclude the first column:
lapply(thelist, function(x) x[, -c(1) ] )
# because there are only two columns in this example you would probably
# want to add drop = FALSe e.g.
# lapply(thelist, function(x) x[, -c(1), drop=FALSE ] )
so from your loop value:
remove_this_one <- 10
lapply(thelist, function(x) x[ -c(remove_this_one) ,] )
# remove row 10
I am hoping to change one of the column names (the 14th column) in each of many files but I cannot figure how to go about it. I have tried multiple kinds of apply but that approach isn't working and I don't know where to start looking for another approach. Here is my code so far:
File.names<-(tk_choose.files(default="", caption="Files", multi=TRUE, filters=NULL, index=1))
Num.Files<-NROW(File.names)
test<-sapply(1:Num.Files,function(x){readLines(File.names[x])})
lapply(1:Num.Files, function(x){data<-read.table(header=TRUE, text=test)})
#This is the issue
names(data)[14]<-'column14'
names(data)
As I mentioned I tried varying types of apply but to no avail. Is there a different way of going about this? Any suggestions would be welcome.
You have to call names another lapply. E.g.:
l <- list(x=c(a=1, b=1), y=c(a=1, b=1))
l2 <- lapply(l, function(x) {
names(x)[2] <- "d"
return(x)
})
l2
#$x
#a d
#1 1
#
#$y
#a d
#1 1
Split the names out first, then alter, then assign as a group. Like,
new.names <- names( data )
new.names[[14]] <- `column14`
names( data ) <- new.names