I have a table like this:
>head(X)
column1 column2
sequence1 ATCGATCGATCG
sequence2 GCCATGCCATTG
I need an output in a fasta file, looking like this:
sequence1
ATCGATCGATCG
sequence2
GCCATGCCATTG
So, basically I need all entries of the 2nd column to become new rows, interspersing the first column. The old 2nd column can then be discarded.
The way I would normally do that is by replacing a whitespace (or tab) with \n in notepad++, but I fear my files will be too big for doing that.
Is there a way for doing that in R?
I had the same question but found a really easy way to convert a data frame to a fasta file using the package: "seqRFLP".
Do the following:
Install and load seqRFLP
install.packages("seqRFLP")
library("seqRFLP")
Your sequences need to be in a data frame with sequence headers in column 1 and sequences in column 2 [doesn't matter if it's nucleotide or amino acid]
Here is a sample data frame
names <- c("seq1","seq2","seq3","seq4")
sequences<-c("EPTFYQNPQFSVTLDKR","SLLEDPCYIGLR","YEVLESVQNYDTGVAK","VLGALDLGDNYR")
df <- data.frame(names,sequences)
Then convert the data frame to .fasta format using the function: 'dataframe2fas'
df.fasta = dataframe2fas(df, file="df.fasta")
D <- do.call(rbind, lapply(seq(nrow(X)), function(i) t(X[i, ])))
D
# 1
# column1 "sequence1"
# column2 "ATCGATCGATCG"
# column1 "sequence2"
# column2 "GCCATGCCATTG"
Then, when you write to file, you could use
write.table(D, row.names = FALSE, col.names = FALSE, quote = FALSE)
# sequence1
# ATCGATCGATCG
# sequence2
# GCCATGCCATTG
so that the row names, column names, and quotes will be gone.
When I do this, I tend to use something like:
Xfasta <- character(nrow(X) * 2)
Xfasta[c(TRUE, FALSE)] <- paste0(">", X$column1)
Xfasta[c(FALSE, TRUE)] <- X$column2
This creates an empty character vector, with length twice the length of your table; then puts the values from column1 in every second position starting at 1, and the values of column2 in every second position starting at 2.
then write using writeLines:
writeLines(Xfasta, "filename.fasta")
In this answer, I added a ">" to the headers since this is standard for fasta format and is required by some tools that take fasta input. If you don't care about adding the ">", then:
Xfasta <- character(nrow(X) * 2)
Xfasta[c(TRUE, FALSE)] <- X$column1
Xfasta[c(FALSE, TRUE)] <- X$column2
If you didn't read your file in with options to stop characters being read as factors, then you might need to use <- as.character(X$column1) instead.
There are also a few tools available for this conversion, I think the Galaxy browser has an option for it.
Related
This is regarding an issue that I'm facing while exporting data to a .csv file. I'm working with a large dataframe in R. It contains few cells with characters greater than 32767, which is apparently the maximum that a cell can accomodate. When I export the data to a .csv file the content of these cells spills over into the next row. The dataframe however looks completely fine once the .csv file is imported into RStudio. Is there a way to limit the number of characters in each cell to 32767 while exporting the data?
Split long strings into n columns with x characters each, for example:
# example data
d <- data.frame(x = c("longstring", "anotherlongstring"))
d
# x
# 1 longstring
# 2 anotherlongstring
x = 6
n = 3
res <- read.fwf(file = textConnection(d$x), widths = rep(x, n))
res
# V1 V2 V3
# 1 longst ring <NA>
# 2 anothe rlongs tring
I found the solution to my question in the answers posted here.
Let's say df is the dataframe which I want to export to a .csv file. The following code does the job
df <- as.data.frame(substr(as.matrix(df), 1, 32767))
write.csv(df, <file>, rownames = FALSE)
I have a list of csv files in a specific path. I am hoping to upload all of them into one dataframe. This is the code I use:
d <-
list.files(path,pattern="*.csv", full.names = T) %>%
map_dfr(read_csv)
Trouble is that some of these columns (for example the column array_values) are strings that are then converted into numbers. I tried all sorts of ways to convert the variables but can't get it to work unless I have a much more complicated code in which I upload the files one by one, convert and then add to the larger dataframe. Would love to learn if there is a simple way to add it to the code here.
Thanks!
Camille's point is valid. Complete code at least includes the packages that you used with the code snippet you provided.
Having said that, if your CSVs all have the same columns (thus assuming all the columns are the same types) and the columns are in the same order and your problem is that a character column is read as a factor column in some cases or something similar, you can add an argument to read_csv(), col_types to make sure each column is read the same. It's difficult to tell from your question what exactly is wrong.
library(tidyverse)
> list.files(pattern="t.*csv")
[1] "test1.csv" "test2.csv"
> d <- list.files(pattern="t.*csv") %>% map_dfr(read_csv, col_types="dc")
> d
# A tibble: 6 × 2
col1 col2
<dbl> <chr>
1 3 a
2 4 b
3 4.5 a
4 13 a
5 4 goat
6 4.5 a
You can find the column types in the "Column Specification with readr" section here.
Assuming that you have mutliple datasets and having some trouble in some of the columns which are not unique among all the datasets. you can also acheive this using below:
library(dplyr)
path <- 'your path'
filename <- list.files(path, pattern = "*.csv", recursive = TRUE)
filename <- stringr::str_subset(filename,pattern="test")
#Creating a custom function to loop on all files
read_csv_files <- function(x){
df <- read_csv(path = paste(path, x, sep = "/"))
df$yourcol <- as.character(df$yourcol) # you can change datatype for the respective columns as per your needs
fName <- x
df <- cbind(fName,df)
return(df)
}
bind_data <- lapply(filename, read_csv_files) %>%
bind_rows()
Hope this will help. will be happy to connect incase required.
I have a massive csv file but I only use a small subset of its columns in my analysis. To save time and memory space, I would like to load just the necessary columns. I tried using the colClasses method of read.csv as suggested here but I couldn't make it work.
Let me describe the issue with a MWE. Suppose my data (the csv file) is created by the following:
df <- data.frame(a = c('3', '4'), b = c(5, 6))
write.csv(x = df, file = 'df.csv', row.names = F)
In the csv, column a is saved as text, while b is saved as numeric. I would like to load only column a for my analysis. My idea is to just get the column types to form a colClasses vector. To do this, I load just the first row of the data (which is fast, and in practice I have 1M+ rows) retrieve the column types and create a vector to be passed to colClasses:
df <- read.csv(file = 'df.csv', nrows = 1) # read just first row
cols <- colnames(df) # column names
coltypes <- sapply(df, class) # column types
wanted_cols <- c('a') # column names needed for analysis
cc <- rep('NULL', length(cols)) # initialize colClasses vector
cc[cols %in% wanted_cols] <- coltypes[cols %in% wanted_cols] # put the needed types into cc
data <- read.csv(file = 'df.csv', colClasses = cc) # load all rows but just needed columns
However, when R loads the data through read.csv (first line) it sees only integers in column a and automatically converts it to an integer type. When I feed back this type into the colClasses argument, it cannot load the data because a is stored as a string in the csv. I get:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec =
dec, :
scan() expected 'a real', got '"3"'
Another problem that also arises is that by loading just the first row of data to get the column types, I may not give R enough information. If some column's first element is 1, it looks like R thinks it's a logical type, while it could in fact be a lot of other types.
Is there a way to make this work? Or is there a different technique that would enable me to load certain columns based on their names?
Found another solution: use fread(file, select = column_names) from data.table. You can specify column indices or names to the select argument to load only these columns.
Here is the situation, I have many csv files, say, 20 of them. Each of them has different column names. So I built a map for them.
map
# variable location
# A 1
# B 1
# C 2
I was trying to read them all once, so I have a code like this:
Table <- rbindlist(
apply(map, 1, function(x) {
fil <- paste0(x[2], ".csv")
sel <- x[1]
fread(file = fil, select = sel)
}
When it is done, I get a data.table with 1 column of all data. If I use rbind, I get a large matrix of wanted element, but can't be converted into the data.table form I need. How can I make it happen? Please advise, thanks.
The issue is with the columns that are factor class in the dataset 'map'. When we use apply, it is converted to a matrix and the factor columns get coerced to integer values and that causes the mismatch. One option is to convert to character class. This can be done more compactly with Map
rbindlist(Map(fread, file = paste0(map$location, ".csv"),
select = as.character(map$variable)))
I work with surveys and would like to export a large number of tables (drawn from data frames) into an .xlsx or .csv file. I use the xlsx package to do this. This package requires me to stipulate which column in the excel file is the first column of the table. Because I want to paste multiple tables into the .csv file I need to be able to stipulate that the first column for table n is the length of table (n-1) + x number of spaces. To do this I planned on creating values like the following.
dt# is made by changing a table into a data frame.
table1 <- table(df$y, df$x)
dt1 <- as.data.frame.matrix(table1)
Here I make the values for the number of the starting column
startcol1 = 1
startcol2 = NCOL(dt1) + 3
startcol3 = NCOL(dt2) + startcol2 + 3
startcol4 = NCOL(dt3) + 3 + startcol2 + startcol3
And so on. I will probably need to produce somewhere between 50-100 tables. Is there a way in R to make this an iterative process so I can create the 50 values of starting columns without having to write 50+ lines of code with each one building on the previous?
I found stuff on stack overflow and other blogs about writing for - loops or using apply type functions in R but this all seemed to deal with manipulating a vector as opposed to adding values to the workspace. Thanks
You can use a structure similar to this:
Your list of files to read:
file_list = list.files("~/test/",pattern="*csv",full.names=TRUE)
for each file, read and process the data frame and capture how many columns there are in the frame you are reading/processing:
columnsInEachFile = sapply(file_list,
function(x)
{
df = read.csv(x,...) # with your approriate arguments
# do any necessary processing you require per file
return(ncol(df))
}
)
The cumulative sum of the number of columns plus 1 will indicate the start columns of a data frame that contains your processed data stuck next to each other:
columnsToStartDataFrames = cumsum(columnsInEachFile)+1
columnsToStartDataFrames = columnsToStartDataFrames[-length(columnsToStartDataFrames)] # last value is not the start of a data frame but the end
Assuming tab.lst is a list containing tables, then you can do:
cumsum(c(1, sapply(tail(tab.lst, -1), ncol)))
Basically, what I'm doing here is I'm looping through all the tables but the last one (since that one's start col is determined by the second to last), and getting each table's width with ncol. Then I'm doing the cumulative sum over that vector to get all the start positions.
And here is how I created the tables (tables based on all possible combinations of columns in df):
df <- replicate(5, sample(1:10), simplify=F) # data frame with 5 columns
names(df) <- tail(letters, 5) # name the cols
name.combs <- combn(names(df), 2) # get all 2 col combinations
tab.lst <- lapply( # make tables for each 2 col combination
split(name.combs, col(name.combs)), # loop through every column in name.combs
function(x) table(df[[x[[1]]]], df[[x[[2]]]]) # ... and make a table
)