trying to read specific columns from multiple csv with fread and apply - r

Here is the situation, I have many csv files, say, 20 of them. Each of them has different column names. So I built a map for them.
map
# variable location
# A 1
# B 1
# C 2
I was trying to read them all once, so I have a code like this:
Table <- rbindlist(
apply(map, 1, function(x) {
fil <- paste0(x[2], ".csv")
sel <- x[1]
fread(file = fil, select = sel)
}
When it is done, I get a data.table with 1 column of all data. If I use rbind, I get a large matrix of wanted element, but can't be converted into the data.table form I need. How can I make it happen? Please advise, thanks.

The issue is with the columns that are factor class in the dataset 'map'. When we use apply, it is converted to a matrix and the factor columns get coerced to integer values and that causes the mismatch. One option is to convert to character class. This can be done more compactly with Map
rbindlist(Map(fread, file = paste0(map$location, ".csv"),
select = as.character(map$variable)))

Related

R- for loop to select two columns in a data frame, with only the second column changing

I'm having issues trying to write a for loop in R. I have a dataframe of 16 columns and 94 rows and i want to loop through, selecting column 1, plus column 2 in one data frame, then col 1 + col 3 etc, so i end up with 16 dataframes containing 2 columns, all written to individual .csv files
TwoB<- read.csv("data.csv", header=F)
list<- lapply(1:nX, function(x) NULL)
nX <- ncol(TwoB)
for(i in 1:ncol(TwoB)){
list[[i]]<-subset(TwoB,
select=c(1, i+1))
}
Which produces an error:
Error in `[.data.frame`(x, r, vars, drop = drop):
undefined columns selected
I'm not really sure how to code this and clearly haven't quite grasped loops yet so any help would be appreciated!
The error is easily explained as you loop over 16 columns and in the end trying to select 16+1 which column index does not exists.
You probably could loop over nX-1 instead, but I think what you try to achieve can be done more elegant.
TwoB<- read.csv("data.csv", header=F)
library("data.table")
setDT(TwoB)
nX <- ncol(TwoB)
# example to directly write your files
lapply(2:nX, function(col_index) {
fwrite(TwoB[, c(1, ..col_index)], file = paste0("col1_col", col_index, ".csv"))
})
# example to store the new data.tables in a list
list_of_two_column_tables <- lapply(2:nX, function(col_index) {
TwoB[, c(1, ..col_index)]
})

Keeping specific columns in read_excel

I am importing an excel file into R. I only want to keep columns A and C not B (columns are A,B,C in order), but the following code keeps column B too. How can I get rid of column B without subseting in another line of code?
df <- read_excel("df.xlsm", "futsales", range = cell_cols(c("A","C")), na = " ")
By going through the documentation for read_excel function, you have to give a range like,
df <- read_excel("df.xlsm", "futsales", range = cell_cols("A:C"), na = " ")
It looks like you can't specify multiple ranges in the range parameter of read_excel. However, you can use the map function from purrr to apply read_excel to a vector of ranges. In your use case, map_dfc will bind the column A and C selections back together into a single output dataset.
library(readxl)
library(purrr)
path <- readxl_example("datasets.xlsx")
ranges <- list("A", "C")
ranges %>%
purrr::map_dfc(
~ read_excel(
path,
range = cell_cols(.))
)
I just did this to successfully read in 5 columns of an Excel file with 27 columns, so here is how you can do it for a file with name you have stored in x and retrieving only first and third columns, assuming that column A is text and column C is numeric:
library(tibble)
library(readxl)
df.temp <- as.tibble(read_excel(x,
col_names=TRUE,
col_types=c("text","skip","numeric")
)
Other option of what #MauritsEvers said would be:
df <- read_excel("df.xlsm", "futsales")[,c(1,3)]
You are making a matrix with all the data, and at the same time, making df with all the rows (that's why the [,), and only the first ("A") and third ("C") columns (that's why the ,c(1,3)])

replace '?' in a dataset with 0

I am working in R and have a dataset comprising of 700 rows and 10 columns, with some of the values having '?' as value. I want to replace the '?' values with 0.
I am not sure if the is.na() function would work here, as the values are not NA. If I convert my dataset into a matrix, and after searching for '?' , replace it with 0, would that help?
I tried this code:
datafile <- sapply(datafile, function(y){if (y=='?') 0 else y})
after this I saved the file as a text file, but the ? didn't go away.
You don't even need to convert to a matrix. As Ben Bolker said, your best option is to use na.strings when reading in the file.
If the data frame is not coming from a file, you can directly do:
df[df=="?"] <- 0
You have to remember though that anything containing character might be converted to a factor. If that's the case, you have to convert those factors to character. Ben gives you a brute force option, here's a more gentle approach:
# check which variables are factors
isfactor <- sapply(df, is.factor)
# convert them to character
# I use lapply bcs that returns a list, and I use the
# list-like selection of "elements" (variables) to replace
# the variables
df[isfactor] <- lapply(df[isfactor], as.character)
So if you put everything together, you get:
df <- data.frame(
a = c(1,5,3,'?',4),
b = c(3,'?','?',3,2)
)
isfactor <- sapply(df, is.factor)
df[isfactor] <- lapply(df[isfactor], as.character)
df[df=="?"] <- 0
df
It depends whether you have other NA values in your data set. If not, almost certainly the easiest way to do this is to use the na.strings= argument to read.(table|csv|cv2|delim), i.e. read your data with something like dd <- read.csv(...,na.strings=c("?","NA"). Then
dd[is.na(dd)] <- 0
If for some reason you don't have control of this part of the process (e.g. someone handed you a .rda file and you don't have the original CSV), then it's a bit more tedious -- you need
which.qmark <- which(x=="?")
x <- suppressWarnings(as.numeric(as.character(x)))
x[which.qmark] <- 0
(This version also works if you have both ? and other NA values in your data)

Convert table into fasta in R

I have a table like this:
>head(X)
column1 column2
sequence1 ATCGATCGATCG
sequence2 GCCATGCCATTG
I need an output in a fasta file, looking like this:
sequence1
ATCGATCGATCG
sequence2
GCCATGCCATTG
So, basically I need all entries of the 2nd column to become new rows, interspersing the first column. The old 2nd column can then be discarded.
The way I would normally do that is by replacing a whitespace (or tab) with \n in notepad++, but I fear my files will be too big for doing that.
Is there a way for doing that in R?
I had the same question but found a really easy way to convert a data frame to a fasta file using the package: "seqRFLP".
Do the following:
Install and load seqRFLP
install.packages("seqRFLP")
library("seqRFLP")
Your sequences need to be in a data frame with sequence headers in column 1 and sequences in column 2 [doesn't matter if it's nucleotide or amino acid]
Here is a sample data frame
names <- c("seq1","seq2","seq3","seq4")
sequences<-c("EPTFYQNPQFSVTLDKR","SLLEDPCYIGLR","YEVLESVQNYDTGVAK","VLGALDLGDNYR")
df <- data.frame(names,sequences)
Then convert the data frame to .fasta format using the function: 'dataframe2fas'
df.fasta = dataframe2fas(df, file="df.fasta")
D <- do.call(rbind, lapply(seq(nrow(X)), function(i) t(X[i, ])))
D
# 1
# column1 "sequence1"
# column2 "ATCGATCGATCG"
# column1 "sequence2"
# column2 "GCCATGCCATTG"
Then, when you write to file, you could use
write.table(D, row.names = FALSE, col.names = FALSE, quote = FALSE)
# sequence1
# ATCGATCGATCG
# sequence2
# GCCATGCCATTG
so that the row names, column names, and quotes will be gone.
When I do this, I tend to use something like:
Xfasta <- character(nrow(X) * 2)
Xfasta[c(TRUE, FALSE)] <- paste0(">", X$column1)
Xfasta[c(FALSE, TRUE)] <- X$column2
This creates an empty character vector, with length twice the length of your table; then puts the values from column1 in every second position starting at 1, and the values of column2 in every second position starting at 2.
then write using writeLines:
writeLines(Xfasta, "filename.fasta")
In this answer, I added a ">" to the headers since this is standard for fasta format and is required by some tools that take fasta input. If you don't care about adding the ">", then:
Xfasta <- character(nrow(X) * 2)
Xfasta[c(TRUE, FALSE)] <- X$column1
Xfasta[c(FALSE, TRUE)] <- X$column2
If you didn't read your file in with options to stop characters being read as factors, then you might need to use <- as.character(X$column1) instead.
There are also a few tools available for this conversion, I think the Galaxy browser has an option for it.

Creating functions in R with iterative code

I work with surveys and would like to export a large number of tables (drawn from data frames) into an .xlsx or .csv file. I use the xlsx package to do this. This package requires me to stipulate which column in the excel file is the first column of the table. Because I want to paste multiple tables into the .csv file I need to be able to stipulate that the first column for table n is the length of table (n-1) + x number of spaces. To do this I planned on creating values like the following.
dt# is made by changing a table into a data frame.
table1 <- table(df$y, df$x)
dt1 <- as.data.frame.matrix(table1)
Here I make the values for the number of the starting column
startcol1 = 1
startcol2 = NCOL(dt1) + 3
startcol3 = NCOL(dt2) + startcol2 + 3
startcol4 = NCOL(dt3) + 3 + startcol2 + startcol3
And so on. I will probably need to produce somewhere between 50-100 tables. Is there a way in R to make this an iterative process so I can create the 50 values of starting columns without having to write 50+ lines of code with each one building on the previous?
I found stuff on stack overflow and other blogs about writing for - loops or using apply type functions in R but this all seemed to deal with manipulating a vector as opposed to adding values to the workspace. Thanks
You can use a structure similar to this:
Your list of files to read:
file_list = list.files("~/test/",pattern="*csv",full.names=TRUE)
for each file, read and process the data frame and capture how many columns there are in the frame you are reading/processing:
columnsInEachFile = sapply(file_list,
function(x)
{
df = read.csv(x,...) # with your approriate arguments
# do any necessary processing you require per file
return(ncol(df))
}
)
The cumulative sum of the number of columns plus 1 will indicate the start columns of a data frame that contains your processed data stuck next to each other:
columnsToStartDataFrames = cumsum(columnsInEachFile)+1
columnsToStartDataFrames = columnsToStartDataFrames[-length(columnsToStartDataFrames)] # last value is not the start of a data frame but the end
Assuming tab.lst is a list containing tables, then you can do:
cumsum(c(1, sapply(tail(tab.lst, -1), ncol)))
Basically, what I'm doing here is I'm looping through all the tables but the last one (since that one's start col is determined by the second to last), and getting each table's width with ncol. Then I'm doing the cumulative sum over that vector to get all the start positions.
And here is how I created the tables (tables based on all possible combinations of columns in df):
df <- replicate(5, sample(1:10), simplify=F) # data frame with 5 columns
names(df) <- tail(letters, 5) # name the cols
name.combs <- combn(names(df), 2) # get all 2 col combinations
tab.lst <- lapply( # make tables for each 2 col combination
split(name.combs, col(name.combs)), # loop through every column in name.combs
function(x) table(df[[x[[1]]]], df[[x[[2]]]]) # ... and make a table
)

Resources