Merging Datasets with Different Attributes - r

I have a group of .xls files containing data for different periods of the year. I would like to merge them so that I have all the data in one file. I tried the following code:
#create files list
setwd("~/2010")
file.list <- list.files( pattern = ".*\\.xls$", full.names = TRUE )
When I continue, I get some warnings but I don't think they are relevent. See below:
#read files
> l <- lapply( file.list, readxl::read_excel )
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, ... :
Expecting numeric in F1944 / R1944C6: got '-'
2: In read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, ... :
Expecting numeric in H1944 / R1944C8: got '-'
Then, I run the following line and the problems with the attributes pop up:
> dt <- data.table::rbindlist( l, use.names = TRUE, fill = TRUE )
Error in data.table::rbindlist(l, use.names = TRUE, fill = TRUE) :
Class attribute on column 15 of item 4 does not match with column 15 of item 1.
Can someone help me to fix this? Many thanks in advance

If you are going to bind together two datasets, the classes of the columns must match. Yours apparently do not. So you somehow need to address these mismatches.
Because you did not supply a col_types argument to read_xl::read_excel, it is guessing column types. I assume you expect the columns to be the same class in all of the data frames (otherwise, why bind them?) in which case you could pass a col_types argument so that read_xl::read_excel doesn't have to guess.
The error messages here are useful: I think they are saying that a column was guessed to be numeric but then the parser encountered a "-". Maybe this led to the column being assigned class "character". Perhaps "-" appears in the raw data to indicate a missing value. Then passing na = c("", "-") to read_xl::read_excel could resolve the issue.

Related

I can't seem to access the data in my file?

library(tidyverse)
y <- read_tsv("assignment_data.tsv")
x <- 1
When I check R console I get the following:
> y <- read_tsv("assignment_data.tsv", header=TRUE)
Error in read_tsv("assignment_data.tsv", header = TRUE) :
unused argument (header = TRUE)
>
> x <- 1
>
However, I can only access x in the global environment and I can't visualize the data in the file I tried to import.
Regarding your error:
Error in read_tsv("assignment_data.tsv", header = TRUE) :
unused argument (header = TRUE)
If you use
?read_tsv
you will find header is not one of the arguments. Instead, you are looking for col_names
Edit:
We found out the problem laid within the tsv itself. The number of column names did not match the number of columns (implied by data)

Write.table using looped variable in a for loop

I have a very stupid question. It has been already asked, but none of the solutions provided seem to work with me.
I am looping over a list containing different data frames, to perform an analysis and save an output file named differently for each input data frame. The name would be something like originalname_output.txt.
I wrote this piece of code which seems to work fine (does all the analysis in the correct ways), but gives an error when coming to the write.table part.
library(qqman)
library(QuASAR)
list_QuASAR <- list (Fw, Rv, tot) #all of the are dfs
for (i in list_QuASAR){
output <- fitQuasarMpra(i[,2], i[,3], i[,4])
print(sum(output$padj_quasar<0.1))
qq(output$pval3, col = "black", cex = 1)
write.table(output, paste0("quasar_output/", i, "_output.txt"), col.names = T, sep = "\t")
}
fitQuasarMpra is a function of a package called QuASAR. Of course the subdirectory called quasar_output already exists.
The error I am getting is:
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
In addition: Warning message:
In if (file == "") file <- stdout() else if (is.character(file)) { :
the condition has length > 1 and only the first element will be used
I know it's a trivial problem but I am currently stuck. I may consider to switch and use lapply, but then I may encounter the same problem and I wanted to solve this first.
Many thanks for you help.
You're trying to use a data frame object (i) as part of a file name; i.e. the data frame itself, not its name. You could try iterating over a named list instead:
list_QuASAR <- list (Fw = Fw,Rv = Rv,tot = tot)
for (i in names(list_QuASAR)){
output <- fitQuasarMpra(list_QuASAR[[i]][,2], list_QuASAR[[i]][,3], list_QuASAR[[i]][,4])
print(sum(output$padj_quasar<0.1))
qq(output$pval3, col = "black", cex = 1)
write.table(output, paste0("quasar_output/", i, "_output.txt"), col.names = T, sep = "\t")
}

R unable to detect that I have more than one column in loaded files

What I want to do is take every file in the subdirectory that I am in and essentially just shift the column header names over one left.
I try to accomplish this by using fread in a for loop:
library(data.table)
## I need to write this script to reorder the column headers which are now apparently out of wack
## I just need to shift them over one
filelist <- list.files(pattern = ".*.txt")
for(i in 1:length(filelist)){
assign(filelist[[i]], fread(filelist[[i]], fill = TRUE))
names(filelist[[i]]) <- c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)")
}
However, I keep getting the following or a variant of the following error message:
Error in names(filelist[[i]]) <- c("RowID", "rsID", "PosID", "Link", "Link.1", :
'names' attribute [8] must be the same length as the vector [1]
Which is confusing to me because, as you can clearly see above, R Studio is able to load the files as having the correct number of columns. However, the error message seems to imply that there is only one column. I have tried different functions, such as colnames, and I have even tried to define the separator as being quotation marks (as my files were previously generated by another R script that quotation-separated the entries), to no luck. In fact, if I try to define the separator as such:
for(i in 1:length(filelist)){
assign(filelist[[i]], fread(filelist[[i]], sep = "\"", fill = TRUE))
names(filelist[[i]]) <- c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)")
}
I get the following error:
Error in fread(filelist[[i]], sep = "\"", fill = TRUE) :
sep == quote ('"') is not allowed
Any help would be appreciated.
I think the problem is that, despite the name, list.files returns a character vector, not a list. So using [[ isn't right. Then, with assign, you create an objects that have the same name as the files (not good practice, it would be better to use a list). Then you try to modify the names of the object created, but only using the character string of the object name. To use an object who's name is in a character string, you need to use get (which is part of why using a list is better than creating a bunch of objects).
To be more explicit, let's say that filelist = c("data1.txt", "data2.txt"). Then, when i = 1, this code: assign(filelist[[i]], fread(filelist[[i]], fill = TRUE)) creates a data table called data1.txt. But your next line, names(filelist[[i]]) <- ... doesn't modify your data table, it modifies the first element of filelist, which is the string "data1.txt", and that string indeed has length 1.
I recommend reading your files into a list instead of using assign to create objects.
filelist <- list.files(pattern = ".*.txt")
datalist <- lapply(filelist, fread, fill = TRUE)
names(datalist) <- filelist
For changing the names, you can use data.table::setnames instead:
for(dt in datalist) setnames(dt, c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)"))
However, fread has a col.names argument, so you can just do it in the read step directly:
my_names <- c("RowID", "rsID", "PosID", "Link", "Link.1","Direction", "Spearman_rho", "-log10(p)")
datalist <- lapply(filelist, fread, fill = TRUE, col.names = my_names)
I would also suggest not using "-log10(p)" as a column name - nonstandard column names (with parens and -) are usually more trouble than they are worth.
Could you run the following code to have a closer look at what you are putting into filelist?
i <- 1
assign(filelist[[i]], fread(filelist[[i]], fill = TRUE))
print(filelist[[i]])
I suspect you may need to use the code below instead of the assign statement
filelist[[i]] <- fread(filelist[[i]], fill = TRUE)

read_excel 'expecting numeric' ..... and value is numeric

I didn't find an answer to this question, so hopefully this is the place to get some help on this.
I am reading in many Excel files contained in .zip files. Each .zip that I have has about 40 excel files that I want to read. I am trying to create a list of data frames, but encounter an error on reading some files based on file content.
This is the read statment, inside a for loop:
library(readxl)
df[[i]] <- read_excel(xls_lst[i],
skip = 4,
col_names = FALSE,
na = "n/a",
col_types = data_types)
data_types has these values :
> data_types
[1] "text" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
which is correct for this file.
The read_excel statement works well on some files, but returns warning message on others :
In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types,... :
[54, 7]: expecting numeric: got '9999.990000'
Well, the value '9999.99000' looks like a numeric to me.
When I open the Excel file that creates this warning, the file indeed shows these values, and also shows that the column is formatted as text in Excel.
When I change the column formatting to numeric, re-save the Excel sheet, then the data is read in correctly.
However, I have several hundreds of these files to read ... how can read_excel ignore the column format indicated by Excel, and instead use the col_type defintion that I supply in the calling statement ?
Thanks,
I tried to build a toy example.
My xlsx file contains:
3 1
3 3
4 4
5 5
7 '999
6 3
Reading in it your way:
data_types<-c("numeric","numeric")
a<-read_excel("aa.xlsx",
col_names = FALSE,
na = "n/a",
col_types = data_types
)
Warning message:
In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types, :
[5, 2]: expecting numeric: got '999'
Reading in everything as text
data_types<-c("text","text")
dat<-read_excel("aa.xlsx",
col_names = FALSE,
na = "n/a",
col_types = data_types
)
And using type.convert:
dat[]<-lapply(dat, type.convert)
works at least for this simple example.
*Edited:
There was a mistake in the code.
*Edit in response to comment:
Another toy example demonstrating how you could apply type.convert to your data:
#list of data frames
l<-list()
l[[1]]<-data.frame(matrix(rep(as.character(1:5),2), ncol = 2), stringsAsFactors = FALSE)
l<-rep(l,3)
#looping over your list to encode columns correctly:
for (i in 1: length(l)){
l[[i]][]<-lapply(l[[i]], type.convert)
}
There might be better solutions. But I think this should work.

write multiple custom files with d_ply

This question is almost the same as a previous question, but differs enough that the answers for that question don't work here. Like #chase in the last question, I want to write out multiple files for each split of a dataframe in the following format(custom fasta).
#same df as last question
df <- data.frame(
var1 = sample(1:10, 6, replace = TRUE)
, var2 = sample(LETTERS[1:2], 6, replace = TRUE)
, theday = c(1,1,2,2,3,3)
)
#how I want the data to look
write(paste(">", df$var1,"_", df$var2, "\n", df$theday, sep=""), file="test.txt")
#whole df output looks like this:
#test.txt
>1_A
1
>8_A
1
>4_A
2
>9_A
2
>2_A
3
>1_A
3
However, instead of getting the output from the entire dataframe I want to generate individual files for each subset of data. Using d_ply as follows:
d_ply(df, .(theday), function(x) write(paste(">", df$var1,"_", df$var2, "\n", df$theday, sep=""), file=paste(x$theday,".fasta",sep="")))
I get the following output error:
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
In addition: Warning messages:
1: In if (file == "") file <- stdout() else if (substring(file, 1L, :
the condition has length > 1 and only the first element will be used
2: In if (substring(file, 1L, 1L) == "|") { :
the condition has length > 1 and only the first element will be used
Any suggestions on how to get around this?
Thanks,
zachcp
There were two problems with your code.
First, in constructing the file name, you passed the vector x$theday to paste(). Since x$theday is taken from a column of a data.frame, it often has more than one element. The error you saw was write() complaining when you passed several file names to its file= argument. Using instead unique(x$theday) ensures that you will only ever paste together a single file name rather than possibly more than one.
Second, you didn't get far enough to see it, but you probably want to write the contents of x (the current subset of the data.frame), rather than the entire contents of df to each file.
Here is the corrected code, which appears to work just fine.
d_ply(df, .(theday),
function(x) {write(paste(">", x$var1,"_", x$var2, "\n", x$theday, sep=""),
file=paste(unique(x$theday),".fasta",sep=""))
})

Resources