Dropping extra column in Tab delim file - r

I tried to merge different tab delim files into single file using the following R command.
If you observe, I even save the file using write.table command. Now i need to read the same files for further analysis. The biggest problem I am facing is that there is an extra column without any column name created automatically. If you observe that there is a column (Red colour) created automatically when I use the write.table function.
I want to get rid of that column as it hampers all further calculations.
combine=function(file) {
split_list <- unlist(strsplit(file,split=","))
setwd("D:/combine")
dataset <- do.call("cbind",lapply(split_list,FUN=function(files) { read.table(files,header=TRUE, sep="\t") } ) )
names(dataset)[1]=paste("Probe_ID")
drop=c("ProbeID")
dataset=dataset[,!(names(dataset)%in%drop)]
dataset$X=NULL
write.table(dataset,file="D:/output/illumina.txt",sep="\t",col.names=NA)
return ("illumina.txt")
}

Use the argument row.names=FALSE in write.table.

As #James says -- or use row.names=1 in read.table() to indicate that the first column designates the row identifiers of the table when reading the table back into R.

Related

Loading multiple csv files at once and keep the column names unchanged

I have a folder containing different csv files. Below is the picture showing the csv files. I would like to import all of them at once and name them in one go. Also, I would like to keep the column names unchanged.
Here is what I tried:
#Loading the data
filenames <- list.files(path="C:/Users/Juste/Desktop/Customs Data",
pattern="Imports 201+.*csv")
filelist <- lapply(filenames, read.csv)
#assigning names to data.frames
names(filelist) <- paste0("Imports_201",2:length(filelist))
#note the invisible function keeps lapply from spitting out the data.frames to the console
invisible(lapply(names(filelist), function(x) assign(x,filelist[[x]],envir=.GlobalEnv)))
When I tried this, it only imports the first five csv files, it leaves out “Imports 2017_anonymised”. Also the column names change the format. For example, column “Best country” becomes “Best.country”. How can I import all of the csv files and keep the column names unchanged?
You could try map() from the purrr package and read_csv() from the readr package (note that it is written with an underscore). This way your column names don't get changed.
library(purrr)
library(readr)
map(filenames, read_csv)
or if you automatically want to concatenate the dataframes use
map_df(filenames, read_csv)
Sorry, I can't add comments because I don't currently have enough reputation on here to do so. However, I think your regex might be a little off for the import. Try pattern = "^Imports\\s+201\\d_anonymised\\.csv$".
Regarding the "."s in the column names, I believe that by default, R's core data import commands add these where there are spaces. Otherwise you'll need to use backticks each time you want to refer to a column with space in its name. You could try setting check.names = F in you read.csv() function, as this is what calls make.names() to sanitize the column names upon data import. type ?make.names to see what it's doing.

R / openxlsx / Finding the first non-empty cell in Excel file

I'm trying to write data to an existing Excel file from R, while preserving the formatting. I'm able to do so following the answer to this question (Write from R into template in excel while preserving formatting), except that my file includes empty columns at the beginning, and so I cannot just begin to write data at cell A1.
As a solution I was hoping to be able to find the first non-empty cell, then start writing from there. If I run read.xlsx(file="myfile.xlsx") using the openxlsx package, the empty columns and rows are automatically removed, and only the data is left, so this doesn't work for me.
So I thought I would first load the worksheet using wb <- loadWorkbook("file.xlsx") so I have access to getStyles(wb) (which works). However, the subsequent command getTables returns character(0), and wb$tables returns NULL. I can't figure out why this is? Am I right in that these variables would tell me the first non-empty cell?
I've tried manually removing the empty columns and rows preceding the data, straight in the Excel file, but that doesn't change things. Am I on the right path here or is there a different solution?
As suggested by Stéphane Laurent, the package tidyxl offers the perfect solution here.
For instance, I can now search the Excel file for a character value, like my variable names of interest ("Item", "Score", and "Mean", which correspond to the names() of the data.frame I want to write to my Excel file):
require(tidyxl)
colnames <- c("Item","Score","Mean")
excelfile <- "FormattedSheet.xlsx"
x <- xlsx_cells(excelfile)
# Find all cells with character values: return their address (i.e., Cell) and character (i.e., Value)
chars <- x[x$data_type == "character", c("address", "character")]
starting.positions <- unlist(
chars[which(chars$character %in% colnames), "address"]
)
# returns: c(C6, D6, E6)

Dynamically define tables to write to csv in r

I am trying to write multiple dataframes to multiple .csv files dynmanically. I have found online how to do the latter part, but not the former (dynamically define the dataframe).
# create separate dataframes from each 12 month interval of closed age
for (i in 1:max_age) {assign(paste("closed",i*12,sep=""),
mc_masterc[mc_masterc[,7]==i*12,])
write.csv(paste("closed",i*12,sep=""),paste("closed",i*12,".csv",sep=""),
row.names=FALSE)
}
In the code above, the problem is with the first part of the write.csv statement. It will create the .csv file dynamically, but not with the actual content from the table I am trying to specify. What should the first argument of the write.csv statement be? Thank you.
The first argument of write.csv needs to be an R object, not a string. If you don't need the objects in memory you can do it like so:
for (i in 1:max_age) {
df <- mc_masterc[mc_masterc[,7]==i*12,])
write.csv(df,paste("closed",i*12,".csv",sep=""),
row.names=FALSE)
}
and if you need them in memory, you can either do that separately, or use get to return an object based on a string. Seperate:
for (i in 1:max_age) {
df <- mc_masterc[mc_masterc[,7]==i*12,])
assign(paste("closed",i*12,sep=""),df)
write.csv(df,paste("closed",i*12,".csv",sep=""),
row.names=FALSE)
}
With get:
for (i in 1:max_age) {
assign(paste("closed",i*12,sep=""), mc_masterc[mc_masterc[,7]==i*12,])
write.csv(get(paste("closed",i*12,sep="")),paste("closed",i*12,".csv",sep=""),
row.names=FALSE)
}

Importing a .csv with headers that show up as first row of data

I'm importing a csv file into R. I read a post here that said in order to get R to treat the first row of data as headers I needed to include the call header=TRUE.
I'm using the import function for RStudio and there is a Code Preview section in the bottom right. The default is:
library(readr)
existing_data <- read_csv("C:/Users/rruch/OneDrive/existing_data.csv")
View(existing_data)
I've tried placing header=TRUE in the following places:
read_csv(header=TRUE, "C:/Users...)
existing_data.csv", header=TRUE
after 2/existing_data.csv")
Would anyone be able to point me in the right direction?
You should use col_names instead of header. Try this:
library(readr)
existing_data <- read_csv("C:/Users/rruch/OneDrive/existing_data.csv", col_names = TRUE)
There are two different functions to read csv files (actually far more than two): read.csv from utils package and read_csv from readr package. The first one gets header argument and the second one col_names.
You could also try fread function from data.table package. It may be the fastest of all.
Good luck!
It looks like there is one variable name that is correctly identified as a variable name (notice your first column). I would guess that your first row only contains the variable "Existing Product List", and that your other variable names are actually contained in the second row. Open the file in Excel or LibreOffice Calc to confirm.
If it is indeed the case that all of the variable names you've listed (including "Existing Product List") are in the first row, then you're in the same boat as me. In my case, the first row contains all of my variables, however they appear as both variable names and the first row of observations. Turns out the encoding is messed up (which could also be your problem), so my solution was simply to remove the first row.
library(readr)
mydat = read_csv("my-file-path-&-name.csv")
mydat = mydat[-1, ]

Inconsistency between 'read.csv' and 'write.csv' in R

The R function read.csv works as the following as stated in the manual: "If there is a header and the first row contains one fewer field than the number of columns, the first column in the input is used for the row names." That's good. However, when it comes to the function write.csv, I cannot find a way to write the csv file in a similar way. So, if I have a file.txt as below:
Column_1,Column_2
Row_1,2,3
Row_2,4,5
Then when I read it using a = read.csv('file.txt'), the row and column names are Row_x and Column_x as expected. However, when I write the matrix a to a csv file again, then what I get as a result from write.csv('file2.txt', quote=F) is as below:
,Column_1,Column_2
Row_1,2,3
Row_2,4,5
So, there is a comma in the beginning of this file. And if I would read this file again using a2 = read.csv('file2.txt'), then resulting a2 will not be the same as the previous matrix a. The row names of the matrix a2 will not be Row_x. That's, I do not want a comma in the beginning of the file. How can I get rid of this comma while using write.csv?
The two functions that you have mentioned, read.cvs and write.csv are just a specific form of the more generic functions read.table and write.table.
When I copy your example data into a .csv and try to read it with read.csv, R throws a warning and says that the header line was incomplete. Thus it resorted to special behaviour to fix the error. Because we had an incomplete file, it completed the file by adding an empty element at the top left. R understands that this is a header row, and thus the data appears okay in R, but when we write to a csv, it doesn't understand what is header and what is not. Thus the empty element only appearing in the header row created by R shows up as a regular element. Which you would expect. Basically it made our table into a 3x3 because it can't have a weird number of elements.
You want the extra comma there, because it allows programs to read the column names in the right place. In order to read the file in again you can do the following, assuming test.csv is your data. You can fix this by manually adding the column and row names in R, including the missing element to put everything in place.
To fix the wonky row names, you're going to want to add an extra option specifying which row is the row names (row.names = your_column_number) when you read it back in with the comma correctly in place.
y <- read.csv(file = "foo.csv") #this throws a warning because your input is incorrect
write.csv(y, "foo_out.csv")
x <- read.csv(file = "foo.csv", header = T, row.names = 1) #this will read the first column as the row names.
Play around with read/write.csv, but it might be worth while to move into the more generic functions read.table and write.table. They offer expanded functionality.
To read a csv in the generic function
y <- read.table(file = "foo.csv", sep = ",", header = TRUE)
thus you can specify the delimiter and easily read in excel spreadsheets (separated by tab or "\t") or space delimited files ( " " ).
Hope that helps.

Resources