Different results for same dataset in Cluster Analyses with R Studio? - r

I just start using R and I have a question regarding cluster analysis in R.
I apply agnes function to apply cluster analysis for my dataset. But I realized that cluster results and the pltrees are different when I used the .txt file and .csv file.
Maybe it would be better to explain my problem with the images:
My dataset in .txt format;
I used the following code to see the data in R;
data01 <- read.table("D:/CLUSTER_ANALYSIS/NumericData3_IN.txt", header = T)
and everything is fine, it seems like;
I apply the cluster anaylsis,
complete1 <- agnes(data01, stand = FALSE, method = 'complete')
plot(complete1, which.plots=2, main='Complete-Linkage')
And here is the pltree:
I made the same steps with .csv file, which includes exactly the same dataset. Here is the dataset in .csv format:
Again the cluster analysis for .csv file:
data02 <- read.csv("D:/CLUSTER_ANALYSIS/NumericData3.csv", header = T)
complete2 <- agnes(data02, stand = FALSE, method = 'complete')
plot(complete2, which.plots=2, main='Complete-Linkage')
And the pltree is completely different,
So, DECIMAL SEPARATOR for the txt is COMMA and for csv file it is DOT. Which of these results are correct? Is the decimal separator for numeric dataset comma or dot in R?

From the R manual on read.table (and read.csv) you can see the default separators. They are dot for each of your used functions. You can also set them to whatever you like with the "dec" parameter. Eg:
data01 <- read.table("D:/CLUSTER_ANALYSIS/NumericData3_IN.txt", header = T, dec=",")

Related

Read specific non-adjacent columns from Excel file [duplicate]

So I have a bunch of excel files I want to loop through and read specific, discontinuous columns into a data frame. Using the readxl works for the basic stuff like this:
library(readxl)
library(plyr)
wb <- list.files(pattern = "*.xls")
dflist <- list()
for (i in wb){
dflist[[i]] <- data.frame(read_excel(i, sheet = "SheetName", skip=3, col_names = TRUE))
}
# now put them into a data frame
data <- ldply(dflist, data.frame, .id = NULL)
This works (barely) but the problem is my excel files have about 114 columns and I only want specific ones. Also I do not want to allow R to guess the col_types because it messes some of them up (eg for a string column, if the first value starts with a number, it tries to interpret the whole column as numeric, and crashes). So my question is: How do I specify specific, discontinuous columns to read? The range argument uses the cell_ranger package which does not allow for reading discontinuous columns. So any alternative?
.xlsx >>> you can use library openxlsx
The read.xlsx function from library openxlsx has an optional parameter cols that takes a numeric index, specifying which columns to read.
It seems it reads all columns as characters if at least one column contains characters.
openxlsx::read.xlsx("test.xlsx", cols = c(2,3,6))
.xls >>> you can use library XLConnect
The potential problem is that library XLConnect requires library rJava, which might be tricky to install on some systems. If you can get it running, the keep and drop parameters of readWorksheet() accept both column names and indices. Parameter colTypes deals with column types. This way it works for me:
options(java.home = "C:\\Program Files\\Java\\jdk1.8.0_74\\") #path to jdk
library(rJava)
library(XLConnect)
workbook <- loadWorkbook("test.xls")
readWorksheet(workbook, sheet = "Sheet0", keep = c(1,2,5))
Edit:
Library readxl works well for both .xls and .xlsx if you want to read a range (rectangle) from your excel file. E.g.
readxl::read_xls("test.xls", range = "B3:D8")
readxl::read_xls("test.xls", sheet = "Sheet1", range = cell_cols("B:E"))
readxl::read_xlsx("test.xlsx", sheet = 2, range = cell_cols(2:5))

R read excel by column names

So I have a bunch of excel files I want to loop through and read specific, discontinuous columns into a data frame. Using the readxl works for the basic stuff like this:
library(readxl)
library(plyr)
wb <- list.files(pattern = "*.xls")
dflist <- list()
for (i in wb){
dflist[[i]] <- data.frame(read_excel(i, sheet = "SheetName", skip=3, col_names = TRUE))
}
# now put them into a data frame
data <- ldply(dflist, data.frame, .id = NULL)
This works (barely) but the problem is my excel files have about 114 columns and I only want specific ones. Also I do not want to allow R to guess the col_types because it messes some of them up (eg for a string column, if the first value starts with a number, it tries to interpret the whole column as numeric, and crashes). So my question is: How do I specify specific, discontinuous columns to read? The range argument uses the cell_ranger package which does not allow for reading discontinuous columns. So any alternative?
.xlsx >>> you can use library openxlsx
The read.xlsx function from library openxlsx has an optional parameter cols that takes a numeric index, specifying which columns to read.
It seems it reads all columns as characters if at least one column contains characters.
openxlsx::read.xlsx("test.xlsx", cols = c(2,3,6))
.xls >>> you can use library XLConnect
The potential problem is that library XLConnect requires library rJava, which might be tricky to install on some systems. If you can get it running, the keep and drop parameters of readWorksheet() accept both column names and indices. Parameter colTypes deals with column types. This way it works for me:
options(java.home = "C:\\Program Files\\Java\\jdk1.8.0_74\\") #path to jdk
library(rJava)
library(XLConnect)
workbook <- loadWorkbook("test.xls")
readWorksheet(workbook, sheet = "Sheet0", keep = c(1,2,5))
Edit:
Library readxl works well for both .xls and .xlsx if you want to read a range (rectangle) from your excel file. E.g.
readxl::read_xls("test.xls", range = "B3:D8")
readxl::read_xls("test.xls", sheet = "Sheet1", range = cell_cols("B:E"))
readxl::read_xlsx("test.xlsx", sheet = 2, range = cell_cols(2:5))

Convert XML Doc to CSV using R - Or search items in XML file using R

I have a large, un-organized XML file that I need to search to determine if a certain ID numbers are in the file. I would like to use R to do so and because of the format, I am having trouble converting it to a data frame or even a list to extract to a csv. I figured I can search easily if it is in a csv format. So , I need help understanding how to do convert it and extract it properly, or how to search the document for values using R. Below is the code I have used to try and covert the doc,but several errors occur with my various attempts.
## Method 1. I tried to convert to a data frame, but the each column is not the same length.
require(XML)
require(plyr)
file<-"EJ.XML"
doc <- xmlParse(file,useInternalNodes = TRUE)
xL <- xmlToList(doc)
data <- ldply(xL, data.frame)
datanew <- read.table(data, header = FALSE, fill = TRUE)
## Method 2. I tried to convert it to a list and the file extracts but only lists 2 words on the file.
data<- xmlParse("EJ.XML")
print(data)
head(data)
xml_data<- xmlToList(data)
class(data)
topxml <- xmlRoot(data)
topxml <- xmlSApply(topxml,function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml),
row.names=NULL)
write.csv(xml_df, file = "MyData.csv",row.names=FALSE)
I am going to do some research on how to search within R as well, but I assume the file needs to be in a data frame or list to so either way. Any help is appreciated! Attached is a screen shot of the data. I am interested in finding matching entity id numbers to a list I have in a excel doc.

Export SpectraObjects to csv in ChemoSpec

I am using ChemoSpec to analyse FTIR spectra in R.
I was able to import several csv files using files2SpectraObject, applied some of the data pre-processing procedures, such as normalization and binning, and generated new SpectraObjects with the results.
Is it possible to export data back to csv format from the generated SpectraObjects?
So far I tried this
write.table(ftirbin, "E:/ftirbin.txt", sep="\t")
and got this:
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ""Spectra"" to a data.frame
Thanks in advance!
G
If you look at ?Spectra you'll see how a Spectra object is stored. The intensity values are in your_object$data, and the frequency values are in your_object$freq. So you can't export the whole object (it's not a data frame, but rather a list), but you can export the pieces. To export the frequencies in the first column, and the samples in the following columns, you can do this (example uses a built in data set, SrE.IR):
tmp <- cbind(SrE.IR$freq, t(SrE.IR$data))
colnames(tmp) <- c("freq", SrE.IR$names)
tmp <- as.data.frame(tmp) # it was a matrix
Then you can write it out using write.csv or write.table (check the arguments to avoid row numbers).

how to write binary data to csv file in R

I am trying to write binary data to a csv file for further reading this file with 'read.csv2', 'read.table' or 'fread' to get a dataframe. The script is as follows:
library(iotools)
library(data.table)
#make a dataframe
n<-data.frame(x=1:100000,y=rnorm(1:100000),z=rnorm(1:100000),w=c("1dfsfsfsf"))
#file name variable
file_output<-"test.csv"
#check the existence of the file -> if true -> to remove it
if (file.exists(file_output)) file.remove(file_output)
#create a file
file(file_output, ifelse(FALSE, "ab", "wb"))
#to make a file object
zz <- file(file_output, "wb")
#to make a binary vector with column names
rnames<-as.output(rbind(colnames(n),""),sep=";",nsep="\t")
#to make a binary vector with dataframe
r = as.output(n, sep = ";",nsep="\t")
#write column names to the file
writeBin(rnames, zz)
#write data to the file
writeBin(r, zz)
#close file object
close(zz)
#test readings
check<-read.table(file_output,header = TRUE,sep=";",dec=".",stringsAsFactors = FALSE
,blank.lines.skip=T)
str(check)
class(check)
check<-fread(file_output,dec=".",data.table = FALSE,stringsAsFactors = FALSE)
str(check)
class(check)
check<-read.csv2(file_output,dec=".")
str(check)
class(check)
The output from the file is attached:
My questions are:
how to remove the blank line from the file without downloading to R?
It has been made on purpose to paste a binary vector of colnames as a dataframe. Otherwise colnames were written as one-column vector. Maybe it is possible to remove a blank line before 'writeBin()'?
How make the file to be written all numeric values as numeric but not as a character?
I use the binary data transfer on purpose because it is much faster then 'write.csv2'. For instance, if you apply
system.time(write.table.raw(n,"test.csv",sep=";",col.names=TRUE))
the time elapsed will be ~4 times as less as 'write.table' used.
I could not comment on your question because of my reputation but I hope it helps you.
Two things come in my mind
Using the fill in read.table in which, if TRUE then in that case the rows have unequal length, blank fields are implicitly added. (do ??read.table)
You have mentioned blank.lines.skip=TRUE. If TRUE blank lines in the input are ignored.

Resources