Problem with computing venn.diagram with two data sets of character values - r

As in title, when I try to draw a venn diagram with VennDiagram package, the resulting plot looks like this: Resulting graph.
My input is two tables read from txt files with read.delim (although I also tried read.table) put into a list for venn.diagram purpose. The datasets are 1325 and 675 rows long with short peptide sequences as character values (eg. REVDPDGRRTL), so I don't understand the resulting graph.
Here's what in theory should work:
library("VennDiagram")
#reading files
hid <- read.table("data/file1.txt", sep = "\t")
lid <- read.table("data/file2.txt", sep = "\t")
#creating a list
vid <- list(High = hid, Low = lid)
#graph
venn.diagram(vid, fill = c("#EFC000FF", "#0073C2FF"), filename = "#venn.png")
I also tried transforming the sets to vectors/lists and plotting like that but problem stays the same.
It surely lays on datasets/list side, because the graph is correct when I put example values like this
venn.diagram(list(Low = c("REVDPDGRRTL", "IYEDEDVKEA", "GVYDGREHTV"), High = c("IYEDEDVKEA", "GVYDGREHTV")),
fill = c("#EFC000FF", "#0073C2FF"), filename = "#venn.png")
I'm sure it's some rookie mistake but I can't think of a solution.
Any help is highly appreciated,
Thank you

Related

Venn diagram in R completely blank

I am trying to create a Venn diagram for common differentially expressed genes across 3 data sets. I created a list that contains the differentially expressed genes, then I used the venn.diagram() function with the following arguments: x (which is my list of gene names in the three data sets) , filename,category.names and output. However, the Venn diagram is turning out completely blank, no category names nor numbers inside intersections.
My code looks like this:
venn.diagram(up, filename = 'venn_up.png', category.names = c('up_PC3', 'up_LAPC4', 'up_22Rv1'), output = TRUE)
Has anyone faced a similar problem? Thanks all!
Without reproducible dataset it is hard, so I created one:
genes <- paste("gene",1:1000,sep="")
x <- list(
up_PC3 = sample(genes,300),
up_LAPC4 = sample(genes,525),
up_22Rv1 = sample(genes,440)
)
You can use the following code to run a Venn diagram:
library(VennDiagram)
venn.diagram(x, filename = "venn_up.png", category.names = c('up_PC3', 'up_LAPC4', 'up_22Rv1'))
Than check at the right folder of your working directory for the output:

Importing data in R from Excel with information cotained in header

As title says, I am trying to import data from Excel to R, where part of the information is contained in the header.
I a very simplified way, the Excel I have looks like this:
GROUP;1234
MONTH;"Jan"
PERSON;SEX;AGE;INCOME
John;m;26;20000
Michael;m;24;40000
Phillip;m;25;15000
Laura;f;27;72000
Total;;;147000
After reading in to R, it should be a "clean" dataset that looks like this.
GROUP;MONTH;PERSON;SEX;AGE;INCOME
1234;Jan;John;m;26;20000
1234;Jan;Michael;m;24;40000
1234;Jan;Phillip;m;25;15000
1234;Jan;Laura;f;27;72000
I have several files that look like this. The number of persons however varies in each file. The last line contains a summary that should be skipped. There might be empty lines between the list and summary line.
Any help is higly apreciated.Thank you very much.
Excel files can be read using readxl::read_excel()
One of the parameters is skip, using which you can skip certain number of rows defined by you.
For your data, you need to skip the first two lines that contain GROUP and MONTH.
You will get the data in following format.
PERSON;SEX;AGE;INCOME;
John;m;26;20000
Michael;m;24;40000
Phillip;m;25;15000
Laura;f;27;72000
After this, you can manually add the columns GROUP and MONTH
Thank you very much for your help. The hint from #Aurèle brought the missing puzzle piece. The solution I have now come up with is as follows:
group <- read_excel("TEST1.xlsx", col_names = c("C1","GROUP") ,n_max = 1)
group <- group[,2]
month <- read_excel("TEST1.xlsx", col_names = c("C1","MONTH") ,skip = 1, n_max = 1)
month <- month[,2]
data <- read_excel("TEST1.xlsx", col_names = c("NAME","SEX","AGE","INCOME") , skip = 4)
data <- data[data$AGE != NA,]
data <- cbind(data,group,month)
data

Different results for same dataset in Cluster Analyses with R Studio?

I just start using R and I have a question regarding cluster analysis in R.
I apply agnes function to apply cluster analysis for my dataset. But I realized that cluster results and the pltrees are different when I used the .txt file and .csv file.
Maybe it would be better to explain my problem with the images:
My dataset in .txt format;
I used the following code to see the data in R;
data01 <- read.table("D:/CLUSTER_ANALYSIS/NumericData3_IN.txt", header = T)
and everything is fine, it seems like;
I apply the cluster anaylsis,
complete1 <- agnes(data01, stand = FALSE, method = 'complete')
plot(complete1, which.plots=2, main='Complete-Linkage')
And here is the pltree:
I made the same steps with .csv file, which includes exactly the same dataset. Here is the dataset in .csv format:
Again the cluster analysis for .csv file:
data02 <- read.csv("D:/CLUSTER_ANALYSIS/NumericData3.csv", header = T)
complete2 <- agnes(data02, stand = FALSE, method = 'complete')
plot(complete2, which.plots=2, main='Complete-Linkage')
And the pltree is completely different,
So, DECIMAL SEPARATOR for the txt is COMMA and for csv file it is DOT. Which of these results are correct? Is the decimal separator for numeric dataset comma or dot in R?
From the R manual on read.table (and read.csv) you can see the default separators. They are dot for each of your used functions. You can also set them to whatever you like with the "dec" parameter. Eg:
data01 <- read.table("D:/CLUSTER_ANALYSIS/NumericData3_IN.txt", header = T, dec=",")

CSV with multiple datasets/different-number-of-columns

Similar to How can you read a CSV file in R with different number of columns, I have some complex CSV-files. Mine are from SAP BusinessObjects and hold challenges different to those of the quoted question. I want to automate the capture of an arbitrary number of datasets held in one CSV file. There are many CSV-files, but let's start with one of them.
Given: One CSV file containing several flat tables.
Wanted: Several dataframes or other structure holding all data (S4?)
The method so far:
get line numbers of header data by counting number of columns
get headers by reading every line index held in vector described above
read data by calculating skip and nrows between data sets in index described by header lines as above.
give the read data column names from read header
I need help getting me on the right track to avoid loops/making the code more readable/compact when reading headers and datasets.
These CSVs are formatted as normal CSVs, only that they contain an more or less arbitrary amount of subtables. For each dataset I export, the structure is different. In the current example I will suppose there are five tables included in the CSV.
In order to give you an idea, here is some fictous sample data with line numbers. Separator and quote has been stripped:
1: n, Name, Species, Description, Classification
2: 90, Mickey, Mouse, Big ears, rat
3: 45, Minnie, Mouse, Big bow, rat
...
16835: Code, Species
16836: RT, rat
...
22673: n, Code, Country
22674: 1, RT, Murica
...
33211: Activity, Code, Descriptor
32212: running, RU, senseless activity
...
34749: Last update
34750: 2017/05/09 02:09:14
There are a number of ways going about reading each data set. What I have come up with so far:
filepath <- file.path(paste0(Sys.getenv("USERPROFILE"), "\\SAMPLE.CSV)
# Make a vector with column number per line
fieldVector <- utils::count.fields(filepath, sep = ",", quote = "\"")
# Make a vector with unique number of fields in file
nFields <- base::unique(fieldVector)
# Make a vector with indices for position of new dataset
iHeaders <- base::match(nFields, fieldVector)
With this, I can do things like:
header <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4], nrows = iHeaders[5]-iHeaders[4]-1)
data <- utils::read.csv2(filepath, header = FALSE, sep = ",", quote = "\"", skip = iHeaders[4] + 1, nrows = iHeaders[5]-iHeaders[4]-1)
names(data) <- header
As in the intro of this post, I have made a couple of functions which makes it easier to get headers for each dataset:
Headers <- GetHeaders(filepath, iHeaders)
colnames(data) <- Headers[[4]]
I have two functions now - one is GetHeader, which captures one line from the file with utils::read.csv2 while ensuring safe headernames (no æøå % etc).
The other returns a list of string vectors holding all headers:
GetHeaders <- function(filepath, linenums) {
# init an empty list of length(linenums)
l.headers <- vector(mode = "list", length = length(linenums))
for(i in seq_along(linenums)) {
# read.csv2(filepath, skip = linenums[i]-1, nrows = 1)
l.headers[[i]] <- GetHeader(filepath, linenums[i])
}
l.headers
}
What I struggle with is how to read in all possible datasets in one go. Specifically the last set is a bit hard to wrap my head around if I should write a common function, where I only know the line number of header, and not the number of lines in the following data.
Also, what is the best data structure for such a structure as described? The data in the subtables are all relevant to each other (can be used to normalize parts of the data). I understand that I must do manual work for each read CSV, but as I have to read TONS of these files, some common functions to structure them in a predictable manner at each pass would be excellent.
Before you answer, please keep in mind that, no, using a different export format is not an option.
Thank you so much for any pointers. I am a beginner in R and haven't completely wrapped my head around all possible solutions in this particular domain.

Plot data from several large data files in ggplot

I have several data files (numeric) with around 150000 rows and 25 columns. Before I was using gnuplot (where script lines are proportional plot objects) to to plot the data but as I have to do now some additional analysis with it I moved to R and ggplot2.
How to organize the data, thought? Is one big data.frame with an additional column to mark from which file the data is coming from really the only option? Or is there some way around that?
Edit: To be a bit more precise, I'll give as an example in what form I have the data now:
filelst=c("filea.dat", "fileb.dat", "filec.dat")
dat=c()
for(i in 1:length(filelst)) {
dat[[i]]=read.table(file[i])
}
Assuming you have filenames ending with ".dat", here's a mockup example of the strategies proposed by Chase,
require(plyr)
# list the files
lf = list.files(pattern = "\.dat")
str(lf)
# 1. read the files into a data.frame
d = ldply(lf, read.table, header = TRUE, skip = 1) # or whatever options to read
str(d) # should contain all the data, and and ID column called L1
# use the data, e.g. plot
pdf("all.pdf")
d_ply(d, "L1", plot, t="l")
dev.off()
# or using ggplot2
ggplot(d, aes(x, y, colour=L1)) + geom_line()
# 2. read the files into a list
ld = lapply(lf, read.table, header = TRUE, skip = 1) # or whatever options to read
names(ld) = gsub("\.dat", "", lf) # strip the file extension
str(ld)
# use the data, e.g. plot
pdf("all2.pdf")
lapply(names(l), function(ii) plot(l[[ii]], main=ii), t="l")
dev.off()
# 3. is not fun
Your question is a little vague. If I followed along properly, I think you have three main options:
Do as you suggest and then use any one of the "split-apply-combine" functions that exist in R to conduct your analyses by group. These functions may include by, aggregate, ave, package(plyr), package(data.table) and many others.
Store your data object as separate elements in a list(). Then use lapply() and friends to work on them.
Keep everything separate in different data objects and work on them individually. This is probably the most inefficient way to go about doing things, unless you have memory constraints et al.

Resources