I am trying to import a CSV file into R using the read.table command. I keep getting the error message "more columns than column names", even though I have set the strip.white to TRUE. The program that makes the csv files adds a large number of comma characters to the end of each line, which I think is the source of the extra columns.
read.table("filename.csv", sep=",", fill=T, header=TRUE, strip.white = T,
as.is=T,row.names = NULL, quote = "")
How can I get R to strip away the extraneous columns of commas from the header line and from the rest of the CSV file as it reads it into the R console?
Also, numerous cells in the csv file do not contain any data. Is it possible to get R to fill in these empty cells with "NA"?
The first two lines of the csv file:
Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-value,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Value_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Chr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCAATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,User,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
You can use a combination of colClasses with "NULL" entries to "blank-out" the commas (also still needing , fill=TRUE:
read.table(text="1,2,3,4,5,6,7,8,,,,,,,,,,,,,,,,,,
9,9,9,9,9,9,9,9,,,,,,,,,,,,,,,,,", sep=",", fill=TRUE, colClasses=c(rep("numeric", 8), rep("NULL", 30)) )
#------------------
V1 V2 V3 V4 V5 V6 V7 V8
1 1 2 3 4 5 6 7 8
2 9 9 9 9 9 9 9 9
Warning message:
In read.table(text = "1,2,3,4,5,6,7,8,,,,,,,,,,,,,,,,,,\n9,9,9,9,9,9,9,9,,,,,,,,,,,,,,,,,", :
cols = 26 != length(data) = 38
I needed to add back in the missing linefeed at the end of the first line. (Yet another reason why you should edit questions rather than putting data examples in the comments.) There was an octothorpe in the header which required the comment.char be set to "":
read.table(text="Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-value,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Value_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,\nChr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCAATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,User,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", header=TRUE, colClasses=c(rep("character", 24), rep("NULL", 41)), comment.char="", sep=",")
Document_Name Sequence_Name Track_Name Type Name
1 Chr2_FT Chr2 Chr2.bed CDS 10000_ARHGAP15
Sequence Minimum Min_.with_gaps... Maximum
1 GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCAATAACAAGTGGGCACTGAGAGAAAG 55916421 56019336 55916483
Max_.with_gaps. Length Length_.with_gaps. X._Intervals Direction Average.._Quality Coverage modified_by
1 56019399 63 64 1 forward User
Polymorphism_Type Strand.Bias Strand.Bias_.50._P.va..lue Strand.Bias_.65._P.value Variant_Frequency
1
Variant_Nucleotide.s. Variant_P.Va..lue_.approximate.
1
If you know what your colClasses will be, then you can get missing values to be NA in the numeric columns automatically. You could also use the na.strings setting to accomplish this. You could also do some editing on the header to take out the illegal characters in the column names. (I didn't think I needed to be the one to do that though.)
read.table(text="Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-value,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Value_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Chr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCAATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,User,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", header=TRUE, colClasses=c(rep("character", 24), rep("NULL", 41)), comment.char="", sep=",", na.strings="")
#------------------------------------------------------
Document_Name Sequence_Name Track_Name Type Name
1 Chr2_FT Chr2 Chr2.bed CDS 10000_ARHGAP15
Sequence Minimum Min_.with_gaps... Maximum
1 GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCAATAACAAGTGGGCACTGAGAGAAAG 55916421 56019336 55916483
Max_.with_gaps. Length Length_.with_gaps. X._Intervals Direction Average.._Quality Coverage modified_by
1 56019399 63 64 1 forward <NA> <NA> User
Polymorphism_Type Strand.Bias Strand.Bias_.50._P.va..lue Strand.Bias_.65._P.value Variant_Frequency
1 <NA> <NA> <NA> <NA> <NA>
Variant_Nucleotide.s. Variant_P.Va..lue_.approximate.
1 <NA> <NA>
I have been fiddling with the first two lines of your file, and the problem appears to be the # in one of your column names. read.table treats # as a comment character by default, so it reads in your header, ignores everything after # and returns 13 columns.
You will be able to read in your file with read.table using the argument comment.char="".
Incidentally, this is yet another reason why those who ask questions should include examples of the files/datasets they are working with.
>titletool<-read.csv("TotalCSVData.csv",header=FALSE,sep=",")
> class(titletool)
[1] "data.frame"
>titletool[1,1]
[1] Experiment name : CONTROL DB AD_1
>t<-titletool[1,1]
>t
[1] Experiment name : CONTROL DB AD_1
>class(t)
[1] "character"
now i want to create an object (vector) with the name "Experiment name : CONTROL DB AD_1" , or even better if possible CONTROL DB AD_1
Thank you
Use assign:
varname <- "Experiment name : CONTROL DB AD_1"
assign(varname, 3.14158)
get("Experiment name : CONTROL DB AD_1")
[1] 3.14158
And you can use a regular expression and sub or gsub to remove some text from a string:
cleanVarname <- sub("Experiment name : ", "", varname)
assign(cleanVarname, 42)
get("CONTROL DB AD_1")
[1] 42
But let me warn you this is an unusual thing to do.
Here be dragons.
If I understand correctly, you have a bunch of CSV files, each with multiple experiments in them, named in the pattern "Experiment ...". You now want to read each of these "experiments" into R in an efficient way.
Here's a not-so-pretty (but not-so-ugly either) function that might get you started in the right direction.
What the function basically does is read in the CSV, identify the line numbers where each new experiment starts, grabs the names of the experiments, then does a loop to fill in a list with the separate data frames. It doesn't really bother making "R-friendly" names though, and I've decided to leave the output in a list, because as Andrie pointed out, "R has great tools for working with lists."
read.funkyfile = function(funkyfile, expression, ...) {
temp = readLines(funkyfile)
temp.loc = grep(expression, temp)
temp.loc = c(temp.loc, length(temp)+1)
temp.nam = gsub("[[:punct:]]", "",
grep(expression, temp, value=TRUE))
temp.out = vector("list")
for (i in 1:length(temp.nam)) {
temp.out[[i]] = read.csv(textConnection(
temp[seq(from = temp.loc[i]+1,
to = temp.loc[i+1]-1)]),
...)
names(temp.out)[i] = temp.nam[i]
}
temp.out
}
Here is an example CSV file. Copy and paste it into a text editor and save it as "funkyfile1.csv" in the current working directory. (Or, read it in from Dropbox: http://dl.dropbox.com/u/2556524/testing/funkyfile1.csv)
"Experiment Name: Here Be",,
1,2,3
4,5,6
7,8,9
"Experiment Name: The Dragons",,
10,11,12
13,14,15
16,17,18
Here is a second CSV. Again, copy-paste and save it as "funkyfile2.csv" in your current working directory. (Or, read it in from Dropbox: http://dl.dropbox.com/u/2556524/testing/funkyfile2.csv)
"Promises: I vow to",,
"H1","H2","H3"
19,20,21
22,23,24
25,26,27
"Promises: Slay the dragon",,
"H1","H2","H3"
28,29,30
31,32,33
34,35,36
Notice that funkyfile1 has no column names, while funkyfile2 does. That's what the ... argument in the function is for: to specify header=TRUE or header=FALSE. Also the "expression" identifying each new set of data is "Promises" in funkyfile2.
Now, use the function:
read.funkyfile("funkyfile1.csv", "Experiment", header=FALSE)
# read.funkyfile("http://dl.dropbox.com/u/2556524/testing/funkyfile1.csv",
# "Experiment", header=FALSE) # Uncomment to load remotely
# $`Experiment Name Here Be`
# V1 V2 V3
# 1 1 2 3
# 2 4 5 6
# 3 7 8 9
#
# $`Experiment Name The Dragons`
# V1 V2 V3
# 1 10 11 12
# 2 13 14 15
# 3 16 17 18
read.funkyfile("funkyfile2.csv", "Promises", header=TRUE)
# read.funkyfile("http://dl.dropbox.com/u/2556524/testing/funkyfile2.csv",
# "Experiment", header=TRUE) # Uncomment to load remotely
# $`Promises I vow to`
# H1 H2 H3
# 1 19 20 21
# 2 22 23 24
# 3 25 26 27
#
# $`Promises Slay the dragon`
# H1 H2 H3
# 1 28 29 30
# 2 31 32 33
# 3 34 35 36
Go get those dragons.
Update
If your data are all in the same format, you can use the lapply solution mentioned by Andrie along with this function. Just make a list of the CSVs that you want to load, as below. Note that the files all need to use the same "expression" and other arguments the way the function is currently written....
temp = list("http://dl.dropbox.com/u/2556524/testing/funkyfile1.csv",
"http://dl.dropbox.com/u/2556524/testing/funkyfile3.csv")
lapply(temp, read.funkyfile, "Experiment", header=FALSE)