read.csv appends/modifies column headings with date values - r

I'm trying to read a csv file into R that has date values in some of the colum headings.
As an example, the data file looks something like this:
ID Type 1/1/2001 2/1/2001 3/1/2001 4/1/2011
A Supply 25 35 45 55
B Demand 26 35 41 22
C Supply 25 35 44 85
D Supply 24 39 45 75
D Demand 26 35 41 22
...and my read.csv logic looks like this
dat10 <- read.csv("c:\data.csv",header=TRUE, sep=",",as.is=TRUE)
The read.csv works fine except it modifies the name of the colums with dates as follows:
x1.1.2001 x2.1.2001 x3.1.2001 x4.1.2001
Is there a way to prevent this, or a easy way to correct afterwards?

Set check.names=FALSE. But be aware that 1/1/2001 et al are syntactically invalid names, therefore they may cause you some headaches.

You can always change the column names using the colnames function. For example,
colnames(dat10) = gsub("\\.", "/", colnames(dat10))
However, having slashes in your column names isn't a particularly good idea. You can always change them just before you print out the table or when you create a graph.

Related

How to modify number of characters in a vector?

so I have this dataset where age of the respondent was an open-ended question and responses sometimes look as follows:
Age:
23
45
36 years
27
33yo
...
I would like to save the numeric data, without introducing (and filtering out NAs), and I wondered if there is an option of making this out of it:
Age:
23
45
36
27
33
...
..by restricting the n.o. characters in the vector and converting them later to "numeric".
I believe there is a simple line for this. I just somehow couldn't find it.

order() using data frame column name containing spaces produces unexpected results

I was trying to order a data frame in R according to a column named 'credit card usage'. The name of the data frame is mydata. The following command without a comma gives an error
newdata = mydata[order('credit card usage')]
But the following command with a comma works absolutely fine
newdata = mydata[order('credit card usage'),]
I need to understand why do we need the comma. Please can someone explain in a simple language whats going on behind the scenes?
Also the following command
mydata[order('credit card usage'),]
gives only the first row and not the whole dataframe. Why?
Why mydata[order('credit card usage'),] returns only first row is a tricky one.
The name of the column when used after , in [ refers to value of column otherwise it is just a string.
order('credit card usage') call considers just a string is passed to it and it sorts it out and pass the index(which is 1). Hence:
mydata[order('credit card usage'),] reduces to
mydata[1,]
=> which is 1st row of the mydata.
MKR's answer explains why the OP obtained the results described in the post. Here, we'll explain how to deal with the column named credit card usage to correctly sort the entire data frame.
Generally speaking, it's not advisable to use column names containing spaces in R because it often leads to unexpected results as experienced by the OP.
To use a column in a data frame whose name contains spaces, one must use the [[ form of the extract operator. We'll illustrate with some sample data...
set.seed(95014123)
mydata <- data.frame(matrix(round(runif(100)*1000,0),nrow=50))
names(mydata) <- c("credit card usage","value")
head(mydata)
head(mydata[order(mydata[["credit card usage"]]),])
...and the output:
> set.seed(95014123)
> mydata <- data.frame(matrix(round(runif(100)*1000,0),nrow=50))
> names(mydata) <- c("credit card usage","value")
> head(mydata)
credit card usage value
1 795 217
2 816 613
3 342 323
4 126 751
5 618 780
6 625 529
> head(mydata[order(mydata[["credit card usage"]]),])
credit card usage value
47 25 109
44 81 534
18 91 985
31 99 931
19 109 190
4 126 751
>
One can replace the spaces with underscores via the gsub() function, which will enable one to use the $ form of the extract operator in subsequent functions.
# replace spaces with underscores
names(mydata) <- gsub(" ","_",names(mydata))
head(mydata[order(mydata$credit_card_usage),])
...and the output:
> # replace spaces with underscores
> names(mydata) <- gsub(" ","_",names(mydata))
> head(mydata[order(mydata$credit_card_usage),])
credit_card_usage value
47 25 109
44 81 534
18 91 985
31 99 931
19 109 190
4 126 751
>

Variant locations sometimes replaced by ID in subsetted large VCF file?

I have a large VCF file from which I want to extract certain columns and information from and have this matched to the variant location. I thought I had this working but for some variants instead of the corresponding variant location I am given the ID instead?
My code looks like this:
# see what fields are in this vcf file
scanVcfHeader("file.vcf")
# define paramaters on how to filter the vcf file
AN.adj.param <- ScanVcfParam(info="AN_Adj")
# load ALL allele counts (AN) from vcf file
raw.AN.adj. <- readVcf("file.vcf", "hg19", param=AN.adj.param)
# extract ALL allele counts (AN) and corressponding chr location with allele tags from vcf file - in dataframe/s4 class
sclass.AN.adj <- (info(raw.AN.adj.))
The result looks like this:
AN_adj
1:13475_A/T 91
1:14321_G/A 73
rs12345 87
1:15372_A/G 60
1:16174_G/A 41
1:16174_T/C 62
1:16576_G/A 87
rs987654 56
I would like the result to look like this:
AN_adj
1:13475_A/T 91
1:14321_G/A 73
1:14873_C/T 87
1:15372_A/G 60
1:16174_G/A 41
1:16174_T/C 62
1:16576_G/A 87
1:18654_A/T 56
Any ideas on what is going on here and how to fix it?
I would also be happy if there was a way to append the variant location using the CHROM and position fields but from my research data from these fields cannot be requested as they are essential fields used to create the GRanges of variant locations.

R write dataframe column to csv having leading zeroes

I have a table that stores prefixes of different lengths..
snippet of table(ClusterTable)
ClusterTable[ClusterTable$FeatureIndex == "Prefix2",'FeatureIndex',
'FeatureValue')]
FeatureIndex FeatureValue
80 Prefix2 80
81 Prefix2 81
30 Prefix2 30
70 Prefix2 70
51 Prefix2 51
84 Prefix2 84
01 Prefix2 01
63 Prefix2 63
28 Prefix2 28
26 Prefix2 26
65 Prefix2 65
75 Prefix2 75
and I write to csv file using following:
write.csv(ClusterTable, file = "My_Clusters.csv")
The Feature Value 01 loses it leading zero.
I tried first converting the column to characters
ClusterTable$FeatureValue <- as.character(ClusterTable$FeatureValue)
and also tried to append it to an empty string to convert it to string before writing to file.
ClusterTable$FeatureValue <- paste("",ClusterTable$FeatureValue)
Also, I have in this table prefixes of various lengths, so I cant use simple format specifier of a fixed length. i.e the table also has Value 001(Prefix3),0001(Prefix4),etc.
Thanks
EDIT: As of testing again on 8/5/2021, this doesn't work anymore. :(
I know this is an old question, but I happened upon a solution for keeping the lead zeroes when opening .csv output in excel. Before writing your .csv in R, add an apostrophe at the front of each value like so:
vector <- sapply(vector, function(x) paste0("'", x))
When you open the output in excel, the apostrophe will tell excel to keep all the characters and not drop lead zeroes. At this point you can format the column as "text" and then do a find and replace to remove the apostrophes (maybe make a macro for this).
If you just need it for the visual, just need to add one line before you write the csv file, as such:
ClusterTable <- read.table(text=" FeatureIndex FeatureValue
80 Prefix2 80
81 Prefix2 81
30 Prefix2 30
70 Prefix2 70
51 Prefix2 51
84 Prefix2 84
01 Prefix2 01
63 Prefix2 63
28 Prefix2 28
26 Prefix2 26
65 Prefix2 65
75 Prefix2 75",
colClasses=c("character","character"))
ClusterTable$FeatureValue <- paste0(ClusterTable$FeatureValue,"\t")
write.csv(ClusterTable,file="My_Clusters.csv")
It adds a character to the end of the value, but it's hidden in Excel.
Save the file as a csv file, but with a txt extension. Then read it using read.table with sep=",":
write.csv(ClusterTable,file="My_Clusters.txt")
read.table(file=My_Clusters.txt, sep=",")
If you're trying to open the .csv with Excel, I recommend writing to excel instead. First you'll have to pad the data though.
library(openxlsx)
library(dplyr)
ClusterTable <- ClusterTable %>%
mutate(FeatureValue = as.character(FeatureValue),
FeatureValue = str_pad(FeatureValue, 2, 'left', '0'))
write.xlsx(ClusterTable, "Filename.xlsx")
This is pretty much the route you can take when exporting from R. It depends on the type of data and number of records (size of data) you are exporting:
if you have many rows such as thousands, txt is the best route, you can export to csv if you know you don't have leading or trailing zeros in the data, either use txt or xlsx format. Exporting to csv will most likely remove the zeros.
if you don't deal with many rows, then xlsx libraries are better
xlsx libraries may depend on java so make sure you use a library that does not require it
xlsx libraries are either problematic or slow when dealing with many rows, so still txt or csv can be a better route
for your specific problem, it seems you don't deal with a large number of rows, so you can use:
library(openxlsx)
# read data from an Excel file or Workbook object into a data.frame
df <- read.xlsx('name-of-your-excel-file.xlsx')
# for writing a data.frame or list of data.frames to an xlsx file
write.xlsx(df, 'name-of-your-excel-file.xlsx')
You have to modificate your column using format:
format(your_data$your_column, trim = F)
So when you export to .csv then leading zeros will keep on.
When dealing with leading zeros you need to be cautious if exporting to excel. Excel has a tendency to outsmart itself and automatically trim leading zeros. You code is fine otherwise and opening the file in any other text editor should show the zeros.

Data.table fread position error [duplicate]

Is this a bug in data.table::fread (version 1.9.2) or misplaced user expectation/error?
Consider this trivial example where I have a table of values, TAB separated with possibly missing values. If the values are missing in the first column, fread gets upset, but if missing values are elsewhere I return the data.table I expect:
# Data with missing value in first column, third row and last column, second row:
12 876 19
23 39
15 20
fread("12 876 19
23 39
15 20")
#Error in fread("12\t876\t19\n23\t39\t\n\t15\t20") :
# Not positioned correctly after testing format of header row. ch=' '
# Data with missing values last column, rows two and three:
"12 876 19
23 39
15 20 "
fread( "12 876 19
23 39
15 20 " )
# V1 V2 V3
#1: 12 876 19
#2: 23 39 NA
#3: 15 20 NA
# Returns as expected.
Is this a bug or is it not possible to have missing values in the first column (or do I have malformed data somehow?).
I believe this is the same bug that I reported here.
The most recent version that I know will work with this type of input is Rev. 1180. You could checkout and build that version by adding #1180 to the end of the svn checkout command.
svn checkout svn://svn.r-forge.r-project.org/svnroot/datatable/#1180
If you're not familiar with checking out and building packages, see here
But, a lot of great features, bug fixes, enhancements have been implemented since Rev. 1180. (The deveolpment version at the time of this writing is Rev. 1272). So, a better solution, is to replace the R/fread.R and src/fread.c files with the versions from Rev. 1180 or older, and then re-building the package.
You can find those files online without checking them out here (sorry, I can't figure out how to post links that include '*', so you have to copy/paste):
fread.R:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/R/fread.R?revision=988&root=datatable
fread.c:
http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/src/fread.c?revision=1159&root=datatable
Once you've rebuilt the package, you'll be able to read your tsv file.
> fread("12\t876\t19\n23\t39\t\n\t15\t20")
V1 V2 V3
1: 12 876 19
2: 23 39 NA
3: NA 15 20
The downside to doing this is that the old version of fread() does not pass a newer test -- you won't be able to read fields that have quotes in the middle.
> fread('A,B,C\n1.2,Foo"Bar,"a"b\"c"d"\nfo"o,bar,"b,az""\n')
Error in fread("A,B,C\n1.2,Foo\"Bar,\"a\"b\"c\"d\"\nfo\"o,bar,\"b,az\"\"\n") :
Not positioned correctly after testing format of header row. ch=','
With newer versions of fread, you would get this
> fread('A,B,C\n1.2,Foo"Bar,"a"b\"c"d"\nfo"o,bar,"b,az""\n')
A B C
1: 1.2 Foo"Bar a"b"c"d
2: fo"o bar b,az"
So, for now, which version "works" depends on whether you're more likely to have missing values in the first column, or quotes in fields. For me, it's the former, so I'm still using the old code.

Resources