Variant locations sometimes replaced by ID in subsetted large VCF file? - r

I have a large VCF file from which I want to extract certain columns and information from and have this matched to the variant location. I thought I had this working but for some variants instead of the corresponding variant location I am given the ID instead?
My code looks like this:
# see what fields are in this vcf file
scanVcfHeader("file.vcf")
# define paramaters on how to filter the vcf file
AN.adj.param <- ScanVcfParam(info="AN_Adj")
# load ALL allele counts (AN) from vcf file
raw.AN.adj. <- readVcf("file.vcf", "hg19", param=AN.adj.param)
# extract ALL allele counts (AN) and corressponding chr location with allele tags from vcf file - in dataframe/s4 class
sclass.AN.adj <- (info(raw.AN.adj.))
The result looks like this:
AN_adj
1:13475_A/T 91
1:14321_G/A 73
rs12345 87
1:15372_A/G 60
1:16174_G/A 41
1:16174_T/C 62
1:16576_G/A 87
rs987654 56
I would like the result to look like this:
AN_adj
1:13475_A/T 91
1:14321_G/A 73
1:14873_C/T 87
1:15372_A/G 60
1:16174_G/A 41
1:16174_T/C 62
1:16576_G/A 87
1:18654_A/T 56
Any ideas on what is going on here and how to fix it?
I would also be happy if there was a way to append the variant location using the CHROM and position fields but from my research data from these fields cannot be requested as they are essential fields used to create the GRanges of variant locations.

Related

r recognize and importing Multiple Tables from a Single excel file

I tried to read all posts like this but I did not succeed.
I need to extract tables of different layouts from a single sheet in excel, for each sheet of the file.
Any help or ideas that can be provided would be greatly appreciated.
A sample of the datafile and it's structure can be found Here
I would use readxl. The code below reads just one sheet, but it is easy enough to adapt to read multiple or different sheets.
First we just want to read the sheet. Obviously you should change the path to reflect where you saved your file:
library(readxl)
sheet = read_excel("~/Downloads/try.xlsx", col_names = LETTERS[1:12])
If you didn't know you had 12 columns, then using read_excel without specifying the column names would give you enough information to find that out. The different tables in the sheet are separated by one or two blank rows. You can find the blank rows by testing each row to see if all of the cells in that row are NA using the apply function.
blanks = which(apply(sheet, 1, function(row)all(is.na(row))))
blanks
> blanks
[1] 7 8 17 26 35 41 50 59 65 74 80 86 95 98
So you could extract the first table by taking rows 1--6 (7 - 1), the second table by taking rows 9--16 and so on.

Loop through a list of identifiers to load corresponding excel files?

I can import a column of unique identifiers into R as an object. There is an excel file with a matching name for each identifier (around 500). I am trying to write a loop to go through all of these unique IDs and load the corresponding excel.
what I tried is:
for (i in 1:nrow(pi)){
read_excel()
}
Update:
so just to clarify because I don't think I provided adequate examples.
I have an excel column which consists of about 500 unique IDs, each of which is a series of 11 numbers or so. For each ID, I have an excel file with a matching name. All the excel files are in the same folder. For each unique ID, I would like to open up the file with a matching name, and retrieve specific cells, ie the bottom and top values in a particular column, the maximum value or the mean of another column, etc.
Where "pi" is the object which should be a vector of the unique IDs. I'm not sure how to complete this. Alternative methods of solving this problem are welcome. Realistically, I am just trying to retrieve specific values from the excel, ie first and last values in a certain column, maximum and mean of another column etc.
Since the original post didn't provide data, I will illustrate one technique where we use a vector of id numbers to generate file names to read multiple spreadsheets associated with Basic Pokémon stats for generations 1 - 8 of Pokémon.
To make the example completely reproducible, I maintain a zip file with this data on GitHub which we can download and load into R.
We will use the sprintf() function to create the file names, because sprintf() allows us to not only add the directory information needed to locate the files, as well as format the numbers with leading zeroes, which are required to generate the right file names.
Instead of a for() loop we will use lapply() along with an anonymous function to create the file names and read them as Excel files with readxl::read_excel().
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonXLSX.zip",
"PokemonXLSX.zip",
method="curl",mode="wb")
unzip("PokemonXLSX.zip",exdir="./pokemonData")
library(readxl)
# create a set of numbers to be used to generate
generationIds <- 1:8
spreadsheets <- lapply(generationIds,function(x) {
# use generation number to create individual file name
aFile <- sprintf("./PokemonData/gen%02i.xlsx",x)
data <- read_excel(aFile)
})
At this point the object spreadsheets is a list with eight elements, one corresponding to each generation of Pokémon (i.e one element per spreadsheet).
We can combine the seven files with rbind(), and then print the last few rows of the resulting data frame.
pokemon <- do.call(rbind,spreadsheets)
tail(pokemon)
...and the result:
> tail(pokemon)
# A tibble: 6 x 13
ID Name Form Type1 Type2 Total HP Attack Defense Sp..Atk Sp..Def
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 895 Regi… NA Drag… NA 580 200 100 50 100 50
2 896 Glas… NA Ice NA 580 100 145 130 65 110
3 897 Spec… NA Ghost NA 580 100 65 60 145 80
4 898 Caly… NA Psyc… Grass 500 100 80 80 80 80
5 898 Caly… Ice … Psyc… Ice 680 100 165 150 85 130
6 898 Caly… Shad… Psyc… Ghost 680 100 85 80 165 100
# … with 2 more variables: Speed <dbl>, Generation <dbl>
Spotlight: accessing the files from disk
To isolate the downloaded files, we use the exdir= argument on unzip() to write the unzipped files to a subdirectory of the R working directory.
We can access files in this subdirectory by adding ./pokemonData/ to their file names. The . in this syntax references the current directory.
We can illustrate how the filenames are created with the following code.
theFiles <- lapply(generationIds,function(x) {
# use generation number to create individual file name
aFile <- sprintf("./pokemonData/gen%02i.xlsx",x)
message(paste("current file is: ",aFile))
aFile
})
...and the output:
> theFiles <- lapply(generationIds,function(x) {
+ # use generation number to create individual file name
+ aFile <- sprintf("./pokemonData/gen%02i.xlsx",x)
+ message(paste("current file is: ",aFile))
+ aFile
+ })
current file is: ./pokemonData/gen01.xlsx
current file is: ./pokemonData/gen02.xlsx
current file is: ./pokemonData/gen03.xlsx
current file is: ./pokemonData/gen04.xlsx
current file is: ./pokemonData/gen05.xlsx
current file is: ./pokemonData/gen06.xlsx
current file is: ./pokemonData/gen07.xlsx
current file is: ./pokemonData/gen08.xlsx
One can identify the R working directory from within RStudio with the getwd() function. On my MacBook Pro, I get the following result.
> getwd()
[1] "/Users/lgreski/gitrepos/datascience"
>

R write dataframe column to csv having leading zeroes

I have a table that stores prefixes of different lengths..
snippet of table(ClusterTable)
ClusterTable[ClusterTable$FeatureIndex == "Prefix2",'FeatureIndex',
'FeatureValue')]
FeatureIndex FeatureValue
80 Prefix2 80
81 Prefix2 81
30 Prefix2 30
70 Prefix2 70
51 Prefix2 51
84 Prefix2 84
01 Prefix2 01
63 Prefix2 63
28 Prefix2 28
26 Prefix2 26
65 Prefix2 65
75 Prefix2 75
and I write to csv file using following:
write.csv(ClusterTable, file = "My_Clusters.csv")
The Feature Value 01 loses it leading zero.
I tried first converting the column to characters
ClusterTable$FeatureValue <- as.character(ClusterTable$FeatureValue)
and also tried to append it to an empty string to convert it to string before writing to file.
ClusterTable$FeatureValue <- paste("",ClusterTable$FeatureValue)
Also, I have in this table prefixes of various lengths, so I cant use simple format specifier of a fixed length. i.e the table also has Value 001(Prefix3),0001(Prefix4),etc.
Thanks
EDIT: As of testing again on 8/5/2021, this doesn't work anymore. :(
I know this is an old question, but I happened upon a solution for keeping the lead zeroes when opening .csv output in excel. Before writing your .csv in R, add an apostrophe at the front of each value like so:
vector <- sapply(vector, function(x) paste0("'", x))
When you open the output in excel, the apostrophe will tell excel to keep all the characters and not drop lead zeroes. At this point you can format the column as "text" and then do a find and replace to remove the apostrophes (maybe make a macro for this).
If you just need it for the visual, just need to add one line before you write the csv file, as such:
ClusterTable <- read.table(text=" FeatureIndex FeatureValue
80 Prefix2 80
81 Prefix2 81
30 Prefix2 30
70 Prefix2 70
51 Prefix2 51
84 Prefix2 84
01 Prefix2 01
63 Prefix2 63
28 Prefix2 28
26 Prefix2 26
65 Prefix2 65
75 Prefix2 75",
colClasses=c("character","character"))
ClusterTable$FeatureValue <- paste0(ClusterTable$FeatureValue,"\t")
write.csv(ClusterTable,file="My_Clusters.csv")
It adds a character to the end of the value, but it's hidden in Excel.
Save the file as a csv file, but with a txt extension. Then read it using read.table with sep=",":
write.csv(ClusterTable,file="My_Clusters.txt")
read.table(file=My_Clusters.txt, sep=",")
If you're trying to open the .csv with Excel, I recommend writing to excel instead. First you'll have to pad the data though.
library(openxlsx)
library(dplyr)
ClusterTable <- ClusterTable %>%
mutate(FeatureValue = as.character(FeatureValue),
FeatureValue = str_pad(FeatureValue, 2, 'left', '0'))
write.xlsx(ClusterTable, "Filename.xlsx")
This is pretty much the route you can take when exporting from R. It depends on the type of data and number of records (size of data) you are exporting:
if you have many rows such as thousands, txt is the best route, you can export to csv if you know you don't have leading or trailing zeros in the data, either use txt or xlsx format. Exporting to csv will most likely remove the zeros.
if you don't deal with many rows, then xlsx libraries are better
xlsx libraries may depend on java so make sure you use a library that does not require it
xlsx libraries are either problematic or slow when dealing with many rows, so still txt or csv can be a better route
for your specific problem, it seems you don't deal with a large number of rows, so you can use:
library(openxlsx)
# read data from an Excel file or Workbook object into a data.frame
df <- read.xlsx('name-of-your-excel-file.xlsx')
# for writing a data.frame or list of data.frames to an xlsx file
write.xlsx(df, 'name-of-your-excel-file.xlsx')
You have to modificate your column using format:
format(your_data$your_column, trim = F)
So when you export to .csv then leading zeros will keep on.
When dealing with leading zeros you need to be cautious if exporting to excel. Excel has a tendency to outsmart itself and automatically trim leading zeros. You code is fine otherwise and opening the file in any other text editor should show the zeros.

Collecting data in one row from different csv files by the name

It's hard to explain what exactly I want to achieve with my script but let me try.
I have 20 different csv files, so I loaded them into R:
tbl = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
then with your help I combined them into one and removed all of the duplicates:
data_rd <- subset(transform(all_data, X = sub("\\..*", "", X)),
!duplicated(X))
I have now 1 master table which includes all of the names (Accession):
Accession
AT1G19570
AT5G38480
AT1G07370
AT4G23670
AT5G10450
AT4G09000
AT1G22300
AT1G16080
AT1G78300
AT2G29570
Now I would like to find this accession in other csv files and put the data of this accession in the same raw. There are like 20 csv files and for each csv there are like 20 columns so in same cases it might give me a 400 columns. It doesn't matter how long it takes. It has to be done. Is it even possible to do with R ?
Example:
First csv Second csv Third csv
Accession Size Lenght Weight Size Lenght Weight Size Lenght Weight
AT1G19570 12 23 43 22 77 666 656 565 33
AT5G38480
AT1G07370 33 22 33 34 22
AT4G23670
AT5G10450
AT4G09000 12 45 32
AT1G22300
AT1G16080
AT1G78300 44 22 222
AT2G29570
It looks like a hard task to do. Propably it has to be done by the loop. Any ideas ?
This is a merge loop. Here is rough R code that will inefficiently grow with each merge.
Begin as before:
tbls = list.files(pattern="*.csv")
list_of_data = lapply(tbl, read.csv)
tbl=list_of_data[[1]]
for(i in 2:length(list_of_data))
{
tbl=merge(tbl, list of_data[[i]], by="Accession", all=T)
}
The matching column names (not used as a key) will be renamed with a suffix (.x,.y, and so on), the all=T argument will ensure that whenever a new Accession key is merged a new row will be made and the missing cells will be filled with NA.

read.csv appends/modifies column headings with date values

I'm trying to read a csv file into R that has date values in some of the colum headings.
As an example, the data file looks something like this:
ID Type 1/1/2001 2/1/2001 3/1/2001 4/1/2011
A Supply 25 35 45 55
B Demand 26 35 41 22
C Supply 25 35 44 85
D Supply 24 39 45 75
D Demand 26 35 41 22
...and my read.csv logic looks like this
dat10 <- read.csv("c:\data.csv",header=TRUE, sep=",",as.is=TRUE)
The read.csv works fine except it modifies the name of the colums with dates as follows:
x1.1.2001 x2.1.2001 x3.1.2001 x4.1.2001
Is there a way to prevent this, or a easy way to correct afterwards?
Set check.names=FALSE. But be aware that 1/1/2001 et al are syntactically invalid names, therefore they may cause you some headaches.
You can always change the column names using the colnames function. For example,
colnames(dat10) = gsub("\\.", "/", colnames(dat10))
However, having slashes in your column names isn't a particularly good idea. You can always change them just before you print out the table or when you create a graph.

Resources