I tried to read all posts like this but I did not succeed.
I need to extract tables of different layouts from a single sheet in excel, for each sheet of the file.
Any help or ideas that can be provided would be greatly appreciated.
A sample of the datafile and it's structure can be found Here
I would use readxl. The code below reads just one sheet, but it is easy enough to adapt to read multiple or different sheets.
First we just want to read the sheet. Obviously you should change the path to reflect where you saved your file:
library(readxl)
sheet = read_excel("~/Downloads/try.xlsx", col_names = LETTERS[1:12])
If you didn't know you had 12 columns, then using read_excel without specifying the column names would give you enough information to find that out. The different tables in the sheet are separated by one or two blank rows. You can find the blank rows by testing each row to see if all of the cells in that row are NA using the apply function.
blanks = which(apply(sheet, 1, function(row)all(is.na(row))))
blanks
> blanks
[1] 7 8 17 26 35 41 50 59 65 74 80 86 95 98
So you could extract the first table by taking rows 1--6 (7 - 1), the second table by taking rows 9--16 and so on.
I want to create a topic model from data provided by Jstor (e.g. https://www.jstor.org/dfr/about/sample-datasets). However, because of copyright, they do not allow full text access. Instead, I can request a list of unigrams followed by their frequencies in the document (supplied in plain .txt). e.g:
his 295
old 181
he 165
age 152
p 110
from 79
life 74
de 71
petrarch 58
book 51
courtier 47
This should be easy to convert to a bag-of-words vector. However, I have only found examples of Gensim LDA models being built from fulltext. Would it be possible to pass it these vectors instead?
Yes, you only need to convert (word, frequency) to (word_number, frequency), and pass a list of tuples to corpus of any gensim model. To convert a word to a number, you can first count how many words are in the whole corpus, suppose we have V words, then each word can be represented as an integer between 1 to V.
I have a large VCF file from which I want to extract certain columns and information from and have this matched to the variant location. I thought I had this working but for some variants instead of the corresponding variant location I am given the ID instead?
My code looks like this:
# see what fields are in this vcf file
scanVcfHeader("file.vcf")
# define paramaters on how to filter the vcf file
AN.adj.param <- ScanVcfParam(info="AN_Adj")
# load ALL allele counts (AN) from vcf file
raw.AN.adj. <- readVcf("file.vcf", "hg19", param=AN.adj.param)
# extract ALL allele counts (AN) and corressponding chr location with allele tags from vcf file - in dataframe/s4 class
sclass.AN.adj <- (info(raw.AN.adj.))
The result looks like this:
AN_adj
1:13475_A/T 91
1:14321_G/A 73
rs12345 87
1:15372_A/G 60
1:16174_G/A 41
1:16174_T/C 62
1:16576_G/A 87
rs987654 56
I would like the result to look like this:
AN_adj
1:13475_A/T 91
1:14321_G/A 73
1:14873_C/T 87
1:15372_A/G 60
1:16174_G/A 41
1:16174_T/C 62
1:16576_G/A 87
1:18654_A/T 56
Any ideas on what is going on here and how to fix it?
I would also be happy if there was a way to append the variant location using the CHROM and position fields but from my research data from these fields cannot be requested as they are essential fields used to create the GRanges of variant locations.
I apologize in advance for the somewhat lack of reproducibility here. I am doing an analysis on a very large (for me) dataset. It is from the CMS Open Payments database.
There are four files I downloaded from that website, read into R using readr, then manipulated a bit to make them smaller (column removal), and then stuck them all together using rbind. I would like to write my pared down file out to an external hard drive so I don't have to read in all the data each time I want to work on it and doing the paring then. (Obviously, its all scripted but, it takes about 45 minutes to do this so I'd like to avoid it if possible.)
So I wrote out the data and read it in, but now I am getting different results. Below is about as close as I can get to a good example. The data is named sa_all. There is a column in the table for the source. It can only take on two values: gen or res. It is a column that is actually added as part of the analysis, not one that comes in the data.
table(sa_all$src)
gen res
14837291 822559
So I save the sa_all dataframe into a CSV file.
write.csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv',
row.names = FALSE)
Then I open it:
sa_all2 <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
table(sa_all2$src)
g gen res
1 14837289 822559
I did receive the following parsing warnings.
Warning: 4 parsing failures.
row col expected actual
5454739 pmt_nature embedded null
7849361 src delimiter or quote 2
7849361 src embedded null
7849361 NA 28 columns 54 columns
Since I manually add the src column and it can only take on two values, I don't see how this could cause any parsing errors.
Has anyone had any similar problems using readr? Thank you.
Just to follow up on the comment:
write_csv(sa_all, 'D:\\Open_Payments\\data\\written_files\\sa_all.csv')
sa_all2a <- read_csv('D:\\Open_Payments\\data\\written_files\\sa_all.csv')
Warning: 83 parsing failures.
row col expected actual
1535657 drug2 embedded null
1535657 NA 28 columns 25 columns
1535748 drug1 embedded null
1535748 year an integer No
1535748 NA 28 columns 27 columns
Even more parsing errors and it looks like some columns are getting shuffled entirely:
table(sa_all2a$src)
100000000278 Allergan Inc. gen GlaxoSmithKline, LLC.
1 1 14837267 1
No res
1 822559
There are columns for manufacturer names and it looks like those are leaking into the src column when I use the write_csv function.
I have a table that stores prefixes of different lengths..
snippet of table(ClusterTable)
ClusterTable[ClusterTable$FeatureIndex == "Prefix2",'FeatureIndex',
'FeatureValue')]
FeatureIndex FeatureValue
80 Prefix2 80
81 Prefix2 81
30 Prefix2 30
70 Prefix2 70
51 Prefix2 51
84 Prefix2 84
01 Prefix2 01
63 Prefix2 63
28 Prefix2 28
26 Prefix2 26
65 Prefix2 65
75 Prefix2 75
and I write to csv file using following:
write.csv(ClusterTable, file = "My_Clusters.csv")
The Feature Value 01 loses it leading zero.
I tried first converting the column to characters
ClusterTable$FeatureValue <- as.character(ClusterTable$FeatureValue)
and also tried to append it to an empty string to convert it to string before writing to file.
ClusterTable$FeatureValue <- paste("",ClusterTable$FeatureValue)
Also, I have in this table prefixes of various lengths, so I cant use simple format specifier of a fixed length. i.e the table also has Value 001(Prefix3),0001(Prefix4),etc.
Thanks
EDIT: As of testing again on 8/5/2021, this doesn't work anymore. :(
I know this is an old question, but I happened upon a solution for keeping the lead zeroes when opening .csv output in excel. Before writing your .csv in R, add an apostrophe at the front of each value like so:
vector <- sapply(vector, function(x) paste0("'", x))
When you open the output in excel, the apostrophe will tell excel to keep all the characters and not drop lead zeroes. At this point you can format the column as "text" and then do a find and replace to remove the apostrophes (maybe make a macro for this).
If you just need it for the visual, just need to add one line before you write the csv file, as such:
ClusterTable <- read.table(text=" FeatureIndex FeatureValue
80 Prefix2 80
81 Prefix2 81
30 Prefix2 30
70 Prefix2 70
51 Prefix2 51
84 Prefix2 84
01 Prefix2 01
63 Prefix2 63
28 Prefix2 28
26 Prefix2 26
65 Prefix2 65
75 Prefix2 75",
colClasses=c("character","character"))
ClusterTable$FeatureValue <- paste0(ClusterTable$FeatureValue,"\t")
write.csv(ClusterTable,file="My_Clusters.csv")
It adds a character to the end of the value, but it's hidden in Excel.
Save the file as a csv file, but with a txt extension. Then read it using read.table with sep=",":
write.csv(ClusterTable,file="My_Clusters.txt")
read.table(file=My_Clusters.txt, sep=",")
If you're trying to open the .csv with Excel, I recommend writing to excel instead. First you'll have to pad the data though.
library(openxlsx)
library(dplyr)
ClusterTable <- ClusterTable %>%
mutate(FeatureValue = as.character(FeatureValue),
FeatureValue = str_pad(FeatureValue, 2, 'left', '0'))
write.xlsx(ClusterTable, "Filename.xlsx")
This is pretty much the route you can take when exporting from R. It depends on the type of data and number of records (size of data) you are exporting:
if you have many rows such as thousands, txt is the best route, you can export to csv if you know you don't have leading or trailing zeros in the data, either use txt or xlsx format. Exporting to csv will most likely remove the zeros.
if you don't deal with many rows, then xlsx libraries are better
xlsx libraries may depend on java so make sure you use a library that does not require it
xlsx libraries are either problematic or slow when dealing with many rows, so still txt or csv can be a better route
for your specific problem, it seems you don't deal with a large number of rows, so you can use:
library(openxlsx)
# read data from an Excel file or Workbook object into a data.frame
df <- read.xlsx('name-of-your-excel-file.xlsx')
# for writing a data.frame or list of data.frames to an xlsx file
write.xlsx(df, 'name-of-your-excel-file.xlsx')
You have to modificate your column using format:
format(your_data$your_column, trim = F)
So when you export to .csv then leading zeros will keep on.
When dealing with leading zeros you need to be cautious if exporting to excel. Excel has a tendency to outsmart itself and automatically trim leading zeros. You code is fine otherwise and opening the file in any other text editor should show the zeros.