It is known that Excel sheets can display a maximum of 1 million rows. Is there any row limit for csv data, i.e. does Excel allow more than 1 million rows in csv format?
One more question: About this 1 million limitation; Can Excel hold more than 1 million data rows, even though it only displays a maximum of 1 million data rows?
CSV files have no limit of rows you can add to them. Excel won't hold more that the 1 million lines of data if you import a CSV file having more lines.
Excel will actually ask you whether you want to proceed when importing more than 1 million data rows. It suggests to import the remaining data by using the text import wizard again - you will need to set the appropriate line offset.
In my memory, excel (versions >= 2007) limits the power 2 of 20: 1.048.576 lines.
Csv is over to this boundary, like ordinary text file. So you will be care of the transfer between two formats.
Using the Excel Text import wizard to import it if it is a text file, like a CSV file, is another option and can be done based on which row number to which row numbers you specify. See: This link
Related
I would need to analyze a dataframe on R (bash or even python if you have suggestions, but I don't know how to use python well). The dataframe has approximately 6 billion rows and 8 columns (control1, treaty1, control2, treaty2, control3, treaty3, control4, treaty4).
Since it is a file of almost 300Gb and 6 billion lines I cannot open it with R.
I would need to read the file line by line and remove the lines where there is even only a 0.
How could I do?
If I also needed to divide each value inside a column by a number, and put the result in a new dataframe equal to the starting one, how could I do?
I have several excel files (*.xlsx) and I want to import them into R, but each file has 6 to 7 tables in a single sheet, separated by chunks of text, like the picture.
I know how to import several excel files using a loop, but my issue is I cannot figure out how select each of the tables distributed along each sheet, avoiding the rows with text, and bind them. Also, each table from each excel file starts in a different cell, so I cannot just define a coordinate (a specific cell) to import the tables. Every excel file is different in amount of rows. I'll appreciate any help.
For instance, the above picture is about Maryland (an US State), and I want to transform that into what is presented in the following picture:
This is a toy file to anyone able to help me: LINK
Thanks!
Based on the image of the data you showed, it seems that all rows can be removed where the second column has an NA? In that case subsetting in base R is pretty straightforward:
test <- test[!is.na(test[,2]),]
Quick explanation:
test[ ,2] --> evaluate all rows in column 2
is.na(test[ ,2]) --> return TRUE if cell is NA
!is.na(test[ ,2]) --> return FALSE if cell is NA
test[!is.na(test[,2]),] --> all rows of test dataframe where cell in col 2 is not NA
Again, based on the data you showed this should work. But hard to work out w/o true sample date.
I have data from excel which should be a group data(the one I higlight in the picture), the problem is, when i import it to R, it won't consider those data as grouped, how can i fix the problem?
As I can see from the image you provide. They are 3 separated columns, so when you import the excel file to Rstudio. It will treat them as 3 different columns. However, if you want to unite 3 columns into 1. There are also solutions for that.
Sorry if this is difficult to understand - I don't have enough karma to add a picture so I will do the best I can to describe this! Using XLConnect package within R to read & write from/to Excel spreadsheets.
I am working on a project in which I am trying to take columns of data out of many workbooks and concatenate them together into rows of a new workbook based on which workbook they came from (each workbook is data from a consecutive business day). The snag is that the data that I seek is only a small part (10 rows X 3 columns) of each workbook/worksheet and is not always located in the same place within the worksheet due to sloppiness on behalf of the person who originally created the spreadsheets. (e.g. I can't just start at cell A2 because the dataset that starts at A2 in one workbook might start at B12 or C3 in another workbook).
I am wondering if it is possible to search for a cell based on its contents (e.g. a cell containing the title "Table of Arb Prices") and return either the index or reference formula to be able to access that cell.
Also wondering if, once I reference that cell based on its contents, if there is a way to adjust that formula to get to where I know another cell is compared to that one. For example if a cell with known contents is always located 2 rows above and 3 columns to the left of the cell where I wish to start collecting data, is it possible for me to take that first reference formula and increment it by 2 rows and 3 columns to get the reference formula for the cell I want?
Thanks for any help and please advise me if you need further information to be able to understand my questions!
You can just read the entire worksheet in as a matrix with something like
library(XLConnect)
demoExcelFile <- system.file("demoFiles/mtcars.xlsx", package = "XLConnect")
mm <- as.matrix(readWorksheetFromFile(demoExcelFile, sheet=1))
class(mm)<-"character" # convert all to character
Then you can search for values and get the row/colum
which(mm=="3.435", arr.ind=T)
# row col
# [1,] 23 6
Then you can offset those and extract values from the matrix how ever you like. In the end, when you know where you want to read from, you can convert to a cleaner data frame with
read.table(text=apply(mm[25:27, 6:8],1,paste, collapse="\t"), sep="\t")
Hopefully that gives you a general idea of something you can try. It's hard to be more specific without knowing exactly what your input data looks like.
My R application reads input data from large txt files. it does not read the entire
file in one shot. Users specify the name of the gene, (3 or 4 at a time) and based on the user-input, app goes to the appropriate row and reads the data.
File format: 32,000 rows (one gene per row, first two columns contain info about
gene name, etc.) 35,000 columns with numerical data (decimal numbers).
I used read.table (filename, skip=10,000 ) etc. to go to the right row, then read
35,000 columns of data. then I do this again for the 2nd gene, 3rd gene (upto 4 genes max)
and then process the numerical results.
The file reading operations take about 1.5 to 2.0 Minutes. I am experimenting with
reading the entire file and then taking the data for the desired genes.
Any way to accelerate this? I can rewrite the gene data in another format (one
time processing) if that will accelerate reading operations in the future.
You can use the colClasses argument to read.table to speed things up, if you know the exact format of your files. For 2 character columns and 34,998 (?) numeric columns, you would use
colClasses = c(rep("character",2), rep("numeric",34998))
This would be more efficient if you used a database interface. There are several available via the RODBC package, but a particularly well-integrated-with-R option would be the sqldf package which by default uses SQLite. You would then be able to use the indexing capacity of the database to do lookup of the correct rows and read all the columns in one operation.