I am generating a .xls file, in which there is a column containing formulas like these:
IF(A1=1; 'xxx'; IF(A1=2; 'yyy'; IF(A1=3; 'zzz' ...
You know, it would be a SWITCH if it wouldn't be excel formula.
The problem is, depending on how many IF's I use, the time it takes to generate the .xls file grows exponentially.
The filesize is not much different.
I have 18 cases, which means 18 IF's and that takes just unacceptable amount of time.
Why is that so? Is there anything I might be doing wrong?
Here is a sample code:
for ($k = 1; $k<16; $k++){
$cellID = "A".($row+$k);
$codes_if = '=IF('.$cellID.'="1",4579,'
.'IF('.$cellID.'="2",7978,'
... // some more IF's
.'""))))))))))))))))';
$actSheet->SetCellValue("B".($row+$k),$codes_if);
}
A formula like this with multiple nested IFs will be inefficient anyway
Consider replacing your multiple nested IFs with VLOOKUP instead
e.g.
=VLOOKUP(A1,E1:F3,2,TRUE)
where column E contains the lookup values 1, 2, 3, ... and column F contains the return values aaa, yyy, zzz, etc)
That is the equivalent of a switch statement in MS Excel
You will find that using VLOOKUP is a lot more efficient than your nested IFs
To reduce the time it takes to save the file further, be aware that PHPExcel calculates all formulae before saving by default. You can change that behaviour by calling
$objWriter->setPreCalculateFormulas(false);
before calling the save
Related
I have some very large CSV files (~183mio. rows by 8 columns) that I want to load into a database using R. I use duckdb for this and it its built-in function duckdb_read_csv, which is supposed to auto-detect datatypes for each column. If I enter the following code:
con = dbConnect(duckdb::duckdb(), dbdir="testdata.duckdb", read_only = FALSE)
duckdb_read_csv(con, "d15072021","mydata.csv",
header = TRUE)
It produces this error:
Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Could not convert string '2' to BOOL between line 12492801 and 12493825 in column 9. Parser options: DELIMITER=',', QUOTE='"', ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0
I've looked at the rows in question and I can't find any irregularities in column 9. Unfortunately, I cannot post the dataset because it's confidential. But the entire column is filled with either FALSE or TRUE.
If I set the parameter nrow.check to something larger than 12493825 it doesn't produce the same error but takes very long and simply converts the column to VARCHAR instead of a logical. Setting nrow.check to -1 (meaning it checks every row for a pattern) crashes R and my PC completely.
The weird thing: This isn't consistent. Earlier I imported the dataset whilst keeping the default value for nrow.check at 500 and it read the file with no issue (though still converting column 9 to VARCHAR). I have to read a lot of files that are the same pattern so I need to have a reliable way of reading them. Anyone know how duckdb_read_csv actually works and why I might get this error?
Note that reading the files into memory and then into a database isn't an option because I run out of memory instantly.
the way the sniffer works is by sampling nrow.check rows to figure out the data type, so the result can differ from runs if you get unlucky, increasing it will reduce the chances of failing it, mainly because the sniffer looks at more rows.
If increasing the number of rows is not possible due to performance issues, you can of course first define the schema of the CSV file. But then you must know the schema beforehand.
As an example of how you can define the schema and turn off the sniffer:
select * from
SELECT * FROM read_csv('test.csv', COLUMNS=STRUCT_PACK(a := 'INTEGER', b := 'INTEGER'), auto_detect='false')
I have variables that are named team.1, team.2, team.3, and so forth.
First of all, I would like to know how to go through each of these and assign a data frame to each one. So team.1 would have data from one team, then team.2 would have data from a second team. I am trying to do this for about 30 teams, so instead of typing the code out 30 times, is there a way to cycle through each with a counter or something similar?
I have tried things like
vars = list(sprintf("team.x%s", 1:33)))
to create my variables, but then I have no luck assigning anything to them.
Along those same lines, I would like to be able to run a function I made for cleaning and sorting the individual data sets on all of them at once.
For this, I have tried a for loop
for (j in 1:33) {
assign(paste("team.",j, sep = ""), cleaning1(paste("team.",j, sep =""), j))
}
where cleaning1 is my function, with two calls.
cleaning1(team.1, 1)
This produces the error message
Error in who[, -1] : incorrect number of dimensions
So obviously I am hoping the loop would count through my data sets, and also input my function calls and reassign my datasets with the newly cleaned data.
Is something like this possible? I am a complete newbie, so the more basic, the better.
Edit:
cleaning1:
cleaning1 = function (who, year) {
who[,-1]
who$SeasonEnd = rep(year, nrow(who))
who = (who[-nrow(who),])
who = tbl_df(who)
for (i in 1:nrow(who)) {
if ((str_sub(who$Team[i], -1)) == "*") {
who$Playoffs[i] = 1
} else {
who$Playoffs[i] = 0
}
}
who$Team = gsub("[[:punct:]]",'',who$Team)
who = who[c(27:28,2:26)]
return(who)
}
This works just fine when I run it on the data sets I have compiled myself.
To run it though, I have to go through and reassign each data set, like this:
team.1 = cleaning1(team.1, 1)
team.2 = cleaning1(team.2, 2)
So, I'm trying to find a way to automate that part of it.
I think your problem would be better solved by using a list of data frames instead of many variables containing one data frame each.
You do not say where you get your data from, so I am not sure how you would create the list. But assuming you have your data frames already stored in the variables team.1 etc., you could generate the list with
team.list <- list(team.1, team.2, ...,team.33)
where the dots stand for the variables that I did not write explicitly (you will have to do that). This is tedious, of course, and could be simplified as follows
team.list <- do.call(list,mget(paste0("team.",1:33)))
The paste0 command creates the variable names as strings, mget converts them to the actual objects, and do.call applies the list command to these objects.
Now that you have all your data in a list, it is much easier to apply a function on all of them. I am not quite sure how the year argument should be used, but from your example, I assume that it just runs from 1 to 33 (let me know, if this is not true and I'll change the code). So the following should work:
team.list.cleaned <- mapply(cleaning1,team.list,1:33)
It will go through all elements of team.list and 1:33 and apply the function cleaning1 with the elements as its arguments. The result will again be a list containing the output of each call, i.e.,
list( cleaning1(team.list[[1]],1), cleaning1(team.list[[2]],2), ...)
Since you are now to R I strongly recommend that you read the help on the apply commands (apply, lapply, tapply, mapply). There are very useful and once you got used to them, you will use them all the time...
There is probably also a simple way to directly generate the list of data frames using lapply. As an example: if the data frames are read in from files and you have the file names stored in a character vector file.names, then something along the lines of
team.list <- lapply(file.names,read.table)
might work.
I have one loop
for (chr in paste('chr',c(seq(1,22),'X','Y'),sep='')){
.
.
write.table(exp,file="list.txt",col.names =FALSE,row.names=TRUE,sep="\t")
}
right now the loop will give me only one txt with the data from the last loop (chromosome Y)
What i want is one txt file with the data from all the chromosomes/ from all the loops. Not 24 different txt (one per chromosome)
Thank you
Best regards
Anna
What you are not yet recognizing is that 'append' is set to FALSE by default and that you need to explicitly change it to TRUE if you want to record each 'exp'-object (which by the way is an unfortunate name for a data object because it's shared by a common math function). If you set append=TRUE and if the "exp"-object is a full record of one chromosome at each time the write.csv is called (hopefully with an 'chr'-identifier column so you can keep track of where the data came from), then you should succeed.
We have an automatic process that opens a template excel file, writes rows of data, and returns the file to the user. This process is usually fast, however I was recently asked to add a summary page with some Excel formulas to one of the templates, and now the process takes forever.
It successfully runs with about 5 records after a few minutes, however this week's record set is almost 400 rows and the longest I've let it run is about half an hour before cancelling it. Without the formulas, it only takes a few seconds to run.
Is there any known issues with writing rows to an Excel file that contains formulas? Or is there a way to tell Excel not to evaluate formulas until the file is opened by a user?
The formulas on the summary Sheet are these:
' Returns count of cells in column where data = Y
=COUNTIF(Sheet1!J15:Sheet1!J10000, "Y")
=COUNTIF(Sheet1!F15:Sheet1!F10000, "Y")
' Return sum of column where data is a number greater than 0
' Column contains formula calculating the difference in months between two dates
=SUMIF(Sheet1!I15:Sheet1!I10000,">0",Sheet1!I15:Sheet1!I10000)
' Returns a count of distinct values in a column
=SUMPRODUCT((Sheet1!D15:Sheet1!D10000<>"")/COUNTIF(Sheet1!D15:Sheet1!D10000,Sheet1!D15:Sheet1!D10000&""))
And the code that writes to excel looks something like this:
Dim xls as New Excel.Application()
Dim xlsBooks as Excel.Workbooks, xlsBook as Excel.Workbook
Dim xlsSheets as Excel.Sheets, xlsSheet as Excel.Worksheet
Dim xlsCells as Excel.Range
xls.Visible = False
xls.DisplayAlerts = False
xlsBooks = xls.Workbooks
xlsBooks.Open(templateFile)
xlsBook = xlsBooks.Item(1)
' Loop through excel Sheets. Some templates have multiple sheets.
For Each drSheet as DataRow in dtSheets.Rows
xlsSheets = xlsBook.Worksheets
xlsSheet = CType(xlsSheets.Item(drSheet("SheetName")), Excel.Worksheet)
xlsCells = xlsSheet.Cells
' Loop though Column list from Database. Each Template requires different columns
For Each drDataCols as DataRow in dtDataCols.Rows
' Loop though Rows to get data
For Each drData as DataRow in dtData.Rows
xlsCells(drSheet("StartRow") + dtData.Rows.IndexOf(drData), drDataCols("DataColumn")) = drData("Col" + drDataCols("DataColumn").toString).toString
Next
Next
Next
xlsSheet.SaveAs(newFile)
xlsBook.Close
xls.Quit()
Every time you write to a cell Excel recalculates the open workbooks and refreshes the screen. Both of these things are slow, so you need to set Application.Screenupdating=false and Application.Calculation=xlCalculationManual
Also there is a high overhead associated with each write to a cell, so it is much faster to acuumulate the data in an array and then write the array to the range with a single call to the Excel object model.
With auto mode calculation, recalculation occurs after every data input/changed. I had the same problem, was solved by setting Manual calculation mode. (Reference MSDN link.)
xls.Calculation = Excel.XlCalculation.xlCalculationManual
Also, this property can only be set after a Workbook has been opened or it will throw a run-time error.
One way that has saved me over the years is to add
Application.ScreenUpdating = False
directly before I execute a potentially lengthy method, and then
Application.ScreenUpdating = True
directly after, or at least at some later point in the code. This forces Excel to not redraw anything on the visible screen until it is complete That issue is where I've found lengthy running operations to stem from quite often.
I have a file with 15 million lines (will not fit in memory). I also have a small vector of line numbers - the lines that I want to extract.
How can I read-out the lines in one pass?
I was hoping for a C function that does it on one pass.
The trick is to use connection AND open it before read.table:
con<-file('filename')
open(con)
read.table(con,skip=5,nrow=1) #6-th line
read.table(con,skip=20,nrow=1) #27-th line
...
close(con)
You may also try scan, it is faster and gives more control.
If it's a binary file
Some discussion is here:
Reading in only part of a Stata .DTA file in R
If it's a CSV or other text file
If they are contiguous and at the top of the file, just use the ,nrows argument to read.csv or any of the read.table family. If not, you can combine the ,nrows and the ,skip arguments to repeatedly call read.csv (reading in a new row or group of contiguous rows with each call) and then rbind the results together.
If your file has fixed line lengths then you can use 'seek' to jump to any character position. So just jump to N * line_length for each N you want, and read one line.
However, from the R docs:
Use of seek on Windows is discouraged. We have found so many
errors in the Windows implementation of file positioning that
users are advised to use it only at their own risk, and asked not
to waste the R developers' time with bug reports on Windows'
deficiencies.
You can also use 'seek' from the standard C library in C, but I don't know if the above warning also applies!
Before I was able to get an R solution/answer, I've done it in Ruby:
#!/usr/bin/env ruby
NUM_SEQS = 14024829
linenumbers = (1..10).collect{(rand * NUM_SEQS).to_i}
File.open("./data/uniprot_2011_02.tab") do |f|
while line = f.gets
print line if linenumbers.include? f.lineno
end
end
runs fast (as fast as my storage can read the file).
I compile a solution based on the discussions here.
scan(filename,what=list(NULL),sep='\n',blank.lines.skip = F)
This will only show you number of lines but will read in nothing. If you really want to skip the blank lines, you could just set the last argument to TRUE.