I am trying to write a code that will be able to replace cells in an existing excel sheet with a dataframe values. The lines work, but the problem is: that the codes require pythoncom library, a windows base library. So when I tried to upload it to streamlit cloud, which is based on Linux, an error arose.
the code is like this:
pythoncom.CoInitialize()
with xw.App() as app:
wb=xw.Book(path)
wb.sheets(sheet_name).range(kolom_cluster+str(header+1)).options( index=False,chunksize=30_000).value=df["cluster"]
wb.save(path)
wb.close()
Therefore, I am wondering if there is an alternative to doing the same
(writing df values to an existing excel file) without the need for pythoncom?
Thank you very much for your kind attention and solution.
This is my first question in Stackoverflow. Have been using it for quite a while. I have tried to search for the same solution over StackOverflow quite some time. Thus, I am hoping, if you guys could help me solve the problem.
Thank you very much.
PS.
I have tried script like this, using openpyxl
writer = pd.ExcelWriter(path, engine='openpyxl',
mode='a', # append data instead of overwriting all the book by default
if_sheet_exists='overlay' # write the data over the old one instead of raising a ValueError
)
df["cluster"].to_excel(
writer,
sheet_name=sheet_name,
startrow=header+1, # upper left corner where to start writing data
startcol=ord(kolom_cluster), # note that it starts from 0, not 1 as in Excel
index=False, # don't write index
header=False # don't write headers
)
writer.save()
writer.close()
But it returns some error:
I think we could use pandas.ExcelWriter in this case with openpyxl as an engine, for example. I hope the code below is self-explaining:
writer = pd.ExcelWriter(
"path to the file of interest",
engine='openpyxl',
mode='a', # append data instead of overwriting all the book by default
if_sheet_exists='overlay' # write the data over the old one instead of raising a ValueError
)
df.to_excel(
writer,
sheet_name="worksheet of interest"
startrow=0, # upper left corner where to start writing data
startcol=0, # note that it starts from 0, not 1 as in Excel
index=False, # don't write index
header=False # don't write headers
)
writer.save()
Update
Here's a short test to see if it works:
import pandas as pd
from pathlib import Path
from openpyxl import Workbook
df = pd.DataFrame(1, [1,2,3], [*'abc'])
f = Path('test_openpyxl.xlsx')
if not f.exists():
wb = Workbook()
wb.worksheets[0].title = "Data"
wb.save(f)
wb.close()
with pd.ExcelWriter(
f,
mode='a',
if_sheet_exists='overlay'
) as writer:
assert writer.engine == 'openpyxl'
for n, (i, j) in enumerate(zip([0,0,3,3],[0,3,0,3]), 1):
(n*df).to_excel(
writer,
sheet_name="Data",
startrow=i,
startcol=j,
index=False,
header=False
)
My environment:
python 3.9.7
pandas 1.4.3
openpyxl 3.0.10
excel version 2108 (Office 365)
Related
I was trying to read an excel spreadsheet into R data frame. However, some of the columns have formulas or are linked to other external spreadsheets. Whenever I read the spreadsheet into R, there are always many cells becomes NA. Is there a good way to fix this problem so that I can get the original value of those cells?
The R script I used to do the import is like the following:
options(java.parameters = "-Xmx8g")
library(XLConnect)
# Step 1 import the "raw" tab
path_cost = "..."
wb = loadWorkbook(...)
raw = readWorksheet(wb, sheet = '...', header = TRUE, useCachedValues = FALSE)
UPDATE: read_excel from the readxl package looks like a better solution. It's very fast (0.14 sec in the 1400 x 6 file I mentioned in the comments) and it evaluates formulas before import. It doesn't use java, so no need to set any java options.
# sheet can be a string (name of sheet) or integer (position of sheet)
raw = read_excel(file, sheet=sheet)
For more information and examples, see the short vignette.
ORIGINAL ANSWER: Try read.xlsx from the xlsx package. The help file implies that by default it evaluates formulas before importing (see the keepFormulas parameter). I checked this on a small test file and it worked for me. Formula results were imported correctly, including formulas that depend on other sheets in the same workbook and formulas that depend on other workbooks in the same directory.
One caveat: If an externally linked sheet has changed since the last time you updated the links on the file you're reading into R, then any values read into R that depend on external links will be the old values, not the latest ones.
The code in your case would be:
library(xlsx)
options(java.parameters = "-Xmx8g") # xlsx also uses java
# Replace file and sheetName with appropriate values for your file
# keepFormulas=FALSE and header=TRUE are the defaults. I added them only for illustration.
raw = read.xlsx(file, sheetName=sheetName, header=TRUE, keepFormulas=FALSE)
Does anyone know how to read a text file in SparkR version 1.4.0?
Are there any Spark packages available for that?
Spark 1.6+
You can use text input format to read text file as a DataFrame:
read.df(sqlContext=sqlContext, source="text", path="README.md")
Spark <= 1.5
Short answer is you don't. SparkR 1.4 has been almost completely stripped from low level API, leaving only a limited subset of Data Frame operations.
As you can read on an old SparkR webpage:
As of April 2015, SparkR has been officially merged into Apache Spark and is shipping in an upcoming release (1.4). (...) Initial support for Spark in R be focussed on high level operations instead of low level ETL.
Probably the closest thing is to load text files using spark-csv:
> df <- read.df(sqlContext, "README.md", source = "com.databricks.spark.csv")
> showDF(limit(df, 5))
+--------------------+
| C0|
+--------------------+
| # Apache Spark|
|Spark is a fast a...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
+--------------------+
Since typical RDD operations like map, flatMap, reduce or filter are gone as well it is probably what you want anyway.
Now, low level API is still underneath so you can always do something like below but I doubt it is a good idea. SparkR developers most likely had a good reason to make it private. To quote ::: man page:
It is typically a design mistake to use ‘:::’ in your code since
the corresponding object has probably been kept internal for a
good reason. Consider contacting the package maintainer if you
feel the need to access the object for anything but mere
inspection.
Even if you're willing to ignore good coding practices I it is most likely not worth the time. Pre 1.4 low level API is embarrassingly slow and clumsy and without all the goodness of the Catalyst optimizer it is most likely the same when it comes to internal 1.4 API.
> rdd <- SparkR:::textFile(sc, 'README.md')
> counts <- SparkR:::map(rdd, nchar)
> SparkR:::take(counts, 3)
[[1]]
[1] 14
[[2]]
[1] 0
[[3]]
[1] 78
Not that spark-csv, unlike textFile, ignores empty lines.
Please follow the links
http://ampcamp.berkeley.edu/5/exercises/sparkr.html
we can simply use -
textFile <- textFile(sc, "/home/cloudera/SparkR-pkg/README.md")
While checking the SparkR code, Context.R has textFile method , so ideally a SparkContext must have textFile API to create the RDD , but thats missing in doc.
# Create an RDD from a text file.
#
# This function reads a text file from HDFS, a local file system (available on all
# nodes), or any Hadoop-supported file system URI, and creates an
# RDD of strings from it.
#
# #param sc SparkContext to use
# #param path Path of file to read. A vector of multiple paths is allowed.
# #param minPartitions Minimum number of partitions to be created. If NULL, the default
# value is chosen based on available parallelism.
# #return RDD where each item is of type \code{character}
# #export
# #examples
#\dontrun{
# sc <- sparkR.init()
# lines <- textFile(sc, "myfile.txt")
#}
textFile <- function(sc, path, minPartitions = NULL) {
# Allow the user to have a more flexible definiton of the text file path
path <- suppressWarnings(normalizePath(path))
# Convert a string vector of paths to a string containing comma separated paths
path <- paste(path, collapse = ",")
jrdd <- callJMethod(sc, "textFile", path, getMinPartitions(sc, minPartitions))
# jrdd is of type JavaRDD[String]
RDD(jrdd, "string")
}
Follow the link
https://github.com/apache/spark/blob/master/R/pkg/R/context.R
For test case
https://github.com/apache/spark/blob/master/R/pkg/inst/tests/test_rdd.R
Infact, you could use the databricks/spark-csv package to handle tsv files too.
For example,
data <- read.df(sqlContext, "<path_to_tsv_file>", source = "com.databricks.spark.csv", delimiter = "\t")
A host of options is provided here -
databricks-spark-csv#features
Thank you in advance for your're help. I am using R to analyse some data that is initially created in Matlab. I am using the package "R.Matlab" and it is fantastic for 1 file, but I am struggling to import multiple files.
The working script for a single file is as follows...
install.packages("R.matlab")
library(R.matlab)
x<-("folder_of_files")
path <- system.file("/home/ashley/Desktop/Save/2D Stream", package="R.matlab")
pathname <- file.path(x, "Test0000.mat")
data1 <- readMat(pathname)
And this works fantastic. The format of my files is 'Name_0000.mat' where between files the name is a constant and the 4 digits increase, but not necesserally by 1.
My attempt to load multiple files at once was along these lines...
for (i in 1:length(temp))
data1<-list()
{data1[[i]] <- readMat((get(paste(temp[i]))))}
And also in multiple other ways that included and excluded path and pathname from the loop, all of which give me the same error:
Error in get(paste(temp[i])) :
object 'Test0825.mat' not found
Where 0825 is my final file name. If you change the length of the loop it is always just the name of the final one.
I think the issue is that when it pastes the name it looks for that object, which as of yet does not exist so I need to have the pasted text in speach marks, yet I dont know how to do that.
Sorry this was such a long post....Many thanks
I have a textfile of 4.5 million rows and 90 columns to import into R. Using read.table I get the cannot allocate vector of size... error message so am trying to import using the ff package before subsetting the data to extract the observations which interest me (see my previous question for more details: Add selection crteria to read.table).
So, I use the following code to import:
test<-read.csv2.ffdf("FD_INDCVIZC_2010.txt", header=T)
but this returns the following error message :
Error in read.table.ffdf(FUN = "read.csv2", ...) :
only ffdf objects can be used for appending (and skipping the first.row chunk)
What am I doing wrong?
Here are the first 5 rows of the text file:
CANTVILLE.NUMMI.AEMMR.AGED.AGER20.AGEREV.AGEREVQ.ANAI.ANEMR.APAF.ARM.ASCEN.BAIN.BATI.CATIRIS.CATL.CATPC.CHAU.CHFL.CHOS.CLIM.CMBL.COUPLE.CS1.CUIS.DEPT.DEROU.DIPL.DNAI.EAU.EGOUL.ELEC.EMPL.ETUD.GARL.HLML.ILETUD.ILT.IMMI.INAI.INATC.INFAM.INPER.INPERF.IPO ...
1 1601;1;8;052;54;051;050;1956;03;1;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;1;1;Z;16;Z;03;16;Z;Z;Z;21;2;2;2;Z;1;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;1;1;1;4;M;22;32;AZ;AZ;00;04;2;2;0;1;2;4;1;00;Z;54;2;ZZ;1;32;2;10;2;11;111;11;11;1;2;ZZZZZZ;1;2;1;4;41;2;Z
2 1601;1;8;012;14;011;010;1996;03;3;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;2;8;Z;16;Z;ZZ;16;Z;Z;Z;ZZ;1;2;2;2;Z;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;3;3;3;1;M;11;11;ZZ;ZZ;00;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;1;32;Z;10;2;23;230;11;11;Z;Z;ZZZZZZ;1;2;1;4;41;2;Z
3 1601;1;8;006;05;005;005;2002;03;3;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;2;8;Z;16;Z;ZZ;16;Z;Z;Z;ZZ;1;2;2;2;Z;2;1;1;1;4;4;4,02306147485403;ZZZZZZZZZ;3;3;3;1;M;11;11;ZZ;ZZ;00;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;1;32;Z;10;2;23;230;11;11;Z;Z;ZZZZZZ;1;2;1;4;41;2;Z
4 1601;1;8;047;54;046;045;1961;03;2;ZZZZZ;2;Z;Z;Z;1;0;Z;4;Z;Z;6;1;6;Z;16;Z;14;974;Z;Z;Z;16;2;2;2;Z;2;2;4;1;1;4;4;4,02306147485403;ZZZZZZZZZ;2;2;2;1;M;22;32;MN;GU;14;04;2;2;0;1;2;4;1;14;Z;54;2;ZZ;2;32;1;10;2;11;111;11;11;1;4;ZZZZZZ;1;2;1;4;41;2;Z
5 1601;2;9;053;54;052;050;1958;02;1;ZZZZZ;2;Z;Z;Z;1;0;Z;2;Z;Z;2;1;2;Z;16;Z;12;87;Z;Z;Z;22;2;1;2;Z;1;2;3;1;1;2;2;4,21707670353782;ZZZZZZZZZ;1;1;1;2;M;21;40;GZ;GU;00;07;0;0;0;0;0;2;1;00;Z;54;2;ZZ;1;30;2;10;3;11;111;ZZ;ZZ;1;1;ZZZZZZ;2;2;1;4;42;1;Z
I encountered a similar problem related to reading csv into ff objects. On using
read.csv2.ffdf(file = "FD_INDCVIZC_2010.txt")
instead of implicit call
read.csv2.ffdf("FD_INDCVIZC_2010.txt")
I got rid of the error. The explicitly passing values to the argument seems specific to ff functions.
You could try the following code:
read.csv2.ffdf("FD_INDCVIZC_2010.txt",
sep = "\t",
VERBOSE = TRUE,
first.rows = 100000,
next.rows = 200000,
header=T)
I am assuming that since its a txt file, its a tab-delimited file.
Sorry I came across the question just now. Using the VERBOSE option, you can actually see how much time your each block of data is taking to be read. Hope this helps.
If possible try to filter the data at the OS level, that is before they are loaded into R. The simplest way to do this in R is to use a combination of pipe and grep command:
textpipe <- pipe('grep XXXX file.name |')
mutable <- read.table(textpipe)
You can use grep, awk, sed and basically all the machinery of unix command tools to add the necessary selection criteria and edit the csv files before they are imported into R. This works very fast and by this procedure you can strip unnecessary data before R begins to read them from pipe.
This works well under Linux and Mac, perhaps you need to install cygwin to make this work under Windows or use some other windows-specific utils.
perhaps you could try the following code:
read.table.ffdf(x = NULL, file = 'your/file/path', seq=';' )
I have some data gathered by an Android phone and it is stored in SQLite format in an SQLite file. I would like to play around with this data (analysing it) using either MatLab or Octave. The SQLite data is stored as a file.
I was wondering what commands you would use to import this data into MatLab? To say, put it into a vector or matrix. Do I need any special toolboxes or packages like the Database Package to access the SQL format?
There is the mksqlite tool.
I've used it personally, had some issues of getting the correct version for my version of matlab. But after that, no problems. You can even query the database file directly to reduce the amount of data you import into matlab.
Although mksqlite looks nice it is not available for Octave, and may not be suitable as a long-term solution. Exporting the tables to CSV-files is an option, but the importing (into Octave) can be quite slow for larger data sets because of the string-parsing involved.
As an alternative, I ended up writing a small Python script to convert my SQLite table into a MAT file, which is fast to load into either Matlab or Octave. MAT files are platform-neutral binary files, and the method works both for columns with numbers and strings.
import sqlite3
import scipy.io
conn = sqlite3.connect('my_data.db')
csr = conn.cursor()
res = csr.execute('SELECT * FROM MY_TABLE')
db_parms = list(map(lambda x: x[0], res.description))
# Remove those variables in db_parms you do not want to export
X = {}
for prm in db_parms:
csr.execute('SELECT "%s" FROM MY_TABLE' % (prm))
v = csr.fetchall()
# v is now a list of 1-tuples
X[prm] = list(*zip(*v))
scipy.io.savemat('my_data.mat', X)