I'm processing data where the source is a manual update in Excel format which is provided monthly.
One of the quirk of the data is that some cancelled records are indicated either by the person keying in the data highlighting the cells in red, or changing the fonts to be strikeout. Unfortunately I have no control on the data entry source, so I have had to regularly manually search the file for red cells or strike out fonts, and manually clean them up (either delete, or add a column with status as Cancelled, depending on the usage).
Does anyone have suggestion on the best data cleaning practice for this? Is there an automated approach for this, or do I simply have to be resigned to documenting the steps, and executing them regularly?
For info my preferred tool is R, so if there is a way to clean it from within R, that would be best. I'm open to other approaches.
There are a few R packages for working with Excel, but filtering data based on formatting is going to involve using rcom and excel's COM interface. I'm not terribly familiar with either.
The route I would go is to write a VBA macro which would filter the data, wrap that macro in a VBS script, and call that script from the command line (or via R's system or shell functions)
The reason I would go that route is that both VBA and VBS are very easy to pick up if you have any familiarity at all with programming. COM on the other hand isn't something that people gain a level of comfort with very quickly.
VBA is what will give you access to the excel formatting. (Visual Basic for Applications). VBS is what you will need to automate the macro via the command line rather than from within Excel (Visual Basic Scripting Edition).
You can do this in Excel. Start by recording a macro, then put a colour based filter on the column you want (to select only the red cells), delete the rows and then stop recording of the macro.
This should give you a macro. You might need to make some small changes, you can google specific commands for more details.
Related
I am in the midst of writing some scripts to perform data analysis on large excel sheets faster than by hand. However, my company has a strict quality review system where the program used needs to be validated and secure (i.e. no one can edit it, there is proof of what code was run, etc.). So essentially I would like my code to be able to be ran by my coworkers without them being able to edit the script. I was also interested in inserting prompts that they can fill in (e.g. "Which column would you like to analyze?")
Is all of this possible? I have read a few things online about file permissions but I know that these can easily be changed by the user. I also read about obfuscators but am entirely unfamiliar with their use.
One thought I have is to use Rmarkdown as a method of displaying which lines were run for which results. However, I believe that document could be edited as well? This would also leave the issue of the script itself being able to be edited.
I'm trying to find a way to automate a series of processes that uses several different programmes. (Indifferently on Ubuntu and Windows).
For each programme, I've boiled the process down to either a macro or a script in each programme, so I feel confident that the entire process can be almost entirely automated. I just can't figure out what I can do to create a unifying tool.
The process is the following;
I have a simple text file with data, I use a jedit macro to tidy the data. This then goes to a OpenCalc template to create a graph, that data is then imported to a programme called TXM which (after many clicks) generates a column of data, this is exported to a csv file, that csv file is imported to an R session where upon a script is executed.
I have to repeat this process( and a few more similarones) dozens of times a day, and it's driving me nuts.
My research into how to automate the import treatment export process has shown a few glimmers of hope but I haven't been able to make any real progress.
I tried Autoexpect, but couldn't make it work on Ubuntu. TCL, I think only works for internet applications, Fabric I also haven't been able to make work.
I'm willing to spend a lot of time learning and develloping a tool to achieve this, but at the moment I'm not even sure what terms to search for.
I've figured it out for windows; I created a .bat file in a text editor which, when click prompst the user for names, etc and rewrites another text file. It then executes that .txt file as a script with r with the
command R.exe CMD BATCH c:\user\desktop\mymacro.txt
I would like to access the history of what have been typed in the source panel in RStudio.
I'm interested in the way we learn and type code. Three things I would like to analyse are: i) the way a single person type code, ii) how different persons type code, iii) the way a beginner improve typing.
Grabbing the history of commands is quite satisfying as first attempt in this way but I would like to reach a finer granularity and thus access the successive changes, within a single line in a way.
So, to be clear, I'm neither looking for the history of commands or for a diff between different versions of and .R file.
What I would like to access is really the successive alterations to the source panel that are visible when you recursively press Ctrl+Z. I do not know if there is a more accurate word for what I describe, but again what I'm interested in is how bits of code are added/moved/deleted/corrected/improved in the source panel but not necessary passed to the Console and thus absent from the history of command.
This must be somewhere/somehow saved by RStudio as it is accessible by the later. This may be saved in a quite hidden/private/memory/process/... way and I have a very vague idea of how a GUI works. I do not know it if would be easily accessible, then programmaticaly analyzed, typically if we could save a file from it. Timestamps would be the cherry on top but I would be happy without.
Do you have idea how to access this history?
RStudio's source panel is essentially a view to an Ace Editor. As such you'd need to access the editor session's editSession and use getDocument or getWordRange along with the undo of the editSession's undoManager instance.
I don't think you'll be doing that from within RStudio without hacking on the RStudio code unless the RStudio Addin api is made to pass-thru editor events in the future.
It might be easier to write a session recorder as changes are made rather than try to mess with the undo history. I imagine you could write an Addin that calls a javascript to communicate over the existing RStudio port using the Ace Editor's events (ie. onChange).
As #GegznaV said, RStudio saves code history to a ".RHistory" file. It's in the "Documents" folder on my computer. It's probably the same folder if you're using Windows. If you do not know the location, you can find the file by searching.
It also allows saving RStudio history to a file manually. There is a "Save" button in the History panel. So you can also get a timestamp. You can ask different users to save their code history after they have finished writing code. It may be indirectly useful to your research.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Workflow for statistical analysis and report writing
I have been programming with R for not too long but am running into a project organization question that I was hoping somebody could give me some tips on. I am finding that a lot of the analysis I do is ad hoc: that is, I run something, think about the results, tweek it and run some more. This is conceptually different than in a language like C++ where you think about the entire thing you want to run before coding. It is a huge benefit of interpreted languages. However, the issue that comes up is I end up having a lot of .RData files that I save so I don't have to source my script every time. Does anyone have any good ideas about how to organize my project so I can return to it a month later and have a good idea of what each file is associated with?
This is sort of a documentation question I guess. Should I document my entire project at each leg and be vigorous about cleaning up files that will no longer be necessary but were a byproduct of the research? This is my current system but it is a bit cumbersome. Does anyone else have any other suggestions?
Per the comment below: One of the key things that I am trying to avoid is the proliferation of .R analysis files and .RData sets that go along with them.
Some thoughts on research project organisation here:
http://software-carpentry.org/4_0/data/mgmt/
the take-home message being:
Use Version Control for your programs
Use sensible directory names
Use Version Control for your metadata
Really, Version Control is a good thing.
My analysis is a knitr document, with some external .R files which are called from it.
All data is in a database, but during my analysis the processed data are saved as .RData files. Only when I delete the RData, they are recreated from the database when I run the analysis again. Kinda like a cache, saves database access and data processing time when I rerun (parts of) my analysis.
Using a knitr (Sweave, etc) document for the analysis enables you to easily write a documented workflow with the results included. And knitr caches the results of the analysis, so small changes do usually not result in a full rerun of all R code, but only of a small section. Saves quite some running time for a bigger analysis.
(Ah, and as said before: use version control. Another tip: working with knitr and version control is very easy with RStudio.)
Let me first say that I assiduously avoid hand-cleaning data in favor of regular expressions and the like. However, occasionally it is inevitable.
I use something like the Load-Clean-Func-Do workflow normally, so this obviously fits into the cleaning phase. However, any hand-editing breaks the ability to run the stuff before the hand-cleaning if it needs updating.
I can think of at least three ways to handle this:
Put the by-hand changes as early in the workflow as possible, so that everything after that remains runnable.
Write out regexes or assignment operations for every single change.
Use a tool that generates (2) for you after you close the spreadsheet where you've made the changes.
The problem with 2 is that it can be extremely unweildy. The problem with 3 is that I'm unaware of any such tool existing for R. Stata has an extremely good implementation of this.
So the questions are:
Which results in the most replicable code with the least-frustrating code writing?
Does a tool as in (3) exist?
I agree that hand-cleaning is generally a rather bad idea. However, sometimes it is unavoidable. I'd suggest one of the two, or both:
Keep a separate data file with "data fixing" containing three variables "case_id", "variable_name", "value". Use it to store information about which values in the original data need to be replaced. You may add some additional variables to extra information about cleaning (e.g. why value on variable "variable_name" need to be replaced with "value" for case "case_id", etc.). Then have a short piece of R code, which loads your original data and then cleans it with the additional information in the "fixing" file.
Perhaps you should start using some version control system like git or subversion (there are other progs also). Every hand-made change to the data could be recorded in the system as a separate commit. By the end of the day, you will be able to easily check the log for what change you made to the data and when. Moreover, you will be able to generate patch files that transform original data files to the cleaned ones. It is also beneficial to have your R code files version-controlled.