I have an issue very annoying.
I have some oxygen measurements saved in .xlsx table (created directly by the device software). Opened with excel, this is my part of my file.
In the first picture, we can notice that sometimes, the software skips a second (11:13:00 then 13:02).
in the second picture, just notice the continuity of time from 11:19:01 to 11:19:09.
I call my excel table in R with the package readxl with the code
oxy <- read_excel("./Metabolism/20180502 DAPH 20.xlsx" , 1)
And before any manipulation, when I check my table in R (Rstudio), I have that:
In the first case, R kept the time continuity by adding 11:13:01 and shift the next rows.
Then, later, reverse situation: the continuity of time was respected in excel, but R skips a second and again, shits the next rows.
At the end, there is the same number of rows. I guess it is a problem with the way R and excel round the time. But these little errors prevent me using the date to merge two tables, and the calculations afterwards are wrong.
May I do something to tell R to read the data exactly the same way Excel saved them?
Thank you very much!
Index both with a sequential integer counter each starting at the same point and use that for merging like with like. If you want the Excel version to be 'definitive' convert the index back to time with a lookup based on your Excel version.
Related
Tearing my hair out on this one. Took me hours just to get rJava up and running (because mac OS X el capitan was not wanting to play nice with Java) in order to load excel-specific data importing packages etc. But in the end this hasn't helped my problem, and I'm just about at my wits end. Please help.
Basic situation is this:
Have simple excel data of time durations, over a span of a couple of years. So the two columns I'm importing are the time(duration) and year(2016,2017 etc).
In Excel the data is formatted as [h]:mm:ss so it displays correctly (data is related to number of hours worked in a month, so typically something like 80:xx:xx ~ 120:xx:xx). I'm aware that in excel, despite the cells being formatted as above, and only showing the relevant period of hours, that in reality excel has appended an (irrelevant, arbitrary) date to this hours data. I have searched and searched and found no way around this limitation in the way excel handles dates/times/durations.
I import this data into R via the "import data -> import from excel data set" menu item in R commander GUI, not the console.
However when importing the data into R, the data displays as a single number e.g. approx. 110 hrs is converted to 4.xxxxx, not as hh:mm:ss. So when running analyses and generating plots etc, instead of the actual (meaningful) 110:xx:xx type data being displayed, a completely meaningless 4.xxxxxx is displayed.
If I change the formatting of the excel cells to display the date as well as the time rather than use the [h]:mm:ss cell formatting, R erroneously interprets the data to something equally useless, like 1901/02/04 05:23 am
I have installed and loaded a variety of packages such as xlsx, XLConnect, lubridate etc but it hasn't made any difference to how R interprets the excel data on import, from the GUI at least.
Please tell me how do I either
a) edit the raw data to a format that R will understand as a time duration (and nothing but a time duration) in hh:mm:ss format, or
b) format the current data from within R after importing, so that it displays the data in the correct way rather than a useless number or arbitrary date/time?
[Please note: I can use the console, when given the commands etc needed to be executed. But I need to find a solution that ultimately will allow the data to be imported and/or manipulated from within the GUI, not from typing a bunch of commands into the console, as the end user (not me) has zero programming ability and cannot use a console, and will only ever be using R via the GUI.]
Is your code importing the data from excel as seconds?
library(lubridate)
duration <- lubridate::as.duration(400000)
as.numeric(duration, "hours")
111.1111
as.numeric(duration, "days")
4.62963
seconds_to_period(400000)
"4d 15H 6M 40S"
Recently I obtained a lot of RTF files that contain econ data for analysis I need to do. Unfortunately, this is how Bureau of Statistics of my country could help with time series data for a long time lapse. If there is a one time need to select particular indicator for 10 years or so I'm OK to find these values by hand using Word/Notepad/TestEdit(for Mac). But my problem is that I have 15 files with data that I need to combine somehow in one dataset for my work. But, before even start doing this I don't have a clue if it is possible to read those files in appropriate format (data.frame). Wanted to ask expert opinion on how to approach this task. An example of one of files could be downloaded from here:
[https://www.dropbox.com/s/863ikx6poid8unc/Export_for_SO.rtf?dl=0][1]
All values are in Russian. Dataset represents export of particular product (first column) across countries (second column) in US dollars for 2 periods.
Thank you.
Use code found on https://datascienceplus.com/how-to-import-multiple-csv-files-simultaneously-in-r-and-create-a-data-frame/
replace read_csv with read_rtf
You may want to manually convert your files to another format using an office suite or a text editor. You should be able to save as in another format.
While in R, you may want to give striprtf a try. I'm guessing you will still have to clean your data a bit afterward.
You can install the package like this:
install.packages("striprtf")
I have data in excel and after reading in R it reads as follows
as
lob2 lob3
1.86E+12 7.58E+12
I want it as
lob2 lob3
1857529190776.75 7587529190776.75
This difference causes me to have different results after doing my analysis later on
How is the data stored in Excel (does it think it is a number, a string, a date, etc.)?
How are you getting the data from Excel to R? If you save the data as a .csv file then read it into R, look at the intermediate file, Excel is known to abbreviate when saving and R would then see character strings instead of numbers. You need to find a way to tell excel to export the data in the correct format with the correct precision.
If you are using a package (there are more than 1) then look into the details of that package for how to grab the numbers correctly (you may need to make changes in Excel so that it knows they are numbers).
Lastly, what does the str function on your R object say? It could be that R is storing the proper numbers and only displaying the short version as mentioned in the comments. Or, it could be that R received strings that did not convert nicely to numbers and is storing them as characters or factors. The str function will let you see how your data is stored in R, and therefore how to convert or display it correctly.
I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.
Is this possible? I cannot seem to find an answer anywhere.
If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.
And that step is easier to happen outside R.
(1) Linux Shell:
Assuming your data falls into a consistent format. Each row is one record. You can do:
sort -R data | head -n 1000 >data.sample
This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample
(2) If the data is not small enough to fit into memory.
There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:
select * from tablename order by rand() limit 1000
You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.
These are the two most commonly used ways based on my experience for dealing with 'big' data.
I have a dataset in SPSS that has 100K+ rows and over 100 columns. I want to filter both the rows and columns at the same time into a new SPSS dataset.
I can accomplish this very easily using the subset command in R. For example:
new_data = subset(old_data, select = ColumnA >10, select = c(ColumnA, ColumnC, ColumnZZ))
Even easier would be:
new data = old_data[old_data$ColumnA >10, c(1, 4, 89)]
where I am passing the column indices instead.
What is the equivalent in SPSS?
I love R, but the read/write and data management speed of SPSS is significantly better.
I am not sure what exactly you are referring to when you write that "the read/write and data management speed of SPSS being significantly better" than R. Your question itself demonstrates how flexible R is at data management! And, a dataset of 100k rows and 100 columns is by no means a large one.
But, to answer your question, perhaps you are looking for something like this. I'm providing a "programmatic" solution, rather than the GUI one, because you're asking the question on Stack Overflow, where the focus is more on the programming side of things. I'm using a sample data file that can be found here: http://www.ats.ucla.edu/stat/spss/examples/chp/p004.sav
Save that file to your SPSS working directory, open up your SPSS syntax editor, and type the following:
GET FILE='p004.sav'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'mynewdatafile.sav'
/KEEP currentm previous lactatio.
GET FILE='mynewdatafile.sav'.
More likely, though, you'll have to go through something like this:
FILE HANDLE directoryPath /NAME='C:\path\to\working\directory\' .
FILE HANDLE myFile /NAME='directoryPath/p004.sav' .
GET FILE='myFile'.
SELECT IF (lactatio <= 3).
SAVE OUTFILE= 'directoryPath/mynewdatafile.sav'
/KEEP currentm previous lactatio.
FILE HANDLE myFile /NAME='directoryPath/mynewdatafile.sav'.
GET FILE='myFile'.
You should now have a new file created that has just three columns, and where no value in the "lactatio" column is greater than 3.
So, the basic steps are:
Load the data you want to work with.
Subset for all columns from all the cases you're interested in.
Save a new file with only the variables you're interested in.
Load that new file before you proceed.
With R, the basic steps are:
Load the data you want to work with.
Create an object with your subset of rows and columns (which you know how to do).
Hmm.... I don't know about you, but I know which method I prefer ;)
If you're using the right tools with R, you can also directly read in the specific subset you are interested in without first loading the whole dataset if speed really is an issue.
In spss you can't combine the two actions in one command, but it's easy enough to do it in two:
dataset copy old_data. /* delete this if you don't need to keep both old and new data.
select if ColumnA>10.
add files /file=* /keep=ColumnA ColumnC ColumnZZ.