Reading dates as dates and not characters - r

The data I have been working with reads everything just fine, **except** for the date column. It always reads it as characters instead.
This would be fine except that, when I have lots of dates (like over 400 of them), then you can see something like this on a scatterplot:
Scatter Plot
In essence, I have two questions.
The first is, apart from using as.Date, which is fine when I'm needed temporary stuff, how do I permanently make R read the date column as legit dates? What I mean is, is there a way I can make that date column read as dates when I am using read.csv or read.excel?
When graphing, like the graph I have included here, how can I only include some of the labels throughout so that it won't be so cramped up? I still want all the data, but really do not want all those labels.
I was hoping to add the data file, but I am unaware of how to add excel/csv files on this website and my data set is quite long (n = 491). I do have 9 columns, 1 of which is the date column. The others are numbers or actual letters (the latter of which is in fact a character). I can add maybe a few rows just to help out.
Some of the data set

Related

R Code: csv file data incorrectly breaking across lines

I have some csv data that I'm trying to read in, where lines are breaking across rows weirdly.
An example of the file (the files are the same but the date varies) is here: http://nemweb.com.au/Reports/Archive/DispatchIS_Reports/PUBLIC_DISPATCHIS_20211118.zip
The csv is non-rectangular because there's 4 different types of data included, each with their own heading rows. I can't skip a certain number of lines because the length of the data varies by date.
The data that I want is the third dataset (sometimes the second), and has approximately twice the number of headers as the data above it. So I use read.csv() without a header and ideally it should pull all data and fill NAs above.
But for some reason read.csv() seems to decide that there's 28 columns of data (corresponding to the data headers on row 2) which splits the data lower down across three rows - so instead of the data headers being on one row, it splits across three; and so does all the rows of data below it.
I tried reading it in with the column names explicitly defined, it's still splitting the rows weirdly
I can't figure out what's going on - if I open the csv file in Excel it looks perfectly normal.
If I use readr::read_lines() there's no errant carriage returns or new lines as far as I can tell.
Hoping someone might have some guidance, otherwise I'll have to figure out a kind of nasty read_lines approach.

Plotting POSIXct in ggplot manually scaling x-axis

I am trying to plot up this windspeed data, with years displaying on the x-axis. The data frame was set up as
wsAvg<-data.frame(date=as.POSIXct(ws07$date[1224:1559]),u.1=(ws07$u[1224:1559]),stringsAsFactors = FALSE)
wsAvg<-rbind(wsAvg,c(date=as.POSIXct(ws08$date[1032:1367]),(ws08$u[1032:1367])))
And below using ggplot to plot my windspeed data frame.
ggplot(wsAvg,aes(x=date,y=as.numeric(u.1)))+geom_point(size=3,pch=2)+
geom_smooth(method="lm",colour="black",se=FALSE)+
#scale_x_datetime(limits=as.POSIXct(c('2006-09-01','2016-10-01')),breaks=date_breaks("1 year"),labels=date_format("%Y"))+
Without the scale_x_datetime() in my command, I get those dates. When I add in the scale_x_datetime() function to manually scale my x-axis to display only years. All my data lines up onto 2007. Anyone know why this is?
It is very difficult to provide the answer to your question, since we don't have a clear picture of any of your data. With that being said, let's look at the information you did provide and see where the likely source of the problem is for your question.
The issue is clearly related to the formatting/data located in your "date" column. It's best to look at this stepwise and test at each step to see what can go wrong here:
Your raw data: There is likely nothing wrong with your base data, but we don't know the format of the "date" vector coming from ws07$date[1224:1559] and ws08$date[1032:1367]. Your raw data originates from two data frames, so just confirm that the raw data from these two vectors is formatted identically, but more importantly, is it already formatted as a date? What is class(ws08$date)? Also, what does the data look like if you took a sample of that dataset? (e.g. ws07$date[sample(1224:1559, 20)]).
Conversion to POSIXct: The first code you show includes as.POSIXct(), but does not include the argument for format=. You may or may not need to specify this, but I would recommend consulting the documentation to be sure you're using the function correctly. You can try converting a small subset of the data just using as.POSIXct(ws07$date[1224:1250]) or something like that. Does it give you the dates formatted correctly? If not, try specifying the format= arg until it "works" as you intended.
Initial Plot and Second plot The data is spread out in the first plot, likely kind of how you expected. What about the month/day combinations in the first plot - are they correct? If they are correct, it may indicate the year is being read wrong, since apparently all dates are clustered around May and June of 2007. Comparing the first and second plots, there's no obvious issue with scale_x_datetime() here. Those two plots are consistent with data that has x values = dates ranging from May-June of 2007.
Bottom line: Hard to discern exactly where it's going wrong for you, but likely it's (1) in the conversion to date using as.POSIXct from your ws07 and ws08 datasets, or (2) the format of ws07$date or ws08$date being imported/converted incorrectly. The solution is to use the format= argument in the date conversion/import function you are using to ensure that the format is correct and years/months/dates are imported accordingly.
The code that worked for me. Instead of using c() function when I was binding data from other datasets, I had to use data.frame() to add other years into the wsAvg data frame.
wsAvg<-data.frame(date=as.POSIXct(ws07$date[1224:1559]),u.1=(ws07$u[1224:1559]),stringsAsFactors = FALSE)
wsAvg<-rbind(wsAvg,data.frame(date=as.POSIXct(ws08$date[1032:1367]),u.1=(ws08$u[1032:1367])))

Looking for advice on creating Tidy data from the start

I have a data set that will be growing. It is categorical observations (i.e., 1=yes, 2=no) by date and hour. Is the following an acceptable method of formatting for import to R or is there a better way?
I would use a template like this:
Using one column for the date makes it much easier to read/import into R. Also, the YYYY-MM-DD is the default format in R for date columns. Trying to write date and hour together in one column could be done but seems like it could be tedious and not as easy to see what is going on in the data. As was mentioned in the comments above, each observation should be on a separate row. Once you save the data as a csv, it will be easily imported into R.
Good luck.

Plotting Time Series

I'm working on 16 world indices over three year and i want to make a plot from these 16 indices.
all<-read.table("C.../16indices.txt")
dimnames(all)[[2]]<-c("Date","BEL 20","CAC 40","AEX","DAX","FTSE 100","IBEXx 35","ATX","SMI","FTSE MIB","RTX","HSI","NIKKEI 225","S&P 500","NASDAQ","Dow Jones","BOVESPA")
attach(all)
Problems
My dates are written in the form "2009-01-05". I want only "2009" to appear otherwise i would have to many jumps.
For example the prices from the BOVESPA go from 40.000,15 to 60.000,137. How do I get nice y-labels? For instance 40.000, 45.000,...,60.000.
How do i get 16 of these plots in one nice figure/plot?
I'm not used to work with R. I tried something like this but that didn't work...
plot(all[1,],all[,2])
Biggest problem is no sample data> Here is advice based on guesswork:
I tried something like this but that didn't work... plot(all[1,],all[,2])
You need to format your date values as R Date class. If they are in YYYY-MM-DD format it will be as simple as:
all$Date <- as.Date(all.Date)
To your specific questions:
1) My dates are written in the form "2009-01-05". I want only "2009" to appear otherwise i would have to many jumps.
You will need to suppress axis plotting in the plot call and then need to add an axis() call.
2) For example the prices from the BOVESPA go from 40.000,15 to 60.000,137. How do I get nice y-labels? For instance 40.000, 45.000,...,60.000.
You appear to be in a European locale and that mean your initial read.table call probably mangled the data input and you need to read the documentation for read.csv2 which will properly handle the reversal of the decimal point and comma meanings for numeric data. You should also use colClasses.
3) How do i get 16 of these plots in one nice figure/plot?
You should probably calculate ratios from an initial starting point for each series so there can be a common scale for display.

Trouble getting my data into wide form with the reshape package

I am currently analysing a rather large dataset (22k+records) and am having some trouble getting the data into a wide format (with one row corresponding to each observation, and columns representing variables).
The data came in two CSV files, one giving demographics and the other giving participants probability ratings to a number of questions. Both of these CSV files were in long format.
I have used the reshape (and reshape2 for speed) packages to attempt to solve my problem. The specific issue i am having is the following.
I have the participants probability ratings in the following form (after one successful reshape).
dtf <- read.csv("http://dl.dropbox.com/u/8566396/foobar.csv")
Now, the format i would like my data to be in is as follows:
User ID Qid1, ....Qid255 Time, with the probabilities for each question in the questions corresponding column.
I have tried a loop and apply to put the values into a new data frame, and many variations of melt and cast. I have also tried the base reshape function, but all to no avail.
In the past, i've always edited my CSV files directly, but this is not an option with the size of this file (my laziness when it comes to data manipulation within R has come back to haunt me).
Any advice or solution you can give to avoid me having to do this by hand would be greatly appreciated.
Your dataset has 6 rows, 3 of which have the column "variable" equal to "probability" and 3 of which have that column equal to "time". You want to have probability be the value of each, and time be added onto the right.
I think there's a difficulty in making this work for you because what you want to do isn't clear. You have values for each UID-Time-X### cell, and values for each UID-Prob-X### cell. Therefore, you have to discard information to get it into your preferred format (UID-Time-X### with probabilities as the values). It seems to me like you're treating time as an ID variable, but it's storing values like a content variable.
To avoid discarding any data, your output would have to look something like:
UID Time1 Time2 Time3 Prob1 Prob2 Prob3
Which is simply reshaped wide.

Resources