R + ggplot: Order irregular Time Strings for Plot - r

I have a data frame with two columns. The first is a numerical value, the other is a string describing a time. The time format looks like yyyy-mm-dd--hh-mm-ss-?????? (e.g. 2015-03-04--12-11-35-669696), I don't know what the last 6 digits mean. E.g.
y time
1 4.548 2014-08-11--09-07-44-202586
2 4.548 2014-08-11--09-07-54-442586
3 4.548 2014-08-11--09-08-04-522586
4 4.478 2014-08-11--09-08-14-762586
5 4.431 2014-08-11--09-08-24-522586
6 4.446 2014-08-11--09-08-34-922586
7 4.492 2014-08-11--09-08-44-522586
8 4.508 2014-08-11--09-08-54-442586
9 4.486 2014-08-11--09-09-04-202586
10 4.497 2014-08-11--09-09-14-442586
11 4.461 2014-08-11--09-09-24-202586
I want to plot them with
ggplot(df, aes(x=time, y=y)) + geom_line()
But I have the problem, that ggplot doesn't know how to deal with data of class character and in particular with my given time format.
I tried to use AsciiToInt from the pakage {sfsmisc} to convert the strings to numerical values, but it repeats a list of integers for each string (one number for each character, of course).
I can also sort my time strings with mixedsort from the pakage {gtools}, but I don't how to apply it for the plot (also keeping in mind the distance).
Another problem is that I don't want every time string appear as tick at the x-axis, due to I have around 20k rows. Maybe I can solve that problem like in this question, but I cannot check that as long as the first problem occurs.
Can you help me, ploting such data with the time as a numeric-like value on the x-axis?

I loaded your data as a .txt file called time dat. First I convert your data into POSIXct type. To make a cleaner graph for test purposes I omit the seconds field, if you want to add them in just use the commented out line.
library(ggplot2)
timedat<-read.csv("~/Work/Timedat.csv")
timedat
str(timedat)
> str(timedat)
'data.frame': 11 obs. of 3 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ y : num 4.55 4.55 4.55 4.48 4.43 ...
$ time: Factor w/ 11 levels "2014-08-11--09-07-44-202586",..: 1 2 3 4 5 6 7 8 9 10 ...
#timedat$time<-as.POSIXct(as.character(timedat$time),format = "%Y-%m-%d--%H-%M-%S")
timedat$time<-as.POSIXct(as.character(timedat$time),format = "%Y-%m-%d--%H-%M")
qplot(data=timedat,y=y,x=time)+theme_bw()
> timedat
X y time
1 1 4.548 2014-08-11--09-07-44-202586
2 2 4.548 2014-08-11--09-07-54-442586
3 3 4.548 2014-08-11--09-08-04-522586
4 4 4.478 2014-08-11--09-08-14-762586
5 5 4.431 2014-08-11--09-08-24-522586
6 6 4.446 2014-08-11--09-08-34-922586
7 7 4.492 2014-08-11--09-08-44-522586
8 8 4.508 2014-08-11--09-08-54-442586
9 9 4.486 2014-08-11--09-09-04-202586
10 10 4.497 2014-08-11--09-09-14-442586
11 11 4.461 2014-08-11--09-09-24-202586
This produces the following plot with the dates nicely ordered.

Related

creating a dataframe of means of 5 randomly sampled observations

I'm currently reading "Practical Statistics for Data Scientists" and following along in R as they demonstrate some code. There is one chunk of code I'm particularly struggling to follow the logic of and was hoping someone could help. The code in question is creating a dataframe with 1000 rows where each observation is the mean of 5 randomly drawn income values from the dataframe loans_income. However, I'm getting confused about the logic of the code as it is fairly complicated with a tapply() function and nested rep() statements.
The code to create the dataframe in question is as follows:
samp_mean_5 <- data.frame(income = tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean),
type='mean_of_5')
In particular, I'm confused about the nested rep() statements and the 1000*5 portion of the sample() function. Any help understanding the logic of the code would be greatly appreciated!
For reference, the original dataset loans_income simply has a single column of 50,000 income values.
You have 50,000 loans_income in a single vector. Let's break your code down:
tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean)
I will replace 1000 with 10 and income with random numbers, so it's easier to explain. I also set set.seed(1) so the result can be reproduced.
sample(loans_income$income,1000*5)
We 50 random incomes from your vector without replacement. They are (temporarily) put into a vector of length 50, so the output looks like this:
> sample(runif(50000),10*5)
[1] 0.73283101 0.60329970 0.29871173 0.12637654 0.48434952 0.01058067 0.32337850
[8] 0.46873561 0.72334215 0.88515494 0.44036341 0.81386225 0.38118213 0.80978822
[15] 0.38291273 0.79795343 0.23622492 0.21318431 0.59325586 0.78340477 0.25623138
[22] 0.64621658 0.80041393 0.68511759 0.21880083 0.77455662 0.05307712 0.60320912
[29] 0.13191926 0.20816298 0.71600799 0.70328349 0.44408218 0.32696205 0.67845445
[36] 0.64438336 0.13241312 0.86589561 0.01109727 0.52627095 0.39207860 0.54643661
[43] 0.57137320 0.52743012 0.96631114 0.47151170 0.84099503 0.16511902 0.07546454
[50] 0.85970500
rep(1:1000,rep(5,1000))
Now we are creating an indexing vector of length 50:
> rep(1:10,rep(5,10))
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6
[29] 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 10 10 10 10 10
Those indices "group" the samples from step 1. So basically this vector tells R that the first 5 entries of your "sample vector" belong together (index 1), the next 5 entries belong together (index 2) and so on.
FUN = mean
Just apply the mean-function on the data.
tapply
So tapply takes the sampled data (sample-part) and groups them by the second argument (the rep()-part) and applies the mean-function on each group.
If you are familiar with data.frames and the dplyr package, take a look at this (only the first 10 rows are displayed):
set.seed(1)
df <- data.frame(income=sample(runif(5000),10*5), index=rep(1:10,rep(5,10)))
income index
1 0.42585569 1
2 0.16931091 1
3 0.48127444 1
4 0.68357403 1
5 0.99374923 1
6 0.53227877 2
7 0.07109499 2
8 0.20754511 2
9 0.35839481 2
10 0.95615917 2
I attached the an index to the random numbers (your income). Now we calculate the mean per group:
df %>%
group_by(index) %>%
summarise(mean=mean(income))
which gives us
# A tibble: 10 x 2
index mean
<int> <dbl>
1 1 0.551
2 2 0.425
3 3 0.827
4 4 0.391
5 5 0.590
6 6 0.373
7 7 0.514
8 8 0.451
9 9 0.566
10 10 0.435
Compare it to
set.seed(1)
tapply(sample(runif(5000),10*5),
rep(1:10,rep(5,10)),
mean)
which yields basically the same result:
1 2 3 4 5 6 7 8 9
0.5507529 0.4250946 0.8273149 0.3905850 0.5902823 0.3730092 0.5143829 0.4512932 0.5658460
10
0.4352546

Getting a difference between time(n+1)-time(n) in a dataframe in r

I have a dataframe where the columns represent monthly data and the rows different simulations. the data I am working with accumulates over time so I want to take the difference between the months to get the true value for that month. There are not headers for my data frame
For example:
View(df)=
1 3 4 6 19 23 24 25 26 ...
1 2 3 4 5 6 7 8 9 ...
0 0 2 3 5 7 14 14 14 ...
My plan was to use the diff() function or something like it, but I am having trouble using it on a dataframe.
I have tried:
df1<-diff(df, lag = 1, differences = 1)
but only get zeros.
I am grateful for any advice.
see ?apply. If it's a data frame
apply(df,2,diff)
should work. Also since a dataframe is a list of vectors sapply(df,diff) should work.

ggplot2 time series with an ordered factor on the x-axis

I'd be extremely grateful for your assistance with the following issue.
I wish to create a representative time series for different subjects who have undertaken a test at discrete intervals. The data frame is called Hayling.Impulsivity. Here is a sample of the data in wide format:
Subject Baseline 2-weeks 6-weeks 3-months
1 1 15 23 5 NA
2 2 15 27 3 4
3 3 5 7 0 19
4 4 1 5 2 6
5 5 3 7 18 27
6 6 0 2 19 2`
I then made Subject a factor:
Hayling.Impulsivity$Subject<-factor(Hayling.Impulsivity$Subject)
I then melted the data frame into long format using the reshape package:
Long.H.I.<-melt(Hayling.Impulsivity, id.vars="Subject", variable.name="Follow Up", value.name="Hayling AB Error Score")
I then ordered the measurement variables:
Long.H.I.$"Follow Up"<-factor(Long.H.I.$"Follow Up", levels=c("Baseline", "2-weeks", "6-weeks", "3-months"), ordered=TRUE)
Here's the structure of this data frame:
'data.frame': 52 obs. of 3 variables:
$ Subject : Factor w/ 13 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Follow Up : Ord.factor w/ 4 levels "Baseline"<"2-weeks"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ Hayling AB Error Score: num 15 15 5 1 3 0 3 0 0 33 ...
Now I try to construct the time series in ggplot:
ggplot(Long.H.I., aes("Follow Up", "Hayling AB Error Score", group=Subject, colour=Subject))+geom_line()
But all I get is an empty plot. I'm not permitted to post an image to show you but the x and y axes are labelled only with "Follow Up" and "Hayling AB Error Score" respectively. There are no actual scales / values / categories on either axis and no points have been plotted.
Where have I gone wrong?
It looks like spaces in your column names are causing the problem even if you use aes_string. You could replace the spaces with underscores and then label the x and y axes explicitly. Code could look like:
Hayling.Impulsivity$Subject<-factor(Hayling.Impulsivity$Subject)
Long.H.I.<-melt(Hayling.Impulsivity, id.vars="Subject",
variable.name="Follow_Up", value.name="Hayling_AB_Error_Score")
Long.H.I.$Follow_Up <-factor(Long.H.I.$"Follow_Up",
levels=c("Baseline","2-weeks","6-Weeks","3-months"), ordered=TRUE)
ggplot(Long.H.I., aes(Follow_Up, Hayling_AB_Error_Score, group=Subject, colour=Subject))+
geom_line() +
labs(x="Follow Up", y="Hayling AB Error Score")

Plot empty groups in boxplot

I want to plot a lot of boxplots in on particular style to compare them.
But when a group is empty the group "isn't plotted".
lets say I have a dataframe:
a b
1 1 5
2 1 4
3 1 6
4 1 4
5 2 9
6 2 8
7 2 9
8 3 NaN
9 3 NaN
10 3 NaN
11 4 2
12 4 8
and I use boxplot to plot it:
boxplot(b ~ a , df)
than I get the plot without group 3
(which I can't show because I did not have "10 reputation")
I found some solutions for removing empty groups via Google but my problem is the other way around.
And I found the solution via at=c(1,2,4) but as I generate an Rscript with python and different groups are empty I would prefer, that the groups aren't dropped at all.
Oh I don't think I have the time to grapple with additional packages.
Therefore I would be thankful for solutions without them.
You can get the group on the x-axis by
boxplot(b ~ a , df, na.action=na.pass)
Or
boxplot(b~factor(a), df)

Reverting to Factor Codes R

Let's say I have a data.frame that looks like this:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
and I apply a factor:
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
Now, how I would like to convert it back the integer codes:
as.numeric(df.test[1])## replies with an error code.
But this works:
as.numeric(df.test$a)
Why is that?
Actually Joshua's link are not applicable here because the task is not coverting from a factor with levels that have numeric interpretation. Your original effort that produced an error was almost correct. It was missing only a comma before the 1:
df.test <- data.frame(1:26, 1:26)
colnames(df.test) <- c("a","b")
df.test$a <- factor(df.test$a, levels=c(1:26), labels=letters)
as.numeric(df.test[,1])
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# [19] 19 20 21 22 23 24 25 26
Or you could have used "[["
> as.numeric(df.test[[1]])
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26
as.numeric will convert a factor to numeric:
as.numeric(df.test$a)
Accessing a column by name gives you a factor vector, which can be converted to numeric.
However, a data frame is a list (of columns), and when you use the single bracket operator and a single number on a list, you get a list of length one. The same applies for data frames, so df.test[1] gets you column one as a new data frame, which cannot be coerced by as.numeric(). I did not know this!
> str(df.test$a)
Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
> str(df.test[1])
'data.frame': 26 obs. of 1 variable:
$ a: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
To respond to your edit: Keep in mind that a factor has two parts: 1) the labels, and 2) the underlying integer codes. The two answers I linked to in my comment were to convert the labels to numeric. If you just want to get the internal codes, use as.integer(df.test$a) as demonstrated in the examples section of ?factor. aL3xa answered your question about why as.numeric(df.test[1]) throws an error.

Resources