X axis not ordering discrete value after melt of DF - r

I am fairly new to R and I have a problem with the usage of ggplot2 together with the melt function. In the specific case I am trying to create a multiline plot which represents certain time gaps and their evolution during a race.
Say the data frame is the following (DF_TimeGap)
Lap Ath1 Ath2 Ath3 Ath4 Ath5
1 1 0 0 0 -1 1
2 2 0 0 14 0 28
3 3 0 -1 3 0 18
4 4 0 0 1 0 3
5 5 0 -8 1 -9 3
6 6 0 -22 0 -23 1
7 7 0 0 1 -19 3
8 8 0 -1 13 -2 13
9 9 0 -1 1 0 -1
10 10 0 5 7 8 10
I then melt it with
library(reshape2)
DFMelt_TimeGap = melt(DF_TimeGap, id.var="Lap")
names(DFMelt_TimeGap)[2] = "Rider"
names(DFMelt_TimeGap)[3] = "Gap"
and it looks like (I'll just report the first two for space reasons)
Lap Rider Gap
1 1 Ath1 0
2 2 Ath1 0
3 3 Ath1 0
4 4 Ath1 0
5 5 Ath1 0
6 6 Ath1 0
7 7 Ath1 0
8 8 Ath1 0
9 9 Ath1 0
10 10 Ath1 0
11 1 Ath2 0
12 2 Ath2 0
13 3 Ath2 -1
14 4 Ath2 0
15 5 Ath2 -8
16 6 Ath2 -22
17 7 Ath2 0
18 8 Ath2 -1
19 9 Ath2 -1
20 10 Ath2 5
...
when I am trying to plot the multiline plot then
ggplot(DFMelt_TimeGap, aes(x = Lap, y = Gap, col= Rider, group = Rider)) +
geom_point()+
geom_line()+
xlab("Lap")+ ylab("Gap (s)")
what I obtain is the following graph(forget about colour labels, I am avoiding unnecessary code)
which would be fine if not for the fact that the ordering on the x axis is
1 10 2 3 4 5 6 7 8 9
Is anyone aware of how to fix this sort of issues?
Thanks to everyone who is so keen to contribute

In your melt process Lap gets somehow transformed to a character. My guess is that in your real data Lap contains already a character (or worse a factor). Then in your ggplot the x-axis is mapped to a character column, which uses alphabetical ordering by default.
You could verify that via str(DFMelt_TimeGap).
Best is to make sure that Lap is a numeric to start with so DF_TimeGap$Lap <- as.numeric(as.character(DF_TimeGap$Lap)) should fix it.
I used as.numeric(as.character(.)) in case your Lap was originally formatted as factor.
This will result in a numeric scale for your plot. You may like to add scale_x_continuous(breaks = 1:10) to have breaks at each Lap number.
If you want to stick to the factor/character variable. you have to manually adjust the ordering of the levels in DFMelt_TimeGap, which you could do via DFMelt_TimeGap$Lap <- factor(DFMelt_TimeGap$Lap, 1:10)

Related

Data Frame- Add number of occurrences with a condition in R

I'm having a bit of a struggle trying to figure out how to do the following. I want to map how many days of high sales I have previously a change of price. For example, I have a price change on day 10 and the high sales indicator will tell me any sale greater than or equal to 10. Need my algorithm to count the number of consecutive high sales.
In this case it should return 5 (day 5 to 9)
For example purposes, the dataframe is called df. Code:
#trying to create a while loop that will check if lag(high_sales) is 1, if yes it will count until
#there's a lag(high_sales) ==0
#loop is just my dummy variable that will take me out of the while loop
count_sales<-0
loop<-0
df<- df %>% mutate(consec_high_days= ifelse(price_change > 0, while(loop==0){
if(lag(High_sales_ind)==1){
count_sales<-count_sales +1}
else{loop<-0}
count_sales},0))
day
price
price_change
sales
High_sales_ind
1
5
0
12
1
2
5
0
6
0
3
5
0
5
0
4
5
0
4
0
5
5
0
10
1
6
5
0
10
1
7
5
0
10
1
8
5
0
12
1
9
5
0
14
1
10
7
2
3
0
11
7
0
2
0
This is my error message:
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i the condition has length > 1 and only the first element will be used
Warning: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
i 'x' is NULL so the result will be NULL
Error: Problem with mutate() column consec_high_days.
i consec_high_days = ifelse(...).
x replacement has length zero
Any help would be greatly appreciated.
This is a very inelegant brute-force answer, though hopefully someone better than me can provide a more elegant answer - but to get the desired dataset, you can try:
df <- read.table(text = "day price price_change sales High_sales_ind
1 5 0 12 1
2 5 0 6 0
3 5 0 5 0
4 5 0 4 0
5 5 0 10 1
6 5 0 10 1
7 5 0 10 1
8 5 0 12 1
9 5 0 14 1
10 7 2 3 0
11 7 0 2 0", header = TRUE)
# assign consecutive instances of value
df$seq <- sequence(rle(as.character(df$sales >= 10))$lengths)
# Find how many instance of consecutive days occurred before price change
df <- df %>% mutate(lseq = lag(seq))
# define rows you want to keep and when to end
keepz <- df[df$price_change != 0, "lseq"]
end <- as.numeric(rownames(df[df$price_change != 0,]))-1
df_want <- df[keepz:end,-c(6:7)]
Output:
# day price price_change sales High_sales_ind
# 5 5 5 0 10 1
# 6 6 5 0 10 1
# 7 7 5 0 10 1
# 8 8 5 0 12 1
# 9 9 5 0 14 1

How to plot two sets of data with two different color schemes on the same graph using ggplot?

Note: as I'm writing this I can't figure out how to insert images, I'll work on it after posting, but if you run the code below, you should be able to see the graphs I'm talking about....sorry!
Essentially, I have these two graphs and I want them to be on the same plot (overlayed on top of one another), but I need them to use different color schemes or I won't be able to tell them apart very easily.
I've looked everywhere on this site and while there are a lot of similar questions, none of them have worked quite in the way that I need them to. The closest ones I've linked below, just know that I've read them and they did not solve my issues:
Distinct color palettes for two different groups in ggplot2
R ggplot two color palette on the same plot
The first graph uses this data (shortened to 50 lines, actually goes to about 1000), RuleCount repeats 1-14 over and over, TrainingPass goes up until about 60
RuleCount TrainingPass m4Accuracy
1 1 -1 0.000000000
2 2 -1 0.000000000
3 3 -1 0.004225352
4 4 -1 0.014225352
5 5 -1 0.022816901
6 6 -1 0.182957746
7 7 -1 0.194507042
8 8 -1 0.207183099
9 9 -1 0.239859155
10 10 -1 0.362394366
11 11 -1 0.430704225
12 12 -1 0.567887324
13 13 -1 0.582535211
14 14 -1 0.602676056
15 1 0 0.000000000
16 2 0 0.000281690
17 3 0 0.006901408
18 4 0 0.018732394
19 5 0 0.031267606
20 6 0 0.202676056
21 7 0 0.215633803
22 8 0 0.231830986
23 9 0 0.262253521
24 10 0 0.373661972
25 11 0 0.440281690
26 12 0 0.573802817
27 13 0 0.588169014
28 14 0 0.608873239
29 1 1 0.000985915
30 2 1 0.014788732
31 3 1 0.032957746
32 4 1 0.071408451
33 5 1 0.113943662
34 6 1 0.276760563
35 7 1 0.290281690
36 8 1 0.303943662
37 9 1 0.335633803
38 10 1 0.438028169
39 11 1 0.501971831
40 12 1 0.625070423
41 13 1 0.637323944
42 14 1 0.658169014
43 1 2 0.000985915
44 2 2 0.015915493
45 3 2 0.030704225
46 4 2 0.076619718
47 5 2 0.119436620
48 6 2 0.280563380
49 7 2 0.294507042
50 8 2 0.308732394
I graphed it using this code:
ggplot(df_m4, aes(x=RuleCount, y=m4Accuracy, group = TrainingPass, color = TrainingPass)) +
geom_line()+
scale_color_gradient(low = "green", high = "blue")
Resulting in this graph:
m4 Accuracy
The second graph is essentially the same data and code, except rather than getting a bunch of slightly varying lines on the graph, each of the lines ends up being the same line
data:
RuleCount TrainingPass Accuracy
1 1 -1 0.000422535
2 2 -1 0.000422535
3 3 -1 0.002676056
4 4 -1 0.005915493
5 5 -1 0.007746479
6 6 -1 0.053239437
7 7 -1 0.059718310
8 8 -1 0.068309859
9 9 -1 0.099859155
10 10 -1 0.197042254
11 11 -1 0.256197183
12 12 -1 0.421971831
13 13 -1 0.440422535
14 14 -1 0.468028169
15 1 0 0.000422535
16 2 0 0.000422535
17 3 0 0.002676056
18 4 0 0.005915493
19 5 0 0.007746479
20 6 0 0.053239437
21 7 0 0.059718310
22 8 0 0.068309859
23 9 0 0.099859155
24 10 0 0.197042254
25 11 0 0.256197183
26 12 0 0.421971831
27 13 0 0.440422535
28 14 0 0.468028169
29 1 1 0.000422535
30 2 1 0.000422535
31 3 1 0.002676056
32 4 1 0.005915493
33 5 1 0.007746479
34 6 1 0.053239437
35 7 1 0.059718310
36 8 1 0.068309859
37 9 1 0.099859155
38 10 1 0.197042254
39 11 1 0.256197183
40 12 1 0.421971831
41 13 1 0.440422535
42 14 1 0.468028169
43 1 2 0.000422535
44 2 2 0.000422535
45 3 2 0.002676056
46 4 2 0.005915493
47 5 2 0.007746479
48 6 2 0.053239437
49 7 2 0.059718310
50 8 2 0.068309859
code:
ggplot(df_rules_only, aes(x=RuleCount, y=Accuracy, group = TrainingPass, color = TrainingPass)) +
geom_line() +
scale_color_gradient(low = "green", high = "blue")
Resulting in this graph:
rules only Accuracy
I understand how to get the data on to the same graph. By combining my two data frames and using the code below, I can add the 'rules_only' data to the 'm4' graph:
ggplot(df_Training, aes(x=ruleCount, y=m4Accuracy, group = training_pass, color = training_pass)) +
geom_line() +
scale_color_gradient(low = "green", high = "blue")+
geom_line(aes(x=ruleCount, y=rulesOnlyAccuracy))
Resulting in this graph:
both_data_sets
The problem is that the new data blends right in with the old because it has the same color scheme.
At first I tried keeping them in the same data frame and just adding "color = 'orange'" to the last line of the previous code, but that gives me the error: "Error: Discrete value supplied to continuous scale"
Next I split them up into the two data frames you see above and tried to graph them this way:
ggplot(df_m4, aes(x=RuleCount, y=m4Accuracy, group = TrainingPass, color = TrainingPass)) +
geom_line() +
scale_color_gradient(low = "green", high = "blue")+
geom_line(df_rules_only, aes(x=RuleCount, y=Accuracy, color = "orange"))
but I get the error: "Error: mapping must be created by aes()"
Those last two attempts were kind of shots in the dark since I couldn't find anything else to try, but I'm pretty certain R doesn't work that way.
I'd really prefer for answers to use ggplot since other graphs never look quite as good. Just really feel like I've been going about this all wrong and could really use some help! Thank you in advance :)
Very complicated question for a very simple answer. Wanted to move this out of the comments but #aosmith helped me out. The code below makes my second group of data a different color:
ggplot(df_Training, aes(x=ruleCount, y=m4Accuracy, group = training_pass, color = training_pass)) +
geom_line() +
geom_line(aes(x=ruleCount, y=rulesOnlyAccuracy), color = "orange")
Just have to work on adding a second legend now!

transform values in data frame, generate new values as 100 minus current value

I'm currently working on a script which will eventually plot the accumulation of losses from cell divisions. Firstly I generate a matrix of values and then I add the number of times 0 occurs in each column - a 0 represents a loss.
However, I am now thinking that a nice plot would be a degradation curve. So, given the following example;
>losses_plot_data <- melt(full_losses_data, id=c("Divisions", "Accuracy"), value.name = "Losses", variable.name = "Size")
> full_losses_data
Divisions Accuracy 20 15 10 5 2
1 0 0 0 0 3 25
2 0 0 0 1 10 39
3 0 0 1 3 17 48
4 0 0 1 5 23 55
5 0 1 3 8 29 60
6 0 1 4 11 34 64
7 0 2 5 13 38 67
8 0 3 7 16 42 70
9 0 4 9 19 45 72
10 0 5 11 22 48 74
Is there a way I can easily turn this table into being 100 minus the numbers shown in the table? If I can plot that data instead of my current data, I would have a lovely curve of degradation from 100% down to however many cells have been lost.
Assuming you do not want to do that for the first column:
fld <- full_losses_data
fld[, 2:ncol(fld)] <- 100 - fld[, -1]

summing a range of columns in data frame

I am having trouble summing select columns within a data frame, a basic problem that I've seen numerous similar, but not identical questions/answers for on StackOverflow.
With this perhaps overly complex data frame:
site<-c(223,257,223,223,257,298,223,298,298,211)
moisture<-c(7,7,7,7,7,8,7,8,8,5)
shade<-c(83,18,83,83,18,76,83,76,76,51)
sampleID<-c(158,163,222,107,106,166,188,186,262,114)
bluestm<-c(3,4,6,3,0,0,1,1,1,0)
foxtail<-c(0,2,0,4,0,1,1,0,3,0)
crabgr<-c(0,0,2,0,33,0,2,1,2,0)
johnson<-c(0,0,0,7,0,8,1,0,1,0)
sedge1<-c(2,0,3,0,0,9,1,0,4,0)
sedge2<-c(0,0,1,0,1,0,0,1,1,1)
redoak<-c(9,1,0,5,0,4,0,0,5,0)
blkoak<-c(0,22,0,23,0,23,22,17,0,0)
my.data<-data.frame(site,moisture,shade,sampleID,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak)
I want to sum the counts of each plant species (bluestem, foxtail, etc. - columns 4-12 in this example) within each site, by summing rows that have the same site number. I also want to keep information about moisture and shade (these are consistant withing site, but may also be the same between sites), and want a new column that is the count of number of rows summed.
the result would look like this
site,moisture,shade,NumSamples,bluestm,foxtail,crabgr,johnson,sedge1,sedge2,redoak,blkoak
211,5,51,1,0,0,0,0,0,1,0,0
223,7,83,4,13,5,4,8,6,1,14,45
257,7,18,2,4,2,33,0,0,1,1,22
298,8,76,3,2,4,3,9,13,2,9,40
The problem I am having is that, my real data sets (and I have several of them) have from 50 to 300 plant species, and I want refer a range of columns (in this case, [5:12] ) instead of my.data$foxtail, my.data$sedge1, etc., which is going to be very difficult with 300 species.
I know I can start off by deleting the column I don't need (SampleID)
my.data$SampleID <- NULL
but then how do I get the sums? I've messed with the aggregate command and with ddply, and have seen lots of examples which call particular column names, but just haven't gotten anything to work. I recognize this is a variant of a commonly asked and simple type of question, but I've spent hours without resolving it on my own. So, apologies for my stupidity!
This works ok:
x <- aggregate(my.data[,5:12], by=list(site=my.data$site, moisture=my.data$moisture, shade=my.data$shade), FUN=sum, na.rm=T)
library(dplyr)
my.data %>%
group_by(site) %>%
tally %>%
left_join(x)
site n moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 1 5 51 0 0 0 0 0 1 0 0
2 223 4 7 83 13 5 4 8 6 1 14 45
3 257 2 7 18 4 2 33 0 0 1 1 22
4 298 3 8 76 2 4 3 9 13 2 9 40
Or to do it all in dplyr
my.data %>%
group_by(site) %>%
tally %>%
left_join(my.data) %>%
group_by(site,moisture,shade,n) %>%
summarise_each(funs(sum=sum)) %>%
select(-sampleID)
site moisture shade n bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 211 5 51 1 0 0 0 0 0 1 0 0
2 223 7 83 4 13 5 4 8 6 1 14 45
3 257 7 18 2 4 2 33 0 0 1 1 22
4 298 8 76 3 2 4 3 9 13 2 9 40
Try following using base R:
outdf<-data.frame(site=numeric(),moisture=numeric(),shade=numeric(),bluestm=numeric(),foxtail=numeric(),crabgr=numeric(),johnson=numeric(),sedge1=numeric(),sedge2=numeric(),redoak=numeric(),blkoak=numeric())
my.data$basic = with(my.data, paste(site, moisture, shade))
for(b in unique(my.data$basic)) {
outdf[nrow(outdf)+1,1:3] = unlist(strsplit(b,' '))
for(i in 4:11)
outdf[nrow(outdf),i]= sum(my.data[my.data$basic==b,i])
}
outdf
site moisture shade bluestm foxtail crabgr johnson sedge1 sedge2 redoak blkoak
1 223 7 83 13 5 4 8 6 1 14 45
2 257 7 18 4 2 33 0 0 1 1 22
3 298 8 76 2 4 3 9 13 2 9 40
4 211 5 51 0 0 0 0 0 1 0 0

R tvm financial package

Im trying to estimate the present value of a stream of payments using the fvm in the financial package.
y <- tvm(pv=NA,i=2.5,n=1:10,pmt=-c(5,5,5,5,5,8,8,8,8,8))
The result that I obtain is:
y
Time Value of Money model
I% #N PV FV PMT Days #Adv P/YR C/YR
1 2.5 1 4.99 0 -5 30 0 12 12
2 2.5 2 9.97 0 -5 30 0 12 12
3 2.5 3 14.94 0 -5 30 0 12 12
4 2.5 4 19.90 0 -5 30 0 12 12
5 2.5 5 24.84 0 -5 30 0 12 12
6 2.5 6 47.65 0 -8 30 0 12 12
7 2.5 7 55.54 0 -8 30 0 12 12
8 2.5 8 63.40 0 -8 30 0 12 12
9 2.5 9 71.26 0 -8 30 0 12 12
10 2.5 10 79.09 0 -8 30 0 12 12
There is a jump in the PV from 5 to 6 (when the price changes to 8) that appears to be incorrect. This affects the result in y[10,3] which is the result that I'm interested in obtaining.
The NPV formula in Excel produces similar results when the payments are the same throughout the whole stream, however, when the vector of paymets is variable, the resuls with the tvm formula and the NPV differ. I need to obtain the same result that the NPV formula provides in Excel.
What should I do to make this work?
The cf formula helps but it is not always consistent with Excel.
I solved my problem using the following function:
npv<-function(a,b,c) sum(a/(1+b)^c)

Resources