Following are first 15 rows of my data:
> head(df,15)
frame.group class lane veh.count mean.speed
1 [22,319] 2 5 9 23.40345
2 [22,319] 2 4 9 24.10870
3 [22,319] 2 1 11 14.70857
4 [22,319] 2 3 8 20.88783
5 [22,319] 2 2 6 16.75327
6 (319,616] 2 5 15 22.21671
7 (319,616] 2 2 16 23.55468
8 (319,616] 2 3 12 22.84703
9 (319,616] 2 4 14 17.55428
10 (319,616] 2 1 13 16.45327
11 (319,616] 1 1 1 42.80160
12 (319,616] 1 2 1 42.34750
13 (616,913] 2 5 18 30.86468
14 (319,616] 3 3 2 26.78177
15 (616,913] 2 4 14 32.34548
'frame.group' contains time intervals, 'class' is the vehicle class i.e. 1=motorcycles, 2=cars, 3=trucks and 'lane' contains lane numbers. I want to create 3 scatter plots with frame.group as x-axis and mean.speed as y-axis, 1 for each class. In a scatterplot for one vehicle class e.g. cars, I want 5 plots i.e. one for each lane. I tried following:
cars <- subset(df, class==2)
by(cars, lane, FUN = plot(frame.group, mean.speed))
There are two problems:
1) R does not plot as expected i.e. 5 plots for 5 different lanes.
2) Only one is plotted and that too is box-plot probably because I used intervals instead of numbers as x-axis.
How can I fix the above issues? Please help.
Each time a new plot command is issued, R replaces the existing plot with the new plot. You can create a grid of plots by doing par(mfrow=c(1,5)), which will be 1 row with 5 plots (other numbers will have other numbers of rows and columns). If you want a scatterplot instead of a boxplot you can use plot.default
It is easier to do all this with the ggplot2 library instead of the base graphics, and the resulting plot will look much nicer:
library(ggplot2)
ggplot(cars,aes(x=frame.group,y=mean.speed))+geom_point()+facet_wrap(~lane)
See the ggplot2 documentation for more details: http://docs.ggplot2.org/current/
Related
I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))
I have data as below. I have found linkage disequilibrium (LD) between my data (chromosomes positions) and Now I like draw a plot like heatmap or other plots that show relation between columns S1 and S2.
Input:
S1 S2 R^2
1 10 73576307 11 308290 9.648065e-05
2 10 73576307 11 309127 5.023692e-04
3 11 308290 11 309127 3.927666e-01
4 10 73576307 1 158813819 1.227192e-04
5 11 308290 1 158813819 1.404745e-03
Thanks for your hep in advance.
Say I have data that look like this:
level start end
1 1 133.631 825.141
2 2 133.631 155.953
3 3 146.844 155.953
4 2 293.754 302.196
5 3 293.754 302.196
6 4 293.754 301.428
7 2 326.253 343.436
8 3 326.253 343.436
9 4 333.827 343.436
10 2 578.066 611.766
11 3 578.066 611.766
12 4 578.066 587.876
13 4 598.052 611.766
14 2 811.228 825.141
15 3 811.228 825.141
or this:
level start end
1 1 3.60353 1112.62000
2 2 3.60353 20.35330
3 3 3.60353 8.77526
4 2 72.03720 143.60700
5 3 73.50530 101.13200
6 4 73.50530 81.64660
7 4 92.19030 101.13200
8 3 121.28500 143.60700
9 4 121.28500 128.25900
10 2 167.19700 185.04800
11 3 167.19700 183.44600
12 4 167.19700 182.84600
13 2 398.12300 418.64300
14 3 398.12300 418.64300
15 2 445.83600 454.54500
16 2 776.59400 798.34800
17 3 776.59400 796.64700
18 4 776.59400 795.91300
19 2 906.68800 915.89700
20 3 906.68800 915.89700
21 2 1099.44000 1112.62000
22 3 1099.44000 1112.62000
23 4 1100.14000 1112.62000
They produce the following graphs:
As you can see there are several time intervals at different levels. The level-1 interval always spans the entire duration of the time of interest. Levels 2+ have time intervals that are shorter.
What I would like to do is select the maximum number of non-overlapping time intervals covering each period that contain the maximum number of total time within them. I have marked in pink which ones those would be.
For small dataframes it is possible to brute force this, but obviously there should be some more logical way of doing this. I'm interested in hearing some ideas about what I should try.
EDIT:
I think one thing that could help here is the column 'level'. The results come from Kleinberg's burst detection algorithm (package 'bursts'). You will note that the levels are hierarchically organized. Levels of the same number cannot overlap. However levels successively increasing e.g. 2,3,4 in successive rows can overlap.
In essence, I think the problem could be shortened to this. Take the levels produced, but remove level 1. This would be the vector for the 2nd example:
2 3 2 3 4 4 3 4 2 3 4 2 3 2 2 3 4 2 3 2 3 4
Then, look at the 2s... if there are fewer than or only one '3' then that 2 is the longest interval. But if there are two or more 3's between successive 2's, then those 3s should be counted. Do this iteratively for each level. I think that should work...?
e.g.
vec<-df$level %>% as.vector() %>% .[-1]
vec
#[1] 2 3 2 3 4 4 3 4 2 3 4 2 3 2 2 3 4 2 3 2 3 4
max(vec) #4
vec3<-vec #need to find two or more 4's between 3s
vec3[vec3==3]<-NA
names(vec3)<-cumsum(is.na(vec3))
0 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 8 8
2 NA 2 NA 4 4 NA 4 2 NA 4 2 NA 2 2 NA 4 2 NA 2 NA 4
vec3.res<-which(table(vec3,names(vec3))["4",]>1)
which(names(vec3)==names(vec3.res) & vec3==4) #5 6
The above identifies rows 5 and 6 (which equate to rows 6 and 7 in original df) as having two fours that lie between 3's. Perhaps something using this sort of approach might work?
OK here is a stab using your second data set to test. This might not be correct in all cases!!
library(data.table)
dat <- fread("data.csv")
dat[,use:="maybe"]
make.pass <- function(dat,low,high,the.level,use) {
check <- dat[(use!="no" & level > the.level)]
check[,contained.by.above:=(low<=start & end<=high)]
check[,consecutive.contained.by.above:=
(contained.by.above &
!is.na(shift(contained.by.above,1)) &
shift(contained.by.above,1)),by=level]
if(!any(check[,consecutive.contained.by.above])) {
#Cause a side effect where we've learned we don't care:
dat[check[(contained.by.above),rownum],use:="no"]
print(check)
return("yes")
} else {
return("no")
}
}
dat[,rownum:=.I]
dat[level==1,use:=make.pass(dat,start,end,level,use),by=rownum]
dat
dat[use=="maybe" & level==2,use:=make.pass(dat,start,end,level,use),by=rownum]
dat
dat[use=="maybe" & level==3,use:=make.pass(dat,start,end,level,use),by=rownum]
dat
#Finally correct for last level
dat[use=="maybe" & level==4,use:="yes"]
I wrote these last steps out so you can trace in your own interactive session to see what's happening (see the print to get an idea) but you can remove the print and also condense the last steps into something like lapply(1:dat[,max(level)-1], function(the.level) dat[use=="maybe" & level==the.level,use:=make.pass......]) In response to your comment if there are an arbitrary number of levels you will definitely want to use this formalism, and follow it with a final call to dat[use=="maybe" & level==max(level),use:="yes"].
Output:
> dat
level start end use rownum
1: 1 3.60353 1112.62000 no 1
2: 2 3.60353 20.35330 yes 2
3: 3 3.60353 8.77526 no 3
4: 2 72.03720 143.60700 no 4
5: 3 73.50530 101.13200 no 5
6: 4 73.50530 81.64660 yes 6
7: 4 92.19030 101.13200 yes 7
8: 3 121.28500 143.60700 yes 8
9: 4 121.28500 128.25900 no 9
10: 2 167.19700 185.04800 yes 10
11: 3 167.19700 183.44600 no 11
12: 4 167.19700 182.84600 no 12
13: 2 398.12300 418.64300 yes 13
14: 3 398.12300 418.64300 no 14
15: 2 445.83600 454.54500 yes 15
16: 2 776.59400 798.34800 yes 16
17: 3 776.59400 796.64700 no 17
18: 4 776.59400 795.91300 no 18
19: 2 906.68800 915.89700 yes 19
20: 3 906.68800 915.89700 no 20
21: 2 1099.44000 1112.62000 yes 21
22: 3 1099.44000 1112.62000 no 22
23: 4 1100.14000 1112.62000 no 23
level start end use rownum
On the off chance this is correct, the algorithm can roughly be described as follows:
Mark all the intervals as possible.
Start with a given level. Pick a particular interval (by=rownum) say called X. With X in mind, subset a copy of the data to all higher-level intervals.
Mark any of these that are contained in X as "contained in X".
If consecutive intervals at the same level are contained in X, X is no good b/c it wastes intervals. In this case label X's "use" variable as "no" so we'll never think about X again. [Note: if it's possible that non-consecutive intervals are contained in X, or that containing multiple intervals across levels could ruin X's viability, then this logic might need to be changed to count contained intervals instead of finding consecutive ones. I didn't think about this at all, but it's just occurring to me now, so use at your own risk.]
On the other hand, if X passed the test, then we've already established it's good. Mark it as a "yes." But importantly, we also have to mark any single interval contained in X as "no," or else when we iterate the step it will forget that it was contained inside a good interval and mark itself as "yes" as well. This is the side effect step.
Now, iterate, ignoring any results that we've already determined.
Finally any "maybe"s leftover at the highest level are automatically in.
Let me know what you think of this--this is a rough draft and some aspects might not be correct.
I want to plot a lot of boxplots in on particular style to compare them.
But when a group is empty the group "isn't plotted".
lets say I have a dataframe:
a b
1 1 5
2 1 4
3 1 6
4 1 4
5 2 9
6 2 8
7 2 9
8 3 NaN
9 3 NaN
10 3 NaN
11 4 2
12 4 8
and I use boxplot to plot it:
boxplot(b ~ a , df)
than I get the plot without group 3
(which I can't show because I did not have "10 reputation")
I found some solutions for removing empty groups via Google but my problem is the other way around.
And I found the solution via at=c(1,2,4) but as I generate an Rscript with python and different groups are empty I would prefer, that the groups aren't dropped at all.
Oh I don't think I have the time to grapple with additional packages.
Therefore I would be thankful for solutions without them.
You can get the group on the x-axis by
boxplot(b ~ a , df, na.action=na.pass)
Or
boxplot(b~factor(a), df)
I need help with a R plot, with a data format I have not worked with before. Please help if you know.
NUMBER FREQUENCY
10 1
11 1
12 3
10 45
11 2
12 3
i need a bar plot with numbers on X axis (continuous, not bins in histogram) and frequency on Y, but combined.
like
10 46
11 3
12 6
it seems simple enough, but i have 10,000 rows and large numbers in real data so I am looking for a good solution in R without doing it manually.
What about:
##tapply splits dd$FREQ by dd$NUM and "sums" them
barplot(tapply(dd$FREQUENCY, dd$NUMBER, sum))
to get:
Read in your data:
dd = read.table(textConnection("NUMBER FREQUENCY
10 1
11 1
12 3
10 45
11 2
12 3"), header=TRUE)