I have a dataframe "df" that has 120 rows and 2 columns containing numbers as shown...
V1 V2
10001 177417
227418 267719
317720 471368
I want to be able to lay these along the X-axis of a plot with a line connecting the values from V1 t0 V2 in each row.
one option would be to use seq(V1,V2) for each row then concatenate to create a full series, However with the the amount of data involved, the object size runs to >10GB and is therefore not a viable option. The Y-axis position here is not important.
Any ideas?
First create a plot object, then enter the rest of the rows using the segments function:
plot(x=c(1,1), y=df[1,], xlim = c(1,nrow(df)), ylim=range(df), type='l')
segments(x0=2:nrow(df), x1=2:nrow(df), y0=df[-1,1], y1=df[-1,2])
Here is how it looks on a random cumulative set:
df <- apply(as.data.frame(cbind(rnorm(1000),rnorm(1000))),2,cumsum)
Related
Seems like quite an easy problem to solve, but I can't seem to get my head around it in R.
I have dataset with the following columns:
'Biomass' where each row is a value of biomass for a particular species
'Count' where each row is the number of individual animals of that species counted
I need to create a histogram of biomasses, but if I use hist(DF$Biomass) I will get a histogram of the biomasses of the animals where each value is one animal.
I need to include the count, so that I have (for example) the weight frequencies of elephant x 2, giraffe x 56 etc..
you're not making my life easy :)
Is this what you want ?
DF <- data.frame(Biomass=c(200,200,1500),Count = c(36,20,2))
DF2 <- aggregate(Count ~ Biomass,DF,sum) # sum different occurrences for each Biomass value
barplot(DF2$Count,names.arg =DF2$Biomass) # presents them with a barplot, which is more appropriate than an histogram in the R sense here.
If I understood you right that is what you need :)
biomass<-c(1,5,7,6,3)
count<-c(1,2,1,3,4)
new<-NULL
for (i in 1:length(biomass))
{
new<-c(new, rep(biomass[i], count[i]))
}
new
hist(new)
So finally just type:
new<-NULL
for (i in 1:length(DF$Biomass))
{
new<-c(new, rep(DF$Biomass[i], DF$Count[i]))
}
hist(new)
I have a set of transaction values whose range are 0-15000 USD. I've plotted a histogram specifying breaks of $250 bin values, which is helpful. What I would like to do is go back into the dataframe and create my own bin values within the data frame. The bins would specify the range that the transactions fall into, such as: 0-250, 251-499, 500-749, 750...by 250 all the way up to 15,000.
I looked at this nifty post Generate bins from a data frame regarding 'cut' and 'findInterval' but they aren't really meeting my expectations. It's either nasty factors that looks okay for low bin ranges but once I get above $x,000 I get e-values (1.27e+04, 1.3e04).
What I'd like is:
Tran ID Amount Bin
135 $249.22 0-250
138 $1,022.01 1000-1249
155 $10,350.11 10,249-10,500
Is this possible with 'cut' or 'findInterval' or is there a better implementation?
cut is the way to go for this problem. If you do not like the output with the brackets, you can use some data manipulation to get it to look the way you'd like.
bins <- seq(0, 15000, by=250)
Amount2 <- as.numeric(gsub("\\$|,", "", df$Amount))
labels <- gsub("(?<!^)(\\d{3})$", ",\\1", bins, perl=T)
rangelabels <- paste(head(labels,-1), tail(labels,-1), sep="-")
df$Bin <- cut(Amount2, bins, rangelabels)
We first create a sequence from 0 to 15,000 by 250. Next we format the Amount column by eliminating the dollar signs and commas and save to the variable Amount2. We then format the output labels by inserting commas after the first three digits. We will use that variable in the final Bin column.
The variable rangelabels combines the bin break-points with a hyphen. The main function is next, cut(Amount2, bins, rangelabels). The first argument, Amount2 is the data frame vector being cut. The second argument, bins supplies the breaks for the intervals. The last argument, rangelabels is the vector of names for the output resulting in:
df
TranID Amount Bin
1 135 $249.22 0-250
2 138 $1,022.01 1,000-1,250
3 155 $10,350.11 10,250-10,500
My table having 40 raw and 4 columns, in that 4 columns first column belongs to one group and the remaining constitute the other group.
using following commands for calculating jaccard's index
x <- read.csv(file name,header=T, sep= )
jac <- vegdist(x,method="jaccard")
from this out file(jac) how can i find the p value for two groups?
and how can i plot notched box plot of these two groups?
when i use boxes(as.matrix(jac)~x$first column,notch=TRUE)
its showing 40 box plots. why it so?
Just picking up R and I have the following question:
Say I have the following data.frame:
v1 v2 v3
3 16 a
44 457 d
5 23 d
34 122 c
12 222 a
...and so on
I would like to create a histogram or barchart for this in R, but instead of having the x-axis be one of the numeric values, I would like a count by v3. (2 a, 1 c, 2 d...etc.)
If I do hist(dataFrame$v3), I get the error that 'x 'must be numeric.
Why can't it count the instances of each different string like it can for the other columns?
What would be the simplest code for this?
OK. First of all, you should know exactly what a histogram is. It is not a plot of counts. It is a visualization for continuous variables that estimates the underlying probability density function. So do not try to use hist on categorical data. (That's why hist tells you that the value you pass must be numeric.)
If you just want counts of discrete values, that's just a basic bar plot. You can calculate counts of values in R for discrete data using table and then plot that with the basic barplot() command.
barplot(table(dataFrame$v3))
If you want to require a minimum number of observations, try
tbl<-table(dataFrame$v3)
atleast <- function(i) {function(x) x>=i}
barplot(Filter(atleast(10), tbl))
i would like to write a function with graphical output of original data regression and one for modified data. The original data regression should be an option. Moreover there should be legends in the graphs. And here is my problem:
If i choose the option: orig.plot=FALSE, everything works ok.
But when i choose the other option: orig.plot=TRUE, the position of my legends is not very satisfying.
# Generation of the data set
set.seed(444)
nr.outlier<- 10
x<-seq(0,60,length=150);
y<-rnorm(150,0,10);
yy<-x+y;
d<-cbind(x,yy)
# Manipulation of data:
ss1<-sample(1:nr.outlier,1) # sample size 1
sri1<-sample(c(1:round(0.2*length(x))),ss1) # sample row index 1
sb1<-c(yy[quantile(yy,0.95)<yy])# sample base 1
d[sri1,2]<-sample(sb1,ss1,replace=T) # manipulation of part 1
ss2<-nr.outlier-ss1 # sample size 2
sri2<-sample(c(round(0.8*length(x)+1):length(x)),ss2) # sample row index 2
sb2<-c(yy[quantile(yy,0.05)>yy])# sample base 2
d[sri2,2]<-sample(sb2,ss2,replace=T) # manipulation of par 2
tlm2<-function(x,y,alpha=0.95,orig.plot=FALSE,orig.ret=FALSE){
m1<-lm(y~x)
res<-abs(m1$res)
topres<-sort(res,decreasing=TRUE)[1:round((1-alpha)*length(x))] # top alpha*n residuals
topind<-rownames(as.data.frame(topres)) # indices of the top residuals
x2<-x[-as.numeric(topind)] #
y2<-y[-as.numeric(topind)] # removal of the identified observations
m2<-lm(y2~x2)
r2_m1<-summary(m1)$'r.squared'
r2_m2<-summary(m2)$'r.squared'
if(orig.plot==TRUE){
par(mfrow=c(2,1))
plot(x,y,xlim=range(x),ylim=c(min(d[,2])-30,max(d[,2]+30)),main="Model based on original data")
abline(m1$coef);legend("topleft",legend=bquote(italic(R)^2==.(r2_m1)),bty="n")
}
plot(x2,y2,xlim=range(x),ylim=c(min(d[,2])-30,max(d[,2]+30)),main="Model based on trimmed data")
abline(m2$coef);
legend("topleft",legend=bquote(atop(italic(R)^2==.(r2_m2),alpha==.(alpha))),bty="n")
return(if(orig.ret==TRUE){list(m1=m1,m2=m2)} else{m2})
}
tlm2(d[,1],d[,2])
tlm2(d[,1],d[,2],orig.plot=T)
Can anyone give me a hint?
Thank You in advance!
The problem is that atop is essentially a division (i.e. x/y) without plotting the division bar. It also centers the denominator. The solution is to use expression instead of bquote. However, to mix expressions and variables, you need to use substitute and eval:
Here's what your legend call should look like:
legend("topleft",legend=c(eval(substitute(expression(italic(R)^2==my.r2), list(my.r2 = r2_m2))) ,
eval(substitute(expression(alpha==my.alpha), list(my.alpha = alpha)))
)
,bty="n")
And here's the result: