how to make box plot with R with interrupted Y axis - r

This is a sample of my data. It's a tab delimited file with a header.
X1 X2 X3 X4
1.3 0.5 0.1 1
NA 0.3 0.4 3
NA 0.2 0.3 0.3
NA 0.1 3 0.2
NA 27 5 56
NA NA 10 0.01
I would like to get a boxplot from this data. The problem is that I want to interrupt the plot at 10 and 50 on Y-axis. I want a bigger plot size before 10 and a smaller plot size after that. I don't know how to plot with 2 gaps in Y-axis. I tried with axis.break and gap.boxplot but as my programming skills with R are very limited so I am unable to use both of these methods properly. I'd be grateful for any hints to accomplish this?

I'm not really clear on what you want, and what you mean by "bigger plot size before 10 and a smaller plot size after that". Do you mean different scales? That is a bad idea, I think, and I don't believe it would be straightforward.
Here's how to break the axis twice (I'm guessing on the regions to exclude):
library(plotrix)
library(reshape2)
a <- read.table(textConnection("X1 X2 X3 X4
1.3 0.5 0.1 1
NA 0.3 0.4 3
NA 0.2 0.3 0.3
NA 0.1 3 0.2
NA 27 5 56
NA NA 10 0.01"),sep=" ",header=T)
am <-melt(a) #from reshape2 - allows categorical variables to be in one column
gap.boxplot(am$value ~ am$variable, #means the values are plotted againsy variable
gap=list(top=c(30,50),bottom=c(10,24)), #specifies regions of Y axis to exclude
axis.labels=T) #should label all the Y axis, doesn't seem to work well

Related

How to reference multiple dataframe columns to calculate a new column of weighted averages in R

I am currently calculating the weighted average column for my dataframe through manual referencing of each column name. Is there a way to shorten the code by multiplying sets of arrays
eg:
df[,c(A,B,C)] and df[,c(PerA,PerB,PerC)] to obtain the weighted average, like the SUMPRODUCT in Excel? Especially if I have multiple input columns to calculate the weighted average column
df$WtAvg = df$A*dfPerA + df$B*df$PerB + df$C*df$PerC
Without transforming your dataframe and assuming that first half of the dataframe is the size and the second half is the weight, you can use weighted.mean function in apply function:
df$WtAvg = apply(df,1,function(x){weighted.mean(x[1:(ncol(df)/2)],
x[(ncol(df)/2+1):ncol(df)])})
And you get the following output:
> df
A B C PerA PerB PerC WtAvg
1 1 2 3 0.1 0.2 0.7 2.6
2 4 5 6 0.5 0.3 0.2 4.7
3 7 8 9 0.6 0.1 0.3 7.7

How to match two columns with nearest time points?

I have a following dataframe. It is a time series with each observations having values for days 1-4. There is an additional column that shows at which time the test was made in hrs.
dt
Name values Days Test
a 0.2 1 20
a 0.3 2 20
a 0.6 3 20
a 0.2 4 20
b 0.3 1 44
b 0.4 2 44
b 0.8 3 44
b 0.7 4 44
c 0.2 1 24
c 0.7 2 24
I have to make a time series such that each line represents the subject.
First I made a plot with days and values, with subjects as colors.
This gave me a line plot for each subject, plotted against days and values. I am happy with it.
However, I have to incorporte when the test was taken on the line plot. I could do it separately at the top or bottom of the plot. But not exactly on the line.
Could someone please help me?
Thanks in advance!
Use the directlabels package to add the times:
library(ggplot2)
library(directlabels)
ggplot(DF, aes(Days, values, color = Name)) +
geom_line() +
geom_dl(aes(label = Test), method = "last.points")
Note
The input DF in reproducible form is:
Lines <- "
Name values Days Test
a 0.2 1 20
a 0.3 2 20
a 0.6 3 20
a 0.2 4 20
b 0.3 1 44
b 0.4 2 44
b 0.8 3 44
b 0.7 4 44
c 0.2 1 24
c 0.7 2 24"
DF <- read.table(text = Lines, header = TRUE)

R - conditional cumsum using multiple columns

I'm new to stackoverflow So I hope I post my question in the right format. I have a test dataset with three columns where rank is the ranking of a cell, Esvalue is the value of a cell and zoneID is an area identifier(Note! in the real dataset I have up to 40.000 zoneIDs)
rank<-seq(0.1,1,0.1)
Esvalue<-seq(10,1)
zoneID<-rep(seq.int(1,2),times=5)
rank Esvalue zoneID
0.1 10 1
0.2 9 2
0.3 8 1
0.4 7 2
0.5 6 1
0.6 5 2
0.7 4 1
0.8 3 2
0.9 2 1
1.0 1 2
I want to calculate the following:
% ES value <- For each rank, including all lower ranks, the cumulative % share of the total ES value relative to the ES value of all zones
cumsum(df$Esvalue)/sum(df$Esvalue)
% ES value zone <- For each rank, including all lower ranks, the cumulative % share of the total Esvalue relative to the ESvalue of a zoneID for each zone. I tried this now using mutate and using dplyr. Both so far only give me the cumulative sum, not the share. In the end this will generate a variable for each zoneID
df %>%
mutate(cA=cumsum(ifelse(!is.na(zoneID) & zoneID==1,Esvalue,0))) %>%
mutate(cB=cumsum(ifelse(!is.na(zoneID) & zoneID==2,Esvalue,0)))
These two variables I want to combine by
1) calculating the abs difference between the two for all the zoneIDs
2) for each rank calculate the mean of the absolute difference over all zoneIDs
In the end the final output should look like:
rank Esvalue zoneID mean_abs_diff
0.1 10 1 0.16666667
0.2 9 2 0.01333333
0.3 8 1 0.12000000
0.4 7 2 0.02000000
0.5 6 1 0.08000000
0.6 5 2 0.02000000
0.7 4 1 0.04666667
0.8 3 2 0.01333333
0.9 2 1 0.02000000
1.0 1 2 0.00000000
Now I created the last using some intermediate steps in Excel but my final dataset will be way too big to be handled by Excel. Any advice on how to proceed would be appreciated

R: Improvement of loop to create distance matrix from data frame

I am creating a distance matrix using the data from a data frame in R.
My data frame has the temperature of 2244 locations:
plot temperature
A 12
B 12.5
C 15
... ...
I would like to create a matrix that shows the temperature difference between each pair of locations:
. A B C
A 0 0.5 3
B 0.5 0 0.5
C 3 2.5 0
This is what I have come up with in R:
temp_data #my data frame with the two columns: location and temperature
temp_dist<-matrix(data=NA, nrow=length(temp_data[,1]), ncol=length(temp_data[,1]))
temp_dist<-as.data.frame(temp_dist)
names(temp_dist)<-as.factor(temp_data[,1]) #the locations are numbers in my data
rownames(temp_dist)<-as.factor(temp_data[,1])
for (i in 1:2244)
{
for (j in 1:2244)
{
temp_dist[i,j]<-abs(temp_data[i,2]-temp_data[j,2])
}
}
I have tried the code with a small sample with:
for (i in 1:10)
and it works fine.
My problem is that the computer has been running now for two full days and it hasn't finished.
I was wondering if there is a way of doing this quicker. I am aware that loops in loops take lots of times and I am trying to fill in a matrix of more than 5 million cells and it makes sense it takes so long, but I am hoping there is a formula that gets the same result in a quicker time as I have to do the same with the precipitation and other variables.
I have also read about dist, but I am unsure if with the data frame I have I can use that formula.
I would very much appreciate your collaboration.
Many thanks.
Are you perhaps just looking for the following?
out <- dist(temp_data$temperature, upper=TRUE, diag=TRUE)
out
# 1 2 3
# 1 0.0 0.5 3.0
# 2 0.5 0.0 2.5
# 3 3.0 2.5 0.0
If you want different row/column names, it seems you have to convert this to a matrix first:
out_mat <- as.matrix(out)
dimnames(out_mat) <- list(temp_data$plot, temp_data$plot)
out_mat
# A B C
# A 0.0 0.5 3.0
# B 0.5 0.0 2.5
# C 3.0 2.5 0.0
Or just as an alternative from the toolbox:
m <- with(temp_data, abs(outer(temperature, temperature, "-")))
dimnames(m) <- list(temp_data$plot, temp_data$plot)
m
# a b c
# a 0.0 0.5 3.0
# b 0.5 0.0 2.5
# c 3.0 2.5 0.0

Exclude smaller values than a threshold in R

I have data in a tab-delimited text file like this:
FID HV HH VOLUME
1 -2.1 -0.1 0
2 -4.3 -0.2 200
3 -1.4 1.2 20
4 -1.2 0.6 30
5 -3.7 0.8 10
These tables have mostly more than 6000 rows and much more columns.
I need to extract values of the column VOLUME smaller than e.g. 20.
I tried to do it with following command
x <- -which(names(x)["VOLUME"] > 20)
but it did not work.
Is there any method to do it? Any help is appreciated.
Say your data is sample:
subset(sample, VOLUME<20)
Assuming x is your data, try this:
x <- x[which(x$VOLUME <= 20),]

Resources