Grouping a Data Column in R - r

I have a data frame of 48 samples of zinc concentration reading. I am trying to group the data as Normal, High and Low (0 -30 low, 31 - 80 normal and above 80 high) and proceed to plot a scatter plot with different pch for each group.
here is the first 5 entries
sample concentration
1 1 71.1
2 2 131.7
3 3 13.9
4 4 31.7
5 5 6.4
THANKS

In general please try to include sample data in by using dput(head(data)) and pasting the output into your question.
What you want is called binning (grouping is a very different operation in R terms).
The standard function here is cut :
numericVector <- runif(100, min = 1, max = 100 ) # sample vector
cut(numericVector, breaks = c(0,30,81,Inf),right = TRUE, include.lowest = FALSE, labels = c("0-30","31-80","80 or higher"))
Please check the function documentation to adapt the parameters to your specific case.

Related

How to sort vector into bins in R?

I have a vector that consists of numbers that can take on any value between 1 and 100.
I want to sort that vector into bins of a certain size.
My logic:
1.) Divide the range (in this case, 1:100) into the amount of bins you want (lets say 10 for this example)
Result: (1, 10.9], 10.9,20.8], (20.8,30.7], (30.7,40.6], (40.6,50.5], (50.5,60.4], (60.4,70.3], (70.3,80.2], (80.2,90.1], (90.1,100]
2.) Then sort my vector
I found a handy function that almost does all this in one fell swoop: cut(). Here is my code:
> table(cut(vector, breaks = 10))
(0.959,10.9] (10.9,20.8] (20.8,30.7] (30.7,40.5] (40.5,50.4] (50.4,60.3] (60.3,70.1] (70.1,80] (80,89.9] (89.9,99.8]
175 171 117 103 82 67 54 46 39 31
Unfortunately, the intervals are different than the bins we calculated from the possible range (1:100). So I tried fixing this by adding in that range into the vector:
> table(cut(c(1,100,vector), breaks = 10))
(0.901,10.9] (10.9,20.8] (20.8,30.7] (30.7,40.6] (40.6,50.5] (50.5,60.4] (60.4,70.3] (70.3,80.2] (80.2,90.1] (90.1,100]
176 171 117 104 82 66 54 48 38 31
This almost worked perfectly except the left-most interval which starts from 0.901 for some reason.
My questions:
1.) Is there a way to do this (using cut or another function/package) without having to insert artificial data points to get the specified bin ranges?
2.) If not, why does the lower bin start from 0.901 and not 1?
Based on your response to #Allan Cameron, I understand taht you want to divide your vector in 10 bins of the same size. But when you define this number of breaks in the cut() function, the size of the intervals calculated by the function, are different accros the groups. As #akrun sad, this occurs because of the method of calculus that the function uses on this case you define only the number's of breaks.
I do not know if there is a way to avoid this in the function. But I think it will be easier if you define the bins as you want as #Gregor Thomas suggested. Here is an example of how I would approach your desire:
vec <- sample(1:100, size = 500, replace = T)
# Here I suppose that you want to divide the data in
# intervals of the same length
breaks <- seq(min(vec), max(vec), by = 9.9)
cut(vec, breaks = breaks)
Other option, would be the cut_interval() function from ggplot2 package, that cut's the vector in n groups with the same length.
library(ggplot2)
cut_interval(vec, n = 10)
why does the lower bin start from 0.901 and not 1?
The answer is the first bit of the Details section of the ?cut help page:
When breaks is specified as a single number, the range of the data is divided into breaks pieces of equal length, and then the outer limits are moved away by 0.1% of the range to ensure that the extreme values both fall within the break intervals.
That .1% adjustment is the reason your lower bound is 0.901 --- the upper bound isn't adjusted because it is a closed, ], not open ) interval on that end.
If you'd like to use other breaks, you can specify exact breaks however you want. Perhaps this:
my_breaks = seq(1, 100, length.out = 11) ## for n bins, you need n+1 breaks
my_breaks
# [1] 1.0 10.9 20.8 30.7 40.6 50.5 60.4 70.3 80.2 90.1 100.0
cut(vector, breaks = my_breaks, include.lowest = TRUE)
But I actually think Allan's suggestion of 0:10 * 10 might be what you really want. I wouldn't dismiss it too quickly:
table(cut(1:100, breaks = 0:10*10))
# (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
# 10 10 10 10 10 10 10 10 10 10

I want bins of 5 mm everytime, in an R script that analyzes different normal distributions

Right off the bat, I'm a newbie to R and maybe I'm misunderstanding the concept of what my code does vs what I want it to do. Here is the code I've written so far.
his <- hist(x, breaks=seq(floor(min(x)) , ceiling(max(x)) + ifelse(ceiling(max(x)) %% 5 != 0, 5, 0), 5)
Here is some sample data:
Autonr X
1 -12
2 -6
3 -17
4 8
5 -11
6 -10
7 10
8 -22
I'm not able to upload one of the histograms that did work, but it should show bins of 5, no matter how large the spread of the data. The amount of bins should therefore be flexible.
The idea of the code above is to make sure that the outer ranges of my data always fall within neatly defined 5mm bins. Maybe I lost oversight. but I can't seem to understand why this does not always work. In some cases it does, but with other datasets it doesn't.
I get: some 'x' not counted; maybe 'breaks' do not span range of 'x'.
Any help would be greatly appreciated, as I don't want to have to tinker around with my breaks and bins everytime I get a new dataset to run through this.
Rather than passing a vector of breaks, you can supply a single value, in this case the calculation of how many bins are needed given the range of the data and a bindwidth of 5.
# Generate uniformly distributed dummy data between 0 and 53
set.seed(5)
x <- runif(1000, 0, 53)
# Plot histogram with binwidths of 5.
hist(x, breaks = ceiling(diff(range(x)) / 5 ))
For the sake of completeness, here is another approach which uses breaks_width() from the scales package. scales is part of Hadley Wickham's ggplot2 ecosphere.
# create sample data
set.seed(5)
x <- runif(1000, 17, 53)
# plot histogram
hist(x, breaks = scales::breaks_width(5)(range(x)))
Explanation
scales::breaks_width(5) creates a function which is then called with the vector of minimum and maximum values of x as returned by range(x). So,
scales::breaks_width(5)(range(x))
returns a vector of breaks with the required bin size of 5:
[1] 15 20 25 30 35 40 45 50 55
The nice thing is that we have full control of the bins. E.g., we can ask for weird things like a bin size of 3 which is offset by 2:
scales::breaks_width(3, 2)(range(x))
[1] 17 20 23 26 29 32 35 38 41 44 47 50 53 56

How to interpolate a single point where line crosses a baseline between two points [duplicate]

This question already has answers here:
get x-value given y-value: general root finding for linear / non-linear interpolation function
(2 answers)
Closed 3 years ago.
I am new to R but I am trying to figure out an automated way to determine where a given line between two points crosses the baseline (in this case 75, see dotted line in image link below) in terms of the x-coordinate. Once the x value is found I would like to have it added to the vector of all the x values and the corresponding y value (which would always be the baseline value) in the y value vectors. Basically, have a function look between all points of the input coordinates to see if there are any linear lines between two points that cross the baseline and if there are, to add those new coordinates at the baseline crossing to the output of the x,y vectors. Any help would be most appreciated, especially in terms of automating this between all x,y coordinates.
https://i.stack.imgur.com/UPehz.jpg
baseline = 75
X <- c(1,2,3,4,5,6,7,8,9,10)
y <- c(75,53,37,25,95,35,50,75,75,75)
Edit: added creation of combined data frame with original data + crossing points.
Adapted from another answer related to two intersecting series with uniform X spacing.
baseline = 75
X <- c(1,2,3,4,5,6,7,8,9,10)
Y1 <- rep(baseline, 10)
Y2 <- c(75,53,37,25,95,35,50,75,75,75)
# Find points where x1 is above x2.
above <- Y1>Y2
# Points always intersect when above=TRUE, then FALSE or reverse
intersect.points<-which(diff(above)!=0)
# Find the slopes for each line segment.
Y2.slopes <- (Y2[intersect.points+1]-Y2[intersect.points]) /
(X[intersect.points+1]-X[intersect.points])
Y1.slopes <- rep(0,length(Y2.slopes))
# Find the intersection for each segment
X.points <- intersect.points + ((Y2[intersect.points] - Y1[intersect.points]) / (Y1.slopes-Y2.slopes))
Y.points <- Y1[intersect.points] + (Y1.slopes*(X.points-intersect.points))
# Plot.
plot(Y1,type='l')
lines(Y2,type='l',col='red')
points(X.points,Y.points,col='blue')
library(dplyr)
combined <- bind_rows( # combine rows from...
tibble(X, Y2), # table of original, plus
tibble(X = X.points,
Y2 = Y.points)) %>% # table of interpolations
distinct() %>% # and drop any repeated rows
arrange(X) # and sort by X
> combined
# A tibble: 12 x 2
X Y2
<dbl> <dbl>
1 1 75
2 2 53
3 3 37
4 4 25
5 4.71 75
6 5 95
7 5.33 75
8 6 35
9 7 50
10 8 75
11 9 75
12 10 75

R radarchart: free axis to enhance records display?

I am trying to display my data using radarchart {fmsb}. The values of my records are highly variable. Therefore, low values are not visible on final plot.
Is there a was to "free" axis per each record, to visualize data independently of their scale?
Dummy example:
df<-data.frame(n = c(100, 0,0.3,60,0.3),
j = c(100,0, 0.001, 70,7),
v = c(100,0, 0.001, 79, 3),
z = c(100,0, 0.001, 80, 99))
n j v z
1 100.0 100.0 100.000 100.000 # max
2 0.0 0.0 0.000 0.000 # min
3 0.3 0.001 0.001 0.001 # small values -> no visible on final chart!!
4 60.0 0.001 79.000 80.000
5 0.3 0.0 3.000 99.000
Create radarchart
require(fmsb)
radarchart(df, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)
Result: (only rows #2 and #3 are visible, row #1 with low values is not visible !!)
How to make visible all records (rows), i.e. how to "free" axis for any of my records? Thank you a lot,
If you want to be sure to see all 4 dimensions whatever the differences, you'll need a logarithmic scale.
As by design of the radar chart we cannot have negative values we are restricted on our choice of base by the range of values and by our number of segments (axis ticks).
If we want an integer base the minimum we can choose is:
seg0 <- 5 # your initial choice, could be changed
base <- ceiling(
max(apply(df[-c(1,2),],MARGIN = 1,max) / apply(df[-c(1,2),],MARGIN = 1,min))
^(1/(seg0-1))
)
Here we have a base 5.
Let's normalize and transform our data.
First we normalize the data by setting the maximum to 1 for all series,then we apply our logarithmic transformation, that will set the maximum of each series to seg0 (n for black, z for others) and the minimum among all series between 1 and 2 (here the v value of the black series).
df_normalized <- as.data.frame(df[-c(1,2),]/apply(df[-c(1,2),],MARGIN = 1,max))
df_transformed <- rbind(rep(seg0,4),rep(0,4),log(df_normalized,base) + seg0)
radarchart(df_transformed, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = seg0, centerzero = T,maxmin=T)
If we look at the green series we see:
j and v have same order of magnitude
n is about 5^2 = 25 times smaller than j (5 i the value of the base, ^2 because 2 segments)
v is about 5^2 = 25 times (again) smaller than z
If we look at the black series we see that n is about 3.5^5 times bigger than the other dimensions.
If we look at the red series we see that the order of magnitude is the same among all dimensions.
Maybe a workaround for your problem:
If you would transform your data before running radarchart
(e.g. logarithm, square root ..) then you could also visualise small values.
Here an example using a cubic root transformation:
library(specmine)
df.c<-data.frame(cubic_root_transform(df)) # transform dataset
radarchart(df.c, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)`
and the result will look like this:
EDIT:
If you want to zoom the small values even more you can do that with a higher order of the root.
e.g.
t<-5 # for fifth order root
df.t <- data.frame(apply(df, 2, function(x) FUN=x^(1/t))) # transform dataset
radarchart(df.t, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)
You can adjust the "zoom" as you want by changing the value of t
So you should find a visualization that is suitable for you.
Here is an example using 10-th root transformation:
library(specmine)
df.c<-data.frame((df)^(1/10)) # transform dataset
radarchart(df.c, axistype=0, pty=32, axislabcol="grey",# na.itp=FALSE,
seg = 5, centerzero = T)`
and the result will look like this:
You can try n-th root for find the one that is best for you. N grows, the root of a number nearby zero grows faster.

Why does R add a column to my graph?

I'm trying learn R by graphing 3 series from some data collected in a Physics lab. I have 3 csv files:
r.csv:
"Voltage","Current"
2,133
4,266
6,399
8,532
10,666
And there are two more. I load them like this:
r <- read.csv("~/Desktop/r.csv")
rs <- read.csv("~/Desktop/rs.csv")
rp <- read.csv("~/Desktop/rp.csv")
Then, I combine them into a matrix:
data <- cbind(r, rs, rp)
When I display data, I get this extra column on the left which I though was just there when the matrix was displayed:
data
Voltage Current Voltage Current Voltage Current
1 2 133 2 270 2 67.4
2 4 266 4 535 4 134.3
3 6 399 6 803 6 200.0
4 8 532 8 1070 8 267.0
5 10 666 10 1338 10 334.0
But the graph shows an extra series:
matplot(data, type = c("b"), pch=5, col = 1:5, xlab="Voltage (V)", ylab="Current (A)")
Also, why is the x axis all wrong? The data points are at x = 2, 4, 6, 8, and 10, but the graph shows them at 1, 2, 3, 4, and 5. What am I missing here?
help(matplot) for starters:
matplot(x, y, type = "p", lty = 1:5, lwd = 1, lend = par("lend"),
pch = NULL,....
says:
x,y: vectors or matrices of data for plotting. The number of rows
should match. If one of them are missing, the other is taken
as ‘y’ and an ‘x’ vector of ‘1:n’ is used. Missing values
(‘NA’s) are allowed.
so you've only given one of x and y, so matplot has treated your whole matrix as y and a vector of 1:5 on the X axis. matplot isn't psychic enough to realise your matrix is a mix of X and Y numbers.
Solution is is something like:
matplot(data[,1], data[,-c(1,3,5)],....etc...)
which gets the X coordinates from the first voltage column, and then creates the Y matrix by dropping all the voltage columns using negative indexing to leave just the current columns. data[,c(2,4,6)] would probably work just as well.
Note this only works because your three series have all the same X-values. If this isn't the case and you want to plot multiple lines then I reckon you should look into the ggplot2 package...

Resources