Find edges of intervals in dataframe column and use them for geom_rect xmin-xmax in ggplot - r

I have a data frame consituted by two columns
positionx <- c(1:10)
pvalue <- c(0.1, 0.04, 0.03, 0.02, 0.001, 0.2, 0.5, 0.6, 0.001, 0.002)
df <- data.frame(cbind(positionx, pvalue))
df
positionx pvalue
1 1 0.100
2 2 0.040
3 3 0.030
4 4 0.020
5 5 0.001
6 6 0.200
7 7 0.500
8 8 0.600
9 9 0.001
10 10 0.002
I would like to find in which intervals of values of positionx my pvalue is below a certain treshold, let's say 0.05.
Using 'which' I can find the index of the rows and I could go back to the vlaues of positionx.
which(df[,2]<0.05)
[1] 2 3 4 5 9 10
Howeverm what I would like are the edges of the intervals, with that I mean a result like: 2-5, 9-10
I also tried to use the findInterval function as below
int <- c(-10, 0.05, 10)
separation <- findInterval(pvalue,int)
separation
[1] 2 1 1 1 1 2 2 2 1 1
df_sep <- data.frame(cbind(df, separation))
df_sep
positionx pvalue separation
1 1 0.100 2
2 2 0.040 1
3 3 0.030 1
4 4 0.020 1
5 5 0.001 1
6 6 0.200 2
7 7 0.500 2
8 8 0.600 2
9 9 0.001 1
10 10 0.002 1
However I am stuck again with a column of numbers, while I want the edges of the intervals that contain 1 in the separation column.
Is there a way to do that?
This is semplified example, in reality I have many plots and for each plot one data frame of this type (just much longer and with pvalues not as easy to judge at a glance).
The reason why I think I need the information of the edges of my intervals, is that I would like to colour the background of my ggplot according to the pvalue. I know I can use geom_rect for it, but I think I need the edges of the intervals in order to build the coloured rectangles.
Is there a way to do this in an automated way instead of manually?

This seems like a great use case for run length encoding.
Example as below:
library(ggplot2)
# Data from question
positionx <- c(1:10)
pvalue <- c(0.1, 0.04, 0.03, 0.02, 0.001, 0.2, 0.5, 0.6, 0.001, 0.002)
df <- data.frame(cbind(positionx, pvalue))
# Sort data (just to be sure)
df <- df[order(df$positionx),]
# Do run length encoding magic
threshold <- 0.05
rle <- rle(df$pvalue < threshold)
starts <- {ends <- cumsum(rle$lengths)} - rle$lengths + 1
df2 <- data.frame(
xmin = df$positionx[starts],
xmax = df$positionx[ends],
type = rle$values
)
# Filter on type
df2 <- df2[df2$type == TRUE, ] # Satisfied threshold criterium
ggplot(df2, aes(xmin = xmin, xmax = xmax, ymin = 0, ymax = 1)) +
geom_rect()
Created on 2020-05-22 by the reprex package (v0.3.0)

Related

How do I create a conditional variable based on another variable in R?

I'm back to using R after using SAS for a few years, and I'm relearning everything again.
I have a dataset with variable Lot_Size, which contains continuous data from 0.1980028 - 1.2000000 acres. I'd like to categorize this variable based on these demarcations:
0 - 1/3 acre = 0
1/3 - 2/3 acre = 1
2/3 - 1 acre = 2
1+ acre = 3
Into a new variable LS_cat.
I've explored the mutate command but I keep returning errors. Anyone have any ideas?
UPDATE
Thanks for responding - both solutions worked perfectly. Since this was a learning experience for me, I'll add to the question.
I actually misunderstood the question posed to me - if I were to make dummy variables for each category previously noted, how would I do that? For example, if Lot_Size is 0 - 1/3 of an acre, I want variable ls_1_3 to be 1, if it's not then I'd like it to be 0. Would I use ifelse command?
Use case_when().
library(tidyverse)
set.seed(123)
my_df <- tibble(
lot_size = runif(n = 10, min = 0.1980028, max = 1.2)
)
my_df |> mutate(
ls_cat = case_when(lot_size < 1 / 3 ~ 0,
lot_size < 2 / 3 ~ 1,
lot_size < 1 ~ 2,
TRUE ~ 3)
)
#> A tibble: 10 x 2
#> lot_size ls_cat
#> <dbl> <dbl>
#> 1 0.486 1
#> 2 0.988 2
#> 3 0.608 1
#> 4 1.08 3
#> 5 1.14 3
#> 6 0.244 0
#> 7 0.727 2
#> 8 1.09 3
#> 9 0.751 2
#>10 0.656 1
Case_when() is usually a sound solution when there's more than two options (if_else() if there are just two), but in this case there's a simpler math(s) solution.
my_df <- tibble(lot_size = seq(0, 1.2, by = 0.1))
my_df$ls_cat <- ceiling((my_df$lot_size*3)-0.99)
Though, this may be less instructive on R programming.
For your follow on question, ifelse() works well, e.g.
Base:
my_df$ls_1_3 <- ifelse(my_df$lot_size < 1/3, 1, 0)
Or Tidyverse:
my_df <- my_df %>%
mutate(ls_1_3 = if_else(lot_size < 1/3, 1, 0))
NB: if_else() is a more pedantic version of ifelse(). Both should work equally well here, but if_else() is better for catching possible errors
We can use findInterval:
Lot_Size <- seq(0.2, 1.2, len=10)
Lot_Size
# [1] 0.2000000 0.3111111 0.4222222 0.5333333 0.6444444 0.7555556 0.8666667 0.9777778 1.0888889 1.2000000
findInterval(Lot_Size, c(0, 1/3, 2/3, 1, Inf), rightmost.closed = TRUE) - 1L
# [1] 0 0 1 1 1 2 2 2 3 3
In this case it is returning the index within the vector, which we then convert to your 0-based with the trailing - 1L (integer 1).
cut it.
dat <- transform(dat, Lot_Size_cat=
cut(Lot_Size, breaks=c(0, 1/3, 2/3, 1, Inf), labels=0:3,
include.lowest=TRUE))
dat
# X1 Lot_Size Lot_Size_cat
# 1 0.77436849 1.0509024 3
# 2 0.19722419 0.2819626 0
# 3 0.97801384 0.8002238 2
# 4 0.20132735 0.9272001 2
# 5 0.36124443 0.6396998 1
# 6 0.74261194 1.0990851 3
# 7 0.97872844 1.1648617 3
# 8 0.49811371 0.7221819 2
# 9 0.01331584 1.1915689 3
# 10 0.25994613 0.4076475 1
Data:
set.seed(666)
n <- 10
dat <- data.frame(X1=runif(n),
Lot_Size=sample(seq(0.1980028, 1.2, 1e-7), n, replace=TRUE))

Counting values within levels

I have a set of levels in R that I generate with cut, e.g. say fractional values between 0 and 1, broken down into 0.1 bins:
> frac <- cut(c(0, 1), breaks=10)
> levels(frac)
[1] "(-0.001,0.1]" "(0.1,0.2]" "(0.2,0.3]" "(0.3,0.4]" "(0.4,0.5]"
[6] "(0.5,0.6]" "(0.6,0.7]" "(0.7,0.8]" "(0.8,0.9]" "(0.9,1]"
Given a vector v containing continuous values between [0.0, 1.0], how do I count the frequency of elements in v that fall within each level in levels(frac)?
I could customize the number of breaks and/or the interval from which I am making levels, so I'm looking for a way to do this with standard R commands, so that I can build a two-column data frame: one column for the levels as factors, and the second column for a fractional or percentage value of total elements in v over the level.
Note: The following does not work:
> table(frac)
frac
(-0.001,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6]
1 0 0 0 0 0
(0.6,0.7] (0.7,0.8] (0.8,0.9] (0.9,1]
0 0 0 1
If I use cut on v directly, then I do not get the same levels when I run cut on different vectors, because the range of values — their minimum and maximum — is going to be different between arbitrary vectors, and so while I may have the same number of breaks, the level intervals will not be the same.
My goal is to take different vectors and bin them to the same set of levels. Hopefully this helps clarify my question. Thanks for any assistance.
Amend frac to actually represent your desired intervals, and then use the table function:
x = runif(100) # For example.
frac = cut(x, breaks = seq(0, 1, 0.1))
table(frac)
Result:
frac
(0,0.1] (0.1,0.2] (0.2,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8]
14 9 8 10 8 12 7 7
(0.8,0.9] (0.9,1]
16 9
Introduce extremes c(0, 1) to v then use the same cut:
library(dplyr)
#dummy data
set.seed(1)
v <- round(runif(7), 2)
#result
data.frame(v,
vFrac = cut(c(0, 1, v), breaks = 10)[-c(1, 2)]) %>%
group_by(vFrac) %>%
mutate(vFreq = n())
# Source: local data frame [10 x 3]
# Groups: vFrac [8]
#
# v vFrac vFreq
# <dbl> <fctr> <int>
# 1 0.27 (0.2,0.3] 1
# 2 0.37 (0.3,0.4] 1
# 3 0.57 (0.5,0.6] 1
# 4 0.91 (0.9,1] 2
# 5 0.20 (0.1,0.2] 1
# 6 0.90 (0.8,0.9] 1
# 7 0.94 (0.9,1] 2
frac = seq(0,1,by=0.1)
ranges = paste(head(frac,-1), frac[-1], sep=" - ")
freq = hist(v, breaks=frac, include.lowest=TRUE, plot=FALSE)
data.frame(range = ranges, frequency = freq$counts)
Use findInterval instead of cut:
v<-data.frame(v=runif(100,0,1))
library(plyr)
v$x<-findInterval(v$v,seq(0,1,by=0.1))*0.1
ddply(v, .(x), summarize, n=length(x))
frac = seq(0, 1, 0.1)
set.seed(42); v = rnorm(10, 0.5, 0.2)
sapply(1:(length(frac)-1), function(i) sum(frac[i]<v & frac[i+1]>=v))
#[1] 0 0 0 1 3 2 1 1 1 1

Categorizing Data frame with R

I have a following sample code to make one data frame containing information for more than 1 ID.
I want to sort them by defined categories.
In which I want to see the percentage change at specific (given time for e.h here t=10) with respect to
its baseline value and return the value of that found category in output.
I have explained detailed step of my calculation below.
a=c(100,105,126,130,150,100,90,76,51,40)
t=c(0,5,10,20,30)
t=rep(t,2)
ID=c(1,1,1,1,1,2,2,2,2,2)
data=data.frame(ID,t,a)
My desired Calculation
1)for all ID at t=0 "a" value is baseline
2) Computation
e.g At Given t=10 (Have to provide) take corresponding a value
%Change(answer) = (taken a value - baseline/baseline)
3) Compare the answer in the following define CATEGORIES..
#category
1-If answer>0.25
2-If -0.30<answer<0.25
3-If -1.0<answer< -0.30
4-If answer== -1.0
4)Return the value of category
Desired Output
ID My_Answer
1 1
2 3
Can anybody help me in this.I do understand the flow of my computation but not awre of efficient way of doing it as i have so many ID in that data frame.
Thank you
It's easier to do math with columns than with rows. So the first step is to move baseline numbers into their own columns, then use cut to define these groups:
library(dplyr)
library(tidyr)
foo <- data %>%
filter(t == 0) %>%
left_join(data %>%
filter(t != 0),
by = "ID") %>%
mutate(percentchange = (a.y - a.x) / a.x,
My_Answer = cut(percentchange, breaks = c(-1, -0.3, 0.25, Inf),
right = FALSE, include.lowest = TRUE, labels = c("g3","g2","g1")),
My_Answer = as.character(My_Answer),
My_Answer = ifelse(percentchange == -1, "g4", My_Answer)) %>%
select(ID, t = t.y, My_Answer)
foo
ID t.x a.x t.y a.y percentchange My_Answer
1 1 0 100 5 105 0.05 g2
2 1 0 100 10 126 0.26 g1
3 1 0 100 20 130 0.30 g1
4 1 0 100 30 150 0.50 g1
5 2 0 100 5 90 -0.10 g2
6 2 0 100 10 76 -0.24 g2
7 2 0 100 20 51 -0.49 g3
8 2 0 100 30 40 -0.60 g3
You can see that this lets us calculate My_Answer for all values at once. if you want to find out the values for t == 10, you can just pull out those rows:
foo %>%
filter(t == 10)
ID t My_Answer
1 1 10 g1
2 2 10 g2

get custom table frequency in R

Suppose I have the following vector.
test <- c(0.3,1.0,0.8,0.3,0.6,0.4,0.3,0.5,0.6,0.4,0.5,0.6,0.1,0.6,0.2,0.7,0.0,0.7,0.3,0.3,0.4,0.9,0.9,0.9,0.3,0.6,0.3,0.1)
Is there a way to get non logical frequency table such as?
Frequency between 0 and 0.1
Frequency between 0.2 and 0.4
Frequency between 0.5 and 0.8
Frequency between 0.9 and 1
Thanks
There are a few extra unnecessary groups in here but you can ignore those or subset them
table(cut(test, breaks = c(0,0.1,0.2,0.4,0.5,0.8,0.9,1)))
I'm not aware of a dedicated function, but you could write your own:
test <- c(0.3,1.0,0.8,0.3,0.6,0.4,0.3,0.5,0.6,0.4,0.5,0.6,0.1,0.6,0.2,0.7,0.0,0.7,0.3,0.3,0.4,0.9,0.9,0.9,0.3,0.6,0.3,0.1)
mapply(function (start, end) { sum(test >= start & test <= end) },
c(0, 0.2, 0.5, 0.9), # starts
c(0.1, 0.4, 0.8, 1)) # ends
# [1] 3 11 10 4
The use of mapply is purely to vectorise over the starts and ends which you supply. Note test is hard-coded into this function and the endpoints are inclusive, so adjust as necessary, etc.
Something like this maybe:
labs <- c("0 and 0.1", "0.2 and 0.4", "0.5 and 0.8", "0.9 and 1")
table(cut(test, c(0, .2, .5, .9, 1.1), right = FALSE, labels = labs))
## 0 and 0.1 0.2 and 0.4 0.5 and 0.8 0.9 and 1
## 3 11 10 4
Assuming that you really want to bin these as tenths, and there are no missing intervals, findInterval is made for the task.
Here, 1.0 is in a group by itself:
table(findInterval(test, c(0,.2, .5, .9, 1)))
## 1 2 3 4 5
## 3 11 10 3 1
With this statement, 1.0 is in the last interval, with .9:
table(findInterval(test, c(0,.2, .5, .9, 1), rightmost.closed=T))
## 1 2 3 4
## 3 11 10 4

How to sum up the numbers in sequence only when the time interval between them is very small

this is very simple thing I am trying to do: I want to add the numbers consequently if their times are close to each other, if not I would like to keep the number. The time limit which dictates how close they are together I set up manually (here it is 0.03). I want then store these numbers for further manipulation. I want to run this across, and if the 1.23 and 1.24 fit the criteria I want to add 1+2, but then I want to compare 1.24 and 1.25 and they satisfy the condition too so then 1+2+1, and so on. Once they are no longer close I would store this number and move on. The output vector would then be smaller in size. This is the output I want
output = (1 + 2 + 1 + 5, 3 + 4, 11 + 13, 25 + 1, 11, 7)
output = (9, 7, 24, 26, 11, 7)
This is what I have so far:
v1 <- c(1,2,1,5,3,4,11,13, 25, 1)
t1 <- c(1.23, 1.24, 1.25, 1.28, 2.28, 2.29, 2.90, 2.91, 3.11, 3.12)
i<-1
j<-2
sums <- NULL
tot <- NULL
while (j <= length(v1))
{
if (t1[j] - t1[i] < 0.03)
{
sums[i] <- v1[i] + v1[j]
}
if (t1[j] - t1[i] > 0.03)
{
tot[i] <- v1[i]
}
i = i + 1
j = j + 1
}
The following should work :
v1 <- c(1,2,1,5,3,4,11,13, 25, 1)
t1 <- c(1.23, 1.24, 1.25, 1.28, 2.28, 2.29, 2.90, 2.91, 3.11, 3.12)
threshold <- 0.02
fac <- c(1, cumsum(diff(t1) > threshold) + 1)
as.vector(tapply(v1, fac, sum))
Which gives :
# [1] 4 5 7 24 26
If you want to compute things on this output, as suggested in your comment, you should store it in a data frame. For example :
df <- data.frame(v1, t1)
df$fac <- c(1, cumsum(diff(t1) > threshold) + 1)
library(plyr)
df2 <- ddply(df, "fac", summarize, v1=sum(v1), t1=min(t1))
df2$time <- cut(df2$t1, breaks=1:4)
Which would give :
R> df2
fac v1 t1 time
1 1 4 1.23 (1,2]
2 2 5 1.28 (1,2]
3 3 7 2.28 (2,3]
4 4 24 2.90 (2,3]
5 5 26 3.11 (3,4]
I would suggest using clustering:
#Cluster according to distance
hr <- hclust(dist(t1))
#plot a dendrogram
plot(hr)
# cut at desired distance
hc <- cutree(hr, h=0.02)
#highlight in dendrogram
rect.hclust(hr, h=0.02)
aggregate(v1, list(hc), sum)
# Group.1 x
# 1 1 3
# 2 2 1
# 3 3 5
# 4 4 7
# 5 5 24
# 6 6 26
Note that this does not give exactly the result as the recursive approach outlined in your question, but it seems more sensible this way. You could control the clustering using different cutoff values.
#Use 0.03 for cutoff instead:
aggregate(v1, list(cutree(hr, h=0.03)), sum)
# Group.1 x
# 1 1 4
# 2 2 5
# 3 3 7
# 4 4 24
# 5 5 26

Resources