R: grouping numbers into bins - r

I am looking to find the smallest number in a column in a data frame that is larger a number in another array.
Example
DistrDF
Bin Freq CumSum
0.1 0.05 0.05
0.2 0.07 0.12
0.3 0.20 0.32
0.4 0.10 0.42
0.5 0.00 0.42
0.6 0.15 0.57
0.7 0.00 0.57
0.8 0.30 0.87
0.9 0.11 0.98
1.0 0.02 1.0
Then I have an array of, say, 10 random numbers between 0 and 1 (i.e. each random number will fall into one of the bins in the DistrDF)
RandNums
0.13
0.50
0.11
0.10
0.70
0.05
0.12
0.80
0.88
0.40
I would like to use these two table to create a third table, which indicates into which bin each of the random numbers falls, as below:
ResultDF
0.30 (because 0.13 < 0.32 and 0.13 > 0.12)
0.60 (because 0.50 < 0.57 and 0.50 > 0.42)
...
0.30 (because 0.40 < 0.42 and 0.40 > 0.32)
Does anyone have any ideas? I feel like an aggregate or something might be in order, but I'm not sure.

The cut function does what you want:
DistrDF <- DistrDF[DistrDF$Freq > 0,] # Remove empty bins
DistrDF$Bin[cut(x$RandNums, c(0, DistrDF$CumSum))]
# [1] 0.3 0.6 0.2 0.2 0.8 0.1 0.2 0.8 0.9 0.4
You can manipulate the include.lowest and right parameters to change how you handle points that fall on the border of bins.

Related

Grouped boxplot in R - simplest way

I have been struggling with creating a very simple grouped boxplot. My data looks as follows
> data
Wörter Sätze Text
P.01 0.15 0.24 0.34
P.02 0.10 0.15 0.08
P.03 0.05 0.18 0.16
P.04 0.55 0.60 0.44
P.05 0.00 0.06 0.26
P.06 0.20 0.65 0.68
P.07 0.15 0.31 0.47
P.08 0.35 0.87 0.69
P.09 0.35 0.75 0.76
N.01 0.40 0.78 0.59
N.02 0.55 0.95 0.76
N.03 0.65 0.96 0.83
N.04 0.60 0.90 0.77
N.05 0.50 0.95 0.82
If I simply execute boxplot(data) I obtain almost what I want. One plot with three boxes, each for one of the variables in my data.
Boxplot, almost
What I want is to separate these into two boxes per variable (one for the P-indexed, one for the N-indexed observations) for a total of six plots each.
I began by introducing a new variable
data$Gruppe <- c(rep("P",9), rep("N",5))
> data
Wörter Sätze Text Gruppe
P.01 0.15 0.24 0.34 P
P.02 0.10 0.15 0.08 P
P.03 0.05 0.18 0.16 P
P.04 0.55 0.60 0.44 P
P.05 0.00 0.06 0.26 P
P.06 0.20 0.65 0.68 P
P.07 0.15 0.31 0.47 P
P.08 0.35 0.87 0.69 P
P.09 0.35 0.75 0.76 P
N.01 0.40 0.78 0.59 N
N.02 0.55 0.95 0.76 N
N.03 0.65 0.96 0.83 N
N.04 0.60 0.90 0.77 N
N.05 0.50 0.95 0.82 N
Now that the data contains a non-numerical variable I cannot simply execute the boxplot() function as before. What would be a minimal alteration to make here to obtain the six plots that I want? (colour coding for the two groups would be nice)
I have encountered some solutions to a grouped boxplot, however the data from which others start tends to be organised differently than my (very simple) one.
Many thanks!
As #teunbrand already mentioned in the comments you could use pivot_longer to make your data in a longer format by Gruppe. You could use fill to make for each variable two boxplot in total 6 like this:
library(tidyr)
library(dplyr)
library(ggplot2)
data$Gruppe <- c(rep("P",9), rep("N",5))
data %>%
pivot_longer(cols = -Gruppe) %>%
ggplot(aes(x = name, y = value, fill = Gruppe)) +
geom_boxplot()
Created on 2023-01-10 with reprex v2.0.2
Data used:
data <- read.table(text = " Wörter Sätze Text
P.01 0.15 0.24 0.34
P.02 0.10 0.15 0.08
P.03 0.05 0.18 0.16
P.04 0.55 0.60 0.44
P.05 0.00 0.06 0.26
P.06 0.20 0.65 0.68
P.07 0.15 0.31 0.47
P.08 0.35 0.87 0.69
P.09 0.35 0.75 0.76
N.01 0.40 0.78 0.59
N.02 0.55 0.95 0.76
N.03 0.65 0.96 0.83
N.04 0.60 0.90 0.77
N.05 0.50 0.95 0.82", header = TRUE)

Is there a way to modify specific cells in a data.frame with an apply-statement?

I have a data set
V1 V2 V3 V4
1 0.2 0.1 0.0 0.8
2 0.3 0.4 0.3 0.0
3 0.1 0.3 0.2 0.0
4 0.2 0.1 0.4 0.1
5 0.2 0.1 0.1 0.1
in which each variable has one cell to which I would like to add a fraction (10 %) of the other values in the same column.
This indicates the row in each variable that should receive the bonus:
bonus<-c(2,3,1,4)
And the desired output is this:
V1 V2 V3 V4
1 0.18 0.09 0.10 0.72
2 0.37 0.36 0.27 0.00
3 0.09 0.37 0.18 0.00
4 0.18 0.09 0.36 0.19
5 0.18 0.09 0.09 0.09
I do this with a for-loop:
for(i in 1:ncol(tab)){
tab[bonus[i],i]<-tab[bonus[i],i]+sum(0.1*tab[-bonus[i],i])
tab[-bonus[i],i]<-tab[-bonus[i],i]-(0.1*tab[-bonus[i],i])
}
First row in the {} adds the 0.1*sum_of_other_values to the desired cell whose index is in bonus, second row subtracts from all cells but the one in bonus.
But I need to do this with a lot of columns in a lot of matrices and am struggling with including the information from the external vector bonus into a loop-less function.
Is there a way to vectorise this and then apply it across the datasets to make it faster?
Thanks very much!
( Example data:
tab<-data.frame(V1=c(0.2,0.3,0.1,0.2,0.2),
V2=c(0.1,0.4,0.3,0.1,0.1),
V3=c(0.00,0.3,0.2,0.4,0.1),
V4=c(0.8,0.0,0.0,0.1,0.1))
)
Try this:
mapply(
function(vec, bon) {
more <- vec/10
vec + ifelse(seq_along(vec) %in% bon, sum(more[-bon]), -more)
}, asplit(tab, 2), bonus)
# V1 V2 V3 V4
# [1,] 0.18 0.09 0.10 0.72
# [2,] 0.37 0.36 0.27 0.00
# [3,] 0.09 0.37 0.18 0.00
# [4,] 0.18 0.09 0.36 0.19
# [5,] 0.18 0.09 0.09 0.09
Sometimes I try to separate the change out of the function (such as when you want to troubleshoot the magnitude of change or some other summary statistic about it before updating the original table); if that appeals, then this can be shifted slightly:
changes <- mapply(
function(vec, bon) {
more <- vec/10
ifelse(seq_along(vec) %in% bon, sum(more[-bon]), -more)
}, asplit(tab, 2), bonus)
changes
# V1 V2 V3 V4
# [1,] -0.02 -0.01 0.10 -0.08
# [2,] 0.07 -0.04 -0.03 0.00
# [3,] -0.01 0.07 -0.02 0.00
# [4,] -0.02 -0.01 -0.04 0.09
# [5,] -0.02 -0.01 -0.01 -0.01
tab + changes
# V1 V2 V3 V4
# 1 0.18 0.09 0.10 0.72
# 2 0.37 0.36 0.27 0.00
# 3 0.09 0.37 0.18 0.00
# 4 0.18 0.09 0.36 0.19
# 5 0.18 0.09 0.09 0.09

How to create a new column in a data frame depending on multiple criteria from multiple columns from the same data frame

I have a data frame df1 with four variables. One refers to sunlight, the second one refers to the moon-phase light (light due to the moon's phase), the third one to the moon-position light (light from the moon depending on if it is in the sky or not) and the fourth refers to the clarity of the sky (opposite to cloudiness).
I call them SL, MPhL, MPL and SC respectively. I want to create a new column referred to "global light" that during the day depends only on SL and during the night depends on the other three columns ("MPhL", "MPL" and "SC"). What I want is that at night (when SL == 0), the light in a specific area is equal to the product of the columns "MPhL", "MPL" and "SC". If any of them is 0, then, the light at night would be 0 also.
Since I work with a matrix of hundreds of thousands of rows, what would be the best way to do it? As an example of what I have:
SL<- c(0.82,0.00,0.24,0.00,0.98,0.24,0.00,0.00)
MPhL<- c(0.95,0.85,0.65,0.35,0.15,0.00,0.87,0.74)
MPL<- c(0.00,0.50,0.10,0.89,0.33,0.58,0.00,0.46)
SC<- c(0.00,0.50,0.10,0.89,0.33,0.58,0.00,0.46)
df<-data.frame(SL,MPhL,MPL,SC)
df
SL MPhL MPL SC
1 0.82 0.95 0.00 0.00
2 0.00 0.85 0.50 0.50
3 0.24 0.65 0.10 0.10
4 0.00 0.35 0.89 0.89
5 0.98 0.15 0.33 0.33
6 0.24 0.00 0.58 0.58
7 0.00 0.87 0.00 0.00
8 0.00 0.74 0.46 0.46
What I would like to get is this:
df
SL MPhL MPL SC GL
1 0.82 0.95 0.00 0.00 0.82 # When "SL">0, GL= SL
2 0.00 0.85 0.50 0.50 0.21 # When "SL" is 0, GL = MPhL*MPL*SC
3 0.24 0.65 0.10 0.10 0.24
4 0.00 0.35 0.89 0.89 0.28
5 0.98 0.15 0.33 0.33 0.98
6 0.24 0.00 0.58 0.58 0.24
7 0.00 0.87 0.00 0.00 0.00
8 0.00 0.74 0.46 0.46 0.16
the most simple way would be to use the ifelse function:
GL <- ifelse(SL == 0, MPhL * MPL * SC, SL)
If you want to work in a more structured environment, I can recommend the dplyr package:
library(dplyr)
tibble(SL = SL, MPhL = MPhL, MPL = MPL, SC = SC) %>%
mutate(GL = if_else(SL == 0, MPhL * MPL * SC, SL))
# A tibble: 8 x 5
SL MPhL MPL SC GL
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.82 0.95 0.00 0.00 0.820000
2 0.00 0.85 0.50 0.50 0.212500
3 0.24 0.65 0.10 0.10 0.240000
4 0.00 0.35 0.89 0.89 0.277235
5 0.98 0.15 0.33 0.33 0.980000
6 0.24 0.00 0.58 0.58 0.240000
7 0.00 0.87 0.00 0.00 0.000000
8 0.00 0.74 0.46 0.46 0.156584

How to loop to assign NA for different column with different threshold

I have a data set with several column, for each column i want to find a threshold value to make the NA count between 1010-1020. Below is the way i tried coding. Here is the example for the data.
X1 X2 X3
1.51 0.00 0.00
0.31 3.90 0.00
0.64 13.64 0.00
0.26 9.66 0.00
0.36 0.04 0.00
0.51 0.03 0.00
0.30 0.08 0.02
0.01 0.20 0.04
0.02 0.03 0.00
0.00 0.47 0.00
0.00 1.44 5.54
0.00 2.68 0.74
0.03 0.68 5.49
1.72 0.08 1.54
threshold=seq(0.5,by=0.1,5)
for (j in threshold){
for (i in 1:3){
data[,i]=ifelse(data[,i]> j,data[,i],NA)
if((sum(is.na(data[,i]))==range(2,4)) {break
}
}}
Ok, here's how I'd do it.
threshold <- rep(NA,50)
for (i in 3:50){
# Find the number of current NAs
nNA <- sum(is.na(pred[,i]))
# Find the 1015th smallest value (minus the number of NAs you already have)
threshold[i] <- sort(pred[,i])[1015 - nNA]
pred[pred[,i] < threshold[i],i] <- NA
}
Edit: Changed to fit all new requirements.

All combinations of values between 0-1 sum to 1 in R

Simple question: I'm trying to get all combinations where the weights of 3 numbers (between 0.1 and 0.9) sums to 1.
Example:
c(0.20,0.20,0.60)
c(0.35,0.15,0.50)
.................
with weights differing by 0.05
I have tried this:
library(gregmisc)
permutations(n = 9, r = 3, v = seq(0.1,0.9,0.05))
combn(seq(0.1,0.9,0.05),c(3))
However I would need the 3 numbers (weights) to equal 1, how can I do this?
x <- expand.grid(seq(0.1,1,0.05),
seq(0.1,1,0.05),
seq(0.1,1,0.05))
x <- x[rowSums(x)==1,]
Edit: Use this instead to avoid floating point errors:
x <- x[abs(rowSums(x)-1) < .Machine$double.eps ^ 0.5,]
#if order doesn't matter
unique(apply(x,1,sort), MARGIN=2)
# 15 33 51 69 87 105 123 141 393 411 429 447 465 483 771 789 807 825 #843 1149 1167 1185 1527 1545
#[1,] 0.1 0.10 0.1 0.10 0.1 0.10 0.1 0.10 0.15 0.15 0.15 0.15 0.15 0.15 0.2 0.20 0.2 0.20 0.2 0.25 0.25 0.25 0.3 0.30
#[2,] 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.15 0.20 0.25 0.30 0.35 0.40 0.2 0.25 0.3 0.35 0.4 0.25 0.30 0.35 0.3 0.35
#[3,] 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.70 0.65 0.60 0.55 0.50 0.45 0.6 0.55 0.5 0.45 0.4 0.50 0.45 0.40 0.4 0.35
This will run into performance and memory problems if the possible number of combinations gets huge.
This was an easier to read solution for me:
x_grid <- data.frame(expand.grid(seq(0.1,1,0.05),
seq(0.1,1,0.05),
seq(0.1,1,0.05)))
x_combinations <- x[rowSums(x_grid) == 1, ]

Resources