I need to split a sorted unknown length vector in R into "top 10%,..., bottom 10%"
So, for example if I have vector <- order(c(1:98928)), I want to split it into 10 different vectors, each one representing approximately 10% of the total length.
Ive tried using split <- split(vector, 1:10) but as I dont know the length of the vector, I get this error if its not multiple
data length is not a multiple of split variable
And even if its multiple and the function works, split() does not keep the order of my original vector. This is what split gives:
split(c(1:10) , 1:2)
$`1`
[1] 1 3 5 7 9
$`2`
[1] 2 4 6 8 10
And this is what I want:
$`1`
[1] 1 2 3 4 5
$`2`
[1] 6 7 8 9 10
Im newbie in R and Ive been trying lots of things without success, does anyone knows how to do this?
Problem statement
Break a sorted vector x every 10% into 10 chunks.
Note there are two interpretation for this:
Cutting by vector index:
split(x, floor(10 * seq.int(0, length(x) - 1) / length(x)))
Cutting by vector values (say, quantiles):
split(x, cut(x, quantile(x, prob = 0:10 / 10, names = FALSE), include = TRUE))
In the following, I will make demonstration using data:
set.seed(0); x <- sort(round(rnorm(23),1))
Particularly, our example data are Normally distributed rather than uniformly distributed, so cutting by index and cutting by value are substantially different.
Result
cutting by index
#$`0`
#[1] -1.5 -1.2 -1.1
#
#$`1`
#[1] -0.9 -0.9
#
#$`2`
#[1] -0.8 -0.4
#
#$`3`
#[1] -0.3 -0.3 -0.3
#
#$`4`
#[1] -0.3 -0.2
#
#$`5`
#[1] 0.0 0.1
#
#$`6`
#[1] 0.3 0.4 0.4
#
#$`7`
#[1] 0.4 0.8
#
#$`8`
#[1] 1.3 1.3
#
#$`9`
#[1] 1.3 2.4
cutting by quantile
#$`[-1.5,-1.06]`
#[1] -1.5 -1.2 -1.1
#
#$`(-1.06,-0.86]`
#[1] -0.9 -0.9
#
#$`(-0.86,-0.34]`
#[1] -0.8 -0.4
#
#$`(-0.34,-0.3]`
#[1] -0.3 -0.3 -0.3 -0.3
#
#$`(-0.3,-0.2]`
#[1] -0.2
#
#$`(-0.2,0.14]`
#[1] 0.0 0.1
#
#$`(0.14,0.4]`
#[1] 0.3 0.4 0.4 0.4
#
#$`(0.4,0.64]`
#numeric(0)
#
#$`(0.64,1.3]`
#[1] 0.8 1.3 1.3 1.3
#
#$`(1.3,2.4]`
#[1] 2.4
If you have your vector as a column (named vec) in a data frame, you can simply do something like this:
df$new_vec <- cut(df$vec , breaks = quantile(df$vec, c(0, .1,.., 1)),
labels=1:10, include.lowest=TRUE)
x <- 1:98
y <- split(x, ((seq(length(x))-1)*10)%/%length(x)+1)
Explanation:
seq(length(x)) = 1..98
seq(length(x))-1 = 0..97
(seq(length(x))-1)*10 = (0, 10, ..., 970)
# each number about 10% of values, totally 98
((seq(length(x))-1)*10)%/%length(x) = (0, ..., 0, 1, ..., 1, ..., 9, ..., 9)
# each number about 10% of values, totally 98
seq(length(x))-1)*10)%/%length(x)+1 = (1, ..., 1, 2, ..., 2, ..., 10, ..., 10)
# splits first ~10% of numbers to 1, next ~10% of numbers to 2 etc.
split(x, ((seq(length(x))-1)*10)%/%length(x)+1)
If the vector is sorted, then you could just create a group variable with the same length of vector and split on it. In real case, it will require a little more effort since the length of the vector may not be a multiple of 10 but for your toy example, you can do:
n = 2
split(x, rep(1:n, each = length(x)/n))
# $`1`
# [1] 1 2 3 4 5
# $`2`
# [1] 6 7 8 9 10
A real case example, where the vector's length is not a multiple of the number of groups:
vec = 1:13
n = 3
split(vec, sort(seq_along(vec)%%n))
# $`0`
# [1] 1 2 3 4
# $`1`
# [1] 5 6 7 8 9
# $`2`
# [1] 10 11 12 13
You can use the sum() function to determine the positions to extract a section of the vector. Using a logical operator greater than (>) or less than (<) the percentile value you are indicating. Since sum() assigns the value of 1 if TRUE and 0 if FALSE. It is important to order the elements of the vector first.
# A vector with numbers from 1 to 100
data <- seq(1,100)
# 25th percentile value and 75th percentile value
ps1 <- quantile(data,probs=c(0.25))
ps2 <- quantile(data,probs=c(0.75))
# Positions to split
position1 <- sum(data<=ps1)
position2 <- sum(data<=ps2)
# Split with positions in a sorted data
sort(data)[position1:position2]
The result is
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
In the same way you can divide an ordered vector into 10 equal parts in the following way, specifying the percentiles
# A vector with numbers from 1 to 100
data <- seq(1,100)
# sub vectors based on percentiles
subvectors <- quantile(data,probs=c(0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,1))
for (i in 1:length(subvectors)-1){
# Percentiles values
ps1 <- subvectors[i]
ps2 <- subvectors[i+1]
# Positions to split
position1 <- sum(data<=ps1)
position2 <- sum(data<=ps2)
# Split with positions in a sorted data
print(sort(data)[position1:position2])
}
Related
I would like to create a function in R that rounds numeric dataframes (or columns in a dataframe) depending on the number. If the number is less than 1, round to 1 decimal, but if it is greater than 1, round to 0 decimals.
This is what I have
data <- data.frame(x = c(1.111, 0.809, 5.55555, 0.567), y = c(0.235, 0.777, 4.55555555, 393.55))
round0 <- function(x) format(round(x, digits=0), nsmall = 0, trim = TRUE)
round0(data)
x y
1 1 0
2 1 1
3 6 5
4 1 394
# What I want
x y
1 1 0.2
2 0.8 1
3 6 5
4 1 394
> round0 <- function(x) ifelse(x<1,round(x,1),round(x))
> sapply(data,round0)
x y
[1,] 1.0 0.2
[2,] 0.8 0.8
[3,] 6.0 5.0
[4,] 0.6 394.0
You can use :
round0 <- function(x) ifelse(x < 1, format(round(x, 1), nsmall = 1), round(x))
data[] <- lapply(data, round0)
data
# x y
#1 1 0.2
#2 0.8 0.8
#3 6 5
#4 0.6 394
Note that this is only for display purpose and classes of columns are of type character. If you want to perform any mathematical calculation on it you need to convert it back to numeric.
I have a data frame that looks like this:
library(dplyr)
size_df <- tibble(size_chr = c("XS", "S", "M", "L", "XL", "1XL", "2XL", "3XL", "4XL", "5XL", "6XL"),
size_min = c(0,36,39,42,45,48,52,56,60,64,66),
size_max = c(36,39,42,45,48,52,56,60,64,66,70))
For any given number less than 70, I want to find the two sizes that it lies between, and the distance between them both (normalised to between 0 and 1)
For example:
input <- 37.2
# S 0.6
# M 0.4
input <- 48
# XL 1
input <- 68
# 5XL 0.5
# 6XL 0.5
This is the perfect case for findInterval(). We'll create a vector of the breaks between categories and use those to calculate scaling factors.
size_breaks <- c(size_df[["size_min"]], max(size_df[["size_max"]]))
size_breaks
# [1] 0 36 39 42 45 48 52 56 60 64 66 70
size_spans <- diff(size_breaks)
size_scales <- 1 / size_spans
size_scales
# [1] 0.02777778 0.33333333 0.33333333 0.33333333 0.33333333 0.25000000 0.25000000
# [8] 0.25000000 0.25000000 0.50000000 0.25000000
findInterval() will give us the index of the lower bound. The upper bound is just that index + 1.
neighbor_distances <- function(x) {
lower <- findInterval(x, size_breaks)
neighbors <- c(lower, lower + 1)
distances <- abs(x - size_breaks[neighbors]) * size_scales[lower]
tibble(
size_chr = size_df[["size_chr"]][neighbors],
distance = distances
)
}
It works well for your first example.
neighbor_distances(37.2)
# # A tibble: 2 x 2
# size_chr distance
# <chr> <dbl>
# 1 S 0.4
# 2 M 0.600
The second example gives two rows instead of just one, but that can be handled with extra logic in the function. I left that logic out to keep things simple.
neighbor_distances(48)
# # A tibble: 2 x 2
# size_chr distance
# <chr> <dbl>
# 1 1XL 0
# 2 2XL 1
It gives a different answer for your third example, but I don't know why you expect a number to be compared to a size category smaller than the lower bound.
neighbor_distances(68)
# # A tibble: 2 x 2
# size_chr distance
# <chr> <dbl>
# 1 6XL 0.5
# 2 NA 0.5
INDS = c(max(1, tail(which(size_df$size_min < input), 1)),
min(NROW(size_df), 1 + head(which(size_df$size_max > input), 1)))
size_df$size_chr[INDS]
#[1] "S" "M"
DIST = c(abs(size_df$size_min[INDS[1]] - input),
abs(size_df$size_max[INDS[2]] - input))
DIST/sum(DIST)
#[1] 0.2 0.8
now I have a lot of matrices with the different number of rows. And I want to sum the odd-number rows and even number rows element respectivelylike below:
o <- matrix(rep(c(1,2,3,4,5,6),6),ncol = 6)
o2 <- matrix(rep(c(1,2,3,4,5,6),12),ncol = 6)
#I want to sum the odd-number rows and even number rows element respectively
i=1
kg <- NULL
while(i <= 2){
op<-unlist(Map(sum,o[i,],o[i+2,],o[i+4,]))
kg <- c(kg,op)
i=i+1
}
i=1
kg2 <- NULL
while(i <= 2){
op2<-unlist(Map(sum,o2[i,],o2[i+2,],o2[i+4,],o2[i+6],o2[i+8],o2[i+10]))
kg2 <- c(kg2,op2)
i=i+1
}
kg
kg2 #the result should be a vector sequence like kg and kg2
> kg2
[1] 18 18 18 18 18 18 24 24 24 24 24 24
It is what I can do know. But my data have a lot of different length of columns. Is that any method I can do it quickly?
And how can I generate a sring like "o2[i,],o2[i+2,],o2[i+4,],o2[i+6],o2[i+8],o2[i+10])" automatically according to the input number? Thank you for your help :)
Perhaps something like this?
o <- matrix(rep(c(1,2,3,4,5,6),6),ncol = 6)
o2 <- matrix(rep(c(1,2,3,4,5,6),12),ncol = 6)
even <- function(x) 2 * seq(1, nrow(x) / 2);
odd <- function(x) 2 * seq(1, nrow(x) / 2) - 1;
colSums(o[even(o), ]);
#[1] 12 12 12 12 12 12
colSums(o[odd(o), ]);
#[1] 9 9 9 9 9 9
colSums(o2[even(o2), ]);
#[1] 24 24 24 24 24 24
colSums(o2[odd(o2), ]);
#[1] 18 18 18 18 18 18
Explanation: even/odd return even/odd row indices of a matrix/data.frame; we can then use colSums to sum entries by column.
Update
To sum entries from rows 3, 6, 9, 12 (or any other sequence) you just need to define a corresponding function, e.g.
another_seq <- function(x) 3 * seq(1, nrow(x) / 3)
colSums(o2[another_seq(o2), ]);
#[1] 18 18 18 18 18 18
In the OP's loop, if we want to change the Map to make it more automatic
unlist(do.call(Map, c(f = sum, as.data.frame(t(o2[seq(i, i+10, by = 2),])))))
Using the full code
o <- matrix(rep(c(1,2,3,4,5,6),6),ncol = 6)
o2 <- matrix(rep(c(1,2,3,4,5,6),12),ncol = 6)
#I want to sum the odd-number rows and even number rows
i=1
kg <- NULL
while(i <= 2){
#op<-unlist(Map(sum,o[i,],o[i+2,],o[i+4,]))
op <- unlist(do.call(Map, c(f = sum,
as.data.frame(t(o[seq(i, i+4, by = 2),]))))) # change here
kg <- c(kg,op)
i=i+1
}
i=1
kg2 <- NULL
while(i <= 2){
#op2<-unlist(Map(sum,o2[i,],o2[i+2,],o2[i+4,],o2[i+6],o2[i+8],o2[i+10]))
op2 <- unlist(do.call(Map, c(f = sum,
s.data.frame(t(o2[seq(i, i+10, by = 2),]))))) # change here
kg2 <- c(kg2,op2)
i=i+1
}
kg
#[1] 9 9 9 9 9 9 12 12 12 12 12 12
kg2
#[1] 18 18 18 18 18 18 24 24 24 24 24 24
In the OP's code, if we analyze the individual arguments of Map with just two arguments i.e. the first and 3rd row of 'o'
i <- 1
Map(function(x, y) c(x, y), o[i,], o[i+2,])
#[[1]]
#[1] 1 3
#[[2]]
#[1] 1 3
#[[3]]
#[1] 1 3
#[[4]]
#[1] 1 3
#[[5]]
#[1] 1 3
#[[6]]
#[1] 1 3
Here, each element of the list is the column values concatenated (c). If we need to get a similar structure, by subsetting the odd rows, we transpose the subset of rows, convert it to data.frame, so that each individual block is a column (that corresponds to the original rows subsetted)
do.call(Map, c(f=c, as.data.frame(t(o[c(i, i+2),]))))
#[[1]]
#V1 V2
# 1 3
#[[2]]
#V1 V2
# 1 3
#[[3]]
#V1 V2
# 1 3
#[[4]]
#V1 V2
# 1 3
#[[5]]
#V1 V2
# 1 3
#[[6]]
#V1 V2
# 1 3
Keeping it as a matrix will not solve it as it take the whole matrix as a single cell (a matrix is a vector with dimension attribute)
do.call(Map, c(f=c, o[c(i, i+2),]))
#[[1]]
#[1] 1 3 1 3 1 3 1 3 1 3 1 3
while using Map directly will loop through each element of the matrix (vector) instead of each column
Map(c, o[c(i, i+2),]) # check the output
Another option would be to split the object by col and then do the sum
onew <- o[seq(i, i+4, by = 2),]
Map(sum, split(onew, col(onew)))
The above approach is loopy, but we can also use vectorized approach (just like in the #Maurits Evers post). Instead of seq, here we are using the recycling of logical vector to subset the rows and then do the colSums
i1 <- c(TRUE, FALSE)
colSums(cbind(o[i1,], o[!i1,]))
#[1] 9 9 9 9 9 9 12 12 12 12 12 12
colSums(cbind(o2[i1,], o2[!i1,]))
#[1] 18 18 18 18 18 18 24 24 24 24 24 24
So what I would like to do is to remove each element that is the last value based on the ID. I have a fairly large data set, so it would not be optimal to do it by hand.
Here are some sample values.
ID <- c(1, 1, 1, 231313, 231313, 45, 45, 89, 89, 89, 89)
distance <- c(0.3, 1.2, 0.8, 0.5, 6789, 0.1, 349495, 0.1, 0.3, 0.9, 49494)
data <- data.frame(ID = ID, distance = distance)
To make it more clear, I would like to remove 0.8, 6789, 349495 and 49494 with "" since they are the last value for each unique ID.
The results would look like this:
ID distance
1 0.3
1 1.2
1
231313 0.5
231313
45 0.1
45
89 0.1
89 0.3
89 0.9
89
Generally not a good idea to use "" as 'missing' (e.g., it coerces numeric values to character strings). Base R functionality finds those that are not duplicated (starting from the end), and updates the corresponding column
> data[!duplicated(data$ID, fromLast=TRUE), "distance"] = NA
> data
ID distance
1 1 0.3
2 1 1.2
3 1 NA
4 231313 0.5
5 231313 NA
6 45 0.1
7 45 NA
8 89 0.1
9 89 0.3
10 89 0.9
11 89 NA
Instead of creating a '' and changing the column class from numeric to character, we can replace the last value per each 'ID' as NA. Using data.table, we convert the 'data.frame' to 'data.table' (setDT(data)), get the row index (.I) of last row (.N) for each 'ID', we assign the 'distance' corresponding to that rows as 'NA'.
library(data.table)
i1 <- setDT(data)[, .I[.N], ID]$V1
data[i1, distance:= NA_real_]
data
# ID distance
# 1: 1 0.3
# 2: 1 1.2
# 3: 1 NA
# 4: 231313 0.5
# 5: 231313 NA
# 6: 45 0.1
# 7: 45 NA
# 8: 89 0.1
# 9: 89 0.3
#10: 89 0.9
#11: 89 NA
Or we can do this in a single step
setDT(data)[1:.N ==.N , distance:= NA_real_ ,ID]
Suppose I want to generate bins for range 1 to 10
round(seq(1,20,length.out=5))
the output is
1 6 10 15 20
I want to form a data.frame as
[,1] [,2]
[1,] 1 6
[2,] 7 10
[3,] 11 15
[4,] 16 20
so the start will be 1,7, 11, 16, and ends are 6, 10, 15, 20, respectively.
Any solution for this?
x = round(seq(1,20,length.out=5))
df = data.frame(a = c(x[1], head(x[-1],-1) + 1), b = x[-1])
df
# a b
#1 1 6
#2 7 10
#3 11 15
#4 16 20
I am not sure if you are looking for the following solution. If you are, you can use cut and sub function as in my earlier post:
mydata<-round(seq(1,20,length.out=5))
mydata<-as.data.frame(mydata)
names(mydata)<-"V" #name the column as V
mydata$V1<-cut(mydata$V,5) #break the data into five intervals and name that as col V1
mydata$lower<-with(mydata,as.numeric( sub("\\((.+),.*", "\\1", V1))) #extract lower value
mydata$upper<-with(mydata,as.numeric( sub("[^,]*,([^]]*)\\]", "\\1",V1))) # extract upper value
myfinaldata<-mydata[,c("lower","upper")] #create data frame of lower and upper values
> myfinaldata
lower upper
1 0.981 4.79
2 4.790 8.60
3 8.600 12.40
4 12.400 16.20
5 16.200 20.00
Note: Although these look like ovelapping intervals, they are not. For example for the first row this means all data>=0.981 but <4.79 where as for the second row, this is >=4.79 and <8.60.