shorten vectors by using certain means - r

I have a simple question, but I can't find the right solution => I have a list (let's call it "list") consisting of about 2000 ordinary vectors (list[[1]],list[[2]], etc.). Each of those vectors contains 50399 numbers. Now what I want is to shorten each vector so that it consits of 840 numbers in the end.
So I want the first number to be the mean of the first 60 numbers of the original vector (mean(list[[i]][1:60])), the second number shall be the mean of the next 60 numbers etc. That should work 839 times (for 50399 numbers altogether). So the last number should be the mean of the last 59 (not 60) numbers of the original vector.
That should work for each veactor (list[[i]]) in "list"!
Do you guys have an idea how that works?

You can work it out like this:
set.seed(1)
(list <- replicate(3, sample(1:10, 10, T), simplify = FALSE))
# [[1]]
# [1] 3 4 6 10 3 9 10 7 7 1
#
# [[2]]
# [1] 3 2 7 4 8 5 8 10 4 8
#
# [[3]]
# [1] 10 3 7 2 3 4 1 4 9 4
n <- 5 # crunch vectors of 10 into 5 means
lapply(list, function(x) sapply(split(x, ceiling(seq_along(x)/(length(x)/n))), mean))
# [[1]]
# 1 2 3 4 5
# 3.5 8.0 6.0 8.5 4.0
#
# [[2]]
# 1 2 3 4 5
# 2.5 5.5 6.5 9.0 6.0
#
# [[3]]
# 1 2 3 4 5
# 6.5 4.5 3.5 2.5 6.5
i.e., in your case:
list <- replicate(2000, sample(1:10, 50399, T), simplify = FALSE)
res <- lapply(list, function(x) sapply(split(x, ceiling(seq_along(x)/(length(x)/840))), mean))
sapply(res, length) # check

Related

Quantile cuts despite duplicates

I have a dataset with > 900,000 rows with many duplicates:
> sum(duplicated(df$colB))
[1] 904515
So when I try to quantile cut into ten equally large subsets, I get an error
> df$colC <- cut(df$colB, quantile(df$colB,c(0:10)/10), labels=FALSE,
+ include.lowest=TRUE)
Error in cut.default(df$colB, quantile(df$colB, :
'breaks' are not unique
Using unique(quantile(df$colB,c(0:10)/10)) doesn't give equally sized subsets. There must be an easy solution to make quantile cuts which also considers the number of rows, in addition to the values in colB. Starting a loop sequence would probably take forever as I have a high number of rows. Any ideas?
Dummy dataset:
set.seed(10)
B <- round(runif(100, 0, 0.4), digits=2) # gives 63 duplicates
df$colB <- B
df <- as.data.frame(df)
There might be a neater solution than this, but this will do it:
df$colC <- ceiling((1:nrow(df))*10/nrow(df))[rank(df$colB, ties.method = 'first')]
table(df$colC)
#>
#> 1 2 3 4 5 6 7 8 9 10
#> 10 10 10 10 10 10 10 10 10 10
It might be hard to imagine, but there must be a range of values in df$colB that is invariant, so quantile returns two (or more) of a single value.
A contrived example:
set.seed(42)
vec <- c(rep(10,20),sample(100,size=80,))
brks <- quantile(vec, (0:10)/10)
brks
# 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
# 2.0 10.0 10.0 14.7 25.6 36.5 47.4 58.9 72.4 88.1 100.0
The cut function requires that there be no repeated values in its breaks= arguments. It should be informative to look at just the quantiles of your function to confirm this.
One way around this is to use .bincode, which does not enforce unique breaks.
cut(vec, brks, include.lowest = TRUE)
# Error in cut.default(vec, brks, include.lowest = TRUE) :
# 'breaks' are not unique
.bincode(vec, brks, include.lowest = TRUE)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 8 4 9 4
# [26] 10 6 4 8 10 6 4 5 1 6 5 5 1 5 9 7 6 10 5 6 4 4 9 1 9
# [51] 8 10 1 7 10 9 8 1 8 1 7 9 7 4 8 7 6 1 6 9 5 8 6 10 6
# [76] 9 1 5 3 10 6 5 9 4 5 7 10 7 8 9 4 5 7 3 8 4 10 7 8 10
(Note that there is no "2" in the return values with this data, because brks[2] is the same as brks[3], so appears to be ignored.)
One side-effect of this is that you don't get the factor labels by default, which might be useful.
labels <- sprintf("(%0.01f-%0.01f]", brks[-10], brks[-1])
substr(labels[1], 1, 1) <- "["
labels
# [1] "[2.0-10.0]" "(10.0-10.0]" "(10.0-14.7]" "(14.7-25.6]"
# [5] "(25.6-36.5]" "(36.5-47.4]" "(47.4-58.9]" "(58.9-72.4]"
# [9] "(72.4-88.1]" "(100.0-100.0]"
head(labels[ .bincode(vec, brks, include.lowest = TRUE) ])
# [1] "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]"
(Where the use of %0.01f is where you may want to customize this assumption.)

Create a matrix which contains 1 column of all the objects in a list

If we have this list with 100 objects:
List
[[1]]
x y z
379.0 0.6 1.0
369.0 0.3 2.0
[[2]]
x y z
359.0 6.2 4.0
379.0 4.6 1.0
[[3]]
x y z
379.0 6.0 1.0
379.0 0.5 1.0
.
.
[[100]]
How could I obtain a matrix using "y" values of list-columns of each object (all of them have the same length), obtaining something like:
X:
y1 y2 y3 ... y100
0.6 6.2 6.0
0.3 4.6 0.5
Try below:
# example list
l <- list(cars[1:3, ], cars[4:6, ])
l
# [[1]]
# speed dist
# 1 4 2
# 2 4 10
# 3 7 4
#
# [[2]]
# speed dist
# 4 7 22
# 5 8 16
# 6 9 10
Subset 2nd column using "[" function, then bind columns:
do.call(cbind, lapply(l, "[", 2))
# dist dist
# 1 2 22
# 2 10 16
# 3 4 10
Or another similar approach, instead of binding columns, we convert to data.frame:
as.data.frame(lapply(l, "[", 2))
# dist dist.1
# 1 2 22
# 2 10 16
# 3 4 10
Note: We could subset using column names, too. In my example data 2nd column name is "dist", so replace number 2 above with "dist" and it should work the same.

Replicate certain values in vector determined by other vector

I have a vector of values (say 1:10), and want to repeat certain values in it 2 or more times, determined by another vector (say c(3,4,6,8)). In this example, the result would be c(1,2,3,3,4,4,5,6,6,7,8,8,9,10) when repeating 2 times.
This should work for an arbitrary length range vector (like 200:600), with a second vector which is contained by the first. Is there a handy way to achieve this?
Akrun's is a more compact method, but this also will work
# get rep vector
reps <- rep(1L, 10L)
reps[c(3,4,6,8)] <- 2L
rep(1:10, reps)
[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
The insight here is that rep will take an integer vector in the second argument the same length as the first argument that indicates the number of repetitions for each element of the first argument.
Note that this solution relies on the assumption that c(3,4,6,8) is the index or position of the elements that are to be repeated. Under this scenario, then d-b's comment has a one-liner
rep(x, (seq_along(x) %in% c(3,4,6,8)) + 1)
If instead, c(3,4,6,8) indicates the values that are to be repeated, then docendo-discimus's super-compact code,
rep(x, (x %in% c(3,4,6,8)) * (n-1) +1)
where n may be adjusted to change the number of repetitions. If you need to call this a couple times, this could be rolled up into a function like
myReps <- function(x, y, n) rep(x, (x %in% y) * (n-1) +1)
and called as
myReps(1:10, c(3,4,6,8), 2)
in the current scenario.
We can try
i1 <- v1 %in% v2
sort(c(v1[!i1], rep(v1[i1], each = 2)))
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
Update
For the arbitrary vector,
f1 <- function(vec1, vec2, n){
i1 <- vec1 %in% vec2
vec3 <- seq_along(vec1)
c(vec1[!i1], rep(vec1[i1], each = n))[order(c(vec3[!i1],
rep(vec3[i1], each=n)))]
}
set.seed(24)
v1N <- sample(10)
v2 <- c(3,4,6,8)
v1N
#[1] 3 10 6 4 7 5 2 9 8 1
f1(v1N, v2, 2)
#[1] 3 3 10 6 6 4 4 7 5 2 9 8 8 1
f1(v1N, v2, 3)
#[1] 3 3 3 10 6 6 6 4 4 4 7 5 2 9 8 8 8 1
Here's another approach using sapply
#DATA
x = 1:10
r = c(3,4,6,8)
n = 2 #Two repetitions of selected values
#Assuming 'r' is the index of values in x to be repeated
unlist(sapply(seq_along(x), function(i) if(i %in% r){rep(x[i], n)}else{rep(x[i],1)}))
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
#Assuming 'r' is the values in 'x' to be repeated
unlist(sapply(x, function(i) if(i %in% r){rep(i, n)}else{rep(i, 1)}))
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
Haven't tested these thoroughly but could be possible alternatives. Note that the order of the output will be considerably different with this approach.
sort(c(x, rep(x[x %in% r], n-1))) #assuming 'r' is values
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
sort(c(x, rep(x[r], n-1))) #assuming 'r' is index
#[1] 1 2 3 3 4 4 5 6 6 7 8 8 9 10
I suggest this solution just to emphasize the cool usage of append function in base R:
ff <- function(vec, v, n) {
for(i in seq_along(v)) vec <- append(vec, rep(v[i], n-1), after = which(vec==v[i]))
vec
}
Examples:
set.seed(1)
ff(vec = sample(10), v = c(3,4,6,8), n = 2)
#[1] 3 3 4 4 5 7 2 8 8 9 6 6 10 1
ff(vec = sample(10), v = c(2,5,9), n = 4)
#[1] 3 2 2 2 2 6 10 5 5 5 5 7 8 4 1 9 9 9 9

R data.table with variable number of columns

For each student in a data set, a certain set of scores may have been collected. We want to calculate the mean for each student, but using only the scores in the columns that were germane to that student.
The columns required in a calculation are different for each row. I've figured how to write this in R using the usual tools, but am trying to rewrite with data.table, partly for fun, but also partly in anticipation of success in this small project which might lead to the need to make calculations for lots and lots of rows.
Here is a small working example of "choose a specific column set for each row problem."
set.seed(123234)
## Suppose these are 10 students in various grades
dat <- data.frame(id = 1:10, grade = rep(3:7, by = 2),
A = sample(c(1:5, 9), 10, replace = TRUE),
B = sample(c(1:5, 9), 10, replace = TRUE),
C = sample(c(1:5, 9), 10, replace = TRUE),
D = sample(c(1:5, 9), 10, replace = TRUE))
## 9 is a marker for missing value, there might also be
## NAs in real data, and those are supposed to be regarded
## differently in some exercises
## Students in various grades are administered different
## tests. A data structure gives the grade to test linkage.
## The letters are column names in dat
lookup <- list("3" = c("A", "B"),
"4" = c("A", "C"),
"5" = c("B", "C", "D"),
"6" = c("A", "B", "C", "D"),
"7" = c("C", "D"),
"8" = c("C"))
## wrapper around that lookup because I kept getting confused
getLookup <- function(grade){
lookup[[as.character(grade)]]
}
## Function that receives one row (named vector)
## from data frame and chooses columns and makes calculation
getMean <- function(arow, lookup){
scores <- arow[getLookup(arow["grade"])]
mean(scores[scores != 9], na.rm = TRUE)
}
stuscores <- apply(dat, 1, function(x) getMean(x, lookup))
result <- data.frame(dat, stuscores)
result
## If the data is 1000s of thousands of rows,
## I will wish I could use data.table to do that.
## Client will want students sorted by state, district, classroom,
## etc.
## However, am stumped on how to specify the adjustable
## column-name chooser
library(data.table)
DT <- data.table(dat)
## How to write call to getMean correctly?
## Want to do this for each participant (no grouping)
setkey(DT, id)
The desired output is the student average for the appropriate columns, like so:
> result
id grade A B C D stuscores
1 1 3 9 9 1 4 NaN
2 2 4 5 4 1 5 3.0
3 3 5 1 3 5 9 4.0
4 4 6 5 2 4 5 4.0
5 5 7 9 1 1 3 2.0
6 6 3 3 3 4 3 3.0
7 7 4 9 2 9 2 NaN
8 8 5 3 9 2 9 2.0
9 9 6 2 3 2 5 3.0
10 10 7 3 2 4 1 2.5
Then what? I've written a lot of mistakes so far...
I did not find any examples in the data table examples in which the columns to be used in calculations for each row was itself a variable, I thank you for your advice.
I was not asking anybody to write code for me, I'm asking for advice on how to get started with this problem.
First of all, when creating a reproducible example using functions such as sample (which set a random seed each time you run it), you should use set.seed.
Second of all, instead of looping over each row, you could just loop over the lookup list which will always be smaller than the data (many times significantly smaller) and combine it with rowMeans. You can also do it with base R, but you asked for a data.table solution so here goes (for the purposes of this solution I've converted all 9 to NAs, but you can try to generalize this to your specific case too)
So using set.seed(123), your function gives
apply(dat, 1, function(x) getMean(x, lookup))
# [1] 2.000000 5.000000 4.666667 4.500000 2.500000 1.000000 4.000000 2.333333 2.500000 1.500000
And here's a possible data.table application which runs only over the lookup list (for loops on lists are very efficient in R btw, see here)
## convert all 9 values to NAs
is.na(dat) <- dat == 9L
## convert your original data to `data.table`,
## there is no need in additional copy of the data if the data is huge
setDT(dat)
## loop only over the list
for(i in names(lookup)) {
dat[grade == i, res := rowMeans(as.matrix(.SD[, lookup[[i]], with = FALSE]), na.rm = TRUE)]
}
dat
# id grade A B C D res
# 1: 1 3 2 NA NA NA 2.000000
# 2: 2 4 5 3 5 NA 5.000000
# 3: 3 5 3 5 4 5 4.666667
# 4: 4 6 NA 4 NA 5 4.500000
# 5: 5 7 NA 1 4 1 2.500000
# 6: 6 3 1 NA 5 3 1.000000
# 7: 7 4 4 2 4 5 4.000000
# 8: 8 5 NA 1 4 2 2.333333
# 9: NA 6 4 2 2 2 2.500000
# 10: 10 7 3 NA 1 2 1.500000
Possibly, this could be improved utilizing set, but I can't think of a good way currently.
P.S.
As suggested by #Arun, please take a look at the vignettes he himself wrote here in order to get familiar with the := operator, .SD, with = FALSE, etc.
Here's another data.table approach using melt.data.table (needs data.table 1.9.5+) and then joins between data.tables:
DT_m <- setkey(melt.data.table(DT, c("id", "grade"), value.name = "score"), grade, variable)
lookup_dt <- data.table(grade = rep(as.integer(names(lookup)), lengths(lookup)),
variable = unlist(lookup), key = "grade,variable")
score_summary <- setkey(DT_m[lookup_dt, nomatch = 0L,
.(res = mean(score[score != 9], na.rm = TRUE)), by = id], id)
setkey(DT, id)[score_summary, res := res]
# id grade A B C D mean_score
# 1: 1 3 9 9 1 4 NaN
# 2: 2 4 5 4 1 5 3.0
# 3: 3 5 1 3 5 9 4.0
# 4: 4 6 5 2 4 5 4.0
# 5: 5 7 9 1 1 3 2.0
# 6: 6 3 3 3 4 3 3.0
# 7: 7 4 9 2 9 2 NaN
# 8: 8 5 3 9 2 9 2.0
# 9: 9 6 2 3 2 5 3.0
#10: 10 7 3 2 4 1 2.5
It's more verbose, but just over twice as fast:
microbenchmark(da_method(), nk_method(), times = 1000)
#Unit: milliseconds
# expr min lq mean median uq max neval
# da_method() 17.465893 17.845689 19.249615 18.079206 18.337346 181.76369 1000
# nk_method() 7.047405 7.282276 7.757005 7.489351 7.667614 20.30658 1000

mapply and two lists

I'm trying to use mapply to combine two lists (A and B). Each element is a dataframe.
I'm trying to rbind the dataframes in A to the corresponding dataframes in B. The following returns what I would like in combo1:
num = 10
A<-list()
B<-list()
for (j in 1:num){
A[[j]] <- as.data.frame(matrix(seq(1:9),3,3))
B[[j]] <- as.data.frame(matrix(seq(10:18),3,3))
}
combo1<-list()
for (i in 1:num){
combo1[[i]] <-rbind(A[[i]], B[[i]])
}
I'm trying to use mapply to do the same, but I can't get it to work:
combo2<-list()
combo2<-mapply("rbind", A, B)
I was hoping someone could please help me
You were very close!
## Make this a more _minimal_ reproducible example
A <- A[1:2]
B <- B[1:2]
## Override default attempt to reduce results to a vector, matrix, or other array
mapply("rbind", A, B, SIMPLIFY=FALSE)
# [[1]]
# V1 V2 V3
# 1 1 4 7
# 2 2 5 8
# 3 3 6 9
# 4 1 4 7
# 5 2 5 8
# 6 3 6 9
#
# [[2]]
# V1 V2 V3
# 1 1 4 7
# 2 2 5 8
# 3 3 6 9
# 4 1 4 7
# 5 2 5 8
# 6 3 6 9

Resources