I am working on a project where I need to enter a number of "T score" tables into R. These are tables used to convert raw test scores into standardized values. They generally follow a specific pattern, but not one that is simple. For instance, one pattern is:
34,36,39,42,44,47,50,52,55,58,60,63,66,68,
71,74,76,79,82,84,87,90,92,95,98,100,103,106
I'd prefer to use a simple function to fill these in, rather than typing them by hand. I know that the seq() function can create a simple seqeuence, like:
R> seq(1,10,2)
[1] 1 3 5 7 9
Is there any way to create more complex sequences based on specific patterns? For instance, the above data could be done as:
c(34,seq(36:106,c(3,3,2)) # The pattern goes 36,39,42,44,47,50,52 (+3,+3,+2)
...however, this results in an error. I thought there would be a function that should do this, but all my Google-fu has just brought me back to the original seq().
This could be done using the cumsum (cumulative sum) function and rep:
> 31 + cumsum(rep(c(3, 2, 3), 9))
[1] 34 36 39 42 44 47 50 52 55 58 60 63 66 68 71 74 76 79 82
[20] 84 87 90 92 95 98 100 103
To make sure sure the sequence stops at the right place:
> (31 + cumsum(rep(c(3, 2, 3), 10)))[1:28]
[1] 34 36 39 42 44 47 50 52 55 58 60 63 66 68 71 74 76 79 82
[20] 84 87 90 92 95 98 100 103 106
Here is a custom function that should work in most cases. It uses the cumulative sum (cumsum()) of a sequence, and integer division to calculate the length of the desired sequence.
cseq <- function(from, to, by){
times <- (to-from) %/% sum(by)
x <- cumsum(c(from, rep(by, times+1)))
x[x<=to]
}
Try it:
> cseq(36, 106, c(3,3,2))
[1] 36 39 42 44 47 50 52 55 58 60 63 66 68 71 74 76 79 82 84 87 90 92 95 98
[25] 100 103 106
> cseq(36, 109, c(3,3,2))
[1] 36 39 42 44 47 50 52 55 58 60 63 66 68 71 74 76 79 82 84 87 90 92 95 98
[25] 100 103 106 108
Here is a non-iterative solution, in case you need a specific element of the sequence
f <- function(x){
d <- (x) %/% 3
r <- x %% 3
31 + d*8 + c(0,3,5)[r+1]
}
> f(1:10)
[1] 34 36 39 42 44 47 50 52 55 58
Related
I remarked a strange behavior of data.table that I don't understand:
library(data.table)
df <- as.data.table(matrix(ncol = 100,nrow = 3,data = sample(letters,300,replace = T)))
If I want to inverse first two columns, I could do:
df[,c(2,1,3:100L)]
which works fine. But if I do:
df[,c(2,1,3:ncol(df))]
[1] 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[33] 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[65] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100
and I don't understand it, because ncol(df) is 100 and is an integer. Why does it do that ?
You need to use with=FALSE as follows:
df[,c(2,1,3:ncol(df)),with=FALSE]
From ?data.table, under the Arguments for with
When j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table. with=FALSE is not necessary anymore to select columns dynamically. Note that x[, cols] is equivalent to x[, ..cols] and to x[, cols, with=FALSE] and to x[, .SD, .SDcols=cols].
Since c(2,1,3:100L) is a numeric column, then with=FALSE is not required and the columns are automatically returned. When it is c(2,1,3:ncol(df)), this expression will be evaluated and returned as a vector.
Should have a dupe somewhere
Here is a snippet of my code:
m <- as.data.frame.matrix(matrix(c(20, 32, 52, 84, 98, 101), ncol = 2, nrow = 3))
ages <- as.numeric()
for(i in 1:nrow(m)){
ages <- c(ages, c(m$V1[i]:m$V2[i]))
}
Essentially, the first column is the starting age, and the second column is the ending age. I'm trying to append every single age from start to end for every individual into a list. Unfortunately, this is very slow since I have around a million observations, and I'm looking for a way to optimize.
We could use mapply and create sequence between two columns
unlist(mapply(`:`, m$V1, m$V2))
#[1] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37..
#[29] 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65..
#[57] 76 77 78 79 80 81 82 83 84 32 33 34 35 36 37 38 39 40..
#[85] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68..
#[113] 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96..
#[141] 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ..
#[169] 88 89 90 91 92 93 94 95 96 97 98 99 100 101
Here is an option using pmap
library(purrr)
library(dplyr)
set_names(m, c('from', 'to')) %>%
pmap(., seq) %>%
unlist
Or using Map from base R
unlist(do.call(Map, c(f = `:`, m)))
I've got some sort of index, like:
index <- 1:100
I've also got a list of "exclusion intervals" / ranges
exclude <- data.frame(start = c(5,50, 90), end = c(10,55, 95))
start end
1 5 10
2 50 55
3 90 95
I'm looking for an efficient way (in R) to remove all the indexes that belong in the ranges in the exclude data frame
so the desired output would be:
1,2,3,4, 11,12,...,48,49, 56,57,...,88,89, 96,97,98,99,100
I could do this iteratively: go over every exclusion interval (using ddply) and iteratively remove indexes that fall in each interval. But is there a more efficient way (or function) that does this?
I'm using library(intervals) to calculate my intervals, I could not find a built-in function tha does this.
Another approach that looks valid could be:
starts = findInterval(index, exclude[["start"]])
ends = findInterval(index, exclude[["end"]])# + 1L) ##1 needs to be added to remove upper
##bounds from the 'index' too
index[starts != (ends + 1L)] ##a value above a lower bound and
##below an upper is inside that interval
The main advantage here is that no vectors including all intervals' elements need to be created and, also, that it handles any set of values inside a particular interval; e.g.:
set.seed(101); x = round(runif(15, 1, 100), 3)
x
# [1] 37.848 5.339 71.259 66.111 25.736 30.705 58.902 34.013 62.579 55.037 88.100 70.981 73.465 93.232 46.057
x[findInterval(x, exclude[["start"]]) != (findInterval(x, exclude[["end"]]) + 1L)]
# [1] 37.848 71.259 66.111 25.736 30.705 58.902 34.013 62.579 55.037 88.100 70.981 73.465 46.057
We can use Map to get the sequence for the corresponding elements in 'start' 'end' columns, unlist to create a vector and use setdiff to get the values of 'index' that are not in the vector.
setdiff(index,unlist(with(exclude, Map(`:`, start, end))))
#[1] 1 2 3 4 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#[20] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
#[39] 45 46 47 48 49 56 57 58 59 60 61 62 63 64 65 66 67 68 69
#[58] 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
#[77] 89 96 97 98 99 100
Or we can use rep and then use setdiff.
i1 <- with(exclude, end-start) +1L
setdiff(index,with(exclude, rep(start, i1)+ sequence(i1)-1))
NOTE: Both the methods return the index position that needs to be excluded. In the above case, the original vector ('index') is a sequence so I used setdiff. If it contains random elements, use the position vector appropriately, i.e.
index[-unlist(with(exclude, Map(`:`, start, end)))]
or
index[setdiff(seq_along(index), unlist(with(exclude,
Map(`:`, start, end))))]
Another approach
> index[-do.call(c, lapply(1:nrow(exclude), function(x) exclude$start[x]:exclude$end[x]))]
[1] 1 2 3 4 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[25] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 56 57 58 59 60
[49] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
[73] 85 86 87 88 89 96 97 98 99 100
I am trying to write a function in R which will print every 3rd number in [1,100]; this is what I have tried, but this doesn't produce every third number, it produces every number
x <- c(100)
question.1 <- function (x){
out <- seq(x)
return(out)
}
question.1(x)
Am I missing something? Any help you can offer would be greatly appreciated!
Use the modulo operator (%%) to obtain every nth value from 1:100 like this:
nth <- function(x,n){
x[x%%n==0]
}
For example:
x <- 1:100
nth(x,7)
[1] 7 14 21 28 35 42 49 56 63 70 77 84 91 98
Use indexing with [ and the recycling of short vectors:
seq(100)[c(F,F,T)]
## [1] 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99
Excellent answers have already been posted; this is just one further simple alternative:
start <- 1 # defines the initial number you want to select
step <- 3 # difference between subsequent numbers
seq(start, 100, by=step)
#[1] 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100
Just wrapping Matthew's solution in a general nth function:
nth <- function(x, n) x[c(rep(FALSE, n-1), TRUE)]
nth(1:100, 5)
# [1] 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Or even using an operator-style:
`%nth%` <- function(x, n) x[c(rep(FALSE, n-1), TRUE)]
seq(100) %nth% 5
The getOption("max.print") can be used to limit the number of values that can be printed from a single function call. For example:
options(max.print=20)
print(cars)
prints only the first 10 rows of 2 columns. However, max.print doesn't work very well lists. Especially if they are nested deeply, the amount of lines printed to the console can still be infinite.
Is there any way to specify a harder cutoff of the amount that can be printed to the screen? For example by specifying the amount of lines after which the printing can be interrupted? Something that also protects against printing huge recursive objects?
Based in part on this question, I would suggest just building a wrapper for print that uses capture.output to regulate what is printed:
print2 <- function(x, nlines=10,...)
cat(head(capture.output(print(x,...)), nlines), sep="\n")
For example:
> print2(list(1:10000,1:10000))
[[1]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12
[13] 13 14 15 16 17 18 19 20 21 22 23 24
[25] 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48
[49] 49 50 51 52 53 54 55 56 57 58 59 60
[61] 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84
[85] 85 86 87 88 89 90 91 92 93 94 95 96
[97] 97 98 99 100 101 102 103 104 105 106 107 108