I'm managing with a data table. I have 13 * 2598893 data table, and I'm trying to make new column filled with character calculated based on another column.
So i made a function, and applied it to 'for in' loops, with those millions of rows. And it takes forever! I waited for some minutes, and I could not distinguish it from system down.
I tried it for just 10 rows, and the loops and function works well fastly. But when I extend it to other rows, it takes forever, again.
str(eco)
'data.frame': 2598893 obs. of 13 variables:
made function like this
check<-function(x){
if(x<=15){
return(1)
}
else{
return(0)
}
}
And applied loops like this.
for(x in c(1:nrow(eco))){eco[x,13]<-check(eco[x,4])}
And it continues and continues to work.
How can I shorten this work? Or is this just the limit of R that I should endure?
You should probably try to vectorize your operations (NB: for loops can often times be avoided in R). In addition, you could check out the data.table package to further improve efficiency:
library(data.table)
set.seed(1)
## create data.table
eco <- as.data.table(matrix(sample(1:100, 13 * 2598893, replace = TRUE), ncol = 13))
## update column
system.time(
set(eco, j = 13L, value = 1 * (eco[[4]] <= 15))
)
#> user system elapsed
#> 0.018 0.016 0.033
eco
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
#> 1: 68 74 55 62 82 51 42 18 16 12 50 73 0
#> 2: 39 97 53 61 21 25 79 71 85 19 54 30 0
#> 3: 1 89 62 42 5 90 33 77 31 1 59 26 0
#> 4: 34 22 27 4 36 74 65 45 46 67 74 34 1
#> 5: 87 57 88 4 42 26 9 13 64 32 16 15 1
#> ---
#> 2598889: 91 59 78 28 98 98 13 87 88 46 66 85 0
#> 2598890: 82 60 87 60 49 25 10 9 97 78 61 91 0
#> 2598891: 19 2 100 75 66 88 12 46 94 32 69 56 0
#> 2598892: 18 47 22 87 23 79 56 99 13 29 15 46 0
#> 2598893: 47 30 8 8 9 80 49 78 20 43 86 11 1
Related
I'm struggling with something that might turn out to be super easy.
What i'd like is some short and efficient code to create a dataframe where each column is made up of V1, V1 * 2, V1 * 3... and so on until a set number of columns is reached.
For example, if my V1 is this:
V1=rep(10000,1000)
I'd like a code to automatically generate additional columns such as V2 and V3
V2=V1*2
V3=V1*3
and bind them together in a dataframe to give
d=data.frame(V1,V2,V3)
d
Should this be done with a loop? Tried a bunch of things but am not the best at looping and at the moment I feel rather stuck.
Ideally I'd like my vector V1 to be:
V1=rep(10000,10973)
and to form a dataframe with 17 columns.
Thanks!
Use sapply to create multiple columns. Here, I am creating 17 columns where 1 to 10 are sequentially multiplied by 1, 2, ..., 17. Use as.data.frame to convert to data.frame object.
sapply(1:17, function(x) x * 1:10) |>
as.data.frame()
output
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
2 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
3 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51
4 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
5 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
6 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 102
7 7 14 21 28 35 42 49 56 63 70 77 84 91 98 105 112 119
8 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136
9 9 18 27 36 45 54 63 72 81 90 99 108 117 126 135 144 153
10 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170
In your case, you would need:
sapply(1:17, function(x) x * rep(10000, 10973)) |>
as.data.frame()
We could use outer
as.data.frame( outer(1:10, 1:17))
I created two matrices that have random integers as components, the dimension of the matrix doesn't matter. Then I want to calculate the distance matrix by the Manhattan method and frame it as a matrix. The matrix should be symmetric, but when I frame it as a matrix, the output is a non
symmetric distance matrix.
By that matrix (that should be the output) I want to calculate a cluster.
Where is my mistake?
Code:
a <- c(sample.int(30,6))
b <- c(sample.int(30,6))
c <- c(sample.int(30,6))
d <- c(sample.int(30,6))
e <- c(sample.int(30,6))
f <- c(sample.int(30,6))
V2 <- rbind(a,b,c,d,e,f)
V1 <- rbind(a,b,c,d,e,f)
d1MNR <- matrix(dist(Vorlage1,Vorlage2, method="manhattan")) #### Is non symmetric
d1MR <- matrix(dist(V1,V2,upper=TRUE, diag=TRUE ,method="manhattan")) #### Should be symmetric, but is not
d1MR ### Generate output
hclust <- hclust(dist(d1MR), method = "single") ### Clustering
You can make a symmetrical distance matrix from V1 or a symmetrical matrix from V2, but the only way to make a symmetric matrix from both of them together is to combine them V12 <- rbind(V1, V2). The dist() function returns a dist object that hclus can use. You do not need to convert them to a matrix. In your example V1 and V2 are identical. We need them to be different:
set.seed(42)
V1 <- matrix(sample.int(30, 36, replace=TRUE), 6)
V2 <- matrix(sample.int(30, 36, replace=TRUE), 6)
V12 <- rbind(V1, V2)
rownames(V12) <- paste(rep(c("V1", "V2"), each=6), 1:6, sep=":")
colnames(V12) <- letters[1:6]
V12
# a b c d e f
# V1:1 17 18 4 18 4 28
# V1:2 5 26 25 15 5 8
# V1:3 1 17 5 3 13 3
# V1:4 25 15 14 9 5 26
# V1:5 10 24 20 25 20 1
# V1:6 4 7 26 27 2 10
# V2:1 24 8 28 3 18 22
# V2:2 30 4 5 24 6 21
# V2:3 11 4 4 23 6 2
# V2:4 15 22 2 17 2 23
# V2:5 22 18 24 21 20 6
# V2:6 26 13 18 26 3 26
d1MNR <- dist(V12, method="manhattan")
hclust <- hclust(d1MNR, method = "single")
plot(hclust)
If you want to look at a symmetrical distance matrix:
print(d1MNR, upper=TRUE, diag=TRUE)
# V1:1 V1:2 V1:3 V1:4 V1:5 V1:6 V2:1 V2:2 V2:3 V2:4 V2:5 V2:6
# V1:1 0 65 67 33 79 75 76 43 53 16 66 39
# V1:2 65 0 58 66 44 38 79 90 64 57 49 72
# V1:3 67 58 0 72 62 76 79 88 52 67 69 98
# V1:4 33 66 72 0 86 78 45 46 74 43 63 26
# V1:5 79 44 62 86 0 58 83 90 54 73 31 72
# V1:6 75 38 76 78 58 0 75 68 48 73 59 54
# V2:1 76 79 79 45 83 75 0 67 93 80 52 59
# V2:2 43 90 88 46 90 68 67 0 40 49 73 36
# V2:3 53 64 52 74 54 48 93 40 0 55 65 68
# V2:4 16 57 67 43 73 73 80 49 55 0 72 49
# V2:5 66 49 69 63 31 59 52 73 65 72 0 57
# V2:6 39 72 98 26 72 54 59 36 68 49 57 0
I've a dataframe with 12 columns. For example, as below:
values <- matrix(1:120, nrow=10)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 11 21 31 41 51 61 71 81 91 101 111
2 12 22 32 42 52 62 72 82 92 102 112
3 13 23 33 43 53 63 73 83 93 103 113
4 14 24 34 44 54 64 74 84 94 104 114
5 15 25 35 45 55 65 75 85 95 105 115
6 16 26 36 46 56 66 76 86 96 106 116
7 17 27 37 47 57 67 77 87 97 107 117
8 18 28 38 48 58 68 78 88 98 108 118
9 19 29 39 49 59 69 79 89 99 109 119
Now, let's say I want to run a Linear Regression model lm() on a set of 3 columns each time (thus running the lm 4 times.. and each time giving a set of 3 columns as the dataframe for my each run of lm() function).
I was trying:
tapply(as.list(values), gl(ncol(values)/3, 3), lm())
and it said:
Error in terms.formula(formula, data = data):
argument is not a valid model
How do I solve this problem and run a linear model lm on a large dataframe by passing a specified number of columns as the input dataset for each run?
?sort states that the partial argument may be NULL or a vector of indices for partial sorting.
I tried:
x <- c(1,3,5,2,4,6,7,9,8,10)
sort(x)
## [1] 1 2 3 4 5 6 7 8 9 10
sort(x, partial=5)
## [1] 1 3 4 2 5 6 7 9 8 10
sort(x, partial=2)
## [1] 1 2 5 3 4 6 7 9 8 10
sort(x, partial=4)
## [1] 1 2 3 4 5 6 7 9 8 10
I am not sure what partial means when sorting a vector.
As ?sort states,
If partial is not NULL, it is taken to contain indices of elements of the result
which are to be placed in their correct positions in the sorted array by partial sorting.
In other words, the following assertion is always true:
stopifnot(sort(x, partial=pt_idx)[pt_idx] == sort(x)[pt_idx])
for any x and pt_idx, e.g.
x <- sample(100) # input vector
pt_idx <- sample(1:100, 5) # indices for partial arg
This behavior is different from the one defined in the Wikipedia article on partial sorting. In R sort()'s case we are not necessarily computing k smallest elements.
For example, if
print(x)
## [1] 91 85 63 80 71 69 20 39 78 67 32 56 27 79 9 66 88 23 61 75 68 81 21 90 36 84 11 3 42 43
## [31] 17 97 57 76 55 62 24 82 28 72 25 60 14 93 2 100 98 51 29 5 59 87 44 37 16 34 48 4 49 77
## [61] 13 95 31 15 70 18 52 58 73 1 45 40 8 30 89 99 41 7 94 47 96 12 35 19 38 6 74 50 86 65
## [91] 54 46 33 22 26 92 53 10 64 83
and
pt_idx
## [1] 5 54 58 95 8
then
sort(x, partial=pt_idx)
## [1] 1 3 2 4 5 6 7 8 11 12 9 10 13 15 14 16 17 18 23 30 31 27 21 32 36 34 35 19 20 37
## [31] 38 33 29 22 26 25 24 28 39 41 40 42 43 48 46 44 45 47 51 50 52 49 53 54 57 56 55 58 59 60
## [61] 62 64 63 61 65 66 70 72 73 69 68 71 67 79 78 82 75 81 80 77 76 74 89 85 88 87 83 84 86 90
## [91] 92 93 91 94 95 96 97 99 100 98
Here x[5], x[54], ..., x[8] are placed in their correct positions - and we cannot say anything else about the remaining elements. HTH.
EDIT: Partial sorting may reduce the sorting time, of course if you are interested in e.g. finding only some of the order statistics.
require(microbenchmark)
x <- rnorm(100000)
microbenchmark(sort(x, partial=1:10)[1:10], sort(x)[1:10])
## Unit: milliseconds
## expr min lq median uq max neval
## sort(x, partial = 1:10)[1:10] 2.342806 2.366383 2.393426 3.631734 44.00128 100
## sort(x)[1:10] 16.556525 16.645339 16.745489 17.911789 18.13621 100
regarding the statement "Here x[5], x[54], ..., x[8] are placed in their correct positions", I don't think it's correct, it should be "in the result, i.e. sorted x, result[5], result[54],.....,result[8], will be placed with right values from x."
quote from R manual:
If partial is not NULL, it is taken to contain indices of elements of
the result which are to be placed in their correct positions in the
sorted array by partial sorting. For each of the result values in a
specified position, any values smaller than that one are guaranteed to
have a smaller index in the sorted array and any values which are
greater are guaranteed to have a bigger index in the sorted array.
I have the following vector in R. Think of them as a vector of numbers.
x = c(1,2,3,4,...100)
I want to randomize this vector "locally" based on some input number the "locality factor". For example if the locality factor is 3, then the first 3 elements are taken and randomized followed by the next 3 elements and so on. Is there an efficient way to do this? I know if I use sample, it would jumble up the whole array.
Thanks in advance
Arun didn't like how inefficient my other answer was, so here's something very fast just for him ;)
It requires just one call each to runif() and order(), and doesn't use sample() at all.
x <- 1:100
k <- 3
n <- length(x)
x[order(rep(seq_len(ceiling(n/k)), each=k, length.out=n) + runif(n))]
# [1] 3 1 2 6 5 4 8 9 7 11 12 10 13 14 15 18 16 17
# [19] 20 19 21 23 22 24 27 25 26 29 28 30 33 31 32 36 34 35
# [37] 37 38 39 40 41 42 43 44 45 47 48 46 51 49 50 52 54 53
# [55] 55 57 56 58 60 59 62 63 61 66 64 65 68 67 69 71 70 72
# [73] 75 74 73 76 77 78 81 80 79 84 82 83 86 85 87 89 88 90
# [91] 93 92 91 94 96 95 97 98 99 100
General solution:
Edit: As #MatthewLundberg comments, the issue I pointed out with "repeating numbers in x" can be easily overcome by working on seq_along(x), which would mean the resulting values will be indices. So, it'd be like so:
k <- 3
x <- c(2,2,1, 1,3,4, 4,6,5, 3)
x.s <- seq_along(x)
y <- sample(x.s)
x[unlist(split(y, (match(y, x.s)-1) %/% k), use.names = FALSE)]
# [1] 2 2 1 3 4 1 4 5 6 3
Old answer:
The bottleneck here is the amount of calls to function sample. And as long as your numbers don't repeat, I think you can do this with just one call to sample in this manner:
k <- 3
x <- 1:20
y <- sample(x)
unlist(split(y, (match(y,x)-1) %/% k), use.names = FALSE)
# [1] 1 3 2 5 6 4 8 9 7 12 10 11 13 14 15 17 16 18 19 20
To put everything together in a function (I like the name scramble from #Roland's):
scramble <- function(x, k=3) {
x.s <- seq_along(x)
y.s <- sample(x.s)
idx <- unlist(split(y.s, (match(y.s, x.s)-1) %/% k), use.names = FALSE)
x[idx]
}
scramble(x, 3)
# [1] 2 1 2 3 4 1 5 4 6 3
scramble(x, 3)
# [1] 1 2 2 1 4 3 6 5 4 3
To reduce the answer (and get it faster) even more, following #flodel's comment:
scramble <- function(x, k=3L) {
x.s <- seq_along(x)
y.s <- sample(x.s)
x[unlist(split(x.s[y.s], (y.s-1) %/% k), use.names = FALSE)]
}
For the record, the boot package (shipped with base R) includes a function permutation.array() that is used for just this purpose:
x <- 1:100
k <- 3
ii <- boot:::permutation.array(n = length(x),
R = 2,
strata = (seq_along(x) - 1) %/% k)[1,]
x[ii]
# [1] 2 1 3 6 5 4 9 7 8 12 11 10 15 13 14 16 18 17
# [19] 21 19 20 23 22 24 26 27 25 28 29 30 33 31 32 36 35 34
# [37] 38 39 37 41 40 42 43 44 45 46 47 48 51 50 49 53 52 54
# [55] 57 55 56 59 60 58 63 61 62 65 66 64 67 69 68 72 71 70
# [73] 75 73 74 76 77 78 79 80 81 82 83 84 86 87 85 89 88 90
# [91] 93 91 92 94 95 96 97 98 99 100
This will drop elements at the end (with a warning):
locality <- 3
x <- 1:100
c(apply(matrix(x, nrow=locality, ncol=length(x) %/% locality), 2, sample))
## [1] 1 2 3 4 6 5 8 9 7 12 10 11 13 15 14 16 18 17 19 20 21 22 24 23 26 25 27 28 30 29 32 33 31 35 34 36 38 39 37
## [40] 42 40 41 43 44 45 47 48 46 51 49 50 54 52 53 55 57 56 58 59 60 62 61 63 64 65 66 67 69 68 71 72 70 74 75 73 78 77 76
## [79] 80 81 79 83 82 84 87 85 86 88 89 90 92 93 91 96 94 95 99 98 97
v <- 1:16
scramble <- function(vec,n) {
res <- tapply(vec,(seq_along(vec)+n-1)%/%n,
FUN=function(x) x[sample.int(length(x), size=length(x))])
unname(unlist(res))
}
set.seed(42)
scramble(v,3)
#[1] 3 2 1 6 5 4 9 7 8 12 10 11 15 13 14 16
scramble(v,4)
#[1] 2 3 1 4 5 8 6 7 10 12 9 11 14 15 16 13
I like Matthew's approach way better but here was the way I did the problem:
x <- 1:100
fact <- 3
y <- ceiling(length(x)/fact)
unlist(lapply(split(x, rep(1:y, each =fact)[1:length(x)]), function(x){
if (length(x)==1) return(x)
sample(x)
}), use.names = FALSE)
## [1] 3 1 2 6 4 5 8 9 7 11 10 12 13 15 14 17 16 18
## [19] 20 21 19 24 23 22 26 27 25 29 30 28 31 32 33 35 34 36
## [37] 39 37 38 41 42 40 45 43 44 47 46 48 51 49 50 52 53 54
## [55] 57 56 55 59 60 58 63 62 61 64 66 65 67 68 69 70 71 72
## [73] 75 73 74 77 76 78 80 79 81 82 84 83 85 86 87 90 89 88
## [91] 92 91 93 96 94 95 98 99 97 100