I have a data.table in R and I want to create a new column that finds the interval for every price of the respective year/month.
Reproducible example:
set.seed(100)
DT <- data.table(year=2000:2009, month=1:10, price=runif(5*26^2)*100)
intervals <- list(year=2000:2009, month=1:10, interval = sort(round(runif(9)*100)))
intervals <- replicate(10, (sample(10:100,100, replace=T)))
intervals <- t(apply(intervals, 1, sort))
intervals.dt <- data.table(intervals)
intervals.dt[, c("year", "month") := list(rep(2000:2009, each=10), 1:10)]
setkey(intervals.dt, year, month)
setkey(DT, year, month)
I have just tried:
merging the DT and intervals.dt data.tables by month/year,
creating a new intervalsstring column consisting of all the V* columns to
one column string, (not very elegant, I admit), and finally
substringing it to a vector, so as I can use it in findInterval() but the solution does not work for every row (!)
So, after:
DT <- merge(DT, intervals.dt)
DT <- DT[, intervalsstring := paste(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10)]
DT <- DT[, c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10") := NULL]
DT[, interval := findInterval(price, strsplit(intervalsstring, " ")[[1]])]
I get
> DT
year month price intervalsstring interval
1: 2000 1 30.776611 12 21 36 46 48 51 63 72 91 95 2
2: 2000 1 62.499648 12 21 36 46 48 51 63 72 91 95 6
3: 2000 1 53.581115 12 21 36 46 48 51 63 72 91 95 6
4: 2000 1 48.830599 12 21 36 46 48 51 63 72 91 95 5
5: 2000 1 33.066053 12 21 36 46 48 51 63 72 91 95 2
---
3376: 2009 10 33.635924 12 40 45 48 50 65 75 90 96 97 2
3377: 2009 10 38.993769 12 40 45 48 50 65 75 90 96 97 3
3378: 2009 10 75.065820 12 40 45 48 50 65 75 90 96 97 8
3379: 2009 10 6.277403 12 40 45 48 50 65 75 90 96 97 0
3380: 2009 10 64.189162 12 40 45 48 50 65 75 90 96 97 7
which is correct for the first rows, but not for the last (or other) rows.
For example, for the row 3380, the price ~64.19 should be in the 5th interval and not the 7th. I guess my mistake is that by my last command, finding Intervals relies only on the first row of intervalsstring.
Thank you!
You have to us the argument by = year to apply the function to all subsets:
DT[, interval := findInterval(price, intervals[as.character(year), ]), by = year]
year price interval
1: 2000 30.776611 4
2: 2001 25.767250 1
3: 2002 55.232243 4
4: 2003 5.638315 0
5: 2004 46.854928 2
---
3376: 2005 97.497761 10
3377: 2006 50.141227 5
3378: 2007 50.186270 7
3379: 2008 99.229338 10
3380: 2009 64.189162 8
Update (based on edited question):
DT[ , interval := findInterval(price,
unlist(intervals.dt[J(year[1], month[1]), 1:10])),
by = c("year", "month")]
year month price V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 interval
1: 2000 1 30.776611 12 21 36 46 48 51 63 72 91 95 2
2: 2000 1 62.499648 12 21 36 46 48 51 63 72 91 95 6
3: 2000 1 53.581115 12 21 36 46 48 51 63 72 91 95 6
4: 2000 1 48.830599 12 21 36 46 48 51 63 72 91 95 5
5: 2000 1 33.066053 12 21 36 46 48 51 63 72 91 95 2
---
3376: 2009 10 33.635924 12 40 45 48 50 65 75 90 96 97 2
3377: 2009 10 38.993769 12 40 45 48 50 65 75 90 96 97 3
3378: 2009 10 75.065820 12 40 45 48 50 65 75 90 96 97 8
3379: 2009 10 6.277403 12 40 45 48 50 65 75 90 96 97 0
3380: 2009 10 64.189162 12 40 45 48 50 65 75 90 96 97 7
Related
I'm struggling with something that might turn out to be super easy.
What i'd like is some short and efficient code to create a dataframe where each column is made up of V1, V1 * 2, V1 * 3... and so on until a set number of columns is reached.
For example, if my V1 is this:
V1=rep(10000,1000)
I'd like a code to automatically generate additional columns such as V2 and V3
V2=V1*2
V3=V1*3
and bind them together in a dataframe to give
d=data.frame(V1,V2,V3)
d
Should this be done with a loop? Tried a bunch of things but am not the best at looping and at the moment I feel rather stuck.
Ideally I'd like my vector V1 to be:
V1=rep(10000,10973)
and to form a dataframe with 17 columns.
Thanks!
Use sapply to create multiple columns. Here, I am creating 17 columns where 1 to 10 are sequentially multiplied by 1, 2, ..., 17. Use as.data.frame to convert to data.frame object.
sapply(1:17, function(x) x * 1:10) |>
as.data.frame()
output
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
2 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
3 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51
4 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
5 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
6 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 102
7 7 14 21 28 35 42 49 56 63 70 77 84 91 98 105 112 119
8 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136
9 9 18 27 36 45 54 63 72 81 90 99 108 117 126 135 144 153
10 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170
In your case, you would need:
sapply(1:17, function(x) x * rep(10000, 10973)) |>
as.data.frame()
We could use outer
as.data.frame( outer(1:10, 1:17))
I created two matrices that have random integers as components, the dimension of the matrix doesn't matter. Then I want to calculate the distance matrix by the Manhattan method and frame it as a matrix. The matrix should be symmetric, but when I frame it as a matrix, the output is a non
symmetric distance matrix.
By that matrix (that should be the output) I want to calculate a cluster.
Where is my mistake?
Code:
a <- c(sample.int(30,6))
b <- c(sample.int(30,6))
c <- c(sample.int(30,6))
d <- c(sample.int(30,6))
e <- c(sample.int(30,6))
f <- c(sample.int(30,6))
V2 <- rbind(a,b,c,d,e,f)
V1 <- rbind(a,b,c,d,e,f)
d1MNR <- matrix(dist(Vorlage1,Vorlage2, method="manhattan")) #### Is non symmetric
d1MR <- matrix(dist(V1,V2,upper=TRUE, diag=TRUE ,method="manhattan")) #### Should be symmetric, but is not
d1MR ### Generate output
hclust <- hclust(dist(d1MR), method = "single") ### Clustering
You can make a symmetrical distance matrix from V1 or a symmetrical matrix from V2, but the only way to make a symmetric matrix from both of them together is to combine them V12 <- rbind(V1, V2). The dist() function returns a dist object that hclus can use. You do not need to convert them to a matrix. In your example V1 and V2 are identical. We need them to be different:
set.seed(42)
V1 <- matrix(sample.int(30, 36, replace=TRUE), 6)
V2 <- matrix(sample.int(30, 36, replace=TRUE), 6)
V12 <- rbind(V1, V2)
rownames(V12) <- paste(rep(c("V1", "V2"), each=6), 1:6, sep=":")
colnames(V12) <- letters[1:6]
V12
# a b c d e f
# V1:1 17 18 4 18 4 28
# V1:2 5 26 25 15 5 8
# V1:3 1 17 5 3 13 3
# V1:4 25 15 14 9 5 26
# V1:5 10 24 20 25 20 1
# V1:6 4 7 26 27 2 10
# V2:1 24 8 28 3 18 22
# V2:2 30 4 5 24 6 21
# V2:3 11 4 4 23 6 2
# V2:4 15 22 2 17 2 23
# V2:5 22 18 24 21 20 6
# V2:6 26 13 18 26 3 26
d1MNR <- dist(V12, method="manhattan")
hclust <- hclust(d1MNR, method = "single")
plot(hclust)
If you want to look at a symmetrical distance matrix:
print(d1MNR, upper=TRUE, diag=TRUE)
# V1:1 V1:2 V1:3 V1:4 V1:5 V1:6 V2:1 V2:2 V2:3 V2:4 V2:5 V2:6
# V1:1 0 65 67 33 79 75 76 43 53 16 66 39
# V1:2 65 0 58 66 44 38 79 90 64 57 49 72
# V1:3 67 58 0 72 62 76 79 88 52 67 69 98
# V1:4 33 66 72 0 86 78 45 46 74 43 63 26
# V1:5 79 44 62 86 0 58 83 90 54 73 31 72
# V1:6 75 38 76 78 58 0 75 68 48 73 59 54
# V2:1 76 79 79 45 83 75 0 67 93 80 52 59
# V2:2 43 90 88 46 90 68 67 0 40 49 73 36
# V2:3 53 64 52 74 54 48 93 40 0 55 65 68
# V2:4 16 57 67 43 73 73 80 49 55 0 72 49
# V2:5 66 49 69 63 31 59 52 73 65 72 0 57
# V2:6 39 72 98 26 72 54 59 36 68 49 57 0
I'm managing with a data table. I have 13 * 2598893 data table, and I'm trying to make new column filled with character calculated based on another column.
So i made a function, and applied it to 'for in' loops, with those millions of rows. And it takes forever! I waited for some minutes, and I could not distinguish it from system down.
I tried it for just 10 rows, and the loops and function works well fastly. But when I extend it to other rows, it takes forever, again.
str(eco)
'data.frame': 2598893 obs. of 13 variables:
made function like this
check<-function(x){
if(x<=15){
return(1)
}
else{
return(0)
}
}
And applied loops like this.
for(x in c(1:nrow(eco))){eco[x,13]<-check(eco[x,4])}
And it continues and continues to work.
How can I shorten this work? Or is this just the limit of R that I should endure?
You should probably try to vectorize your operations (NB: for loops can often times be avoided in R). In addition, you could check out the data.table package to further improve efficiency:
library(data.table)
set.seed(1)
## create data.table
eco <- as.data.table(matrix(sample(1:100, 13 * 2598893, replace = TRUE), ncol = 13))
## update column
system.time(
set(eco, j = 13L, value = 1 * (eco[[4]] <= 15))
)
#> user system elapsed
#> 0.018 0.016 0.033
eco
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
#> 1: 68 74 55 62 82 51 42 18 16 12 50 73 0
#> 2: 39 97 53 61 21 25 79 71 85 19 54 30 0
#> 3: 1 89 62 42 5 90 33 77 31 1 59 26 0
#> 4: 34 22 27 4 36 74 65 45 46 67 74 34 1
#> 5: 87 57 88 4 42 26 9 13 64 32 16 15 1
#> ---
#> 2598889: 91 59 78 28 98 98 13 87 88 46 66 85 0
#> 2598890: 82 60 87 60 49 25 10 9 97 78 61 91 0
#> 2598891: 19 2 100 75 66 88 12 46 94 32 69 56 0
#> 2598892: 18 47 22 87 23 79 56 99 13 29 15 46 0
#> 2598893: 47 30 8 8 9 80 49 78 20 43 86 11 1
I have a data table with over 90000 observations and 1201 variables. All columns except the last one store numeric values, the last column is the column with names of source files (over 100). Here is a small sample of the data table:
library(data.table)
DT <- data.table(V1=sample(0:100,20,replace=TRUE),
V2=sample(0:100,20,replace=TRUE), V3=sample(0:100,20,replace=TRUE),
V4=sample(0:100,20,replace=TRUE), V5=sample(0:100,20,replace=TRUE),
V6=sample(0:100,20,replace=TRUE), V7=sample(0:100,20,replace=TRUE),
file=rep(c("A","B","C","D"), each = 5))
What I want to do is to calculate a median of ALL values in each group (file). So e.g. for group A the median would be calculated from rows 1,2,3,4,5 at once. In the next step, I would like to assign the medians to each of the rows depending on a group (expected output below).
The question seems to be simple, I have googled many similar questions regarding median/mean calculation depending on a group (aggregate as one of the most popular solutions). However, in all cases only one column is taken into account for the median calculation. Here are 7 (or in my original data 1200) and median does not accept that - I should provide a numerical vector.
Therefore I have experimented with unlist, aggregate, dplyr package, tapply with any luck...
Due to the amount of data and groups (i.e. file) the code should be quite automatic and efficient... I would really appreciate your help!
Just a small example if the code which obviously has failed:
DT_median <- setDT(DT)[, DT_med := median(DT[,1:7]), by = file]
The expected result should look like this:
V1 V2 V3 V4 V5 V6 V7 file DT_med
42 78 9 0 60 46 65 A 37.5
36 36 46 45 5 96 64 A 37.5
83 31 92 100 15 2 9 A 37.5
36 16 49 82 32 4 46 A 37.5
29 17 39 6 62 52 97 A 37.5
37 70 17 90 8 10 93 B 47
72 62 68 83 96 77 20 B 47
10 47 29 2 93 16 30 B 47
69 87 7 47 96 17 8 B 47
23 70 72 27 10 86 49 B 47
78 51 13 33 56 6 39 C 51
28 92 100 5 75 33 17 C 51
71 82 9 20 34 83 22 C 51
62 40 84 87 37 45 34 C 51
55 80 55 94 66 96 12 C 51
93 1 99 97 7 77 6 D 41
53 55 71 12 19 25 28 D 41
27 25 28 89 41 22 60 D 41
91 25 25 57 21 98 27 D 41
2 63 17 53 99 65 95 D 41
As we want to calculate the median from all the values, grouped by 'file', unlist the Subset of Data.table (.SD), get the median and assign (:=) the output to create the new column 'DT_med'
library(data.table)
DT[, DT_med := median(unlist(.SD), na.rm = TRUE), by = file]
I have the following vector in R. Think of them as a vector of numbers.
x = c(1,2,3,4,...100)
I want to randomize this vector "locally" based on some input number the "locality factor". For example if the locality factor is 3, then the first 3 elements are taken and randomized followed by the next 3 elements and so on. Is there an efficient way to do this? I know if I use sample, it would jumble up the whole array.
Thanks in advance
Arun didn't like how inefficient my other answer was, so here's something very fast just for him ;)
It requires just one call each to runif() and order(), and doesn't use sample() at all.
x <- 1:100
k <- 3
n <- length(x)
x[order(rep(seq_len(ceiling(n/k)), each=k, length.out=n) + runif(n))]
# [1] 3 1 2 6 5 4 8 9 7 11 12 10 13 14 15 18 16 17
# [19] 20 19 21 23 22 24 27 25 26 29 28 30 33 31 32 36 34 35
# [37] 37 38 39 40 41 42 43 44 45 47 48 46 51 49 50 52 54 53
# [55] 55 57 56 58 60 59 62 63 61 66 64 65 68 67 69 71 70 72
# [73] 75 74 73 76 77 78 81 80 79 84 82 83 86 85 87 89 88 90
# [91] 93 92 91 94 96 95 97 98 99 100
General solution:
Edit: As #MatthewLundberg comments, the issue I pointed out with "repeating numbers in x" can be easily overcome by working on seq_along(x), which would mean the resulting values will be indices. So, it'd be like so:
k <- 3
x <- c(2,2,1, 1,3,4, 4,6,5, 3)
x.s <- seq_along(x)
y <- sample(x.s)
x[unlist(split(y, (match(y, x.s)-1) %/% k), use.names = FALSE)]
# [1] 2 2 1 3 4 1 4 5 6 3
Old answer:
The bottleneck here is the amount of calls to function sample. And as long as your numbers don't repeat, I think you can do this with just one call to sample in this manner:
k <- 3
x <- 1:20
y <- sample(x)
unlist(split(y, (match(y,x)-1) %/% k), use.names = FALSE)
# [1] 1 3 2 5 6 4 8 9 7 12 10 11 13 14 15 17 16 18 19 20
To put everything together in a function (I like the name scramble from #Roland's):
scramble <- function(x, k=3) {
x.s <- seq_along(x)
y.s <- sample(x.s)
idx <- unlist(split(y.s, (match(y.s, x.s)-1) %/% k), use.names = FALSE)
x[idx]
}
scramble(x, 3)
# [1] 2 1 2 3 4 1 5 4 6 3
scramble(x, 3)
# [1] 1 2 2 1 4 3 6 5 4 3
To reduce the answer (and get it faster) even more, following #flodel's comment:
scramble <- function(x, k=3L) {
x.s <- seq_along(x)
y.s <- sample(x.s)
x[unlist(split(x.s[y.s], (y.s-1) %/% k), use.names = FALSE)]
}
For the record, the boot package (shipped with base R) includes a function permutation.array() that is used for just this purpose:
x <- 1:100
k <- 3
ii <- boot:::permutation.array(n = length(x),
R = 2,
strata = (seq_along(x) - 1) %/% k)[1,]
x[ii]
# [1] 2 1 3 6 5 4 9 7 8 12 11 10 15 13 14 16 18 17
# [19] 21 19 20 23 22 24 26 27 25 28 29 30 33 31 32 36 35 34
# [37] 38 39 37 41 40 42 43 44 45 46 47 48 51 50 49 53 52 54
# [55] 57 55 56 59 60 58 63 61 62 65 66 64 67 69 68 72 71 70
# [73] 75 73 74 76 77 78 79 80 81 82 83 84 86 87 85 89 88 90
# [91] 93 91 92 94 95 96 97 98 99 100
This will drop elements at the end (with a warning):
locality <- 3
x <- 1:100
c(apply(matrix(x, nrow=locality, ncol=length(x) %/% locality), 2, sample))
## [1] 1 2 3 4 6 5 8 9 7 12 10 11 13 15 14 16 18 17 19 20 21 22 24 23 26 25 27 28 30 29 32 33 31 35 34 36 38 39 37
## [40] 42 40 41 43 44 45 47 48 46 51 49 50 54 52 53 55 57 56 58 59 60 62 61 63 64 65 66 67 69 68 71 72 70 74 75 73 78 77 76
## [79] 80 81 79 83 82 84 87 85 86 88 89 90 92 93 91 96 94 95 99 98 97
v <- 1:16
scramble <- function(vec,n) {
res <- tapply(vec,(seq_along(vec)+n-1)%/%n,
FUN=function(x) x[sample.int(length(x), size=length(x))])
unname(unlist(res))
}
set.seed(42)
scramble(v,3)
#[1] 3 2 1 6 5 4 9 7 8 12 10 11 15 13 14 16
scramble(v,4)
#[1] 2 3 1 4 5 8 6 7 10 12 9 11 14 15 16 13
I like Matthew's approach way better but here was the way I did the problem:
x <- 1:100
fact <- 3
y <- ceiling(length(x)/fact)
unlist(lapply(split(x, rep(1:y, each =fact)[1:length(x)]), function(x){
if (length(x)==1) return(x)
sample(x)
}), use.names = FALSE)
## [1] 3 1 2 6 4 5 8 9 7 11 10 12 13 15 14 17 16 18
## [19] 20 21 19 24 23 22 26 27 25 29 30 28 31 32 33 35 34 36
## [37] 39 37 38 41 42 40 45 43 44 47 46 48 51 49 50 52 53 54
## [55] 57 56 55 59 60 58 63 62 61 64 66 65 67 68 69 70 71 72
## [73] 75 73 74 77 76 78 80 79 81 82 84 83 85 86 87 90 89 88
## [91] 92 91 93 96 94 95 98 99 97 100