placing value between specific numbers in cycle - r

so lets say I have
x = 1,4,2
i = 2
j = 4
k = 3
So i = 2 and j = 4, the point is i need to place k (3) between the numbers i,j in x so the result would be x = 1,4,3,2. I need it to work in a cycle because the numbers in i,j,k always change and so does the length of x when a new number from k is placed in x. The new x after step one is
x = 1,4,3,2 and lets say new values:
i = 4
j = 3
k = 5 so again in the cycle it should place 5 in x between 4 and 3 so final x = 1,4,5,3,2
Is there a way i could do it?

When i is always the number before j,
You could use append function:
ie:
x = c(1,4,2)
i = 4
k = 3
x <- append(x, k, match(i, x))
x
[1] 1 4 3 2
i = 4
k = 5
x <- append(x, k, match(i, x))
x
[1] 1 4 5 3 2
Putting this in a function:
insert <- function(x, k, i){
append(x, k, match(i, x))
}
Note that you did not specify what would happen if you had more than 1 four in your vector. ie x<- c(1,4,2,4,2) where exactly do you want to place the 3? Is it after the first four or the second four? etc

You can try this function :
insert_after <- function(x, i, k) {
ind <- match(i, x)
new_inds <- sort(c(seq_along(x), ind))
new_x <- x[new_inds]
new_x[duplicated(new_inds)] <- k
new_x
}
x = c(1,4,2)
x <- insert_after(x, 4, 3)
x
#[1] 1 4 3 2
x <- insert_after(x, 4, 5)
x
#[1] 1 4 5 3 2

Related

Create Looping with Customize numbers in R

i want to create an optimal script about 3 parameters in R. These parameters are: n = long of looping, x = any first number, y = any second number. Here the examples:
We can use rep to repeat x and y, n times.
x <- 13
y <- 10
n <- 2
rep(c(x, y), n)
#[1] 13 10 13 10
Using for loop :
vector <- integer(2 * n)
for (i in seq_len(n)) {
vector[c(2 * i -1, 2 * i)] <- c(x, y)
}
vector
#[1] 13 10 13 10

Randomise across columns for half a dataset

I have a data set for MMA bouts.
The structure currently is
Fighter 1, Fighter 2, Winner
x y x
x y x
x y x
x y x
x y x
My problem is that Fighter 1 = Winner so my model will be trained that fighter 1 always wins, which is a problem.
I need to be able to randomly swap Fighter 1 and Fighter 2 for half the data set in order to have the winner represented equally.
Ideally i would have this
Fighter 1, Fighter 2, Winner
x y x
y x x
x y y
y x x
x y y
is there a way to randomise across columns without messing up the order of the rows ??
I'm assuming your xs and ys are arbitrary and just placeholders. I'll further assume that you need the Winner column to stay the same, you just need that the winner not always be in the first column.
Sample data:
set.seed(42)
x <- data.frame(
F1 = sample(letters, size = 5),
F2 = sample(LETTERS, size = 5),
stringsAsFactors = FALSE
)
x$W <- x$F1
x
# F1 F2 W
# 1 x N x
# 2 z S z
# 3 g D g
# 4 t P t
# 5 o W o
Choose some rows to change, randomly:
(ind <- sample(nrow(x), size = ceiling(nrow(x)/2)))
# [1] 3 5 4
This means that we expect rows 3-5 to change.
Now the random changes:
within(x, { tmp <- F1[ind]; F1[ind] = F2[ind]; F2[ind] = tmp; rm(tmp); })
# F1 F2 W
# 1 x N x
# 2 z S z
# 3 D g g
# 4 P t t
# 5 W o o
Rows 1-2 still show the F1 as the Winner, and rows 3-5 show F2 as the Winner.
I also found that this code worked
matches_clean[, c("fighter1", "fighter2")] <- lapply(matches_clean[, c("fighter1", "fighter2")], as.character)
changeInd <- !!((match(matches_clean$fighter1, levels(as.factor(matches_clean$fighter1))) -
match(matches_clean$fighter2, levels(as.factor(matches_clean$fighter2)))) %% 2)
matches_clean[changeInd, c("fighter1", "fighter2")] <- matches_clean[changeInd, c("fighter2", "fighter1")]

For loops to create symmetric matrices

I want to reduce time and memory usage (I previously used outer for this but it consumes more memory than I have) by reducing the iterations to create a symmetric matrix, that is sol[i, j] is the same as sol[j, i].
My code so far:
# Prepare input
subss <- list(a = c(1, 2, 4), b = c(1, 2, 3), c = c(4, 5))
A <- matrix(runif(25), ncol = 5, nrow = 5)
# Pre allocate memory
sol <- matrix(nrow = length(subss), ncol = length(subss),
dimnames = list(names(subss), names(subss)))
x <- 0
for (i in seq_along(subss)) {
# Omit for the subsets I already calculated ?
for (j in seq_along(subss)) {
x <- x + 1
message(x)
# The function I use here might result in a NA
sol[i, j] <- mean(A[subss[[i]], subss[[j]]])
sol[j, i] <- sol[i, j] # Will overwrite when it shouldn't
}
}
Will use 9 iterations, how can I avoid them and do just 6 iterations?
I need to calculate the symmetric values, so this question doesn't apply. Also this other one doesn't work either because there might be many combinations and at some point it can't allocate the vector in memory.
A for loop will usually be slower than outer. Try byte-compiling the loop or implement it in Rcpp.
subss <- list(a = c(1, 2, 4), b = c(1, 2, 3), c = c(4, 5))
set.seed(42)
A <- matrix(runif(25), ncol = 5, nrow = 5)
#all combinations of indices
ij <- combn(seq_along(subss), 2)
#add all i = j
ij <- matrix(c(ij, rep(seq_along(subss), each = 2)), nrow = 2)
#preallocate
res <- numeric(ncol(ij))
#only one loop
for (k in seq_len(ncol(ij))) {
message(k)
res[k] <- mean(A[subss[[ij[1, k]]], subss[[ij[2, k]]]])
}
#1
#2
#3
#4
#5
#6
#create symmetric sparse matrix
library(Matrix)
sol <- sparseMatrix(i = ij[1,], j = ij[2,],
x = res, dims = rep(length(subss), 2),
symmetric = TRUE, index1 = TRUE)
#3 x 3 sparse Matrix of class "dsCMatrix"
#
#[1,] 0.7764715 0.6696987 0.7304413
#[2,] 0.6696987 0.6266553 0.6778936
#[3,] 0.7304413 0.6778936 0.5161089
I found a way with plain for loops:
x <- 0
for (i in seq_along(subss)) {
for (j in seq_len(i)) { # or for (j in 1:i) as proposed below
x <- x + 1
message(x)
sol[i, j] <- mean(A[subss[[i]], subss[[j]]])
sol[j, i] <- sol[i, j]
}
}
for (i in 1:length(subss)) {
for (j in 1:i) {
message(i, ' ', j, ' - ', mean(A[subss[[i]], subss[[j]]]) ) # Check iterations and value
sol2[i, j] <- sol2[j, i] <- mean(A[subss[[i]], subss[[j]]])
}
}
I checked your script values and aren't symmetric:
1 1 - 0.635455905252861
1 2 - 0.638608284398086
1 3 - 0.488700995299344
2 1 - 0.568414432255344
2 2 - 0.602851431118324
2 3 - 0.516099992596234
3 1 - 0.595461705311512
3 2 - 0.656920690399905
3 3 - 0.460815121419728
Mine values (same as #Llopis):
1 2 - 0.638608284398086
1 3 - 0.488700995299344
2 2 - 0.602851431118324
2 3 - 0.516099992596234
3 2 - 0.656920690399905
3 3 - 0.460815121419728

Computing number of bits that are set to 1 for matching rows in terms of hamming distance between two data frames

I have two data frames of same number of columns (but not rows) df1 and df2. For each row in df2, I was able to find the best (and second best) matching rows from df1 in terms of hamming distance, in my previous post. In that post, we have been using the following example data:
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2)) ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2)) ## 3 rows 2 cols
I now need to compute the number of bits equal to 1 for:
each row in df2
the best matching rows in df1
the second matching rows in df1
The number of bits equal to 1 of an integer a maybe computed as
sum(as.integer(intToBits(a)))
And I have applied this to #ZheyuanLi's original function, so I have got item 1>. However I'm unable to apply the same logic to get item 2> and 3>, by simple modification of #ZheyuanLi's function.
Below are the functions from #ZheyuanLi's with modification:
hmd <- function(x,y) {
rawx <- intToBits(x)
rawy <- intToBits(y)
nx <- length(rawx)
ny <- length(rawy)
if (nx == ny) {
## quick return
return (sum(as.logical(xor(rawx,rawy))))
} else if (nx < ny) {
## pivoting
tmp <- rawx; rawx <- rawy; rawy <- tmp
tmp <- nx; nx <- ny; ny <- tmp
}
if (nx %% ny) stop("unconformable length!") else {
nc <- nx / ny ## number of cycles
return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
}
}
foo <- function(df1, df2, p = 2) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
sb <- integer(n)
k <- 1:p
for (i in 1:n) {
set.bits <- sum(as.integer(intToBits(yt[,i])))
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
sb[i] <- set.bits
k <- k + p
}
## recode "id", "d" and "sb" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
sb <- as.data.frame(matrix(sb, ncol = 1)) ## no need for byrow as you have only 1 column
colnames(sb) <- "set.bits.1"
list(id = id, d = d, sb = sb)
}
Running these gives:
> foo(df1, df2)
$id
min1 min2 ## row id for best/second best match in df1
1 1 4
2 2 3
3 5 2
$d
mindist.1 mindist.2 ## minimum 2 hamming distance
1 2 2
2 1 3
3 1 3
$sb
set.bits.1 ## number of bits equal to 1 for each row of df2
1 3
2 2
3 4
OK, after reading through while re-editing your question (many times!), I think I know what you want. Essentially we need change nothing to hmd(). Your required items 1>, 2>, 3> can all be computed after the for loop in foo().
To get item 1>, which you called sb, we can use a tapply(). However, your computation of sb along the for loop is fine, so I will not change it. In the following, I will demonstrate the basic procedure to get item 2> and item 3>.
The id vector inside foo() stores all matching rows in df1:
id <- c(1, 4, 2, 3, 5, 2)
so we can simply extract those rows of df1 (actually, columns of xt), to compute the number of bits equal to 1. As you can see, there are lots of duplicity in id, so we can only computes on unique(id):
id0 <- sort(unique(id))
## [1] 1 2 3 4 5
We now extract those subset columns of xt:
sub_xt <- xt[, id0]
## [,1] [,2] [,3] [,4] [,5]
## V1 9 3 10 5 6
## V2 2 4 8 7 1
To compute the number of bits equal to 1 for each column of sub_xt, we again use tapply() and vectorized approach.
rawbits <- as.integer(intToBits(as.numeric(sub_xt))) ## convert sub_xt to binary
sbxt0 <- unname(tapply(X = rawbits,
INDEX = rep(1:length(id0), each = length(rawbits) / length(id0)),
FUN = sum))
## [1] 3 3 3 5 3
Now we need to map sbxt0 to sbxt:
sbxt <- sbxt0[match(id, id0)]
## [1] 3 5 3 3 3 3
Then we can convert sbxt to a data frame sb1:
sb1 <- as.data.frame(matrix(sbxt, ncol = p, byrow = TRUE))
colnames(sb1) <- paste(paste0("min.", 1:p), "set.bits.1", sep = ".")
## min.1.set.bits.1 min.2.set.bits.1
## 1 3 5
## 2 3 3
## 3 3 3
Finally we can assemble these things up:
foo <- function(df1, df2, p = 2) {
## check p
if (p > nrow(df2)) p <- nrow(df2)
## transpose for CPU cache friendly code
xt <- t(as.matrix(df1))
yt <- t(as.matrix(df2))
## after transpose, we compute hamming distance column by column
## a for loop is decent; no performance gain from apply family
n <- ncol(yt)
id <- integer(n * p)
d <- numeric(n * p)
sb2 <- integer(n)
k <- 1:p
for (i in 1:n) {
set.bits <- sum(as.integer(intToBits(yt[,i])))
distance <- hmd(xt, yt[,i])
minp <- order(distance)[1:p]
id[k] <- minp
d[k] <- distance[minp]
sb2[i] <- set.bits
k <- k + p
}
## compute "sb1"
id0 <- sort(unique(id))
sub_xt <- xt[, id0]
rawbits <- as.integer(intToBits(as.numeric(sub_xt))) ## convert sub_xt to binary
sbxt0 <- unname(tapply(X = rawbits,
INDEX = rep(1:length(id0), each = length(rawbits) / length(id0)),
FUN = sum))
sbxt <- sbxt0[match(id, id0)]
sb1 <- as.data.frame(matrix(sbxt, ncol = p, byrow = TRUE))
colnames(sb1) <- paste(paste0("min.", 1:p), "set.bits.1", sep = ".")
## recode "id", "d" and "sb2" into data frame and return
id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
colnames(id) <- paste0("min.", 1:p)
d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
colnames(d) <- paste0("mindist.", 1:p)
sb2 <- as.data.frame(matrix(sb2, ncol = 1)) ## no need for byrow as you have only 1 column
colnames(sb2) <- "set.bits.1"
list(id = id, d = d, sb1 = sb1, sb2 = sb2)
}
Now, running foo(df1, df2) gives:
> foo(df1,df2)
$id
min.1 min.2
1 1 4
2 2 3
3 5 2
$d
mindist.1 mindist.2
1 2 2
2 1 3
3 1 3
$sb1
min.1.set.bits.1 min.2.set.bits.1
1 3 5
2 3 3
3 3 3
$sb2
set.bits.1
1 3
2 2
3 4
Note that I have renamed the sb you used to sb2.

R - Looping through datasets and change column names

I'm trying to loop through a bunch of datasets and change columns in R.
I have a bunch of datasets, say a,b,c,etc, and all of them have three columns, say X, Y, Z.
I would like to change their names to be a_X, a_Y, a_Z for dataset a, and b_X, b_Y, b_Z for dataset b, and so on.
Here's my code:
name.list = ("a","b","c")
for(i in name.list){
names(i) = c(paste(i,"_X",sep = ""),paste(i,"_Y",sep = ""),paste(i,"_Y",sep = ""));
}
However, the code above doesn't work since i is in text format.
I've considered assign function but doesn't seem to fit as well.
I would appreciate if any ideas.
Something like this :
list2env(lapply(mget(name.list),function(dat){
colnames(dat) <- paste(nn,colnames(dat),sep='_')
dat
}),.GlobalEnv)
for ( i in name.list) {
assign(i, setNames( get(i), paste(i, names(get(i)), sep="_")))
}
> a
a_X a_Y a_Z
1 1 3 A
2 2 4 B
> b
b_X b_Y b_Z
1 1 3 A
2 2 4 B
> c
c_X c_Y c_Z
1 1 3 A
2 2 4 B
Here's some free data:
a <- data.frame(X = 1, Y = 2, Z = 3)
b <- data.frame(X = 4, Y = 5, Z = 6)
c <- data.frame(X = 7, Y = 8, Z = 9)
And here's a method that uses mget and a custom function foo
name.list <- c("a", "b", "c")
foo <- function(x, i) setNames(x, paste(name.list[i], names(x), sep = "_"))
list2env(Map(foo, mget(name.list), seq_along(name.list)), .GlobalEnv)
a
# a_X a_Y a_Z
# 1 1 2 3
b
# b_X b_Y b_Z
# 1 4 5 6
c
# c_X c_Y c_Z
# 1 7 8 9
You could also avoid get or mget by putting a, b, and c into their own environment (or even a list). You also wouldn't need the name.list vector if you go this route, because it's the same as ls(e)
e <- new.env()
e$a <- a; e$b <- b; e$c <- c
bar <- function(x, y) setNames(x, paste(y, names(x), sep = "_"))
list2env(Map(bar, as.list(e), ls(e)), .GlobalEnv)
Another perk of doing it this way is that you still have the untouched data frames in the environment e. Nothing was overwritten (check a versus e$a).

Resources