All,
Consider a simple problem:
set.seed(1) # if generating sample data, it's helpful to set a seed
idx <- rep(1:4,each=4)
c1 <- rnorm(16)
c2 <- rnorm(16)
tmp <- data.frame(idx,c1,c2)
for(i in 2:4){
rows <- which(idx==i)
tmp$delt[rows] <- (tmp$c2[min(rows)-1] - tmp$c1[min(rows)])/tmp$c2[min(rows)-1]
}
tmp
I would like to know if there is an efficient way to generate the delt column using an apply-class function. This example works well enough, but will likely get bogged down when implemented on a large data set.
Cheers
Here is a solution using ave
FUN <- function(i) {
i1 <- i[1]
if (i1 > 1) 1 - tmp$c1[i1] / tmp$c2[i1 - 1] else NA
}
tmp$delt <- ave(1:nrow(tmp), tmp$idx, FUN = FUN)
you can merge the table with itself.
Especially if the data is large, data.table will be quite fast
# put your data into a data.table, keying by idx
library(data.table)
tmpDT <- data.table(idx,c1,c2, key="idx")
# merge to itself and calculate, using tail() and head()
tmpDT[ tmpDT[, list(c2prev = tail(c2, 1)), by=(idx+1)]
, delt := (c2prev - head(c1, 1)) / c2prev ]
Here's a base method:
dal <- c(FALSE, as.logical(diff(idx)))
dal_s <- c(as.logical(diff(idx)), FALSE)
d <- data.frame(idx=2:4, delt=1-tmp$c1[dal]/tmp$c2[dal_s])
merge(tmp, d, all=TRUE)
Note that (x - y)/x = 1 - y/x. You could use the former expression above if necessary.
Related
I have to get the same rows from two datasets without using function as merge or packages like dplyr. basically I can only use for cycles and if.
I've come up to this solution:
#since the two data frames are really big, I've reduced them using:
tab1 <- tab1[seq(800,1000),]
tab2 <- tab2[seq(800,1000),]
rname1 <- rownames(tab1)
rname2 <- rownames(tab2)
vecres <- c()
#since I need the results from only the first 3 columns of datasets:
for (i in rname1) {
a <- tab1[i,c(1,2,3)]
for (j in rname2) {
b <- tab2[j,c(1,2,3)]
cond <- a == b
singlecond <- all(cond)
if (singlecond) {vecres[i] <- c(a[i,c(1,2,3)])}
}
}
I. don't know how to go on and where I'm making mistakes... please help!
You can try the code below
tab1[do.call(paste, tab1[1:3]) %in% do.call(paste, tab2[1:3]), ]
If you really want for loops, you can try
vecres <- c()
for (i in rname1) {
a <- tab1[i, c(1, 2, 3)]
for (j in rname2) {
b <- tab2[j, c(1, 2, 3)]
cond <- a == b
singlecond <- all(cond)
if (singlecond) {
vecres <- c(vecres, i)
}
}
}
tab1[vecres,]
I thought that the following problem must have been answered or a function must exist to do it, but I was unable to find an answer.
I have a nested loop that takes a row from one 3-col. data frame and copies it next to each of the other rows, to form a 6-col. data frame (with all possible combinations). This works fine, but with a medium sized data set (800 rows), the loops take forever to complete the task.
I will demonstrate on a sample data set:
Sdat <- data.frame(
x = c(10,20,30,40),
y = c(15,25,35,45),
ID =c(1,2,3,4)
)
compar <- data.frame(matrix(nrow=0, ncol=6)) # to contain all combinations
names(compar) <- c("x","y", "ID", "x","y", "ID")
N <- nrow(Sdat) # how many different points we have
for (i in 1:N)
{
for (j in 1:N)
{
Temp1 <- Sdat[i,] # data from 1st point
Temp2 <- Sdat[j,] # data from 2nd point
C <- cbind(Temp1, Temp2)
compar <- rbind(C,compar)
}
}
These loops provide exactly the output that I need for further analysis. Any suggestion for vectorizing this section?
You can do:
ind <- seq_len(nrow(Sdat))
grid <- expand.grid(ind, ind)
compar <- cbind(Sdat[grid[, 1], ], Sdat[grid[, 2], ])
A naive solution using rep (assuming you are happy with a data frame output):
compar <- data.frame(x = rep(Sdat$x, each = N),
y = rep(Sdat$y, each = N),
id = rep(1:n, each = N),
x1 = rep(Sdat$x, N),
y1 = rep(Sdat$y, N),
id_1 = rep(1:n, N))
EDIT: I found out that the Matrix package does everything I need. Super fast and flexible. Specifically, the related functions are
Data <- sparseMatrix(i=Data[,1], j=Data[,2], x=Data[,3])
or simply
Data <- Matrix(data=Data,sparse=T)
Once you have your matrix in this Matrix class, everything should work smoothly like a regular matrix (for the most part, anyway).
======================================================
I have a dataset in "Long format" right now, meaning that it has 3 columns: row name, column name, and value. All of the "missing" row-column pairs are equal to zero.
I need to come up with an efficient way to calculate the cosine similarity (or even just the regular dot product) between all possible pairs of rows. The full data matrix is 19000 x 62000, which is why I need to work with the Long format instead.
I came up with the following method, but it's WAY too slow. Any tips on maximizing efficiency, or any suggestions of a better method overall, would be GREATLY appreciated. Thanks!
Data <- matrix(c(1,1,1,2,2,2,3,3,3,1,2,3,1,2,4,1,4,5,1,2,2,1,1,1,1,3,1),
ncol = 3, byrow = FALSE)
Data <- data.frame(Data)
cosine.sparse <- function(data) {
a <- Sys.time()
colnames(data) <- c('V1', 'V2', 'V3')
nvars <- length(unique(data[,2]))
nrows <- length(unique(data[,1]))
sim <- matrix(nrow=nrows, ncol=nrows)
for (i in 1:nrows) {
data.i <- data[data$V1==i,]
length.i.sq <- sum(data.i$V3^2)
for (j in i:nrows) {
data.j <- data[data$V1==j,]
length.j.sq <- sum(data.j$V3^2)
common.vars <- intersect(data.i$V2, data.j$V2)
row1 <- data.i[data.i$V2 %in% common.vars,3]
row2 <- data.j[data.j$V2 %in% common.vars,3]
cos.sim <- sum(row1*row2)/sqrt(length.i.sq*length.j.sq)
sim[i,j] <- sim[j,i] <- cos.sim
}
if (i %% 500 == 0) {cat(i, " rows have been calculated.")}
}
b <- Sys.time()
time.elapsed <- b - a
print(time.elapsed)
return(sim)
}
cosine.sparse(Data2)
I have two datasets with 24k and 15k rows. I used nested for loops in order to rewrite some data... however it takes forever to compute the operation.
does anyone have a suggestion how to optimize the code to speed the process?
my code:
for(i in 1:length(data$kolicina)){
for(j in 1:length(df$kolicina)){
if(data$LIXcode[i] == df$LIXcode[j]){
data$kolicina[i] <- df$kolicina[j]
}
}
}
the full code with the imput looks like this:
df <- data[grepl("Trennscheiben", data$a_naziv) & data$SestavKolicina > 1,]
for(i in 1:length(df$kolicina)){
df$kolicina[i] <- df$kolicina[i] / 10
}
for(i in 1:length(data$kolicina)){
for(j in 1:length(df$kolicina)){
if(data$LIXcode[i] == df$LIXcode[j]){
data$kolicina[i] <- df$kolicina[j]
}
}
}
the data:
LIXcode a_naziv RacunCenaNaEM kolicina
LIX2017396957 MINI HVLP Spritzpistole 20,16 1
LIX2017396957 MINI HVLP Spritzpistole 20,16 1
LIX2017396963 Trennscheiben Ø115 Ø12 12,53 30
LIX2017396963 Trennscheiben Ø115 Ø12 12,53 1
I haven't tried this on my own machine, but this should work
fun <- function(x,y){
x[which(x$LIXcode %in% y$LIXcode)]$kolicina =
y[which(x$LIXcode %in% y$LIXcode)]$kolicina
}
}
fun(data,df)
R has the capability to do them all in parallel
As far as I understand, the question concerns table "dt1" with key column "a" and any number of value columns and any number of observations. And then we have a "dt2" that has some sort of mapping - which means that column "a" has unique values and some column "b" has values that need to be written into "dt1" where columns "a" match.
I would suggest joining tables:
require(data.table)
dt1 <- data.table(a = sample(1:10, 1000, replace = T),
b = sample(letters, 1000, replace = T))
dt2 <- data.table(a = 1:10,
b = letters[1:10])
output <- merge(dt1, dt2, by = "a", all.x = T)
Also you can try:
dt1[,new_value:=dt2$b[match(a, dt2$a)]
Both of these solutions are vectorized, therefore almost instant.
Base solution (no data.table syntax, although I'd highly recommend you to learn it):
dt1$new_value <- dt2$b[match(dt1$a, dt2$a)]
And that's if I understood the question correctly...
Here's a working solution to accommodate for expected output:
dt1[a %in% dt2$a, b:=dt2$b[match(a, dt2$a)]]
What I am trying to do is generate all possible permutations of 1 and 0 given a particular sample size. For instance with a sample of n=8 I would like the m = 2^8 = 256 possible permutations, i.e:
I've written a function in R to do this, but after n=11 it takes a very long time to run. I would prefer a solution in R, but if its in another programming language I can probably figure it out. Thanks!
PermBinary <- function(n){
n.perms <- 2^n
array <- matrix(0,nrow=n,ncol=n.perms)
# array <- big.matrix(n, n.perms, type='integer', init=-5)
for(i in 1:n){
div.length <- ncol(array)/(2^i)
div.num <- ncol(array)/div.length
end <- 0
while(end!=ncol(array)){
end <- end +1
start <- end + div.length
end <- start + div.length -1
array[i,start:end] <- 1
}
}
return(array)
}
expand.grid is probably the best vehicle to get what you want.
For example if you wanted a sample size of 3 we could do something like
expand.grid(0:1, 0:1, 0:1)
For a sample size of 4
expand.grid(0:1, 0:1, 0:1, 0:1)
So what we want to do is find a way to automate that call.
If we had a list of the inputs we want to give to expand.grid we could use do.call to construct the call for us. For example
vals <- 0:1
tmp <- list(vals, vals, vals)
do.call(expand.grid, tmp)
So now the challenge is to automatically make the "tmp" list above in a fashion that we can dictate how many copies of "vals" we want. There are lots of ways to do this but one way is to use replicate. Since we want a list we'll need to tell it to not simplify the result or else we will get a matrix/array as the result.
vals <- 0:1
tmp <- replicate(4, vals, simplify = FALSE)
do.call(expand.grid, tmp)
Alternatively we can use rep on a list input (which I believe is faster because it doesn't have as much overhead as replicate but I haven't tested it)
tmp <- rep(list(vals), 4)
do.call(expand.grid, tmp)
Now wrap that up into a function to get:
binarypermutations <- function(n, vals = 0:1){
tmp <- rep(list(vals), n)
do.call(expand.grid, tmp)
}
Then call with the sample size like so binarypermutations(5).
This gives a data.frame of dimensions 2^n x n as a result - transpose and convert to a different data type if you'd like.
The answer above may be better since it uses base - my first thought was to use data.table's CJ function:
library(data.table)
do.call(CJ, replicate(8, c(0, 1), FALSE))
It will be slightly faster (~15%) than expand.grid, so it will only be more valuable for extreme cases.