Generate double entry table with different data - r

I am using R-exams and I want to generate a double entry table, but each time with a different set of values (I propose 5 different ones: data1,data2,data3,data4,data5), but it does not work, neither with a counter, nor with "sample".
How could I do it? Thanks
datos1 = c(0 , 2 , 4 , 2, 4 , 8, 13, 6, 12)
datos2 = c(11 , 2 , 4 , 2, 4 , 8, 3, 6, 12)
datos3 = c(12 , 2 , 14 , 2, 4 , 28, 3, 6, 12)
datos4 = c(13 , 2 , 4 , 2, 4 , 8, 3, 6, 12)
datos5 = c(1 , 2 , 4 , 22, 4 , 8, 3, 6, 12)
w9 <- sample(c(datos1, datos2, datos3, datos4, datos5),1)
tabla = cbind (expand.grid (list (Y = c ("3","5","6") ,
X = c ("6","8","9"))), count = w9)
ftable (xtabs(count ∼ Y + X, tabla ))

Thanks for the clarification, Amelia :)
c(datos1, datos2, datos3, datos4, datos5) is combining them in to one long vector, and then sample is just taking 1 single value from these. Instead you want to put them in a list and then sample. This keeps them separate so sample is then looking over the list levels and taking one at random. See below:
w9 <- sample(list(datos1, datos2, datos3, datos4, datos5), 1)
# to extract it from the list back in to the same format as datos1, datos2 etc.
w9 <- unlist(w9)
Edit: Covariance
Following the Example section in Wiki. I'm assuming yours is the same style as this?
# package for data exploration
# install.packages("data.table") # if not installed, run this line
library(data.table)
dt <- as.data.table(tabla)
# X & Y stored as factors, converting to numbers
dt[, X := as.integer(as.character(X))]
dt[, Y := as.integer(as.character(Y))]
# weighted means of X and Y
dt[, mux := weighted.mean(X, count)]
dt[, muy := weighted.mean(Y, count)]
# need probabilities of each combo
dt[, p := count/sum(count)]
# covariance
cv <- dt[, sum(p*(X-mux)*(Y-muy))]

Related

finding values in a range in r and sum the number of values

I have a question I have the following data
c(1, 2, 4, 5, 1, 8, 9)
I set a l = 2 and an u = 6
I want to find all the values in the range (3,7)
How can I do this?
In base R we can use comparison operators to create a logical vector and use that for subsetting the original vector
x[x > 2 & x <= 6]
#[1] 3 5 6
Or using a for loop, initialize an empty vector, loop through the elements of 'x', if the value is between 2 and 6, then concatenate that value to the empty vector
v1 <- c()
for(i in x) {
if(i > 2 & i <= 6) v1 <- c(v1, i)
}
v1
#[1] 3 5 6
data
x <- c(3, 5, 6, 8, 1, 2, 1)

Is there an R function for returning sorted indexes of any values of a vector?

I'm not fluent in R data.table and any help will be greatly appreciated to resolve the following problem !
I have big data.table(~1000000 rows) with columns of numeric values and i want to output a same dimension data.table with the sorted indexes position of each row values.
a short example:
-Input:
dt = data.frame(ack = 1:7)
dt$A1 = c( 1, 6, 9, 10, 3, 5, NA)
dt$A2 = c( 25, 12, 30, 10, 50, 1, 30)
dt$A3 = c( 100, 63, 91, 110, 1, 4, 10)
dt$A4 = c( 51, 65, 2, 1, 0, 200, 1)
first row: 1 (1) <= 25 (2) <= 51 (3) <= 100 (4),
row sorted indexes position for (1, 25, 100, 51) are (1, 2, 4, 3) and output should be:
dt$PosA1 = c(1, ...
dt$PosA2 = c(2, ...
dt$PosA3 = c(4, ...
dt$PosA4 = c(3, ...
3rd row : 2 (1) <= 9 (2) <= 30 (3) <= 91 (4) , must output:
dt$PosA1 = c( 1,1,2,...)
dt$PosA2 = c( 2,2,3,...)
dt$PosA3 = c( 4,3,4,...)
dt$PosA4 = c( 3,4,1,...)
Output is a same dimension of input data.table filled with values of sorted indexes by rows .
dt$PosA1 = c( 1, 1, 2, 2, 3, 1, NA)
dt$PosA2 = c( 2, 2, 3, 3, 4, 2, 3)
dt$PosA3 = c( 4, 3, 4, 4, 2, 2, 2)
dt$PosA4 = c( 3, 4, 1, 1, 1, 4, 1)
I think about perhaps something like this?
library(data.table)
setDT(dt)
# pseudocode
dt[, PosA1 := rowPosition(.SD, 1, na.rm=T),
PosA2 := rowPosition(.SD, 2, na.rm=T),
PosA3 := rowPosition(.SD, 3, na.rm=T),
PosA4 := rowPosition(.SD, 4, na.rm=T),
.SDcols=c(A1, A2, A3, A4)]
I'm not sure of syntax and i miss a rowPosition Function. does any function exist to do that ? (i named it rowPosition here)
A little help would be great to code an efficient one , or another approach to solve the problem!
regards.
You can convert to long form and use rank. Or, since you're using data.table, frank:
library(data.table)
setDT(dt)
melt(dt, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
ack A1 A2 A3 A4
1: 1 1 2 4 3
2: 2 1 2 3 4
3: 3 2 3 4 1
4: 4 2 2 3 1
5: 5 3 4 2 1
6: 6 3 1 2 4
7: 7 NA 3 2 1
melt switches to long form; while dcast converts back to wide form.
Since you are looking for speed, you might want to consider using Rcpp. A Rcpp rank that takes care of NA and ties can be found in nrussell's adapted version of René Richter's code.
nr <- 811e3
nc <- 16
DT <- as.data.table(matrix(sample(c(1:200, NA), nr*nc, replace=TRUE), nrow=nr))[,
ack := .I]
#assuming that you have saved nrussell code in avg_rank.cpp
library(Rcpp)
system.time(sourceCpp("rcpp/avg_rank.cpp"))
# user system elapsed
# 0.00 0.13 6.21
nruss_rcpp <- function() {
DT[, as.list(avg_rank(unlist(.SD))), by=ack]
}
data.table.frank <- function() {
melt(DT, id="ack")[, f := frank(value, na.last="keep", ties.method="dense"), by=ack][,
dcast(.SD, ack ~ variable, value.var="f")]
}
library(microbenchmark)
microbenchmark(nruss_rcpp(), data.table.frank(), times=3L)
timings:
Unit: seconds
expr min lq mean median uq max neval cld
nruss_rcpp() 10.33032 10.33251 10.3697 10.3347 10.38939 10.44408 3 a
data.table.frank() 610.44869 612.82685 613.9362 615.2050 615.68001 616.15501 3 b
edit: addressing comments
1) set column names for rank columns using updating by reference
DT[, (paste0("Rank", 1L:nc)) := as.list(avg_rank(unlist(.SD))), by=ack]
2) keeping NAs as it is
option A) change to NA in R after getting output from avg_rank:
for (j in 1:nc) {
DT[is.na(get(paste0("V", j))), (paste0("Rank", j)) := NA_real_]
}
option B) amend the avg_rank code in Rcpp as follows:
Rcpp::NumericVector avg_rank(Rcpp::NumericVector x)
{
R_xlen_t sz = x.size();
Rcpp::IntegerVector w = Rcpp::seq(0, sz - 1);
std::sort(w.begin(), w.end(), Comparator(x));
Rcpp::NumericVector r = Rcpp::no_init_vector(sz);
for (R_xlen_t n, i = 0; i < sz; i += n) {
n = 1;
while (i + n < sz && x[w[i]] == x[w[i + n]]) ++n;
for (R_xlen_t k = 0; k < n; k++) {
if (Rcpp::traits::is_na<REALSXP>(x[w[i + k]])) { #additional code
r[w[i + k]] = NA_REAL; #additional code
} else {
r[w[i + k]] = i + (n + 1) / 2.;
}
}
}
return r;
}

Check rows from all columns that matches specific value

If I have a data.table:
d <- data.table("ID" = c(1, 2, 2, 4, 6, 6),
"TYPE" = c(1, 1, 2, 2, 3, 3),
"CLASS" = c(1, 2, 3, 4, 5, 6))
I know I can remove values greater than a specific value like this:
r <- d[!(d$TYPE > 2), ]
However, if I want to apply this to all of the columns in the entire table instead of just TYPE (basically drop any rows that have a value > 2 in the entire table), how would I generalize the above statement (avoiding using a for loop if possible).
I know I can do d > 2 resulting in a boolean index table, but if I put that into the above line of code it give me an error:
d[!d>2, ]
Results in a invalid matrix type
Note
It was brought up that this question is similar to Return an entire row if the value in any specific set of columns meets a certain criteria.
However, they are working with a data.frame and I am working with a data.table the notation is different. Not a duplicate question due to that.
Using apply with any
d[!apply(d>2,1,any)]
ID TYPE CLASS
1: 1 1 1
2: 2 1 2
Or rowSums
d[rowSums(d>2)==0,]
ID TYPE CLASS
1: 1 1 1
2: 2 1 2
I was wondering what the fastest approach would be for a varying number of rows and columns.
So, here is a benchmark.
It excludes the ID column from being checked for which is not exactly in line with OP's question but is a sensible decision, IMHO.
library(data.table)
library(bench)
bm <- press(
n_row = c(1E1, 1E3, 1E5),
n_col = c(2, 10, 50),
{
set.seed(1L)
d <- data.table(
ID = seq_len(n_row),
matrix(sample(10, n_row*n_col, TRUE), ncol = n_col)
)
mark(
m1 = d[d[, !apply(.SD > 2, 1, any), .SDcols = -"ID"]],
m2 = d[!d[, apply(.SD > 2, 1, any), .SDcols = -"ID"]],
m3 = d[!d[, which(apply(.SD > 2, 1, any)), .SDcols = -"ID"]],
m4 = d[d[, rowSums(.SD > 2) == 0, .SDcols = -"ID"]],
m5 = d[!d[, Reduce(any, lapply(.SD, `>`, y = 2)), by = 1:nrow(d), .SDcols = -"ID"]$V1]
)
})
ggplot2::autoplot(bm)
Apparently, the rowSums() approach is almost always the fastest method.

Variable sample upper value in R

I have the following matrix
m <- matrix(c(2, 4, 3, 5, 1, 5, 7, 9, 3, 7), nrow=5, ncol=2,)
colnames(x) = c("Y","Z")
m <-data.frame(m)
I am trying to create a random number in each row where the upper limit is a number based on a variable value (in this case 1*Y based on each row's value for for Z)
I currently have:
samp<-function(x){
sample(0:1,1,replace = TRUE)}
x$randoms <- apply(m,1,samp)
which work works well applying the sample function independently to each row, but I always get an error when I try to alter the x in sample. I thought I could do something like this:
samp<-function(x){
sample(0:m$Z,1,replace = TRUE)}
x$randoms <- apply(m,1,samp)
but I guess that was wishful thinking.
Ultimately I want the result:
Y Z randoms
2 5 4
4 7 7
3 9 3
5 3 1
1 7 6
Any ideas?
The following will sample from 0 to x$Y for each row, and store the result in randoms:
x$randoms <- sapply(x$Y + 1, sample, 1) - 1
Explanation:
The sapply takes each value in x$Y separately (let's call this y), and calls sample(y + 1, 1) on it.
Note that (e.g.) sample(y+1, 1) will sample 1 random integer from the range 1:(y+1). Since you want a number from 0 to y rather than 1 to y + 1, we subtract 1 at the end.
Also, just pointing out - no need for replace=T here because you are only sampling one value anyway, so it doesn't matter whether it gets replaced or not.
Based on #mathematical.coffee suggestion and my edited example this is the slick final result:
m <- matrix(c(2, 4, 3, 5, 1, 5, 7, 9, 3, 7), nrow=5, ncol=2,)
colnames(m) = c("Y","Z")
m <-data.frame(m)
samp<-function(x){
sample(Z + 1, 1)}
m$randoms <- sapply(m$Z + 1, sample, 1) - 1

Applying function to consecutive subvectors of equal size

I am looking for a nice and fast way of applying some arbitrary function which operates on vectors, such as sum, consecutively to a subvector of consecutive K elements.
Here is one simple example, which should illustrate very clearly what I want:
v <- c(1, 2, 3, 4, 5, 6, 7, 8)
v2 <- myapply(v, sum, group_size=3) # v2 should be equal to c(6, 15, 15)
The function should try to process groups of group_size elements of a given vector and apply a function to each group (treating it as another vector). In this example, the vector v2 is obtained as follows: (1 + 2 + 3) = 6, (4 + 5 + 6) = 15, (7 + 8) = 15. In this case, the K did not divide N exactly, so the last group was of size less then K.
If there is a nicer/faster solution which only works if N is a multiple of K, I would also appreciate it.
Try this:
library(zoo)
rollapply(v, 3, by = 3, sum, partial = TRUE, align = "left")
## [1] 6 15 15
or
apply(matrix(c(v, rep(NA, 3 - length(v) %% 3)), 3), 2, sum, na.rm = TRUE)
## [1] 6 15 15
Also, in the case of sum the last one could be shortened to
colSums(matrix(c(v, rep(0, 3 - length(v) %% 3)), 3))
As #Chase said in a comment, you can create your own grouping variable and then use that. Wrapping that process into a function would look like
myapply <- function(v, fun, group_size=1) {
unname(tapply(v, (seq_along(v)-1) %/% group_size, fun))
}
which gives your results
> myapply(v, sum, group_size=3)
[1] 6 15 15
Note this does not require the length of v to be a multiple of the group_size.
You could try this as well. This works nicely even if you want to include overlapping intervals, as controlled by by, and as a bonus, returns the intervals over which each value is derived:
library (gtools)
v2 <- running(v, fun=sum, width=3, align="left", allow.fewer=TRUE, by=3)
v2
1:3 4:6 7:8
6 15 15

Resources