I'm trying to simulate a randomization process. I think I'll have to use a while loop, and I'm unfamiliar with how to best structure what I'm trying to accomplish in my R code.
Let's say I have 3 classes, a,b, and c in that I want to be distributed in a 3:2:1 ratio, respectively. A vector containing a minimally 'balanced' set of these classes in this ratio would look something like this:
class_1<-"a"
class_2<-"b"
class_3<-"c"
ratio_a<-3
ratio_b<-2
ratio_c<-1
min_set<-c(rep(class_1,ratio_a),rep(class_2,ratio_b),rep(class_3,ratio_c))
This minimum set would look something like this:
min_set
"a""a""a""b""b""c"
Let's then say I want to have k number of this minimally balanced set, I could create that like this:
block_1<-matrix(0,k,length(min_set))
for(i in 1:k)
block_1[i,]<-min_set
This would create a new matrix with my min_setvector for k rows.
Let's now say I want to sample from block_1 without replacement (a treatment allocation would be determined by the class (a,b,c) of the sample) This can be done as:
sample(as.vector(block_1),n,replace=F)
From here, I can enumerate all sampling outcome permutations of the min_set as (thanks to amonk):
myList <- permn(min_set)
all_out <- data.table(matrix(unlist(myList),byrow = T,ncol = 6))
All_out is a df with rows representing each permutation of the min_set. Here's where I'd like help.
Let's create a second block
#Create inactive urn
block_2<-vector('numeric',length=dim(block_1)[1]*dim(block_1)[2])
I would like to sample from block_1 until I have sample one permutation of min_set (one of the rows from all out). My code would look something like this (not currently working):
while (block[2]!='any row of all_out'){
for (i in 1:(dim(block_1)[1]*dim(block_1)[2]))
block_2[i]<-sample(as.vector(block_1),i,replace=F)
}
Once I have achieved the min_set in block_2, I'd like to return the min_set back to block_1 from block_2, keeping p-6 samples (i.e. those not part of the min_set) in block_2.
Repeat until a prespecified number of allocations are made.
So for a given set of characters:
>min_set
[1] "a" "a" "a" "b" "b" "c"
all the permutations are generated (respecting the analogies of characters per string):
library(combinat)
library(data.table)
myList <- permn(min_set)
myDT <- data.table(matrix(unlist(myList),byrow = T,ncol = 6))
> myDT
V1 V2 V3 V4 V5 V6
1: a a a b b c
2: a a a b c b
3: a a a c b b
4: a a c a b b
5: a c a a b b
---
716: a c a a b b
717: a a c a b b
718: a a a c b b
719: a a a b c b
720: a a a b b c
Related
a <- rnorm(10)
b <- sample(a,18,replace = T)
For each element of a, I want to randomly assign a value from vector b. So that I will have a combination for all elements of vector "a". It will be something like:
combinations <- data.table(first=a ,second=sample(b,length(a)))
What I actually want is a little different than the data.table combinations. I want to get a set of combinations where non of the rows has repeated values.
Edit: combinations$first[i] and combinations$second[i] may be equal in the code above. What ı want is to make it impossible to have a case where combinations$first[i] and combinations$second[i] are equal.
Note: I will do this for large vector, so it needs to be fast.
You can sample by group as follows
library(data.table)
set.seed(0L)
a <- LETTERS[1L:10L]
output <- data.table(first=a)[, .(second=sample(setdiff(a, first), .N)), by=.(first)]
If random row ordering is needed, you can run output[sample(.N)].
output:
first second
1: A J
2: B D
3: C E
4: D G
5: E J
6: F B
7: G J
8: H J
9: I F
10: J F
Three text files are in the same directory ("data001.txt", "data002.txt", "data003.txt"). I write a loop to read each data file and generate three data tables;
for(i in files) {
x <- read.delim(i, header = F, sep = "\t", na = "*")
setnames(x, 2, i)
assign(i,x)
}
So let's say each individual table looks something like this:
var1 var2 var3
row1 2 1 3
I've used rbind to combine all of the tables...
combined <- do.call(rbind, mget(ls(pattern="^data")))
and get something like this:
var1 var2 var3
row1 2 1 3
var1 var2 var3
row1 3 2 4
var1 var2 var3
row1 1 3 5
leaving me with superfluous column names. At the moment I can get around this by just deleting that specific row containing the column names, but it's a bit clunky.
colnames(combined) = combined[1, ] # make the first row the column names
combined <- combined[-1, ] # delete the now-unnecessary first row
toDelete <- seq(1, nrow(combined), 2) # define which rows to be deleted i.e. every second odd row
combined <- combined[ toDelete ,] # delete them suckaz
This does give me what I want...
var1 var2 var3
row1 2 1 3
row1 3 2 4
row1 1 3 5
But I feel like a better way would simply be to extract the values of "row1" as a vector or as a list or whatever, and combine them all together into one data table. I feel like there is a quick and easy way to do this but I haven't been able to find anything yet. I've had a look here and here and here.
One possibility is to take the second row (that I want), and convert it into a matrix (then transpose it to make it a row instead of column!?) and rbind:
data001.txt <- as.matrix(data001.txt[2,])
data001.txt <- t(data001.txt)
combined <- rbind(data001.txt, data002.txt)
This gives me more or less what I want except without the column name headers (e.g. va1, var2, var3).
v1 v2 v3
2 1 3
3 2 4
Any ideas? Would this second method work well if there is some way to add the column names? I feel like it's less clunky than the first method. Thanks for any input :)
edit - solved in answer below.
Figured it out. Converting to data matrix and using set.names from data.table package required. Say you have a range of text data files like the one that follows, and you want to extract just the seventh column (the one with the numbers, not letters), and combine them together in their own data table including the row names:
chemical1 a b c d e 1 g h i j k l m
chemical2 a b c d e 2 g h i j k l m
chemical3 a b c d e 3 g h i j k l m
chemical4 a b c d e 4 g h i j k l m
chemical5 a b c d e 5 g h i j k l m
setting row.names = 1 and header = F.
setwd("directory")
files <- list.files(pattern = "data") # take all files with 'data' in their name
for(i in files) {
x <- read.delim(i, row.names = 1, header = F, sep = "\t", na = "*")
setnames(x, 6, i) # if the data you want is in column six. Sets data file name as the column name.
x <- as.matrix(x[6]) # just take the sixth column with the numeric data (delete everything else)
x <- t(x) # transform (if you want..)
assign(i,x)
}
combined <- do.call(rbind, mget(ls(pattern="^data"))) # combine the data matrices into one table
write.table(combined, file="filename.csv", sep=",", row.names=T, col.names = NA)
I want to count the number of specific repetitions in my dataframe. Here is a reproducible example
df <- data.frame(Date= c('5/5', '5/5', '5/5', '5/6', '5/7'),
First = c('a','b','c','a','c'),
Second = c('A','B','C','D','A'),
Third = c('q','w','e','w','q'),
Fourth = c('z','x','c','v','z'))
Give this:
Date First Second Third Fourth
1 5/5 a A q z
2 5/5 b B w x
3 5/5 c C e c
4 5/6 a D w v
5 5/7 c A q z
I read a big file that holds 400,000 instances and I want to know different statistics about specific attributes. For an example here I'd like to know how many times a happens on 5/5. I tried using sum(df$Date == '5/5' & df$First == 'a', na.rm=TRUE) which gave me the right result here (2), but when I run it on the big data set, the numbers are not accurate.
Any idea why?
I am trying to use a data.table within a function, and I am trying to understand why my code is failing. I have a data.table as follows:
DT <- data.table(my_name=c("A","B","C","D","E","F"),my_id=c(2,2,3,3,4,4))
> DT
my_name my_id
1: A 2
2: B 2
3: C 3
4: D 3
5: E 4
6: F 4
I am trying to create all pairs of "my_name" with different values of "my_id", which for DT would be:
Var1 Var2
A C
A D
A E
A F
B C
B D
B E
B F
C E
C F
D E
D F
I have a function to return all pairs of "my_name" for a given pair of values of "my_id" which works as expected.
get_pairs <- function(id1,id2,tdt) {
return(expand.grid(tdt[my_id==id1,my_name],tdt[my_id==id2,my_name]))
}
> get_pairs(2,3,DT)
Var1 Var2
1 A C
2 B C
3 A D
4 B D
Now, I want to execute this function for all pairs of ids, which I try to do by finding all pairs of ids and then using mapply with the get_pairs function.
> combn(unique(DT$my_id),2)
[,1] [,2] [,3]
[1,] 2 2 3
[2,] 3 4 4
tid1 <- combn(unique(DT$my_id),2)[1,]
tid2 <- combn(unique(DT$my_id),2)[2,]
mapply(get_pairs, tid1, tid2, DT)
Error in expand.grid(tdt[my_id == id1, my_name], tdt[my_id == id2, my_name]) :
object 'my_id' not found
Again, if I try to do the same thing without an mapply, it works.
get_pairs3(tid1[1],tid2[1],DT)
Var1 Var2
1 A C
2 B C
3 A D
4 B D
Why does this function fail only when used within an mapply? I think this has something to do with the scope of data.table names, but I'm not sure.
Alternatively, is there a different/more efficient way to accomplish this task? I have a large data.table with a third id "sample" and I need to get all of these pairs for each sample (e.g. operating on DT[sample=="sample_id",] ). I am new to the data.table package, and I may not be using it in the most efficient way.
The function debugonce() is extremely useful in these scenarios.
debugonce(mapply)
mapply(get_pairs, tid1, tid2, DT)
# Hit enter twice
# from within BROWSER
debugonce(FUN)
# Hit enter twice
# you'll be inside your function, and then type DT
DT
# [1] "A" "B" "C" "D" "E" "F"
Q # (to quit debugging mode)
which is wrong. Basically, mapply() takes the first element of each input argument and passes it to your function. In this case you've provided a data.table, which is also list. So, instead of passing the entire data.table, it's passing each element of the list (columns).
So, you can get around this by doing:
mapply(get_pairs, tid1, tid2, list(DT))
But mapply() simplifies the result by default, and therefore you'd get a matrix back. You'll have to use SIMPLIFY = FALSE.
mapply(get_pairs, tid1, tid2, list(DT), SIMPLIFY = FALSE)
Or simply use Map:
Map(get_pairs, tid1, tid2, list(DT))
Use rbindlist() to bind the results.
HTH
Enumerate all possible pairs
u_name <- unique(DT$my_name)
all_pairs <- CJ(u_name,u_name)[V1 < V2]
Enumerate observed pairs
obs_pairs <- unique(
DT[,{un <- unique(my_name); CJ(un,un)[V1 < V2]}, by=my_id][, !"my_id"]
)
Take the difference
all_pairs[!J(obs_pairs)]
CJ is like expand.grid except that it creates a data.table with all of its columns as its key. A data.table X must be keyed for a join X[J(Y)] or a not-join X[!J(Y)] (like the last line) to work. The J is optional, but makes it more obvious that we're doing a join.
Simplifications. #CathG pointed out that there is a cleaner way of constructing obs_pairs if you always have two sorted "names" for each "id" (as in the example data): use as.list(un) in place of CJ(un,un)[V1 < V2].
Why does this function fail only when used within an mapply? I think
this has something to do with the scope of data.table names, but I'm
not sure.
The reason the function is failing has nothing to do with scoping in this case. mapply vectorizes the function, it takes each element of each parameter and passes to the function. So, in your case, the data.table elements are its columns, so mapply is passing the column my_name instead of the complete data.table.
If you want to pass the complete data.table to mapply, you should use the MoreArgs parameter. Then your function will work:
res <- mapply(get_pairs, tid1, tid2, MoreArgs = list(tdt=DT), SIMPLIFY = FALSE)
do.call("rbind", res)
Var1 Var2
1 A C
2 B C
3 A D
4 B D
5 A E
6 B E
7 A F
8 B F
9 C E
10 D E
11 C F
12 D F
I'm processing a data.frame of products called "all" whose first variable all$V1 is a product family. There are several rows per product family, i.e. length(levels(all$V1)) < length(all$V1).
I want to traverse the data.frame and process by product family "p". I'm new to R, so I haven't fully grasped when I can do something vectorially, or when to loop. At the moment, I can traverse and get subsets by:
for (i in levels (all$V1)){
p = all[which(all[,'V1'] == i), ];
calculateStuff(p);
}
Is this the way to do this, or is there a groovy vectorial way of doing this with apply or something? There are only a few thousand rows, so performance gain is probably negligeable, but I'd like to devlop good habits for larger data ses.
Data:
all = data.frame(V1=c("a","b","c","d","d","c","c","b","a","a","a","d"))
'all' can be split by V1:
> ll = split(all, all$V1)
> ll
$a
V1
1 a
9 a
10 a
11 a
$b
V1
2 b
8 b
$c
V1
3 c
6 c
7 c
$d
V1
4 d
5 d
12 d
sapply can be used to analyze each component of list 'll'. Following finds number of rows in each component (which represents product family):
calculateStuff <- function(p){
nrow(p)
}
> sapply(ll, calculateStuff)
a b c d
4 2 3 3
There is unlikely to be much of a performance gain with the below, but it is at least more compact and returns the results of calculateStuff as a convenient list:
lapply(levels(all$V1), function(i) calculateStuff(all[all$V1 == i, ]) )
As #SimonG points out in his comment, depending on exactly what your calculateStuff function is, aggregate may also be useful to you if you want your results in the form of dataframe.