I have a task to do using R. I need to make 10000 samples of a vector of 12 elements each of them between 1 and 7. I did this using:
dataSet = t(replicate(10000, sample(1:7, 12, r=T)))
Now I need to count the rows of this dataSet that contain all the values from 1:7.
How can I do that and is there a better way to represent the data than this?
One way would be (you need to use set.seed in order to make this reproducible)
indx <- 1:7
sum(apply(dataSet, 1, function(x) all(indx %in% x)))
## 2336
Related
I am new to R and trying to exploit a fairly simple task. I have a dataset composed of 20 obs of 19 variabile and I want to generate three non overlapping groups of 5 obs. I am using the slice_sample function from dplyr package, but how do I reiterate excluding the obs already picked up in the first round?
library( "dplyr")
set.seed(123)
NF_1 <- slice_sample(NF, n = 5)
You can use the sample function from base R.
All you have to do is sample the rows with replace = FALSE, which means you won't have any overlapping. You can also define the number of samples.
n_groups <- 3
observations_per_group <- 5
size <- n_groups * obersavations_per_group
selected_samples <- sample(seq_len(nrow(NF)), size = size, replace = FALSE)
# Now index those selected rows
NF_1 <- NF[selected_samples, ]
Now, according to your comment, if you want to generate N dataframes, each with a number of samples and also label them accordingly, you can use lapply (which is a function that "applies" a function to a set of values). The "l" in "lapply" means that it returns a list. There are other types of apply functions. You can read more about that (and I highly recommend that you do!) here.
This code should solve your problem, or at least give you a good idea or where to go.
n_groups <- 3
observations_per_group <- 5
size <- observations_per_group * n_groups
# First we'll get the row samples.
selected_samples <- sample(
seq_len(nrow(NF)),
size = size,
replace = FALSE
)
# Now we split them between the number of groups
split_samples <- split(
selected_samples,
rep(1:n_groups, observations_per_group)
)
# For each group (1 to n_groups) we'll define a dataframe with samples
# and store them sequentially in a list.
my_dataframes <- lapply(1:n_groups, function(x) {
# our subset df will be the original df with the list of samples
# for group at position "x" (1, 2, 3.., n_groups)
subset_df <- NF[split_samples[x], ]
return(subset_df)
})
# now, if you need to access the results, you can simply do:
first_df <- my_dataframes[[1]] # use double brackets to access list elements
Hi i'm trying to create 10 sub-training set (from a training set of 75%) in loop extracting randomly from a dataframe (DB). i'm using
smp_size<- floor((0.75* nrow(DB))/10)
train_ind<-sample(seq_len(nrow(DB)), size=(smp_size))
training<- matrix(ncol=(ncol(DB)), nrow=(smp_size))
for (i in 1:10){
training[i]<-DB[train_ind, ]
}
what's wrong?
To partition your dataset in 10 equally sized subsets, you may use the following:
# Randomly order the rows in your training set:
DB <- DB[order(runif(nrow(DB))), ]
# You will create a sequence 1,2,..,10,1,2,...,10,1,2.. you will use to subset
inds <- rep(1:10, nrow(DB)/10)
# split() will store the subsets (created by inds) in a list
subsets <- split(DB, inds)
Note, however, that split() will only give you equally sized subsets. Therefore, it might (and probably will) happen that some of the observations are not be included in any of the subsets.
If you wish to use all observations, causing some subsets to be larger than others, use inds <- rep(1:10, length.out = nrow(DB)) instead
I am trying to force some list objects (e.g. 4 tables of frequency count) into a matrix by doing rbind. However, they have uneven columns (i.e. some range from 2 to 5, while others range from 1:5). I want is to display such that if a table does not begin with a column of 1, then it displays NA in that row in the subsequent rbind matrix. I tried the approach below but the values repeat itself in the row rather than displaying NAs if is does not exist.
I considered rbind.fill but it requires for the table to be a data frame. I could create some loops but in the spirit of R, I wonder if there is another approach I could use?
# Example
a <- sample(0:5,100, replace=TRUE)
b <- sample(2:5,100, replace=TRUE)
c <- sample(1:4,100, replace=TRUE)
d <- sample(1:3,100, replace=TRUE)
list <- list(a,b,c,d)
table(list[4])
count(list[1])
matrix <- matrix(ncol=5)
lapply(list,(table))
do.call("rbind",(lapply(list,table)))
When I have a similar problem, I include all the values I want in the vector and then subtract one from the result
table(c(1:5, a)) - 1
This could be made into a function
table2 <- function(x, values, ...){
table(c(x, values), ...) - 1
}
Of course, this will give zeros rather than NA
I have 2 relatively large data frames in R. I'm attempting to merge / find all combos, as efficiently as possible. The resulting df turns out to be huge (the length is dim(myDF1)[1]*dim(myDF2)[1]), so I'm attempting to implement a solution using ff. I'm also open to using other solutions, such as the bigmemory package to work-around these memory issues. I'm have virtually no experience with either of these packages.
Working example - assume I'm working with some data frame that looks similar to USArrests:
library('ff')
library('ffbase')
myNames <- USArrests
myNames$States <- rownames(myNames)
rownames(myNames) <- NULL
Now, I will fabricate 2 data frames, which represent some particular sets of observations from myNames. I'm going to try to reference them by their rownames later.
myDF1 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(3*1e5, 1, 50))], ncol = 3)))
myDF2 <- as.ffdf(as.data.frame(matrix(as.integer(rownames(myNames))[floor(runif(2*1e5, 1, 50))], ncol = 2)))
# unique combos:
myDF1 <- unique(myDF1)
myDF2 <- unique(myDF2)
For example, my first set of states in myDF1 are myNames[unlist(myDF1[1, ]), ]. Then I will find all combos of myDF1 and myDF2 using ikey :
# create keys:
myDF1$key <- ikey(myDF1)
myDF2$key <- ikey(myDF2)
startTime <- Sys.time()
# Create some huge vectors:
myVector1 <- ffrep.int(myDF1$key, dim(myDF2)[1])
myVector2 <- ffrep.int(myDF2$key, dim(myDF1)[1])
# This takes about 25 seconds on my machine:
print(Sys.time() - startTime)
# Sort one DF (to later combine with the other):
myVector2 <- ffsorted(myVector2)
# Sorting takes an additional 2.5 minutes:
print(Sys.time() - startTime)
1) Is there a faster way to sort this?
# finally, find all combinations:
myDF <- as.ffdf(myVector1, myVector2)
# Very fast:
print(Sys.time() - startTime)
2) Is there an alternative to this type of combination (without using RAM)?
Finally, I'd like to be able to reference any of the original data by row / column. Specifically, I'd like to get different types of rowSums. For example:
# Here are the row numbers (from myNames) for the top 6 sets of States:
this <- cbind(myDF1[myDF[1:6,1], -4], myDF2[myDF[1:6,2], -3])
this
# Then, the original data for the first set of States is:
myNames[unlist(this[1,]),]
# Suppose I want to get the sum of the Urban Population for every row, such as the first:
sum(myNames[unlist(this[1,]),]$UrbanPop)
3) Ultimately, I'd like a vector with the above rowSum, so I can perform some type of subset on myDF. Any advice on how to most efficiently accomplish this?
Thanks!
It's pretty much unclear to me what you intent to do with the rowSum and your 3) element but if you want an efficient and RAM-friendly combination of 2 ff vectors, to get all combinations, you can use expand.ffgrid from ffbase.
The following will generate your ffdf with dimensions 160Mio rows x 2 columns in a few seconds.
require(ffbase)
x <- expand.ffgrid(myDF1$key, myDF2$key)
Here is my problem. I have a dataset with 200k rows.
Each row corresponds to a test conducted on a subject.
Subjects have unequal number of tests.
Each test is dated.
I want to assign an index to each test. E.g. The first test of subject 1 would be 1, the second test of subject 1 would be 2. The first test of subject 2 would be 1 etc..
My strategy is to get a list of unique Subject IDs, use lapply to subset the dataset into a list of dataframes using the unique Subject IDs, with each Subject having his/her own dataframe with the tests conducted. Ideally I would then be able to sort each dataframe of each subject and assign an index for each test.
However, doing this over a 200k x 32 dataframe made my laptop (i5, Sandy Bridge, 4GB ram) run out of memory quite quickly.
I have 2 questions:
Is there a better way to do this?
If there is not, my only option to overcome the memory limit is to break my unique SubjectID list into smaller sets like 1000 SubjectIDs per list, lapply it through the dataset and at the end of everything, join the lists together. Then, how do I create a function to break my SubjectID list by supplying say an integer that denotes the number of partitions. e.g. BreakPartition(Dataset, 5) will break the dataset into 5 partitions equally.
Here is code to generate some dummy data:
UniqueSubjectID <- sapply(1:500, function(i) paste(letters[sample(1:26, 5, replace = TRUE)], collapse =""))
UniqueSubjectID <- subset(UniqueSubjectID, !duplicated(UniqueSubjectID))
Dataset <- data.frame(SubID = sample(sapply(1:500, function(i) paste(letters[sample(1:26, 5, replace = TRUE)], collapse ="")),5000, replace = TRUE))
Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d.%m.%Y')), 5000, replace = TRUE)
Dataset <- cbind(Dataset, Dates)
I would guess that the splitting/lapply is what is using up the memory. You should consider a more vectorized approach. Starting with a slightly modified version of your example code:
n <- 200000
UniqueSubjectID <- replicate(500, paste(letters[sample(26, 5, replace=TRUE)], collapse =""))
UniqueSubjectID <- unique(UniqueSubjectID)
Dataset <- data.frame(SubID = sample(UniqueSubjectID , n, replace = TRUE))
Dataset$Dates <- sample(c(dates = format(seq(ISOdate(2010,1,1), by='day', length=365), format='%d.%m.%Y')), n, replace = TRUE)
And assuming that what you want is an index counting the tests by date order by subject, you could do the following.
Dataset <- Dataset[order(Dataset$SubID, Dataset$Dates), ]
ids.rle <- rle(as.character(Dataset$SubID))
Dataset$SubIndex <- unlist(sapply(ids.rle$lengths, function(n) 1:n))
Now the 'SubIndex' column in 'Dataset' contains a by-subject numbered index of the tests. This takes a very small amount of memory and runs in a few seconds on my 4GB Core 2 duo Laptop.
This sounds like a job for the plyr package. I would add the index in this way:
require(plyr)
system.time(new_dat <- ddply(Dataset, .(SubID), function(dum) {
dum = dum[order(dum$SubID, dum$Dates), ]
mutate(dum, index = 1:nrow(dum))
}))
This splits the dataset up into chunks per SubID, and adds an index. The new object has all the SubID grouped together, and sorted in time. Your example took about 2 seconds on my machine, and used almost no memory. I'm not sure how ddply scales to your data size and characteristics, but you could try. I this does not work fast enough, definitely take a look at the data.table package. A blog post of mine which compares (among others) ddply and data.table could serve as some inspiration.