In R, I would like to know how I can find the index/indices of the value(s) sampled, for examaple using function sample.
In Matlab, it appears this is quite easily done by requesting output argument idx in function datasample. Explictly, taken from Matlab's documentation page for function datasample:
[y,idx] = datasample(data,k,...) returns an index vector indicating
which values datasample sampled from data.
I would like to know if such a thing can be accomplished in R, and how.
Example:
set.seed(12)
sample(c(0.3,78,45,0.8,0.3,0.8,77), size=1, replace=TRUE)
0.3
How can I know which of the two 0.3's was that one?
We can created a named vector and then sample
v1 <- c(LETTERS[1:10], LETTERS[1])
names(v1) <- seq_along(v1)
v2 <- sample(v1, 20, replace=TRUE)
as.integer(names(v2))
#[1] 10 11 4 2 1 4 6 9 1 1 2 9 2 2 2 3 4 7 3 6
Using the OP's data
set.seed(12)
v1 <- c(0.3,78,45,0.8,0.3,0.8,77)
names(v1) <- seq_along(v1)
set.seed(12)
sample(v1, size=1, replace=TRUE)
# 1
#0.3
Related
I'd like to use uniform distribution to randomly assign value 1 or 2 for five groups(generate 5 random uniform distribution), with each group containing 10 samples.
I try to write:
for(i in 1:5){
rf <- runif(10)
result[rf<=0.5]=1
result[rf>0.5]=2
}
However this will replace the previously assigned values when the loop goes on.
The code produces only 10 results:
1 2 1 2 2 1 1 1 2 1
But I want a total of 50 randomized values:
1 2 1 2 ...... 2 1 1
How to do this? Thank you
Since, you are working on random number generated from same distribution every time, you can better generate 50 numbers in once, and assign value using ifelse function.
Try this:
a <- ifelse(runif(50) <= 0.5, 1, 2)
dim(a) <- c(10,5) #if result in matrix
To add to Gregor Thomas' advice, sample... You can also covert the stream into a matrix of 5 columns (groups) of 10.
nums <- sample(1:2, 50, replace = TRUE)
groups <- matrix(nums, ncol = 5)
I want to create a subset using another subset as a condition. I can't show my actual data, but I can show an example that deals with the core of my problem.
For example, I have 10 subjects with 10 observations each. So an example of my data would be to create a simple data frame using this:
ID <- rep(1:10, each = 10)
x <- rnorm(100)
y <- rnorm(100)
df <- data.frame(ID,x,y)
Which creates:
ID x y
1 1 0.08146318 0.26682668
2 1 -0.18236757 -1.01868755
3 1 -0.96322876 0.09565239
4 1 -0.64841436 0.09202456
5 1 -1.15244873 -0.38668929
6 1 0.28748521 -0.80816416
7 1 -0.64243912 0.69403155
8 1 0.84882350 -1.48618271
9 1 -1.56619331 -1.30379070
10 1 -0.29069417 1.47436411
11 2 -0.77974847 1.25704185
12 2 -1.54139896 1.25146126
13 2 -0.76082748 0.22607239
14 2 -0.07839719 1.94448322
15 2 -1.53020374 -2.08779769
etc.
Some of these subjects were positive for an event (for example subject 3, 5 and 7), so I have created a subset for that using:
event_pos <- subset(df, ID %in% c("3","5","7"))
Now, I also want to create a subset for the subjects who were negative for an event. I could use something like this:
event_neg <- subset(df, ID %in% c("1","2","4","6","8","9","10"))
The problem is, my data set is too large to specify all the individuals of the negative group. Is there a way to use my subset event_pos to get all the subjects with negative events in one subset?
TL;DR
Can I get a subset_2 by removing the subset_1 from the data frame?
You can use :
ind_list <- c("3","5","7")
event_neg <- subset(df, (ID %in% ind_list) == FALSE)
or
event_neg <- subset(df, !(ID %in% ind_list))
Hope that will helps
Gottaviannoni
From ?dplyr::bind_cols:
This is an efficient implementation of the common pattern of do.call(rbind, dfs) or do.call(cbind, dfs) for binding many data frames into one
However, with example data:
tmp_df1 <- data.frame(a = 1)
tmp_df2 <- data.frame(b = c(-2, 2))
tmp_df3 <- data.frame(c = runif(10))
The command do.call(cbind, list(tmp_df1, tmp_df2, tmp_df3)) produces:
a b c
1 1 -2 0.8473307
2 1 2 0.8031552
3 1 -2 0.3057430
4 1 2 0.6344999
5 1 -2 0.7870753
6 1 2 0.9453199
7 1 -2 0.6642231
8 1 2 0.9708049
9 1 -2 0.7189576
10 1 2 0.9217087
That is, rows of tmp_df1 and tmp_df2 are recycled to match the number of rows in tmp_df3.
In dplyr:
> bind_cols(tmp_df1, tmp_df2, tmp_df3)
Error in eval(substitute(expr), envir, enclos) :
incompatible number of rows (2, expecting 1)
The reason why I want to do something like this is because I am in a situation similar to below:
df_normal_param <- df(mu = rnorm(10), sigma = runif(10))
df_normal_sample_list <- lapply(1:10, function(i)
with(df_normal_param,
data.frame(sam = rnorm(100, mu[i], sigma[i]))
and I wish to attach the arguments used to create each entry of df_normal_sample_list to the outputs, e.g.
df_normal_sample_list <- lapply(1:10, function(i)
cbind(df_normal_param[i,], df_normal_sample_list[[i]]))
You argue in a comment that this behavior is safe, I strongly disagree. It seems safe, for this very particular case, but it is likely to cause you problems somewhere down the road. Which is why I believe that the answer to your stated question ("Is there a way to get dplyr's bind_cols to expand number of rows like in cbind?") is a simple: no, and they probably built it that way intentionally.
Instead, I would suggest that you be more explicit in your approach, and just add the columns you want right as you build the data you are creating. For example, you could include that step right in your call (here using apply to clarify what is going where)
df <- data.frame(mu = rnorm(3), sigma = runif(3))
df_normal_sample_list <- apply(df, 1, function(x){
data.frame(
mu = x["mu"]
, sigma = x["sigma"]
, sam = rnorm(3, x["mu"], x["sigma"])
)
})
Returns
[[1]]
mu sigma sam
1 -0.6982395 0.1690402 -0.592286
2 -0.6982395 0.1690402 -0.516948
3 -0.6982395 0.1690402 -0.804366
[[2]]
mu sigma sam
1 -1.698747 0.2597186 -1.830950
2 -1.698747 0.2597186 -2.087393
3 -1.698747 0.2597186 -1.961376
[[3]]
mu sigma sam
1 0.9913492 0.3069877 0.9629801
2 0.9913492 0.3069877 1.2279697
3 0.9913492 0.3069877 1.1222780
Then, instead of binding the columns, then the rows, you can just bind the rows at the end (also from dplyr)
bind_rows(df_normal_sample_list)
I've got a simple question that's stumping me. I'm trying to use a loop to count how many values of a vector fall in a bin (0,.01), (.01,.02), etc. For example (the loop does not work):
set.seed(12345)
x<- rnorm(100, 0, .05)
vec <- rep(NA, 11)
for(i in .01:.11){
vec[i] <- sum(x> i & x < (i +.01))
}
I would like this to ultimately produce a vector of the count between each break, such that the output for the above is:
5,9,10...
I think this may have something to do with the indexing/decimals. Thanks for any and all help.
You example contains negative numbers so I assume you are looking to do this with positive numbers. You should use cut to divide your vector into the given bins by setting breaks parameter. Then using table you can compute frequencies of x's falling within each interval.
## filter x
x <- x[x>=0.01] ## EDIT here : was x <- abs(x)
res <- table(cut(x,breaks=seq(round(min(x),2),round(max(x),2),0.01)))
## prettier output coerce to data.frame
as.data.frame(res)
# Var1 Freq
# 1 (0.01,0.02] 5
# 2 (0.02,0.03] 9
# 3 (0.03,0.04] 10
# 4 (0.04,0.05] 10
# 5 (0.05,0.06] 4
# 6 (0.06,0.07] 0
# 7 (0.07,0.08] 5
# 8 (0.08,0.09] 2
# 9 (0.09,0.1] 5
# 10 (0.1,0.11] 4
# 11 (0.11,0.12] 1
I would like to create a loop in order to create 15 crosstables with one data.frame (var1), which consist of 15 variables, and another variable (var2), see data which can be downloaded here.
The code is now able to give results, but I would like to know how I can rename the variable "mytable" so that I get mytable1, mytable2, etc.
Code:
library(vcd) # for Cramer's V
var1 <- read.csv("~/example.csv", dec=",")
var2 <- sample(1:43)
i <- 1
while(i <= ncol(var1)) {
mytable[[i]] <- table(var2,var1[,i])
assocstats(mytable[[i]])
print(mytable[[i]])
i <- i + 1
}
As suggested in the comments, using names like mytable1, mytable2, etc. for a list of objects is actively discouraged when using R. Collecting all in a list is more useful and cleaner.
One way to do what you want would be this:
library(vcd) # for Cramer's V
data(mtcars)
var1 <- mtcars[ , c(2, 8:11)] ##OP's CSV no longer available
var2 <- sample(1:5, 32, TRUE)
mytable <- myassoc <- list() ##store output in a list
##a `for` loop looks simpler than `while`
for(i in 1:ncol(var1)){
mytable[[i]] <- table(var2, var1[ , i])
myassoc[[i]] <- assocstats(mytable[[i]])
}
So now to access "mytable2" and "myassoc2" you would simply do:
> mytable[[2]]
var2 0 1
1 4 2
2 6 6
3 1 1
4 2 3
5 5 2
> myassoc[[2]]
X^2 df P(> X^2)
Likelihood Ratio 1.7079 4 0.78928
Pearson 1.6786 4 0.79460
Phi-Coefficient : NA
Contingency Coeff.: 0.223
Cramer's V : 0.229