what is this function doing? replication [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
rep_sample_n <- function(tbl, size, replace = FALSE, reps = 1)
{
rep_tbl = replicate(reps, tbl[sample(1:nrow(tbl), size, replace = replace),
], simplify = FALSE) %>%
bind_rows() %>%
mutate(replicate = rep(1:reps, each = size)) %>%
select(replicate, everything()) %>%
group_by(replicate)
return(rep_tbl)
}
Hey, can anyone help me there? What is this function doing? Is the first line setting the variables of the function? And then what is this "replicate" doing? Thanks!

This formula replicates your data. lets say we have a dataset of 10 observations. In order to come up with additional like-datasets of your current one, you can replicate it by introducing random sampling of your dataset.
You can check out the wikipedia page on
statistical replication if you're more curious.
Lets take a simple dataframe:
df <- data.frame(x = 1:10, y = 1:10)
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
if we want to take a random sample of this, we can use the function rep_sample_n which takes 2 arguments tbl, size, and has another 2 optional arguments replace = FALSE, reps = 1.
Here is an example of us just taking 4 randomly selected columns from our data.
rep_sample_n(df, 4)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 1 1
2 1 3 3
3 1 4 4
4 1 10 10
Now if we want to randomly sample 15 observations from a 10 observation dataset, it will throw an error. Currently the replace = FALSE argument doesn't allow that because each time a sample row is chosen, it's removed from the pool for the next sample to be taken. In the example above, it chose the 1st observation, then it went to choose the 2nd (because we asked for 4), and it only have 2 through 10 left, and it chose the 3rd, then 4th and then 10th etc. If we allow replace = TRUE, it will choose an observation from the full dataset each time.
Notice how in this example, the 5th observation was chosen twice. That wouldn't happen with replace = FALSE
rep_sample_n(df, 4, replace = TRUE)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 5 5
2 1 3 3
3 1 2 2
4 1 5 5
Lastly and most importantly, we have the reps argument which is the basis for this function, really. It allows you randomly sample your dataset multiple times, and then combine all those samples together.
Below, we have sampled our original dataset of 10 observations by selecting 4 of them in a sample, then we replicated that 5 times, so we have 5 different sample dataframes of 4 observations each that have been combined together into one 20 observation dataframe, but each of the unique 5 dataframes has been tagged with a replicate #. The replicate column will point out which 4 observations goes with which replicated dataframe.
rep_sample_n(df, 4, reps = 5)
# A tibble: 20 x 3
# Groups: replicate [5]
replicate x y
<int> <int> <int>
1 1 8 8
2 1 4 4
3 1 3 3
4 1 1 1
5 2 4 4
6 2 5 5
7 2 8 8
8 2 3 3
9 3 6 6
10 3 1 1
11 3 3 3
12 3 2 2
13 4 5 5
14 4 7 7
15 4 10 10
16 4 3 3
17 5 7 7
18 5 10 10
19 5 3 3
20 5 9 9
I hope this provided some clarity

This function takes a data frame as input (and several input preferences). It takes a random sample of size rows from the table, with or without replacement as set by the replace input. It repeats that random sampling reps times.
Then, it binds all the samples together into a single data frame, adding a new column called "replicate" indicating which repetition of the sampling produced each row.
Finally, it "groups" the resulting table, preparing it for future group-wise operations with dplyr.
For general questions about specific functions, like "What is this "replicate" doing?", you should look at the function's help page: type ?replicate or help("replicate") to get there. It includes a description of the function and examples of how to use it. If you read the description, run the examples, and are still confused, feel free to come back with a specific question and example illustrating what you are confused by.
Similarly, for "Is the first line setting the variables of the function?", the arguments to function() are the inputs to the function. If you have basic questions about R like "How do functions work", have a look at An Introduction to R, or one of the other sources in the R Tag Wiki.

Related

Create new column with shared ID to randomly link two rows in R

I am using R and working with this sample dataframe.
library(tibble)
library(stats)
set.seed(111)
conditions <- factor(c("1","2","3"))
df_sim <-
tibble::tibble(StudentID = 1:10,
Condition = sample(conditions,
size = 10,
replace = T),
XP = stats::rpois(n = 10,
lambda = 15))
This creates the following tibble.
StudentID
Condition
XP
1
2
8
2
3
11
3
3
16
4
3
12
5
1
22
6
3
16
7
1
18
8
3
8
9
2
14
10
1
17
I am trying create a new column in my dataframe called DyadID. The purpose of this column is to create a variable that is uniquely shared by two students in the dataframe — in other words, two students (e.g. Student 1 and Student 9) would share the same value (e.g. 4) in the DyadID column.
However, I only want observations linked together if they share the same Condition value. Condition contains three unique values (1, 2, 3). I want condition 1 observations linked with other condition 1 observations, 2 with 2, and 3 with 3.
Importantly, I'd like the students to be linked together randomly.
Ideally, I would like to stay within the tidyverse as that is what I am most familiar with. However, if that's not possible or ideal, any solution would be appreciated.
Here is a possible outcome I am hoping to achieve.
StudentID
Condition
XP
DyadID
1
2
8
4
2
3
11
1
3
3
16
2
4
3
12
1
5
1
22
3
6
3
16
NA
7
1
18
3
8
3
8
2
9
2
14
4
10
1
17
NA
Note that two students did not receive a pairing, because there was an odd number in condition 1 and condition 3. If there is an odd number, the DyadID can be NA.
Thank you for your help with this!
Using match to get a unique id according to Condition and sample for randomness.
library(dplyr)
df_sim <- df_sim %>% mutate(dyad_id = match(Condition,sample(unique(Condition))))

How to make a normally distributed variable depend on entries and time in R?

I'm trying to generate a dataset of cross sectional time series to estimate uses of different models.
In this dataset, I have a ID variable and time variable. I'm trying to add a normally distributed variable that depends on the two identifications. In other words, how do I create a variable that recongizes both ID and time in R?
If my question appears uncertain, feel free to ask any questions.
Thanks in advance.
df2 <- read.table(
text =
"Year,ID,H,
1,1,N(2.3),
2,1,N(2.3),
3,1,N(2.3),
1,2,N(0.1),
2,2,N(0.1),
3,2,N(0.1),
", sep = ",", header = TRUE)
Assuming that the data in the dataframe df looks like
ID
Time
1
1
1
2
1
3
1
4
2
1
2
2
2
3
2
4
3
1
3
2
3
3
3
4
you can generate a variable y that depends on ID and time as the sum of two random normal distributions (yielding another normal distribution) that depend on ID and time respectively:
set.seed(42)
df = data.frame(
ID = rep(1:4, each=3),
time = rep(1:3, times=4)
)
df$y = rnorm(nrow(df), mean=df$ID, sd=1+0.1*df$ID) +
rnorm(nrow(df), mean=df$time, sd=0.05*df$time)
# Output:
ID time y
1 1 1 3.438611
2 1 2 2.350953
3 1 3 4.379443
4 1 4 5.823339
5 2 1 3.470909
6 2 2 3.607005
7 2 3 6.447756
8 2 4 6.150432
9 3 1 6.608619
10 3 2 4.740341
11 3 3 7.670543
12 3 4 10.215574
Note that the underlying normal distributions depend on both ID and time. That is in contrast to your example table above where it looks like it solely depends on ID -- namely resulting in a single normal distribution per ID that is independent of the time variable.

Equivalent to first./last. SAS processing in R

I did find a thread on this (R equivalent of .first or .last sas operator) but it did not fully answer my question.
I come from a SAS background and a common operation is, for example, when you have your patient ID with several different values, and you want to keep only the row with the minimum/maximum value for another variable for each ID. For example, I might have data with dates of a certain medical problem for each ID, and I want a dataset with just the first/last problem date for each patient.
Here's a simple example that gets me what I'm want, but I want to know if there's a better way to do it. I sort by ID, and then count, and I want to just keep the row with the largest count for each ID.
testdata<-data.frame(id=c(1,1,1,2,3,3,4,3,4,4,4),
count=c(5,9,2,6,16,12,0,11,8,8,7))
library(dplyr)
testdata2<-arrange(testdata,id,count)
testdata3<-cbind(testdata2,!duplicated(testdata2$id,fromLast=TRUE))
testdata4<-subset(testdata3,testdata3[,3]=='TRUE')[,-3]
> testdata4
id count
3 1 9
4 2 6
7 3 16
11 4 8
Is there a more compact way to do this?
Thank you.
do.call(rbind.data.frame,
c(by(testdata, testdata$id, function(d) d[c(1L,nrow(d)),]), stringsAsFactors=FALSE))
# id count
# 1.1 1 5
# 1.3 1 2
# 2.4 2 6
# 2.4.1 2 6
# 3.5 3 16
# 3.8 3 11
# 4.7 4 0
# 4.11 4 7
Breaking it down:
d[c(1L,nrow(d)),] returns the first and last row from the dataframe. (I'm assuming the frame has already been ordered appropriately.)
by(testdata, testdata$id, function breaks the larger frame into smaller frames by $id, and passes each smaller frame to the anonymous function. This returns a by-list of each return value.
do.call(rbind.data.frame, grabs the list and row-binds them back together into a single frame. Since the default is to use factors, I added stringsAsFactors=FALSE.
If you want to use dplyr, you can do:
library(dplyr)
group_by(testdata, id) %>%
slice(c(1,n())) %>%
ungroup()
# # A tibble: 8 × 2
# id count
# <dbl> <dbl>
# 1 1 5
# 2 1 2
# 3 2 6
# 4 2 6
# 5 3 16
# 6 3 11
# 7 4 0
# 8 4 7
where n() is a special function within dplyr pipes that returns the number of rows in that (optionally-grouped) frame.

How to subtract the mean of each variable from the mean of a specific variable

I would like to subtract the mean of each variable from the mean of a variable named 'birds' and create a new data frame that will contain the results.In my real data frame I have hundreds of variables so I would like to do it automatically.Any Idea how to do so?
I tried with this line of code without the mean function and it works (on the same data frame) :
setNames(as.data.frame(cbind(g, mean(dat$birds)-mean(dat))), c(names(dat), paste0(names(dat),'_new')))
but I don't understand how to use mean as part of the code,I tried:
setNames(as.data.frame(cbind(g, mean(dat$birds)-mean(dat))), c(names(dat), paste0(names(dat),'_new')))
Here is my toy data frame.
dat <- read.table(text = " birds wolfs snakes
3 9 7
3 8 4
1 2 8
1 2 3
1 8 3
6 1 2
6 7 1
6 1 5
5 9 7
3 8 7
4 2 7
1 2 3
7 6 3
6 1 1
6 3 9
6 1 1 ",header = TRUE)
I hope I understood your question correctly.
This should create a new object, in this case - just a vector, where mean of "birds" column is substracted from the means of other columns. This should also work for any size of the data frame.
mean=mean(dat$birds)
dat2=colMeans(dat[2:dim(dat)[2]])-mean
In the future, please provide reproducible example (in your code, object 'g' is not defined) and an example of the expected output, so that it would be clear what you are trying to achieve.

Reshape data into long format, repeating range of ids for every variable [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I want to reshape my data into a long format, but I would like to repeat the entire range of id's for each variable in my data set, even for those id entries on which the variable takes no value. At the moment I can get narrow data, with ids for each variable on which there is a corresponding entry
Suppose my data has 15 variables, with 20 possible id's, I want to create a narrow form of this data that is 15*20 in length (the range of ids, repeated for each variable), whereby each repeated range of id's shows the values taken by variable, for id1, id2, id3 e.t.c until the end of the range of id's is reached, then variable2 is displayed for id1, id2, id3 e.t.c..
I am unsure of ohw to do this in R, I am currently using the reshape package.
You can use the replicate function which is explained here
v1 <- 1:5
v2 <- 1:6
rep(v1, each = 6)
# 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5
rep(v2, 5)
#1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Yeah, this is hard to work with, but you're looking for the melt function I think...
library(reshape2)
melt(yourdata, id.vars = 'ID COLUMN')
This will return a 300 x 3 data set that looks like:
ID COLUMN variable value
1 col2 7
1 col3 8
.... .... ....
20 col14 99
20 col15 100

Resources