Stuck with a 2 data frames row copy - r

I have decided to learn R and am going through Introduction to Scientific programming in R book (http://www.ms.unimelb.edu.au/spuRs/)
I am currently stuck on chapter 7 question 3 of the book, the question is:
Consider the following very simple genetic model. A population consists of
equal numbers of two sexes: male and female. At each generation men and
women are paired at random, and each pair produces exactly two offspring,
one male and one female. We are interested in the distribution of height
from one generation to the next. Suppose that the height of both children
is just the average of the height of their parents, how will the distribution
of height change across generations?
Represent the heights of the current generation as a dataframe with two
variables, m and f, for the two sexes. The command rnorm(100, 160, 20)
will generate a vector of length 100, according to the normal distribution
with mean 160 and standard deviation 20 (see Section 16.5.1). We use it to
randomly generate the population at generation 1:
pop <- data.frame(m = rnorm(100, 160, 20), f = rnorm(100, 160, 20))
The command sample(x, size = length(x)) will return a random sample
of size size taken from the vector x (without replacement). (It will also
sample with replacement, if the optional argument replace is set to TRUE.)
The following function takes the dataframe pop and randomly permutes the
ordering of the men. Men and women are then paired according to rows,
and heights for the next generation are calculated by taking the mean of
each row. The function returns a dataframe with the same structure, giving
the heights of the next generation.
next.gen <- function(pop) {
pop$m <- sample(pop$m)
pop$m <- apply(pop, 1, mean)
pop$f <- pop$m
return(pop)
}
Use the function next.gen to generate nine generations, then use the lattice
function histogram to plot the distribution of male heights in each
generation, as in Figure 7.7. The phenomenon you see is called regression
to the mean.
Hint: construct a dataframe with variables height and generation, where
each row represents a single man.
I have constructed a blank data frame:
generations <- data.frame(gen="", height="")
For now I am trying to get just the first generation height information into it, so I run:
next.gen(pop)
generations$height <- pop$m
and I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "height", value = c(165.208323681597, :
replacement has 100 rows, data has 1
I understand that I'm trying to squeeze in information from pop$m dataframe into a single row of generations$height and that is causing the problem, I do not know how to fix this? I thought that a blank data frame is flexible enough to add rows as they are being copied from pop data frame?
I tried then to run this code:
generations <- pop$m
And I get 100 values but that just turns my generations dataframe into a vector I think and running
generations
Just lists the values copied in a vector only.
I think I am approaching the first step wrong, is my dataframe definition correct? Why can't I copy row information from 1 data frame into an empty one and just adjust the size of the empty data frame as needed?
Thank you

Unsure the exact output you are looking for. Here is an approach which should be simple enough to follow. ** Note: There are workable approaches aplenty.
pop <- data.frame(m = rnorm(100, 160, 20), f = rnorm(100, 160, 20))
next.gen <- function(pop) {
pop$m <- sample(pop$m)
pop$m <- apply(pop, 1, mean)
pop$f <- pop$m
return(pop)
}
# the code
test <- list()
for (i in 1:9) {
test[[i]] <- next.gen(pop)["m"]
test[[i]]$generation <- paste0("g", i)
}
library(data.table)
test2 <- rbindlist(test)
# result
m generation
1: 174.6558 g1
2: 143.2617 g1
3: 185.2829 g1
4: 168.9719 g1
5: 151.6948 g1
---
896: 159.6091 g9
897: 161.4546 g9
898: 171.8679 g9
899: 138.4982 g9
900: 152.7390 g9

Try:
> generations <- data.frame(gen="", height="", stringsAsFactors=F)
> for(i in 1:length(pop$m)) generations[i,] = c("",pop$m[i])
> generations
gen height
1 136.70042632318
2 153.985392293761
3 122.077485676327
4 166.582538529591
5 170.751368839498
6 190.8894492681
...

Related

Distribution of mean*standard deviation of sample from gaussian

I'm trying to assess the feasibility of an instrumental variable in my project with a variable I havent seen before. The variable essentially is an interaction between the mean and standard deviation of a sample drawn from a gaussian, and im trying to see what this distribution might look like. Below is what im trying to do, any help is much appreciated.
Generate a set of 1000 individuals with a variable x following the gaussian distribution, draw 50 random samples of 5 individuals from this distribution with replacement, calculate the means and standard deviation of x for each sample, create an interaction variable named y which is calculated by multiplying the mean and standard deviation of x for each sample, plot the distribution of y.
Beginners version
There might be more efficient ways to code this, but this is easy to follow, I guess:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
# As Ben suggested, we create a data.frame filled with NA values
samples <- data.frame(mean = rep(NA, N), sd = rep(NA, N))
# Now we use a loop to populate the data.frame
for(i in 1:N){
# draw 5 samples from population (without replacement)
# I assume you want to replace for each turn of taking 5
# If you want to replace between drawing each of the 5,
# I think it should be obvious how to adapt the following code
smpl <- sample(stat_pop, size = 5, replace = FALSE)
# the data.frame currently has two columns. In each row i, we put mean and sd
samples[i, ] <- c(mean(smpl), sd(smpl))
}
# $ is used to get a certain column of the data.frame by the column name.
# Here, we create a new column y based on the existing two columns.
samples$y <- samples$mean * samples$sd
# plot a histogram
hist(samples$y)
Most functions here use positional arguments, i.e., you are not required to name every parameter. E.g., rnorm(1000, mean = 0, sd = 1) is the same as rnorm(1000, 0, 1) and even the same as rnorm(1000), since 0 and 1 are the default values.
Somewhat more efficient version
In R, loops are very inefficient and, thus, ought to be avoided. In case of your question, it does not make any noticeable difference. However, for large data sets, performance should be kept in mind. The following might be a bit harder to follow:
stat_pop <- rnorm(1000, mean = 0, sd = 1)
N = 50
n = 5
# again, I set replace = FALSE here; if you meant to replace each individual
# (so the same individual can be drawn more than once in each "draw 5"),
# set replace = TRUE
# replicate repeats the "draw 5" action N times
smpls <- replicate(N, sample(stat_pop, n, replace = FALSE))
# we transform the output and turn it into a data.frame to make it
# more convenient to work with
samples <- data.frame(t(smpls))
samples$mean <- rowMeans(samples)
samples$sd <- apply(samples[, c(1:n)], 1, sd)
samples$y <- samples$mean * samples$sd
hist(samples$y)
General note
Usually, you should do some research on the problem before posting here. Then, you either find out how it works by yourself, or you can provide an example of what you tried. To this end, you can simply google each of the steps you outlined (e.g., google "generate random standard distribution R" in order to find out about the function rnorm().
Run ?rnorm to get help on the function in RStudio.

How can I automate creation of a list of vectors containing simulated data from a known distribution, using a "for loop" in R?

First stack exchange post so please bear with me. I'm trying to automate the creation of a list, and the list will be made up of many empty vectors of various, known lengths. The empty vectors will then be filled with simulated data. How can I automate creation of this list using a for loop in R?
In this simplified example, fish have been caught by casting a net 4 times, and their abundance is given in the vector "abundance" (from counting the number of total fish in each net). We don't have individual fish weights, just the mean weight of all fish each net, so I need to simulate their weights from a lognormal distribution. So, I'm then looking to fill those empty vectors for each net, each with a length equal to the number of fish caught in that net, with weight data simulated from a lognormal distribution with a known mean and standard deviation.
A simplified example of my code:
abundance <- c(5, 10, 9, 20)
net1 <- rep(NA, abundance[1])
net2 <- rep(NA, abundance[2])
net3 <- rep(NA, abundance[3])
net4 <- rep(NA, abundance[4])
simulated_weights <- list(net1, net2, net3, net4)
#meanlog vector for each net
weight_per_net
#meansd vector for each net
sd_per_net
for (i in 1:4) {
simulated_weights[[i]] <- rlnorm(n = abundance[i], meanlog = weight_per_net[i], sd = sd_per_net[i])
print(simulated_weights_VM)
}
Could anyone please help me automate this so that I don't have to write out each net vector (e.g. net1) by hand, and then also write out all the net names in the list() function? There are far more nets than 4 so it would be extremely time consuming and inefficient to do it this way. I've tried several things from other posts like paste0(), other for loops, as.list(c()), all to no avail.
Thanks!
HM
Turns out you don't need the net1, net2, etc variables at all. You can just do
abundance <- c(5, 10, 9, 20)
simulated_weights <- lapply(abundance, function(x) rep(NA, x))
The lapply function will return the list you need by calling the function once for each value of abundance
We could create the 'simulated_weights' with split and rep
simulated_weights <- split(rep(rep(NA, length(abundance)), abundance),
rep(seq_along(abundance), abundance))

Generating n new datasets by randomly sampling existing data, and then applying a function to new datasets

For a paper I'm writing I have subsetted a larger dataset into 3 groups, because I thought the strength of correlations between 2 variables in those groups would differ (they did). I want to see if subsetting my data into random groupings would also significantly affect the strength of correlations (i.e., whether what I'm seeing is just an effect of subsetting, or if those groupings are actually significant).
To this end, I am trying to generate n new data frames by randomly sampling 150 rows from an existing dataset, and then want to calculate correlation coefficients for two variables in those n new data frames, saving the correlation coefficient and significance in a new file.
But, HOW?
I can do it manually, e.g., with dplyr, something like
newdata <- sample_n(Random_sample_data, 150)
output <- cor.test(newdata$x, newdata$y, method="kendall")
I'd obviously like to not type this out 1000 or 100000 times, and have been trying things with loops and lapply (see below) but they've not worked (undoubtedly due to something really obvious that I'm missing!).
Here I have tried to assign each row to a different group, with 10 groups in total, and then to do correlations between x and y by those groups:
Random_sample_data<-select(Range_corrected, x, y)
cat <- sample(1:10, 1229, replace=TRUE)
Random_sample_cats<-cbind(Random_sample_data,cat)
correlation <- function(c) {
c <- cor.test(x,y, method="kendall")
return(c)
}
b<- daply(Random_sample_cats, .(cat), correlation)
Error message:
Error in cor.test(x, y, method = "kendall") :
object 'x' not found
Once you have the code for what you want to do once, you can put it in replicate to do it n times. Here's a reproducible example on built-in data
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
output <- cor.test(newdata$wt, newdata$qsec, method="kendall")
})
replicate will save the result of the last line of what you did (output <- ...) for each replication. It will attempt to simplify the result, in this case cor.test returns a list of length 8, so replicate will simplify the results to a matrix with 8 rows and 10 columns (1 column per replication).
You may want to clean up the results a little bit so that, e.g., you only save the p-value. Here, we store only the p-value, so the result is a vector with one p-value per replication, not a matrix:
result = replicate(n = 10, expr = {
newdata <- sample_n(mtcars, 10)
cor.test(newdata$wt, newdata$qsec, method="kendall")$p.value
})

How can I repeat these two lines of code 100+ times?

I'm still new to the programming world and looking for some guidance on a model I am building for individual animal growths over time.
The goal for the code I'm working with is to
i) Generate random starting sizes of animals from a given distribution
ii) Give each of these individuals a starting growth rate from a given distribution
iii) Calculate new size of individual after 1 year
iv) Assign a new growth rate from above distribution
v) Calculate the new size of individual after another year.
So far I have the code below, and what I want to do is repeat the last two lines of code x amount of times without I having to physically run the code over and over.
# Generate starting lengths
lengths <- seq(from=4.4, to=5.4, by =0.1)
# Generate starting ks (growth rate)
ks <- seq(from=0.0358, to=0.0437, by =0.0001)
#Create individuals
create.inds <- function(id = NaN, length0=NaN, k1=NaN){
inds <- data.frame(id=id, length0 = length0, k1=k1)
inds
}
# Generate individuals
inds <- create.inds(id=1:n.initial,
length=sample(lengths,100,replace=TRUE),
k1=sample(ks, 100, replace=TRUE))
# Calculate new lengths based on last and 2nd last columns and insert into next column
inds[,ncol(inds)+1] <- 326*(1-exp(-(inds[,ncol(inds)])))+
(inds[,ncol(inds)-1]*exp(-(inds[,ncol(inds)])))
# Calculate new ks and insert into last column
inds[,ncol(inds)+1] <- sample(ks, 100, replace=TRUE)
Any and all assistance would be appreciated, also if you think there is a better way to write this please let me know.
i think what you are asking is a simple loop:
for (i in 1:100) { #replace 100 with the desired times you want this to excecute
inds[,ncol(inds)+1] <- 326*(1-exp(-(inds[,ncol(inds)])))+
(inds[,ncol(inds)-1]*exp(-(inds[,ncol(inds)])))
# Calculate new ks and insert into last column
inds[,ncol(inds)+1] <- sample(ks, 100, replace=TRUE)
}

How to import a distance matrix for clustering in R

I have got a text file containing 200 models all compared to eachother and a molecular distance for each 2 models compared. It looks like this:
1 2 1.2323
1 3 6.4862
1 4 4.4789
1 5 3.6476
.
.
All the way down to 200, where the first number is the first model, the second number is the second model, and the third number the corresponding molecular distance when these two models are compared.
I can think of a way to import this into R and create a nice 200x200 matrix to perform some clustering analyses on. I am still new to Stack and R but thanks in advance!
Since you don't have the distance between model1 and itself, you would need to insert that yourself, using the answer from this question:
(you can ignore the wrong numbering of the models compared to your input data, it doesn't serve a purpose, really)
# Create some dummy data that has the same shape as your data:
df <- expand.grid(model1 = 1:120, model2 = 2:120)
df$distance <- runif(n = 119*120, min = 1, max = 10)
head(df)
# model1 model2 distance
# 1 2 7.958746
# 2 2 1.083700
# 3 2 9.211113
# 4 2 5.544380
# 5 2 5.498215
# 6 2 1.520450
inds <- seq(0, 200*119, by = 200)
val <- c(df$distance, rep(0, length(inds)))
inds <- c(seq_along(df$distance), inds + 0.5)
val <- val[order(inds)]
Once that's in place, you can use matrix() with the ncol and nrow to "reshape" your vector of distance in the appropriate way:
matrix(val, ncol = 200, nrow = 200)
Edit:
When your data only contains the distance for one direction, so only between e.g. model1 - model5 and not model5 - model1 , you will have to fill the values in the upper triangular part of a matrix, like they do here. Forget about the data I generated in the first part of this answer. Also, forget about adding the ones to your distance column.
dist_mat <- diag(200)
dist_mat[upper.tri(dist_mat)] <- your_data$distance
To copy the upper-triangular entries to below the diagonal, use:
dist_mat[lower.tri(dist_mat)] <- t(dist_mat)[lower.tri(dist_mat)]
As I do not know from your question what format is your file in, I will assume the most general file format, i.e., CSV.
Then you should look at the reading files, read.csv, or fread.
Example code:
dt <- read.csv(file, sep = "", header = TRUE)
I suggest using data.table package. Then:
setDT(dt)
dt[, id := paste0(as.character(col1), "-", as.character(col2))]
This creates a new variable out of the first and the second model and serves as a unique id.
What I do is then removing this id and scale the numerical input.
After scaling, run clustering algorithms.
Merge the result with the id to analyse your results.
Is that what you are looking for?

Resources