R loop to execute the same code - r

I'm trying to figure out how to repeat the same code 30 times without typing each one at a time... any help will be much appreciated.
SRS_1 <- sample(1:nrow(MyData_points), size=.10*nrow(MyData_points))
data_sample_1 <- MyData_points[SRS_1,]
fpc.srs <- rep(6399875, 639987)
design_SRS_1 <- svydesign(id=~1, strata=NULL, data=data_sample_1, fpc=fpc.srs)
ONStotal_SRS1 <- svytotal(~data_sample_1$V4, design=design_SRS_1)
ONSmean_SRS1 <- svymean(~data_sample_1$V4, design=design_SRS_1)
CI_SRS_1 <- confint(svytotal(~data_sample_1$V4, design=design_SRS_1))
The first code calculates a Simple Random Sampling with a probability of .10 from the data. The second gets the sample from the data. Third, calculates the fpc, which is the 10% of the total data points. Now, in order to estimate the population I need to do a design of the sample without replacement including the fpc. Then, for the last three codes, I calculate a population estimate, mean and confidence interval based on that sample.
What changes is that I must repeat 30 different Simple Random Samplings from the data. Therefore, the resulting estimation, mean and confidence intervals will be obtained from 30 different samples. They might be close but not equal
How can I make this code better so I can run it 30 times each and be able to print a table with (ONStotal_SRS1, ONSmean_SRS1,CI_SRS_1)?

Usually I would use either rbindlist from the data.table package or bind_rows from dplyr in combination with an lapply to build the table a row at a time and then bind the rows together. Here is an example using bind_rows with the mtcars data set:
library(dplyr)
combined_data <- bind_rows(lapply(1:30, function(...) {
# Take a sample
SRS_1 <- sample(1:nrow(mtcars), size = .10 * nrow(mtcars))
data_sample_1 <- mtcars[SRS_1, ]
# Compute some things from the sample
m_disp <- mean(data_sample_1$disp)
m_hp <- mean(data_sample_1$hp)
# Make a one row data.frame that will be returned by the function
data.frame(m_disp, m_hp)
}))
Which gives this data.frame:
> str(combined_data)
'data.frame': 30 obs. of 2 variables:
$ m_disp: num 235 272 410 115 249 ...
$ m_hp : num 147 159 195 113 154 ...

Related

Composing a data.frame from loop-generated sequences

I have a data set which is made up of observations of the weights of fish, the julian dates they were captured on, and their names. I am seeking to assess what the average growth rate of these fish is according to the day of the year (julian date). I believe the best method to do this is to compose a data.frame with two fields: "Julian Date" and "Growth". The idea is this: for a fish which is observed on January 1 (1) at weight 100 and a fish observed again on April 10 (101) at weight 200, the growth rate would be 100g/100days, or 1g/day. I would represent this in a data.frame as 100 rows in which the "Julian Date" column is composed of the Julian date sequence (1:100) and the "Growth" column is composed of the average growth rate (1g/day) over all days.
I have attempted to compose a for loop which passes through each fish, calculates the average growth rate, then creates a list in which each index contains the sequence of Julian dates and the growth rate (repeated the number of times equal to the length of the Julian date sequence). I would then utilize the function to compose my data.frame.
growth_list <- list() # initialize empty list
p <- 1 # initialize increment count
# Looks at every other fish ID beginning at 1 (all even-number observations are the same fish at a later observation)
for (i in seq(1, length(df$FISH_ID), by = 2)){
rate <- (df$growth[i+1]-df$growth[i])/(as.double(df$date[i+1])-as.double(df$date[i]))
growth_list[[p]] <- list(c(seq(as.numeric(df$date[i]),as.numeric(df$date[i+1]))), rep(rate, length(seq(from = as.numeric(df$date[i]), to = as.numeric(df$date[i+1])))))
p <- p+1 # increase to change index of list item in next iteration
}
# Converts list of vectors (the rows which fulfill above criteria) into a data.frame
growth_df <- do.call(rbind, growth_list)
My expected results can be illustrated here: https://imgur.com/YXKLkpK
My actual results are illustrated here: https://imgur.com/Zg4vuVd
As you can see, the actual results appear to be a data.frame with two columns specifying the type of the object, as well as the length of the original list item. That is, row 1 of this dataset contained 169 days between observations, and therefore contained 169 julian dates and 169 repetitions of the growth rate.
Instead of list(), use data.frame() with named columns to build a list of data frames to be row binded at the end:
growth_list <- vector(mode="list", length=length(df$FISH_ID)/2)
for (i in seq(1, length(df$FISH_ID), by=2)){
rate <- with(df, (growth[i+1]-growth[i])/(as.double(date[i+1])-as.double(date[i])))
date_seq <- seq(as.numeric(df$date[i]), as.numeric(df$date[i+1]))
growth_list[[p]] <- data.frame(Julian_Date = date_seq,
Growth_Rate = rep(rate, length(date_seq))
p <- p + 1
}
growth_df <- do.call(rbind, growth_list)
Welcome to stackoverflow
Couple things about your code:
I recommend using the apply function instead of the for loop. You can set parameters in apply to perform row-wise functions. It makes you code run faster. The apply family of functions also creates a list for you, which reduces the code you write to make the list and populate it.
It is common to supply users with a snippet example of your initial data to work with. Sometimes the way we describe our data is not representative of our actual data. This tradition is necessary to alleviate any communication errors. If you can, please manufacture a dummy dataset for us to use.
Have you tried using as.data.frame(growth_list), or data.frame(growth_list)?
Another option is to use an if else statement within your for loop that performs the rbind function. This would look something like this:
#make a row-wise for loop
for(x in 1:nrow(i)){
#insert your desired calculations here. You can turn the rows into their own dataframe by using this, which may make it easier to perform your calculations:
dataCurrent <- data.frame(i[x,])
#finish with something like this to turn your calculations for each row into an output dataframe of your choice.
outFish <- cbind(date, length, rate)
#make your final dataframe as follows
if(exists("finalFishOut") == FALSE){
finalFishOut <- outFish
}else{
finalFishOut <- rbind(finalFishOut, outFish)
}
}
Please update with a snippet of data and I'll update this answer with your exact solution.
Here is a solution using dplyr and plyr with some toy data. There are 20 fish, with a random start and end time, plus random weights at each time. Find the growth rate over time, then create a new df for each fish with 1 row per day elapsed and the daily average growth rate, and output a new df containing all fish.
df <- data.frame(fish=rep(seq(1:20),2),weight=sample(c(50:100),40,T),
time=sample(c(1:100),40,T))
df1 <- df %>% group_by(fish) %>% arrange(time) %>%
mutate(diff.weight=weight-lag(weight),
diff.time=time-lag(time)) %>%
mutate(rate=diff.weight/diff.time) %>%
filter(!is.na(rate)) %>%
ddply(.,.(fish),function(x){
data.frame(time=seq(1:x$diff.time),rate=x$rate)
})
head(df1)
fish time rate
1 1 1 -0.7105263
2 1 2 -0.7105263
3 1 3 -0.7105263
4 1 4 -0.7105263
5 1 5 -0.7105263
6 1 6 -0.7105263
tail(df1)
fish time rate
696 20 47 -0.2307692
697 20 48 -0.2307692
698 20 49 -0.2307692
699 20 50 -0.2307692
700 20 51 -0.2307692
701 20 52 -0.2307692

Performing a Specific Function for One Column For The First 12 Rows?

This is easy, but for some reason I'm having trouble with it. I have a set of Data like this:
File Trait Temp Value Rep
PB Mortality 16 52.2 54
PB Mortality 17 21.9 91
PB Mortality 18 15.3 50
...
And it goes on like that for 36 rows. What I need to do is divide the Value column by 100 in only the first 12 rows. I did:
NewData <- Data[1:12,4]/100
to try and create a new data frame without changing the old data. When I do this it divides the fourth column, but saves only the fourth column (rows 1-12) as a Values in the Global Environment by itself, not as Data with the rest of the rows/columns in the original set. Overall, I'm trying to fit the NewData in a nls function, so I need to save the modified data with the rest of the data, and not as a separate value. Is there a way for me to modify the first 12 rows without having R save it as a value?
Consider copying the dataframe and then updating column at select rows:
NewData <- Data
NewData$Value[1:12] <- NewData$Value[1:12]/10
# NewData[1:12,4] <- NewData[1:12,4]/10 ' ALTERNATE EQUIVALENT
library(dplyr)
newdata <- data[1:12,] %>% mutate(newV = VALUE/100)
newdata$Value = newdata$newV
newdata = newdata %>% select(-newV)
then you can do
full_data = rbind(newdata,data[13:36,])

How to Bootstrap Resample Count Data in R

I have a vector of counts which I want to resample with replacement in R:
X350277 128
X193233 301
X514940 3715
X535375 760
X953855 50
X357046 236
X196664 460
X589071 898
X583656 670
X583117 1614
(Note the second column is counts, the first column is the object the counts represent)
From reading various documentation it seems easy to resample data where each row or column represents a single observation. But how do I do this when each row represents multiple observations summed together (as in a table of counts)?
You can use weighted sampling (as user20650 also mentioned in the comments):
sample_weights <- dat$count/sum(dat$count)
mysample <- dat[sample(1:nrow(dat),1000,replace=T,prob=sample_weights),]
A less efficient approach - which might have its uses depending on what you want to do - is to turn your data to 'long' again:
dat_large <- dat[rep(1:nrow(dat),dat$count),]
#then sampling is easy
mysample <- dat_large[sample(1:nrow(dat_large),1000,replace=T),]

permuting data and random simulation for chisq test on R

I am new to R and I am trying to compare a table of observed values with one of expected values and calculate chisq. As a part of my assignment, I need to compare the expected values table with a set of 999 tables that I created using random permutations from the observed values. I need to calculate the chisq value for each table (nsim=999) and then plot a histogram of all chisq values along with the actual chisq from observed data. Here is the data and codes I am using:
> survival=table(titanic[,c("CLASS","SURVIVED")])
> survival
SURVIVED
CLASS no yes
1st 122 203
2nd 167 118
3rd 528 178
crew 673 212
> expected=expected(survival) #library(epitools)
> expected
SURVIVED
CLASS no yes
1st 220.0136 104.98637
2nd 192.9350 92.06497
3rd 477.9373 228.06270
crew 599.1140 285.88596
>nsim=999
>random= rep(survival,nsim)
and now I am stuck!
The simplest way to generate permutations is to use the sample command on your "SURVIVED" column:
sample(titanic[,"SURVIVED"])
Will shuffled the yes/no labels for that column, then you can repeat this 999 times:
replicate(999, {
permSurvival <- sample(titanic[,"SURVIVED"])
# Code to measure chi square test goes here
})

Generate a set of random unique integers from an interval

I am trying to build some machine learning models,
so I need training data and a validation data
so suppose I have N number of examples, I want to select random x examples in a data frame.
For example, suppose I have 100 examples, and I need 10 random numbers, is there a way (to efficiently) generate 10 random INTEGER numbers for me to extract the training data out of my sample data?
I tried using a while loop, and slowly change the repeated numbers, but the running time is not very ideal, so I am looking for a more efficient way to do it.
Can anyone help, please?
sample (or sample.int) does this:
sample.int(100, 10)
# [1] 58 83 54 68 53 4 71 11 75 90
will generate ten random numbers from the range 1–100. You probably want replace = TRUE, which samples with replacing:
sample.int(20, 10, replace = TRUE)
# [1] 10 2 11 13 9 9 3 13 3 17
More generally, sample samples n observations from a vector of arbitrary values.
If I understand correctly, you are trying to create a hold-out sampling. This is usually done using probabilities. So if you have n.rows samples and want a fraction of training.fraction to be used for training, you may do something like this:
select.training <- runif(n=n.rows) < training.fraction
data.training <- my.data[select.training, ]
data.testing <- my.data[!select.training, ]
If you want to specify EXACT number of training cases, you may do something like:
indices.training <- sample(x=seq(n.rows), size=training.size, replace=FALSE) #replace=FALSE makes sure the indices are unique
data.training <- my.data[indices.training, ]
data.testing <- my.data[-indices.training, ] #note that index negation means "take everything except for those"
from the raster package:
raster::sampleInt(242, 10, replace = FALSE)
## 95 230 148 183 38 98 137 110 188 39
This may fail if the limits are too large:
sample.int(1e+12, 10)

Resources