Random Sample with Replacement Loop

Random Sample with Replacement Loop - r

I have an R script that allows me to select a sample size and take fifty individual random samples with replacement. Below is an example of this code:
## Creates data frame
df = as.data.table(data)
## Select sample size
sample.size = 5
## Creates Sample 1 (Size 5)
Sample.1<-df[,
Dollars[sample(.N, size=sample.size, replace=TRUE)], by = Num]
Sample.1$Sample <- c("01")
According to the R script above, I first created a data frame. I then select my sample size, which in this case is 5. This represents just one sample. Due to my lack of experience with R, I repeat this code 49 more times. The last piece of code looks like this:
## Creates Sample 50 (Size 5)
Sample.50<-df[,
Dollars[sample(.N, size=sample.size, replace=TRUE)], by = Num]
Sample.50$Sample <- c("50")
The sample output would look something like this (Sample Range 1 - 50):
Num Dollars Sample
1 85000 01
1 4900 01
1 18000 01
1 6900 01
1 11000 01
1 8800 50
1 3800 50
1 10400 50
1 2200 50
1 29000 50
It should be noted that varaible 'Num' was created for grouping purposes and has little to no influence on my overall question (which is posted below).
Instead of repeating this code fifty times, to get me fifty individual samples (with a size of 5), is there a loop I can create to help me limit my code? I have been recently asked to create ten thousand random samples, each of a size of 5. I obviously cannot repeat this code ten thousand times so I need some sort of loop.
A sample of my final output should look something like this (Sample Range 1 - 10,000):
Num Dollars Sample
1 85000 01
1 4900 01
1 18000 01
1 6900 01
1 11000 01
1 9900 10000
1 8300 10000
1 10700 10000
1 6800 10000
1 31000 10000
Thank you all in advance for your help, its greatly appreciated.
Here is some sample code if needed:
Num Dollars
1 31002
1 13728
1 23526
1 80068
1 86244
1 9330
1 27169
1 13694
1 4781
1 9742
1 20060
1 35230
1 15546
1 7618
1 21604
1 8738
1 5299
1 12081
1 7652
1 16779

A very simple method would be to use a for loop and store the results in a list:
lst <- list()
for(i in seq_len(3)){
lst[[i]] <- df[sample(seq_len(nrow(df)), 5, replace = TRUE),]
lst[[i]]["Sample"] <- i
}
> lst
[[1]]
Num Dollars Sample
20 1 16779 1
1 1 31002 1
12 1 35230 1
14 1 7618 1
14.1 1 7618 1
[[2]]
Num Dollars Sample
9 1 4781 2
13 1 15546 2
12 1 35230 2
17 1 5299 2
12.1 1 35230 2
[[3]]
Num Dollars Sample
1 1 31002 3
7 1 27169 3
17 1 5299 3
5 1 86244 3
6 1 9330 3
Then, to create a single data.frame, use do.call to rbind the list elements together:
do.call(rbind, lst)
Num Dollars Sample
20 1 16779 1
1 1 31002 1
12 1 35230 1
14 1 7618 1
14.1 1 7618 1
9 1 4781 2
13 1 15546 2
121 1 35230 2
17 1 5299 2
12.1 1 35230 2
11 1 31002 3
7 1 27169 3
171 1 5299 3
5 1 86244 3
6 1 9330 3

It's worth noting that if you're sampling with replacement, then drawing 50 (or 10,000) samples of size 5 is equivalent to drawing one sample of size 250 (or 50,000). Thus I would do it like this (you'll see I stole a line from #beginneR's answer):
df = as.data.table(data)
## Select sample size
sample.size = 5
n.samples = 10000
# Sample and assign groups
draws <- df[sample(seq_len(nrow(df)), sample.size * n.samples, replace = TRUE), ]
draws[, Sample := rep(1:n.samples, each = sample.size)]

Related

Create a new dataframe in R resulting from comparison of differently ordered columns from two other databases with different lengths

I have this two dataframe CDD26_FF (5593 rows) and CDD_HI (5508 rows) having a structure (columns) like below. CDDs are "consecutive dry days", and the two table show species exposure to CDD in far future (FF) and historical period (HI).
I want to focus only on "Biom" and "Species_name" columnes.
As you can see the two table have same "Species_names" and same "Biom" (areas in the world with sama climatic conditions). "Biom" values goes from 0 to 15. By the way, "Species_name" do not always appear in both tables (e.g. Abromoco_ben); Furthemore, the two tables not always have the combinations of "Species_name" and "Biom" (combinations are simply population of the same species belonging to that Biom)
CDD26_FF :
CDD26_FF
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTotal
1
1
13
10
Abrocomo_ben
0.076923
1
1
8
1
Abrocomo_cin
0.125000
1
1
30
10
Abrocomo_cin
0.033333
1
2
10
1
Abrothrix_an
0.200000
1
1
44
10
Abrothrix_an
0.022727
1
3
6
2
Abrothrix_je
0.500000
1
1
7
12
Abrothrix_lo
0.142857
CDD_HI
CDD_HI
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTot_HI
1
1
8
1
Abrocomo_cin
0.125000
1
5
30
10
Abrocomo_cin
0.166666
1
1
5
2
Abrocomo_cin
0.200000
1
1
10
1
Abrothrix_an
0.100000
1
1
44
10
Abrothrix_an
0.022727
1
6
18
1
Abrothrix_je
0.333333
1
1
23
4
Abrothrix_lo
0.130434
I want to highlight rows that have same matches of "Species_name" and "Biom": in the example they are lines 3, 4, 5 from CDD26_FF matching lines 2, 4, 5 from CDD_HI, respectively. I want to store these line in a new table, but I want to store not only "Species_name" and "Biom" column (as "compare()" function seems to do), but also all the other columns.
More precisely, I want then to calculate the ratio of "AreaCellSuAreaTot" / "AreaCellSuAreaTot_HI" from the highlighted lines.
How can I do that?
Aside from "compare()", I tried a "for" loop, but lengths of the table differ, so I tried with a 3-nested for loop, still without results. I also tried "compareDF()" and "semi_join()". No results untill now. Thank you for your help.

You could use an inner join (provided by dplyr). An inner join returns all datasets that are present in both tables/data.frames and with matching conditions (in this case: matching "Biom" and "Species_name").
Subsequently it's easy to calculate some ratio using mutate:
library(dplyr)
cdd26_f %>%
inner_join(cdd_hi, by=c("Biom", "Species_name")) %>%
mutate(ratio = AreaCellSuAreaTotal/AreaCellSuAreaTot_HI) %>%
select(Biom, Species_name, ratio)
returns
# A tibble: 4 x 3
Biom Species_name ratio
<dbl> <chr> <dbl>
1 1 Abrocomo_cin 1
2 10 Abrocomo_cin 0.200
3 1 Abrothrix_an 2
4 10 Abrothrix_an 1
Note: Remove the select-part, if you need all columns or manipulate it for other columns.
Data
cdd26_f <- readr::read_table2("CDD26_FF AreaCell Area_total Biom Species_name AreaCellSuAreaTotal
1 1 13 10 Abrocomo_ben 0.076923
1 1 8 1 Abrocomo_cin 0.125000
1 1 30 10 Abrocomo_cin 0.033333
1 2 10 1 Abrothrix_an 0.200000
1 1 44 10 Abrothrix_an 0.022727
1 3 6 2 Abrothrix_je 0.500000
1 1 7 12 Abrothrix_lo 0.142857")
cdd_hi <- readr::read_table2("CDD_HI AreaCell Area_total Biom Species_name AreaCellSuAreaTot_HI
1 1 8 1 Abrocomo_cin 0.125000
1 5 30 10 Abrocomo_cin 0.166666
1 1 5 2 Abrocomo_cin 0.200000
1 1 10 1 Abrothrix_an 0.100000
1 1 44 10 Abrothrix_an 0.022727
1 6 18 1 Abrothrix_je 0.333333
1 1 23 4 Abrothrix_lo 0.130434")

lag and summarize time series data

I have spent a significant amount of time searching for an answer with little luck. I have some time series data and need to collapse and create a rolling mean of every nth row in that data. It looks like this is possible in zoo and maybe hmisc and i am sure other packages. I need to average rows 1,2,3 then 3,4,5 then 5,6,7 and so on. my data looks like such and has thousands of observations:
id time x.1 x.2 y.1 y.2
10 1 22 19 0 -.5
10 2 27 44 -1 0
10 3 19 13 0 -1.5
10 4 7 22 .5 1
10 5 -15 5 .33 2
10 6 3 17 1 .33
10 7 6 -2 0 0
10 8 44 25 0 0
10 9 27 12 1 -.5
10 10 2 11 2 1
I would like it to look like this when complete:
id time x.1 x.2 y.1 y.2
10 1 22.66 25.33 -.33 -.66
10 2 3.66 13.33 .27 .50
The time var 1 would actually be times 1,2,3 averaged and 2 would be 3,4,5 averaged but at this point the time var would not be important to keep. I would need to group by id as it does change eventually. The only way I could figure out how to do this successfully was to use Lag() and make new rows lead by 1 and another by 2 then take average across columns. after that you have to delete every other row
1 NA NA
2 1 NA
3 2 1
4 3 2
5 4 3
use the 123 and 345 and remove 234... to do this for each var would be outrageous especially as i gather new data.
any ideas? help would be much appreciated

something like this maybe?
# sample data
id <- c(10,10,10,10,10,10)
time <- c(1,2,3,4,5,6)
x1 <- c(22,27,19,7,-15,3)
x2 <- c(19,44,13,22,5,17)
df <- data.frame(id,time,x1,x2)
means <- data.frame(rollmean(df[,c(1,3:NCOL(df))], 3))
means <- means[c(T,F),]
means$time <- seq(1:NROW(means))
row.names(means) <- 1:NROW(means)
> means
id x1 x2 time
1 10 22.666667 25.33333 1
2 10 3.666667 13.33333 2

Capturing all items in basket (R/arulesSequences)

I am having an issue with the arulesSequences package in R. I was able to read baskets into the program, and create a data.frame, however it fails to recognize any other items beyond the first column. Below is a sample of my data set, which follows the form demonstrated here: Data Mining Algorithms in R/Sequence Mining/SPADE.
[sequenceID] [eventID] [SIZE] items
2 1 1 OB/Gyn
15 1 1 Internal_Medicine
15 2 1 Internal_Medicine
15 3 1 Internal_Medicine
56 1 2 Internal_Medicine Neurology
84 1 1 Oncology
151 1 2 Hematology Hematology
151 2 1 Hematology/Oncology
151 3 1 Hematology/Oncology
185 1 2 Gastroenterology Gastroenterology
The dataset was exported from SAS as a [.CSV] then converted to a tab-delimited [.TXT] file in Excel. Headers were removed for import into R, but I placed them in brackets above for clarity in this example. All spaces were replaced with an underscore ("_"), and item names were simplified as much as possible. Each item is listed in a separate column. The following command was used to import the file:
baskets <- read_baskets(con = "...filepath/spade.txt", sep = "[ \t]+",info=c("sequenceID", "eventID", "SIZE"))
I am presented with no errors, so I continue with the following command:
as(baskets, "data.frame")
Here, it returns the data.frame as requested, however it fails to capture the items beyond the first column:
items sequenceID eventID SIZE
{OB/Gyn} 2 1 1
{Internal_Medicine} 15 1 1
{Internal_Medicine} 15 2 1
{Internal_Medicine} 15 3 1
{Internal_Medicine} 56 1 2
{Oncology} 84 1 1
{Hematology} 151 1 2
{Hematology/Oncology} 151 2 1
{Hematology/Oncology} 151 3 1
{Gastroenterology} 185 1 2
Line 5 should look like:
{Internal_Medicine, Neurology} 56 1 2
I have tried importing the file directly as a [.CSV], but the data.frame results in a similar format to my above attempt using tabs, except it places a comma in front of the first item:
{,Internal_Medicine} 56 1 2
Any troubleshooting suggestions would be greatly appreciated. It seems like this package is picky when it comes to formatting.

Line 5 should look like:
{Internal_Medicine, Neurology} 56 1 2
Check out
library(arulesSequences)
packageVersion("arulesSequences")
# [1] ‘0.2.16’
packageVersion("arules")
# [1] ‘1.5.0’
txt <- readLines(n=10)
2 1 1 OB/Gyn
15 1 1 Internal_Medicine
15 2 1 Internal_Medicine
15 3 1 Internal_Medicine
56 1 2 Internal_Medicine Neurology
84 1 1 Oncology
151 1 2 Hematology Hematology
151 2 1 Hematology/Oncology
151 3 1 Hematology/Oncology
185 1 2 Gastroenterology Gastroenterology
writeLines(txt, tf<-tempfile())
baskets <- read_baskets(con = tf, sep = "[ \t]+",info=c("sequenceID", "eventID", "SIZE"))
as(baskets, "data.frame")
# items sequenceID eventID SIZE
# 1 {OB/Gyn} 2 1 1
# 2 {Internal_Medicine} 15 1 1
# 3 {Internal_Medicine} 15 2 1
# 4 {Internal_Medicine} 15 3 1
# 5 {Internal_Medicine,Neurology} 56 1 2 # <----------
# 6 {Oncology} 84 1 1
# 7 {Hematology} 151 1 2
# 8 {Hematology/Oncology} 151 2 1
# 9 {Hematology/Oncology} 151 3 1
# 10 {Gastroenterology} 185 1 2

specify different subsets or intervals of a variable in the by argument of data.table

Using the following reaction time data (simplified for demonstrative purposes):
>dt
subject trialnum blockcode values.trialtype latency correct
1 1 1 practice cueswitch 3020 1
2 1 1 test cuerep 4284 1
3 1 21 test cueswitch 2094 1
4 1 34 test cuerep 3443 1
5 1 50 test taskswitch 3313 1
6 2 1 practice cueswitch 3020 1
7 2 1 test cuerep 1109 1
8 2 21 test cueswitch 3470 1
9 2 34 test cuerep 2753 1
10 2 50 test taskswitch 3321 1
I have been using data.table to obtain reaction time variables for consecutive subsets of trials (specified by trialnum, which ranges from 1 to 170 in the full dataset):
dt1=dt[blockcode=="test" & correct==1, list(
RT1=.SD[trialnum>=1 & trialnum<=30 & values.trialtype=="cuerep", mean(latency)],
RT2=.SD[trialnum>=31 & trialnum<=60 & values.trialtype=="cuerep", mean(latency)]
), by="subject"]
The output is
subject RT1 RT2
1: 1 4284 3443
2: 2 1109 2753
However, it becomes tedious creating a variable for each subset when there are more than 2 or 3 subsets. How can I specify those subsets more efficiently?

Use findInterval or cut to subset your trialnum column`
An example
# set the key to use binary search
setkey(dt, blockcode,correct,values.trialtype)
# the subset you want
dt1 <- dt[.('test',1,'cuerepetition')]
# use cut to define subsets
dt2 <- dt1[,list(latency = mean(latency)),
by=list(subject, trialset = cut(trialnum,seq(0,180,by=30)))]
dt2
# subject trialset latency
# 1: 1 (0,30] 4284
# 2: 1 (30,60] 3443
# 3: 2 (0,30] 1109
# 4: 2 (30,60] 2753
#If you want separate columns, it is a simple as using `dcast`
library(reshape2)
dcast(dt2,subject~trialset, value.var = 'latency')
# subject (0,30] (30,60]
# 1 1 4284 3443
# 2 2 1109 2753

Replace values great than a specific value within a loop in R

I am trying to figure out a way to loop through my data frame and replace any values greater than 200 with a decimal point.
Here is my code:
for (i in data$AGE) if (i > 199) i <- i*.01-2
Here is a head() sample of my data frame:
AGE LOC RACE SEX WORKREL PROD1 ICD10 INJ_ST DTH_YEAR DTH_MONTH DTH_DAY ACC_YEAR ACC_MONTH ACC_DAY
1 26 5 1 1 0 1290 V865 UT 2003 1 1 2002 12 31
2 20 1 7 2 0 1899 X47 HI 2003 1 1 2003 1 1
3 202 1 2 2 0 1598 W75 FL 2003 1 1 2003 1 1
4 86 5 1 2 0 1807 W18 FL 2003 1 1 2002 12 14
5 203 1 2 1 0 1598 W75 GA 2003 1 1 2003 1 1
6 79 0 1 2 2 921 X49 MA 2003 1 1 NA NA NA
So basically, if the value of AGE is greater than 200, then I want to multiply that value by .01 and then subtract 2.
My reason is because any value with 200 and greater is the age in months.
I'm not a Stats or R genius so my humble thanks in advance for all advice.

data$AGE[data$AGE> 200] <- data$AGE[data$AGE > 200] * 0.01 - 2

You can do this reasonably eleganty within and replace
data <- within(data, AGE <- replace(AGE, AGE > 200, AGE[AGE>200] * 0.01-2))
Or using data.table for memory efficiency and syntax elegance
library(data.table)
DT <- as.data.table(data)
# make sure that AGE is numeric not integer
DT[,AGE:= as.numeric(AGE)]
DT[AGE>200, AGE := AGE *0.01 -2]