Generating the dataframe objects that are produced during jackknife sampling - r

This post has been editted to more accurately describe the situation. I am utilising a form of jackknife sampling for my work. The jackknifed data will be used for calibration of a model, and the unused data will be used for validation.
Rather than perform the analysis immediately, I want to save the jackknifed samples as dataframes, as well as the data which was removed for each sample...
It's hard to explain, so I will use an example to illustrate:
The aim in the example is to create the datasets 4 times. Each time there should be 2 datasets - 1 of length 9 (the calibration one), and 1 of length 3 (the validation one).
df <-
data.frame(value1 = 1:(3*4),
value2 = seq(from = 1000, by = 50, length.out = 3*4),
tosplit = rep(1:4, each = 3))
df #df represents the dataframe in its entirety
dfs <- split(df, df$tosplit) #df is now split into 4 equal parts of 3
#####
> #Replicate 1
> r1_3parts <- do.call("rbind", dfs[1:3])
> r1_1parts <- do.call("rbind", dfs[4])
>
> r1_3parts
value1 value2 tosplit
1.1 1 1000 1
1.2 2 1050 1
1.3 3 1100 1
2.4 4 1150 2
2.5 5 1200 2
2.6 6 1250 2
3.7 7 1300 3
3.8 8 1350 3
3.9 9 1400 3
> r1_1parts
value1 value2 tosplit
4.10 10 1450 4
4.11 11 1500 4
4.12 12 1550 4
>
> #Replicate 2
> r2_3parts <- do.call("rbind", dfs[2:4])
> r2_1parts <- do.call("rbind", dfs[1])
>
> r2_3parts
value1 value2 tosplit
2.4 4 1150 2
2.5 5 1200 2
2.6 6 1250 2
3.7 7 1300 3
3.8 8 1350 3
3.9 9 1400 3
4.10 10 1450 4
4.11 11 1500 4
4.12 12 1550 4
> r2_1parts
value1 value2 tosplit
1.1 1 1000 1
1.2 2 1050 1
1.3 3 1100 1
>
> #Replicate 3
> r3_3parts <- do.call("rbind", dfs[c(3:4, 1)])
> r3_1parts <- do.call("rbind", dfs[2])
>
> r3_3parts
value1 value2 tosplit
3.7 7 1300 3
3.8 8 1350 3
3.9 9 1400 3
4.10 10 1450 4
4.11 11 1500 4
4.12 12 1550 4
1.1 1 1000 1
1.2 2 1050 1
1.3 3 1100 1
> r3_1parts
value1 value2 tosplit
2.4 4 1150 2
2.5 5 1200 2
2.6 6 1250 2
>
>
> #Replicate 4
> r4_3parts <- do.call("rbind", dfs[c(4, 1:2)])
> r4_1parts <- do.call("rbind", dfs[3])
>
> r4_3parts
value1 value2 tosplit
4.10 10 1450 4
4.11 11 1500 4
4.12 12 1550 4
1.1 1 1000 1
1.2 2 1050 1
1.3 3 1100 1
2.4 4 1150 2
2.5 5 1200 2
2.6 6 1250 2
> r4_1parts
value1 value2 tosplit
3.7 7 1300 3
3.8 8 1350 3
3.9 9 1400 3
>
This doesn't appear to be an option in packages that I can find - they default to just creating the statistics for you. What I want is to see the sample datasets, and also specify their relative size. Is this possible in an existing package, or if not, is there a suitable way to determine this in a more automated fashion?

Without a random component, this doesn't really strike me as a bootstrap. It seems you are pursuing a variation on permutation.
The data frame can be split with a fairly simple function.
df <-
data.frame(value1 = 1:(3*4),
value2 = seq(from = 1000, by = 50, length.out = 3*4),
tosplit = rep(1:4, each = 3))
split_into_two <- function(data, split_var, split_val){
split <- data[[split_var]] %in% split_val
split(data, split)
}
split_into_two(df, "tosplit", 1:3)
To get the four permutations you describe, we can use lapply:
lapply(list(1:3, 2:4, c(4, 1:2), c(3:4, 1)),
function(x) split_into_two(df, "tosplit", x))
This saves a great deal of copy-paste.

Related

Randomly sampling 1 ID from each pair in column

Say I have something like the following..
df <- data.frame (ID = c("2330", "2331", "2333", "2334", "2336", "2337", "4430", "4431", "4510", "4511"), length = c(8.4,6,3,9,3,4,1,7,4,2))
> df
ID length
1 2330 8.4
2 2331 6.0
3 2333 3.0
4 2334 9.0
5 2336 3.0
6 2337 4.0
7 4430 1.0
8 4431 7.0
9 4510 4.0
10 4511 2.0
IDs that are in a pair are +/- 1 of each other. (2330, 2331), (2333, 2334), (2336, 2337), (4430, 4431), & (4510, 4511) are the pairs in my example. I would like to randomly sample 1 ID from each pair to get a dataframe that looks like the following...
> df
ID length
1 2330 8.4
2 2334 9.0
3 2336 3.0
4 4430 1.0
5 4510 4.0
How would I accomplish this with base R? Thank you.
We may create a grouping column with gl for every 2 adjacent elements and then use slice_sample with n = 1
library(dplyr)
df %>%
group_by(grp = as.integer(gl(n(), 2, n()))) %>%
slice_sample(n = 1) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 × 2
ID length
<chr> <dbl>
1 2330 8.4
2 2333 3
3 2337 4
4 4430 1
5 4510 4
Or using base R
do.call(rbind, lapply(split(df, gl(nrow(df), 2, nrow(df)),
drop = TRUE), function(x) x[sample(nrow(x), 1),]))
-output
ID length
1 2330 8.4
2 2333 3.0
3 2337 4.0
4 4430 1.0
5 4510 4.0
Or with aggregate in base R
aggregate(.~ grp, transform(df, grp = cumsum(c(TRUE,
diff(as.numeric(ID)) !=1))), FUN = sample, 1)[-1]
ID length
1 2331 8.4
2 2334 3
3 2337 3
4 4431 7
5 4510 2
Or with tapply
df[with(df, tapply(seq_along(ID), rep(seq_along(ID), each = 2,
length.out = nrow(df)), FUN = sample, 1)),]
ID length
1 2330 8.4
4 2334 9.0
5 2336 3.0
7 4430 1.0
10 4511 2.0

Dynamic subsetting depending on values in R

i have a data frame with the next structure
Id Flag value1 value2
123 1 10 3.4
124 1 5 1.2
125 0 19 8.4
126 1 8 1.2
127 0 17 6.5
128 2 1 -6.5
I need to separate the data frame into 'n' subsets depending in only the name of the column, where 'n' is the distinct values of the columns, i would expect the following:
dataframe1
Id Flag value1 value2
123 1 10 3.4
124 1 5 1.2
126 1 8 1.2
dataframe2
Id Flag value1 value2
125 0 19 8.4
127 0 17 6.5
dataframe3
Id Flag value1 value2
128 2 1 -6.5
Since this is going inside a function, I only know the name of the column and the distinct values it can take, I've tried:
dataFrame$column==value
but I would need to do this for every value, and the values are dynamic in length depending on the name of the column.
Thanks in advance
Here, split is your friend.
splitbycol <- function(df, colname) {
split(df, df[[colname]])
}
splitbycol(df, "Flag")
## $`0`
## Id Flag value1 value2
## 3 125 0 19 8.4
## 5 127 0 17 6.5
##
## $`1`
## Id Flag value1 value2
## 1 123 1 10 3.4
## 2 124 1 5 1.2
## 4 126 1 8 1.2
##
## $`2`
## Id Flag value1 value2
## 6 128 2 1 -6.5
Then, if you'd like to make each of the data frames a separate "variable", call e.g.
subdf <- splitbycol(df, "Flag")
for (i in seq_along(subdf))
assign(paste0("df", i), subdf[[i]])
df1
## Id Flag value1 value2
## 3 125 0 19 8.4
## 5 127 0 17 6.5
Another approach avoiding for loop
> List <- split(df, df$Flag) # split
> names(List) <- paste0("dataframe", seq_along(List)) # naming (use seq_along better)
> list2env(List, envir=.GlobalEnv) # from list to data.frame
> dataframe1
# Id Flag value1 value2
#3 125 0 19 8.4
#5 127 0 17 6.5
> dataframe2
# Id Flag value1 value2
#1 123 1 10 3.4
#2 124 1 5 1.2
#4 126 1 8 1.2
> dataframe3
# Id Flag value1 value2
# 6 128 2 1 -6.5

sequential subtraction in r

I would highly appreciate if somebody could help me out with this. This looks simple but I have no clue how to go about it.
I am trying to work out the percentage change in one row with respect to the previous one. For example: my data frame looks like this:
day value
1 21
2 23.4
3 10.7
4 5.6
5 3.2
6 35.2
7 12.9
8 67.8
. .
. .
. .
365 27.2
What I am trying to do is to calculate the percentage change in each row with respect to previous row. For example:
day value
1 21
2 (day2-day1/day1)*100
3 (day3-day2/day2)*100
4 (day4-day3/day3)*100
5 (day5-day4/day4)*100
6 (day6-day5/day5)*100
7 (day7-day6/day6)*100
8 (day8-day7/day7)*100
. .
. .
. .
365 (day365-day364/day364)*100
and then print out only those days where the there was a percentage increase of >50% from the previous row
Many thanks
You are looking for diff(). See its help page by typing ?diff. Here are the indices of days that fulfill your criterion:
> value <- c(21,23.4,10.7,5.6,3.2,35.2,12.9,67.8)
> which(diff(value)/head(value,-1)>0.5)+1
[1] 6 8
Use diff:
value <- 100*diff(value)/value[2:length(value)]
Here's one way:
dat <- data.frame(day = 1:10, value = 1:10)
dat2 <- transform(dat, value2 = c(value[1], diff(value) / head(value, -1) * 100))
day value value2
1 1 1 1.00000
2 2 2 100.00000
3 3 3 50.00000
4 4 4 33.33333
5 5 5 25.00000
6 6 6 20.00000
7 7 7 16.66667
8 8 8 14.28571
9 9 9 12.50000
10 10 10 11.11111
dat2[dat2$value2 > 50, ]
day value value2
2 2 2 100
You're looking for the difffunction :
x<-c(3,1,4,1,5)
diff(x)
[1] -2 3 -3 4
Here is another way:
#dummy data
df <- read.table(text="day value
1 21
2 23.4
3 10.7
4 5.6
5 3.2
6 35.2
7 12.9
8 67.8", header=TRUE)
#get index for 50% change
x <- sapply(2:nrow(df),function(i)((df$value[i]-df$value[i-1])/df$value[i-1])>0.5)
#output
df[c(FALSE,x),]
# day value
#6 6 35.2
#8 8 67.8

Creating quantiles

I have a data set of individuals with their socioeconomic scores, ranging from -6.3 to 3.5. Now I want to assign each individual to their quantiles based on their socioeconomic score.
I have a dataset named Healthdata with two columns: Healthdata$SSE, and Healthdata$ID.
Eventually, I would like to get a data frame matched by their SSE quantiles.
How can I do this in R?
Here's one approach:
# an example data set
set.seed(1)
Healthdata <- data.frame(SSE = rnorm(8), ID = gl(2, 4))
transform(Healthdata, quint = ave(SSE, ID, FUN = function(x) {
quintiles <- quantile(x, seq(0, 1, .2))
cuts <- cut(x, quintiles, include.lowest = TRUE)
quintVal <- quintiles[match(cuts, levels(cuts)) + 1]
return(quintVal)
}))
# SSE ID quint
# 1 -0.6264538 1 -0.4644344
# 2 0.1836433 1 0.7482983
# 3 -0.8356286 1 -0.7101237
# 4 1.5952808 1 1.5952808
# 5 0.3295078 2 0.3610920
# 6 -0.8204684 2 -0.1304827
# 7 0.4874291 2 0.5877873
# 8 0.7383247 2 0.7383247
A simple illustration of how it works:
values <- 1:10
# [1] 1 2 3 4 5 6 7 8 9 10
quintiles <- quantile(values, seq(0, 1, .2))
# 0% 20% 40% 60% 80% 100%
# 1.0 2.8 4.6 6.4 8.2 10.0
cuts <- cut(values, quintiles, include.lowest = TRUE)
# [1] [1,2.8] [1,2.8] (2.8,4.6] (2.8,4.6]
# [5] (4.6,6.4] (4.6,6.4] (6.4,8.2] (6.4,8.2]
# [9] (8.2,10] (8.2,10]
# 5 Levels: [1,2.8] (2.8,4.6] ... (8.2,10]
quintVal <- quintiles[match(cuts, levels(cuts)) + 1]
# 20% 20% 40% 40% 60% 60% 80% 80% 100% 100%
# 2.8 2.8 4.6 4.6 6.4 6.4 8.2 8.2 10.0 10.0
So let's start with a sample data set based on your description:
set.seed(315)
Healthdata <- data.frame(SSE = sample(-6.3:3.5, 21, replace=TRUE), ID = gl(7, 3))
Which gives something like this:
> Healthdata[1:15,]
SSE ID
1 -0.3 1
2 -6.3 2
3 -1.3 3
4 -3.3 4
5 -5.3 5
6 -4.3 6
7 -4.3 7
8 0.7 8
9 -4.3 9
10 -4.3 10
11 -3.3 11
12 0.7 12
13 -2.3 13
14 -3.3 14
15 0.7 15
I understand that you want a new variable which identifies the quantile group of the individual's socioeconomic status. I would do something like this:
transform(Healthdata, Q = cut(Healthdata$SSE,
breaks = quantile(Healthdata$SSE),
labels = c(1, 2, 3, 4),
include.lowest=TRUE))
To return:
SSE ID Q
1 -1.3 1 2
2 -6.3 2 1
3 -4.3 3 1
4 0.7 4 3
5 1.7 5 3
6 1.7 6 3
7 -5.3 7 1
8 1.7 8 3
9 2.7 9 4
10 -3.3 10 2
11 -1.3 11 2
12 -3.3 12 2
13 1.7 13 3
14 0.7 14 3
15 -4.3 15 1
If you want to see the upper and lower bounds for the quantile ranges, omit the labels = c(1, 2, 3, 4) to return this instead:
SSE ID Q
1 -1.3 1 (-4.3,-1.3]
2 -6.3 2 [-6.3,-4.3]
3 -4.3 3 [-6.3,-4.3]
4 0.7 4 (-1.3,1.7]
5 1.7 5 (-1.3,1.7]

calculate multiple columns mean in R and generate a new table

I have a data set in .csv. It contains multiple columns for example.
Group Wk1 WK2 WK3 WK4 WK5 WK6
A 1 2 3 4 5 6
B 7 8 9 1 2 3
C 4 5 6 7 8 9
D 1 2 3 4 5 6
Then if I want to have the mean of both WK1 & WK2, Wk3, WK4 & WK5, WK6.
How can I do that?
The result may like
Group 1 2 3 4
mean 3.75 5.25 4.5 6
And how can I save it into a new table?
Thanks in advance.
You can melt your data.frame, create your groups using some basic indexing, and use aggregate:
library(reshape2)
X <- melt(mydf, id.vars="Group")
Match <- c(Wk1 = 1, Wk2 = 1, Wk3 = 2, Wk4 = 3, Wk5 = 3, Wk6 = 4)
aggregate(value ~ Match[X$variable], X, mean)
# Match[X$variable] value
# 1 1 3.75
# 2 2 5.25
# 3 3 4.50
# 4 4 6.00
tapply is also an appropriate candidate here:
tapply(X$value, Match[X$variable], mean)
# 1 2 3 4
# 3.75 5.25 4.50 6.00

Resources