R: Creating multiple resampled dataset based on multiple factors - r

I need to create multiple (several 1000) resampled datasets from a large database. I have three categorical variables. Site (S), Transect(T), Quadrat(Q). The response variable is Value (V), which is the result of the particular S, T, & Q combination. Quads along each transect at each site. I pasted an abbreviated dataset below.
S T Q V
A 1 1 8
A 1 2 5
A 1 3 0
A 2 1 0
A 2 2 15
A 2 3 0
A 3 1 0
A 3 2 25
A 3 3 0
B 1 1 0
B 1 2 1
B 1 3 0
B 2 1 33
B 2 2 1
B 2 3 2
B 3 1 0
B 3 2 207
B 3 3 0
C 1 1 0
C 1 2 1
C 1 3 0
C 2 1 45
C 2 2 33
C 2 3 0
C 3 1 0
C 3 2 1
C 3 3 0
The idea would be that for a given site, the resampled dataset would contain ## of quads from transect 1 to n, where ## would be the number of quadrats(Q) per transect (T) per site (S). I am not trying to resample the dataset based on S, T, & Q. I would like to be able to resample a user-defined number of rows, based on the conditions I define. For example, if I chose to resample using based on 2 quadrats(Q) per transect (T) per site(S), I envision the resampled dataset looking like the below example.
S T Q V
A 1 1 8
A 1 3 0
A 2 1 0
A 2 2 15
A 3 2 25
A 3 3 0
B 1 2 1
B 1 3 0
B 2 2 1
B 2 3 2
B 3 1 0
B 3 2 207
C 1 1 0
C 1 3 0
C 2 1 45
C 2 3 0
C 3 2 1
C 3 3 0
Please let me know if that doesn't make sense and I'll revise until it does. Thanks for any assistance!

Consider by to slice dataframes by Site and Transect factors and then sample random rows:
set.seed(444)
quads <- 2
# BUILD LIST OF SUBSETTED RANDOM SAMPLED DATAFRAMES
df_list <- by(df, df[c("S", "T")], FUN=function(df) df[sample(nrow(df), quads),])
# STACK ALL DATAFRAMES INTO ONE FINAL DF
sample_df <- do.call(rbind, df_list)
# SORT DATAFRAME BY S AND T
sample_df <- with(sample_df, sample_df[order(S, T),])
# RESET ROW NAMES
row.names(sample_df) <- NULL
sample_df
# S T Q V
# 1 A 1 1 8
# 2 A 1 3 0
# 3 A 2 2 15
# 4 A 2 1 0
# 5 A 3 1 0
# 6 A 3 3 0
# 7 B 1 2 1
# 8 B 1 1 0
# 9 B 2 3 2
# 10 B 2 1 33
# 11 B 3 1 0
# 12 B 3 2 207
# 13 C 1 1 0
# 14 C 1 2 1
# 15 C 2 1 45
# 16 C 2 3 0
# 17 C 3 3 0
# 18 C 3 2 1
Data
txt = '
S T Q V
A 1 1 8
A 1 2 5
A 1 3 0
A 2 1 0
A 2 2 15
A 2 3 0
A 3 1 0
A 3 2 25
A 3 3 0
B 1 1 0
B 1 2 1
B 1 3 0
B 2 1 33
B 2 2 1
B 2 3 2
B 3 1 0
B 3 2 207
B 3 3 0
C 1 1 0
C 1 2 1
C 1 3 0
C 2 1 45
C 2 2 33
C 2 3 0
C 3 1 0
C 3 2 1
C 3 3 0'
df = read.table(text=txt, header=TRUE)
To build randomly generated dataframes, simply extend out quads and run it through lapply:
max_quads <- 3
quads <- replicate(1000, sample(1:max_quads, 1))
df_list <- lapply(quads, function(q) {
by_list <- by(df, df[c("S", "T")], FUN=function(df) df[sample(nrow(df), q),]))
sample_df <- do.call(rbind, by_list)
sample_df <- with(sample_df, sample_df[order(S, T),])
row.names(sample_df) <- NULL
return(sample_df)
})

Related

Shift (Complete) Specific Rows Left in R

I pulled a data.frame from the internet and need to shift completely, (5 of 168) specific rows to the left one column. I thought best to append a column to the front of the data.frame and move the rows over but am unsuccessful. For example, I need something like this:
a b c d e >>> a b c d e
0 1 2 3 4 0 1 2 3 4
0 0 1 2 3 0 1 2 3 NA
0 1 2 3 4 0 1 2 3 4
If you know which rows you want to shift, you can replace the first value(s) of these rows with NA, and then use hacksaw::shift_row_values.
library(hacksaw)
data[2, "a"] <- NA
data %>%
shift_row_values(at = 2)
a b c d e
1 0 1 2 3 4
2 0 1 2 3 NA
3 0 1 2 3 4
data
data <- read.table(header = T, text = "
a b c d e
0 1 2 3 4
0 0 1 2 3
0 1 2 3 4 ")
Another possible solution, based on base R:
rows <- 2:3
df[rows,] <- cbind(df[rows, -1], NA)
df
#> a b c d e
#> 1 0 1 2 3 4
#> 2 0 1 2 3 NA
#> 3 1 2 3 4 NA
You can replace a row with an offset part plus NA like this:
dat[2,] <- c(dat[2, 2:5], NA)
Data:
dat <- read.table(text="
a b c d e
0 1 2 3 4
0 0 1 2 3
0 1 2 3 4",
header=TRUE)

Looping through specified columns in a Matrix and replacing their values by subtracting the value from 4

I am new(ish) to R and I am still unsure about loops.
If I had a large matrix object in R with columns having values of 0 - 4, and I would like to invert these values for specified columns.
I would use the code:
b[, "AX1"] <- 4 - b[, "AX1"]
Where b is a Matrix extracted from a larger list object and AX1 would be a column in the matrix.
I would then replace the changed Matrix back into its list using the code:
DF1$geno[[1]]$data <- b
How would I loop this code through a list of column names(AX1, AX10, AX42, ...)for about 30 columns of the 5000 columns in the matrix if I used a list with the 30 Column names to be inverted?
The simplest way you can do it (assuming that you always transform it the way x = 4 - x) is to expand your approach to the list of columns:
# Create an example dataset
set.seed(68859457)
(
dat <- matrix(
data = sample(x = 0:4, size = 100, replace = TRUE),
nrow = 10,
dimnames = list(1:10, paste('AX', 1:10, sep = ''))
)
)
# AX1 AX2 AX3 AX4 AX5 AX6 AX7 AX8 AX9 AX10
# 1 2 1 2 3 2 2 3 1 0 4
# 2 4 3 4 4 0 1 3 1 3 4
# 3 3 0 3 4 2 2 4 1 2 1
# 4 2 2 0 2 4 2 2 1 1 0
# 5 4 4 4 3 3 1 0 3 2 2
# 6 2 1 1 0 3 3 4 4 1 0
# 7 2 3 1 3 3 1 0 1 0 4
# 8 2 2 1 1 0 3 1 3 2 1
# 9 3 1 4 1 2 1 0 0 4 1
# 10 4 3 2 4 1 0 2 0 3 2
# Create a list of columns you want to modify
set.seed(68859457)
(
cols_to_invert <- sort(sample(x = colnames(dat), size = 5))
)
# [1] "AX3" "AX4" "AX5" "AX6" "AX9"
# Use the list of columns created above to modify matrix in place
dat[, cols_to_invert] <- 4 - dat[, cols_to_invert]
# See the result
dat
# AX1 AX2 AX3 AX4 AX5 AX6 AX7 AX8 AX9 AX10
# 1 2 1 2 1 2 2 3 1 4 4
# 2 4 3 0 0 4 3 3 1 1 4
# 3 3 0 1 0 2 2 4 1 2 1
# 4 2 2 4 2 0 2 2 1 3 0
# 5 4 4 0 1 1 3 0 3 2 2
# 6 2 1 3 4 1 1 4 4 3 0
# 7 2 3 3 1 1 3 0 1 4 4
# 8 2 2 3 3 4 1 1 3 2 1
# 9 3 1 0 3 2 3 0 0 0 1
# 10 4 3 2 0 3 4 2 0 1 2
Difficult to tell without knowing exact structure of the data but based on your explanation and attempt maybe this will help.
cols <- c('AX1', 'AX10', 'AX42')
DF1$geno <- lapply(DF1$geno, function(x) {
x$data <- 4 - x$data[, cols]
x
})

identifying rows having common values in two columns

How to identify rows having same values in two columns (here: treatment, replicate) at least in another one row?
set.seed(0)
x <- rep(1:10, 4)
y <- sample(c(rep(1:10, 2)+rnorm(20)/5, rep(6:15, 2) + rnorm(20)/5))
treatment <- sample(gl(8, 5, 40, labels=letters[1:8]))
replicate <- sample(gl(8, 5, 40))
d <- data.frame(x=x, y=y, treatment=treatment, replicate=replicate)
table(d$treatment, d$replicate)
# 1 2 3 4 5 6 7 8
# a 1 0 0 1 1 2 0 0
# b 1 1 0 0 1 2 0 0
# c 0 0 0 0 2 0 1 2
# d 2 0 1 1 0 0 1 0
# e 0 2 1 1 0 0 0 1
# f 0 1 1 0 1 1 1 0
# g 0 1 0 2 0 0 1 1
# h 1 0 2 0 0 0 1 1
From the above output, my guess is that the output should contain 16 rows. Any idea how to achieve this?
Update:
d %>% group_by(treatment, replicate) %>% filter(n()>1)
# A tibble: 16 x 4
x y treatment replicate
<int> <dbl> <fctr> <fctr>
1 2 7.050445 h 3
2 5 1.840198 b 6
3 8 9.160838 d 1
4 9 4.254486 h 3
5 2 8.870106 g 4
6 4 7.821616 a 6
7 6 9.752492 e 2
8 7 9.988579 c 5
9 9 10.480931 c 8
10 1 2.770469 c 8
11 2 7.913338 e 2
12 3 13.743080 d 1
13 9 5.692010 b 6
14 10 11.100722 a 6
15 3 12.198432 g 4
16 5 5.955146 c 5
I have identified one approach where the results seem to satisfy the condition. Any other better solutions?
You can use duplicated as a condition:
dups <- d[which(duplicated(d[,c("treatment", "replicate")]) |
duplicated(d[ ,c("treatment", "replicate")], fromLast = TRUE)),]
>dups
x y treatment replicate
2 2 7.050445 h 3
5 5 1.840198 b 6
8 8 9.160838 d 1
9 9 4.254486 h 3
12 2 8.870106 g 4
14 4 7.821616 a 6
16 6 9.752492 e 2
17 7 9.988579 c 5
19 9 10.480931 c 8
21 1 2.770469 c 8
22 2 7.913338 e 2
23 3 13.743080 d 1
29 9 5.692010 b 6
30 10 11.100722 a 6
33 3 12.198432 g 4
35 5 5.955146 c 5

expand data.frame to long format and increment value

I would like to convert my data from a short format to a long format and I imagine there is a simple way to do it (possibly with reshape2, plyr, dplyr, etc?).
For example, I have:
foo <- data.frame(id = 1:5,
y = c(0, 1, 0, 1, 0),
time = c(2, 3, 4, 2, 3))
id y time
1 0 2
2 1 3
3 0 4
4 1 2
5 0 3
I would like to expand/copy each row n times, where n is that row's value in the "time" column. However, I would also like the variable "time" to be incremented from 1 to n. That is, I would like to produce:
id y time
1 0 1
1 0 2
2 1 1
2 1 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 1 1
4 1 2
5 0 1
5 0 2
5 0 3
As a bonus, I would also like to do a sort of incrementing of the variable "y" where, for those ids with y = 1, y is set to 0 until the largest value of "time". That is, I would like to produce:
id y time
1 0 1
1 0 2
2 0 1
2 0 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 0 1
4 1 2
5 0 1
5 0 2
5 0 3
This seems like something that dplyr might already do, but I just don't know where to look. Regardless, any solution that avoids loops is helpful.
You can create a new data frame with the proper id and time columns for the long format, then merge that with the original. This leaves NA for the unmatched values, which can then be substituted with 0:
merge(foo,
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
),
all.y=TRUE
)
## id time y
## 1 1 1 NA
## 2 1 2 0
## 3 2 1 NA
## 4 2 2 NA
## 5 2 3 1
## 6 3 1 NA
## 7 3 2 NA
## 8 3 3 NA
## 9 3 4 0
## 10 4 1 NA
## 11 4 2 1
## 12 5 1 NA
## 13 5 2 NA
## 14 5 3 0
A similar merge works for the first expansion. Merge foo without the time column with the same created data frame as above:
merge(foo[c('id','y')],
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
)
)
## id y time
## 1 1 0 1
## 2 1 0 2
## 3 2 1 1
## 4 2 1 2
## 5 2 1 3
## 6 3 0 1
## 7 3 0 2
## 8 3 0 3
## 9 3 0 4
## 10 4 1 1
## 11 4 1 2
## 12 5 0 1
## 13 5 0 2
## 14 5 0 3
It's not necessary to specify all (or all.y) in the latter expression because there are multiple time values for each matching id value, and these are expanded. In the prior case, the time values were matched from both data frames, and without specifying all (or all.y) you would get your original data back.
The initial expansion can be achieved with:
newdat <- transform(
foo[rep(rownames(foo),foo$time),],
time = sequence(foo$time)
)
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 1 1
#2.1 2 1 2
#2.2 2 1 3
# etc
To get the complete solution, including the bonus part, then do:
newdat$y[-cumsum(foo$time)] <- 0
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 0 1
#2.1 2 0 2
#2.2 2 1 3
#etc
If you were really excitable, you could do it all in one step using within:
within(
foo[rep(rownames(foo),foo$time),],
{
time <- sequence(foo$time)
y[-cumsum(foo$time)] <- 0
}
)
If you're willing to go with "data.table", you can try:
library(data.table)
fooDT <- as.data.table(foo)
fooDT[, list(time = sequence(time)), by = list(id, y)]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 1 1
# 4: 2 1 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 1 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
And, for the bonus question:
fooDT[, list(time = sequence(time)),
by = list(id, y)][, y := {y[1:(.N-1)] <- 0; y},
by = id][]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 0 1
# 4: 2 0 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 0 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
For the bonus question, alternatively:
fooDT[, list(time=seq_len(time)), by=list(id,y)][y == 1,
y := c(rep.int(0, .N-1L), 1), by=id][]
With dplyr (and magritte for nice legibility):
library(magrittr)
library(dplyr)
foo[rep(1:nrow(foo), foo$time), ] %>%
group_by(id) %>%
mutate(y = !duplicated(y, fromLast = TRUE),
time = 1:n())
Hope it helps

Conditonally delete columns in R

I know how to delete columns in R, but I am not sure how to delete them based on the following set of conditions.
Suppose a data frame such as:
DF <- data.frame(L = c(2,4,5,1,NA,4,5,6,4,3), J= c(3,4,5,6,NA,3,6,4,3,6), K= c(0,1,1,0,NA,1,1,1,1,1),D = c(1,1,1,1,NA,1,1,1,1,1))
DF
L J K D
1 2 3 0 1
2 4 4 1 1
3 5 5 1 1
4 1 6 0 1
5 NA NA NA NA
6 4 3 1 1
7 5 6 1 1
8 6 4 1 1
9 4 3 1 1
10 3 6 1 1
The data frame has to be set up in this fashion. Column K corresponds to column L, and column D, corresponds to column J. Because column D has values that are all equal to one, I would like to delete column D, and the corresponding column J yielding a dataframe that looks like:
DF
L K
1 2 0
2 4 1
3 5 1
4 1 0
5 NA NA
6 4 1
7 5 1
8 6 1
9 4 1
10 3 1
I know there has got to be a simple command to do so, I just can't think of any. And if it makes any difference, the NA's must be retained.
Additional helpful information, in my real data frame there are a total of 20 columns, so there are 10 columns like L and J, and another 10 that are like K and D, I need a function that can recognize the correspondence between these two groups and delete columns accordingly if necessary
Thank you in advance!
Okey, assuming the column-number based correspondence, here is an example:
> n <- 10
>
> # sample data
> d <- data.frame(lapply(1:n, function(x)sample(n)), lapply(1:n, function(x)sample(2, n, T, c(0.1, 0.9))-1))
> names(d) <- c(LETTERS[1:n], letters[1:n])
> head(d)
A B C D E F G H I J a b c d e f g h i j
1 5 5 2 7 4 3 4 3 5 8 0 1 1 1 1 1 1 1 1 1
2 9 8 4 6 7 8 8 2 10 5 1 1 1 1 1 1 1 1 1 1
3 6 6 10 3 5 6 2 1 8 6 1 1 1 1 1 1 1 1 1 1
4 1 7 5 5 1 10 10 4 2 4 1 1 1 1 1 1 1 1 1 1
5 10 9 6 2 9 5 6 9 9 9 1 1 0 1 1 1 1 1 1 1
6 2 1 1 4 6 1 5 8 4 10 1 1 1 1 1 1 1 1 1 1
>
> # find the column that should be left.
> idx <- which(colMeans(d[(n+1):(2*n)], na.rm = TRUE) != 1)
>
> # filter the data
> d[, c(idx, idx+n)]
A B C D F a b c d f
1 5 5 2 7 3 0 1 1 1 1
2 9 8 4 6 8 1 1 1 1 1
3 6 6 10 3 6 1 1 1 1 1
4 1 7 5 5 10 1 1 1 1 1
5 10 9 6 2 5 1 1 0 1 1
6 2 1 1 4 1 1 1 1 1 1
7 8 4 7 10 2 1 1 1 1 0
8 7 3 9 9 4 1 0 1 0 1
9 3 10 3 1 9 1 1 0 1 1
10 4 2 8 8 7 1 0 1 1 1
I basically agree with koshke (whose SO work is excellent), but would suggest that the test to use is colSums(d[(n+1):(2*n)], na.rm=TRUE) == NROW(d) , since a paired 0 and 2 or -1 and 3 could throw off the colMeans test.

Resources