Subsampling without replacement in R [duplicate] - r

This question already has an answer here:
Sample without replacement, or duplicates, in R
(1 answer)
Closed 1 year ago.
There's a very similar question to mine on stack, but that doesn't directly answer my question.
I have abundance data for 250 species across 1000 sites. Species are columns, sites are rows. My abundances data look something like the data in the linked post above.
0 0 3 0 0 201 0 0 0 82
0 23 5 0 0 0 0 0 0 0
9 0 0 0 0 12 0 0 0 913
0 7 91 0 8 0 0 92 9 0
131 12 0 410 0 0 0 3 0 0
If I wanted to sample 50 individuals from each site, without replacement, how can I do this?
Focusing on code for single sites for now.
This code:
samples <- sample(1:ncol(abundances), 50, rep=FALSE, prob=abundances[1,]) doesn't work unless I change to rep=TRUE. However, I need sampling WITHOUT replacement.
I don't want to use sample(abundances[1,], 50, rep=FALSE) because then instead of sampling individuals, it samples species and will report the whole value in that row (i.e. species 6 may occur 201 times at site 1, it'll report 201, rather than 1 individual from that species, resulting in >50 individuals in final subsample).
I essentially want an output identical to what user Dinre answered in post above, but without it being for bootstrapping. I just want to sample without replacement. This process will ultimately be integrated into a for loop for a subsample from each site.

Here is a way to sample vector elements from each row with the sum of each sampled row equal to a chosen integer, the size. In the code below, n <- 5 is passed as function argument size. The call to runif adds an element of randomness to the sampling function.
fun <- function(x, size){
x <- x*runif(length(x))
y <- size*x/sum(x)
round(y)
}
set.seed(2021)
n <- 5
t(apply(df1, 1, fun, size = n))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#[1,] 0 0 0 0 0 3 0 0 0 2
#[2,] 0 4 1 0 0 0 0 0 0 0
#[3,] 0 0 0 0 0 0 0 0 0 5
#[4,] 0 0 3 0 0 0 0 1 0 0
#[5,] 0 0 0 5 0 0 0 0 0 0
Data
Here is the question's data in dput format.
X <-
structure(c(0L, 0L, 9L, 0L, 131L, 0L, 23L, 0L, 7L, 12L, 3L, 5L,
0L, 91L, 0L, 0L, 0L, 0L, 0L, 410L, 0L, 0L, 0L, 8L, 0L, 201L,
0L, 12L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 92L, 3L, 0L,
0L, 0L, 9L, 0L, 82L, 0L, 913L, 0L, 0L), .Dim = c(5L, 10L), .Dimnames = list(
NULL, c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10")))

Related

R: Group by one column, and return the first row that has a value greater than 0 in any of the other columns and then return all rows after this row

I'm new to R programming and hope someone could help me with the situation below:
I have a dataframe shown in the picture (Original Dataframe), I would like to return the first record grouped by the [ID] column that has a value >= 1 in any of the four columns (A, B, C, or D) and all the records after based off the [Date] column (the desired dataframe should look like the Output Dataframe shown in the picture). Basically, remove all the records highlighted in yellow. I would appreciate greatly if you can provide the R code to achieve this.
structure(list(ID = c(101L, 101L, 101L, 101L, 101L, 101L, 103L,
103L, 103L, 103L), Date = c(43338L, 43306L, 43232L, 43268L, 43183L,
43144L, 43310L, 43246L, 43264L, 43209L), A = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 0L), B = c(0L, 2L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 0L), C = c(0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), D = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("ID", "Date",
"A", "B", "C", "D"), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
Here is a solution,
ID Date A B C D
1 101 26.08.2018 0 0 0 0
2 101 25.07.2018 0 2 0 0
3 101 12.05.2018 0 0 1 0
4 101 17.06.2018 0 0 0 0
5 101 24.03.2018 0 0 0 0
6 101 13.02.2018 0 0 0 0
7 103 29.07.2018 0 0 0 0
8 103 26.05.2018 1 1 0 0
9 103 13.06.2018 0 0 0 0
10 103 19.04.2018 0 0 0 0
data$Check <- rowSums(data[3:6]) 
data$Date <- as.Date(data$Date , "%d.%m.%Y")
data <- data[order(data$ID,data$Date),]
id <- unique(data$ID)
for(i in 1:length(id)) {
data_sample <- data[data$ID == id[i],]
data_sample <- data_sample[ min(which(data_sample$Check>0 )):nrow(data_sample),]
if(i==1) {
final <- data_sample
} else {
final <- rbind(final,data_sample)
}
}
final <- final[,-7]
ID Date A B C D
3 101 2018-05-12 0 0 1 0
4 101 2018-06-17 0 0 0 0
2 101 2018-07-25 0 2 0 0
1 101 2018-08-26 0 0 0 0
8 103 2018-05-26 1 1 0 0
9 103 2018-06-13 0 0 0 0
7 103 2018-07-29 0 0 0 0
Here's a tidyverse solution. The filter condition deserves some explanation:
first, we sort by ID and Date and group_by ID
Then, for each ID (since we're grouped by ID) we apply the filter condition:
Test, for each row, whether any of the variables are > 0
Get the row number for all rows (in the group) where this is the case
Find the lowest one (since rows are sorted by Date, this will be the earliest)
Get the value of Date for that row.
Then filter rows where Date is >= than this.
Since we're still grouping by ID, all these calculations will happen separately for each group:
df %>%
arrange(ID, Date) %>%
group_by(ID) %>%
filter(Date >= Date[min(which(A > 0 | B > 0 | C > 0 | D > 0))])
# A tibble: 7 x 6
# Groups: ID [2]
ID Date A B C D
<int> <int> <int> <int> <int> <int>
1 101 43232 0 0 1 0
2 101 43268 0 0 0 0
3 101 43306 0 2 0 0
4 101 43338 0 0 0 0
5 103 43246 1 1 0 0
6 103 43264 0 0 0 0
7 103 43310 0 0 0 0

delete the colmuns which are contained just NA or 0 or both values [duplicate]

This question already has an answer here:
Remove the columns with the colsums=0
(1 answer)
Closed 7 years ago.
What should I do if I want to remove the columns which are contained just the 0 or NA or both values?
mat <- structure(c(0L, 0L, 1L, 0L, 2L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 0L,
0L, 2L, 0L, 0L, NA, 0L, 0L, 0L, 1L, 0L, 0L, NA, 0L, NA, 0L, 0L,
0L, 0L, NA, 0L, 2L, 0L, 0L), .Dim = c(6L, 6L), .Dimnames = list(
c("A05363", "A05370", "A05380", "A05397", "A05400", "A05426"), c("X1.110590170", "X1.110888172", "X1.110906406", "X1.110993854", "X1.110996710", "X1.111144756")))
My output should be like this:
X1.110590170 X1.110906406 X1.110993854 X1.111144756
A05363 0 0 0 0
A05370 0 0 0 NA
A05380 1 2 0 0
A05397 0 0 1 2
A05400 2 0 0 0
A05426 0 NA 0 0
You can filter out the columns using the apply function along the columns.
You will simply have to use the all function to make sure that all values in the column satisfy the logic: is.na(x) | x == 0.
filter_cols <- apply(mat, 2, function(x) !all(is.na(x) | x == 0))
mat[,filter_cols]
#' X1.110590170 X1.110906406 X1.110993854 X1.111144756
#' A05363 0 0 0 0
#' A05370 0 0 0 NA
#' A05380 1 2 0 0
#' A05397 0 0 1 2
#' A05400 2 0 0 0
#' A05426 0 NA 0 0

Count Number of Pairwise Differences of a Matrix in R

I have the following matrix:
0 1 0 0 0 1 0 0 # Row A
0 1 0 0 0 0 1 0 # Row B
0 1 0 0 0 0 0 0 # Row C
0 0 1 0 0 0 0 0 # Row D
I want to make a new matrix that shows the pairwise difference between each row (e.g between rows A and B, there are 2 columns that are different, so the entry in the matrix corresponding to A and B is 2). Like this:
A B C D
A - 2 1 3
B - - 1 3
C - - - 2
D - - - -
The matrix isn't absolutely necessary. It's just an intermediary step for what I really want to do: count the number of pairwise differences between each row in the original matrix like so...
(2+1+3+1+3+2) = 12
You could try combn
v1 <- combn(1:nrow(m1), 2, FUN=function(x) sum(m1[x[1],]!= m1[x[2],]))
v1
#[1] 2 1 3 1 3 2
sum(v1)
#[1] 12
If you need a matrix output
m2 <- outer(1:nrow(m1), 1:nrow(m1), FUN=Vectorize(function(x,y)
sum(m1[x,]!=m1[y,])))
dimnames(m2) <- rep(list(LETTERS[1:4]),2)
m2[lower.tri(m2)] <- 0
m2
# A B C D
#A 0 2 1 3
#B 0 0 1 3
#C 0 0 0 2
#D 0 0 0 0
data
m1 <- structure(c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L), .Dim = c(4L, 8L))
I think that this function could help you to count the differences
count.diff <- function(mat) {
Nrow <- nrow(mat)
count <- 0
for (i in 1:(Nrow-1)) count <- count + sum(t(t(mat[-(1:i),])!=mat[i,]))
count
}
mat <- matrix(rbinom(n=24,size=1,prob=0.7), ncol=4)
mat
count.diff(mat)

R: Combine rows in same data.frame [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 6 years ago.
I have a simple R problem, but I just can't find the answer.
I have a dataframe like this:
A 1 0 0 0 0 0
B 0 1 0 0 0 0
B 0 0 1 0 0 1
B 0 0 0 0 1 0
C 1 0 0 0 0 0
C 0 0 0 1 1 0
And i want it to be just like this:
A 1 0 0 0 0 0
B 0 1 1 0 1 1
C 1 0 0 1 1 0
Thank you very much!
Regards Lisanne
Here's one possbility using tapply:
cbind(unique(dat[1]), do.call(rbind, tapply(dat[-1], dat[[1]], colSums)))
# V1 V2 V3 V4 V5 V6 V7
# 1 A 1 0 0 0 0 0
# 2 B 0 1 1 0 1 1
# 5 C 1 0 0 1 1 0
where dat is the name of your data frame.
dat <- structure(list(V1 = structure(c(1L, 2L, 2L, 2L, 3L, 3L), .Label = c("A",
"B", "C"), class = "factor"), V2 = c(1L, 0L, 0L, 0L, 1L, 0L),
V3 = c(0L, 1L, 0L, 0L, 0L, 0L), V4 = c(0L, 0L, 1L, 0L, 0L,
0L), V5 = c(0L, 0L, 0L, 0L, 0L, 1L), V6 = c(0L, 0L, 0L, 1L,
0L, 1L), V7 = c(0L, 0L, 1L, 0L, 0L, 0L)), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7"), class = "data.frame", row.names = c(NA,
-6L))
You could...
aggregate(.~ V1 , data =dat, sum)
or
library(plyr)
ddply(dat, .(V1), function(x) colSums(x[,2:7]) )
If you're working with a data.frame where there are duplicates but you only want the presence or absence of a 1 to be noted, then after these functions you might want to do something like dat[!(dat %in% c(1,0)] <- 1.
A possibility not mentioned is the aggregate function. I think this is quite 'readable'.
aggregate(cbind(data$X1, data$X2, data$X3, data$X4),
by = list(category = data$group), FUN = sum)

R: use a row as a grouping vector for row sums

If I have a data set laid out like:
Cohort Food1 Food2 Food 3 Food 4
--------------------------------
Group 1 1 2 3
A 1 1 0 1
B 0 0 1 0
C 1 1 0 1
D 0 0 0 1
I want to sum each row, where I can define food groups into different categories. So I would like to use the Group row as the defining vector.
Which would mean that food1 and food2 are in group 1, food3 is in group 2, food 4 is in group 3.
Ideal output something like:
Cohort Group1 Group2 Group3
A 2 0 1
B 0 1 0
C 2 0 1
D 0 0 1
I tried using this rowsum() based functions but no luck, do I need to use ddply() instead?
Example data from comment:
dat <-
structure(list(species = c("group", "princeps", "bougainvillei",
"hombroni", "lindsayi", "concretus", "galatea", "ellioti", "carolinae",
"hydrocharis"), locust = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L), grasshopper = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L),
snake = c(2L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), fish = c(2L,
1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L), frog = c(2L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L), toad = c(2L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L), fruit = c(3L, 0L, 0L, 0L, 0L, 1L, 1L,
0L, 0L, 0L), seed = c(3L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L)), .Names = c("species", "locust", "grasshopper", "snake",
"fish", "frog", "toad", "fruit", "seed"), class = "data.frame", row.names = c(NA,
-10L))
There are most likely more direct approaches, but here is one you can try:
First, create a copy of your data minus the second header row.
dat2 <- dat[-1, ]
melt() and dcast() and so on from the "reshape2" package don't work nicely with duplicated column names, so let's make the column names more "reshape2 appropriate".
Seq <- ave(as.vector(unlist(dat[1, -1])),
as.vector(unlist(dat[1, -1])),
FUN = seq_along)
names(dat2)[-1] <- paste("group", dat[1, 2:ncol(dat)],
".", Seq, sep = "")
melt() the dataset
m.dat2 <- melt(dat2, id.vars="species")
Use the colsplit() function to split the columns correctly.
m.dat2 <- cbind(m.dat2[-2],
colsplit(m.dat2$variable, "\\.",
c("group", "time")))
head(m.dat2)
# species value group time
# 1 princeps 0 group1 1
# 2 bougainvillei 0 group1 1
# 3 hombroni 1 group1 1
# 4 lindsayi 0 group1 1
# 5 concretus 0 group1 1
# 6 galatea 0 group1 1
Proceed with dcast() as usual
dcast(m.dat2, species ~ group, sum)
# species group1 group2 group3
# 1 bougainvillei 0 0 0
# 2 carolinae 1 1 0
# 3 concretus 0 2 2
# 4 ellioti 0 1 0
# 5 galatea 1 1 1
# 6 hombroni 2 1 0
# 7 hydrocharis 0 0 0
# 8 lindsayi 0 1 0
# 9 princeps 0 1 0
Note: Edited because original answer was incorrect.
Update: An easier way in base R
This problem is much more easily solved if you start by transposing your data.
dat3 <- t(dat[-1, -1])
dat3 <- as.data.frame(dat3)
names(dat3) <- dat[[1]][-1]
t(do.call(rbind, lapply(split(dat3, as.numeric(dat[1, -1])), colSums)))
# 1 2 3
# princeps 0 1 0
# bougainvillei 0 0 0
# hombroni 2 1 0
# lindsayi 0 1 0
# concretus 0 2 2
# galatea 1 1 1
# ellioti 0 1 0
# carolinae 1 1 0
# hydrocharis 0 0 0
You can do this using base R fairly easily. Here's an example.
First, figure out which animals belong in which group:
groupings <- as.data.frame(table(as.numeric(dat[1,2:9]),names(dat)[2:9]))
attach(groupings)
grp1 <- groupings[Freq==1 & Var1==1,2]
grp2 <- groupings[Freq==1 & Var1==2,2]
grp3 <- groupings[Freq==1 & Var1==3,2]
detach(groupings)
Then, use the groups to do a rowSums() on the correct columns.
dat <- cbind(dat,rowSums(dat[as.character(grp1)]))
dat <- cbind(dat,rowSums(dat[as.character(grp2)]))
dat <- cbind(dat,rowSums(dat[as.character(grp3)]))
Delete the initial row and the intermediate columns:
dat <- dat[-1,-c(2:9)]
Then, just rename things correctly:
row.names(dat) <- rm()
names(dat) <- c("species","group_1","group_2","group_3")
And you ultimately get:
species group_1 group_2 group_3
bougainvillei 0 0 0
carolinae 1 1 0
concretus 0 2 2
ellioti 0 1 0
galatea 1 1 1
hombroni 2 1 0
hydrocharis 0 0 0
lindsayi 0 1 0
princeps 0 1 0
EDITED: Changed sort order to alphabetical, like other answer.

Resources