Randomize data between two columns in R

Randomize data between two columns in R - r

I have searched for an answer or a solution to this task with no success as of yet, so I do apologize if this is redundant.
I want to randomize the data between two columns. This is to simulate species misidentification in vegetation field data, so I want to assign some sort of probability of misidentification between the two columns as well. I would imagine that there is some way to do this using sample or the "permute" package.
I will select some readily available data for an example.
library (vegan)
data (dune)
If you type head (dune), then you can see that this is a data frame with sites as rows and species as columns. For convenience sake, we can presume some field tech has potential to misidentify Poa pratensis and Poa trivialis.
poa = data.frame(Poaprat=dune$Poaprat,Poatriv=dune$Poatriv)
head(poa)
Poaprat Poatriv
1 4 2
2 4 7
3 5 6
4 4 5
5 2 6
6 3 4
What would be the best way to randomize the values between these two columns (transferring between each other and/or adding to one when both are present). The resulting data may look like:
Poaprat Poatriv
1 6 0
2 4 7
3 5 6
4 5 4
5 0 7
6 4 3
P.S.
For the cringing ecologist out there: please realize, I have made this example in the interest of time and that I know relative cover values are not additive. I apologize for needing to do that.
*** Edit: For more clarity, the type of data being randomized would be percent cover estimates (so values between 0% and 100%). The data in this quick example are relative cover estimates, not counts.

You'll still need to replace the actual columns with the new ones and there may be a more elegant way to do this (it's late in EDT land) and you'll have to decide what else besides the normal distribution you'll want to use (i.e. how you'll replace sample()) but you get your swaps and adds with:
library(vegan)
library(purrr)
data(dune)
poa <- data.frame(
Poaprat=dune$Poaprat,
Poatriv=dune$Poatriv
)
map2_df(poa$Poaprat, poa$Poatriv, function(x, y) {
for (i in 1:length(x)) {
what <- sample(c("left", "right", "swap"), 1)
switch(
what,
left={
x[i] <- x[i] + y[i]
y[i] <- 0
},
right={
y[i] <- x[i] + y[i]
x[i] <- 0
},
swap={
tmp <- y[i]
y[i] <- x[i]
x[i] <- tmp
}
)
}
data.frame(Poaprat=x, Poatriv=y)
})

Here is my approach:
Let's define a function that will take a number of specimens (n) and a probability (p) that it could be labeled incorrectly. This function will sample a 1 with probability p and a 0 with 1-p. The sum of this random sampling will give how many of the n specimens were incorrect.
mislabel = function(x, p){
N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
sum(N_mis)
}
Once defined the function, apply it to each column and store it into two new columns
p_miss = 0.3
poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)
The final number of specimens tagged for each species can be calculated by substracting the incorrect from same species and adding the incorrect from the other specimen.
poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled
Result:
> head(poa)
Poaprat Poatriv Poaprat_mislabeled Poatriv_mislabeled Poaprat_final Poatriv_final
1 4 2 0 0 4 2
2 4 7 1 2 5 6
3 5 6 0 3 8 3
4 4 5 1 2 5 4
5 2 6 0 3 5 3
6 3 4 1 2 4 3
Complete procedure:
mislabel = function(x, p){
N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
sum(N_mis)
}
p_miss = 0.3
poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)
poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled
The p_miss variable is the probability of labeling incorrectly both species. You could also use a different value for each to simulate a non symmetrical chance that it may be easier to mislabel one of them compared to the other.

I just wanted to check in since accepting the answer from hrbrmstr. Given a little bit of time today, I went ahead and made a function that does this task with some degree of flexibility. It allows for inclusion of multiple species pairs, different probabilities between different species pairs (asymmetry in different direction), and includes explicitly the probability of the value staying the same.
misID = function(X, species,probs = c(0.1,0.1,0,0.8)){
library(purrr)
X2 = X
if (!is.matrix(species) == T){
as.matrix(species)
}
if (!is.matrix(probs) == T){
probs=matrix(probs,ncol=4,byrow=T)
}
if (nrow(probs) == 1){
probs = matrix(rep(probs[1,],nrow(species)),ncol=4,byrow=T)
}
for (i in 1:nrow(species)){
Spp = data.frame(X[species[i,1]],X[species[i,2]])
mis = map2_df(Spp[1],Spp[2],function(x,y) {
for(n in 1:length(x)) {
what = sample(c('left', 'right', 'swap','same'), size=1,prob=probs[i,])
switch(
what,
left = {
x[n] = x[n] + y[n]
y[n] = 0
},
right = {
y[n] = x[n] + y[n]
x[n] = 0
},
swap = {
tmp = y[n]
y[n] = x[n]
x[n] = tmp
},
same = {
x[n] = x[n]
y[n] = y[n]
}
)
}
misSpp = data.frame(x,y)
colnames(misSpp) =c(names(Spp[1]),names(Spp[2]))
return(misSpp)
})
X2[names(mis[1])] = mis[1]
X2[names(mis[2])] = mis[2]
}
return(X2)
}
There are probably a number of minor inefficiencies in here, but by and large it does what I need it to do. Sorry that there are no comments, but I did figure out how to handle getting the shuffled data into the data frame easily.
Thanks for pointing out the "purrr" package for me and also the switch function.
Example:
library(vegan)
library(labdsv)
data(dune)
#First convert relative abundances to my best guess at the % values in Van der Maarel (1979)
code = c(1,2,3,4,5,6,7,8,9)
value = c(0.1,1,2.5,4.25,5.5,20,40,60.5,90)
veg = vegtrans(dune,code,value)
specpairs = matrix(c("Poaprat","Poatriv","Trifprat","Trifrepe"),ncol=2,byrow=T) #create matrix of species pairs
probmat = matrix(c(0.3,0,0,0.7,0,0.5,0,0.5),ncol=4,byrow=T) #create matrix of misclassification probabilities
veg2 = misID(veg,specpairs,probs = probmat)
print(veg2)

Related

Create multiple confusion matrices in R using loops

I am trying to create multiple confusion matrices from one dataframe, with each matrix generated based off a different condition in the dataframe.
So for the dataframe below, I want a confusion matrix for when Value = 1, Value = 2, Value =3
observed predicted Value
1 1 1
0 1 1
1 0 2
0 0 2
1 1 3
0 0 3
and see the results like:
Value Sensitivity Specificity PPV NPV
1 .96 .71 .84 .95
2 .89 .63 .30 .45
3 .88 .95 .28 .80
This is what I tried with a reproducible example. I am trying to write a loop that looks at every row, determines if Age = 1, and then pulls the values from the predicted and observed columns to generate a confusion matrix. Then I manually pull out the values from the confusion matrix to write out sen, spec, ppv, and npv and tried to combine all the matrices together. And then the loop starts again with Age = 2.
data(scat)
df<-scat %>% transmute(observed=ifelse(Site=="YOLA","case", "control"), predicted=ifelse(Location=="edge","case", "control"),Age)
x<-1 #evaluate at ages 1 through 5
for (i in dim(df)[1]) { #for every row in df
while(x<6) { #loop stops at Age=5
if(x=df$Age) {
q<-confusionMatrix(data = df$predicted, reference = df$observed, positive = "case")
sensitivity = q$table[1,1]/(q$table[1,1]+q$table[2,1])
specificity = q$table[2,2]/(q$table[2,2]+q$table[1,2])
ppv = q$table[1,1]/(q$table[1,1]+q$table[1,2])
npv = q$table[2,2]/(q$table[2,2]+q$table[2,1])
matrix(c(sensitivity, specificity, ppv, npv),ncol=4,byrow=TRUE)
}
}
x <- x + 1 #confusion matrix at next Age value
}
final<- rbind(matrix) #combine all the matrices together
However, this loop is completely non-functional. I'm not sure where the error is.

Your code can be simplified and the desired output achieved like this:
library(caret)
library(dplyr)
data(scat)
df <- scat %>%
transmute(observed = factor(ifelse(Site == "YOLA","case", "control")),
predicted = factor(ifelse(Location == "edge","case", "control")),
Age)
final <- t(sapply(sort(unique(df$Age)), function(i) {
q <- confusionMatrix(data = df$predicted[df$Age == i],
reference = df$observed[df$Age == i],
positive = "case")$table
c(sensitivity = q[1, 1] / (q[1, 1] + q[2, 1]),
specificity = q[2, 2] / (q[2, 2] + q[1, 2]),
ppv = q[1, 1] / (q[1, 1] + q[1, 2]),
npv = q[2, 2] / (q[2, 2] + q[2, 1]))
}))
Resulting in
final
#> sensitivity specificity ppv npv
#> [1,] 0.0 0.5625000 0.00000000 0.8181818
#> [2,] 0.0 1.0000000 NaN 0.8000000
#> [3,] 0.2 0.5882353 0.06666667 0.8333333
#> [4,] 0.0 0.6923077 0.00000000 0.6923077
#> [5,] 0.5 0.6400000 0.25000000 0.8421053
However, it's nice to know why your own code didn't work, so here are a few issues that might be useful to consider:
You need factor columns rather than character columns for confusionMatrix
You were incrementing through the rows of df, but you need one iteration for each unique age, not each row in your data frame.
Your line to increment x happens outside of the while loop, so x never increments and the loop never terminates, so the console just hangs.
You are doing if(x = df$Age), but you need a == to test equality.
It doesn't make sense to compare x to df$Age anyway, because x is length 1 and df$Age is a long vector.
You have unnecessary repetition by doing q$table each time. You can just make q equal to q$table to make your code more readable and less error-prone.
You call matrix at the end of the loop, but you don't store it anywhere, so the whole loop doesn't actually do anything.
You are trying to rbind an object called matrix in the last line which doesn't exist
Your lack of spaces between math operators, commas and variables make the code less readable and harder to debug. I'm not just saying this as a stylistic point; it is a major source of errors I see frequently here on SO.

Building a life table in R with for loops

I'm new to R and programming in general, and I'm struggling with a for-loop for building the lx function in a life table.
I have the age function x, the death function qx (the probability that someone aged exactly x will die before reaching age x+1), and the surviving function px = 1 - qx.
I want to write a function that returns a vector with all the lx values from first to last age in my table. The function is simple...
I've defined cohort = 1000000. The first age in my table is x = 5, so, considering x = 5...
l_(x) = cohort
And, from now on, l_(x+n) = l_(x+n-1)*p_(x+n-1)
I've searched about for-loops, and I can only get my code working for lx[1] and lx[2], and I get nothing for lx[n] if n > 2.
I wrote that function:
living_x <- function(px, cohort){
result <- vector("double", length(px))
l_x <- vector("double", length(px))
for (i in 1:length(px)){
if (i == 1){
l_x[i] = cohort
}
else l_x[i] = l_x[i-1]*px[i-1]
result[i] = l_x
print(result)
}
}
When I run it, I get several outputs (more than length(px)) and "There were 50 or more warnings (use warnings() to see the first 50)".
When I run warnings(), I get "In result[i] <- l_x : number of items to replace is not a multiple of replacement length" for every number.
Also, everything I try besides it give me different errors or only calculate lx for lx[1] and lx[2]. I know there's something really wrong with my code, but I still couldn't identify it. I'd be glad if someone could give me a hint to find out what to change.
Thank you!

Here's an approach using dplyr from the tidyverse packages, to use px to calculate lx. This can be done similarly in "Base R" using excerpt$lx = 100000 * cumprod(1 - lag(excerpt$qx)).
lx is provided in the babynames package, so we can check our work:
library(tidyverse)
library(babynames)
# Get excerpt with age, qx, and lx.
excerpt <- lifetables %>%
filter(year == 2010, sex == "F") %>%
select(x, qx_given = qx, lx_given = lx)
excerpt
# A tibble: 120 x 3
x qx_given lx_given
<dbl> <dbl> <dbl>
1 0 0.00495 100000
2 1 0.00035 99505
3 2 0.00022 99471
4 3 0.00016 99449
5 4 0.00012 99433
6 5 0.00011 99421
7 6 0.00011 99410
8 7 0.0001 99399
9 8 0.0001 99389
10 9 0.00009 99379
# ... with 110 more rows
Using that data to estimate lx_calc:
est_lx <- excerpt %>%
mutate(px = 1 - qx_given,
cuml_px = cumprod(lag(px, default = 1)),
lx_calc = cuml_px * 100000)
And finally, comparing visually the given lx with the one calculated based on px. They match exactly.
est_lx %>%
gather(version, val, c(lx_given, lx_calc)) %>%
ggplot(aes(x, val, color = version)) + geom_line()

I could do it in a very simple way after thinking for some minutes more.
lx = c()
for (i in 2:length(px)){
lx[1] = 10**6
lx[i] = lx[i-1]*px[i-1]
}

Variable length formula construction

I am trying to apply the Simpson's Diversity Index across a number of different datasets with a variable number of species ('nuse') captured. As such I am trying to construct code which can cope with this automatically without needing to manually construct a formula each time I do it. Example dataset for a manual formula is below:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x) {
total <- x[,"total"]
nuse1 <- x[,"nuse1"]
nuse2 <- x[,"nuse2"]
nuse3 <- x[,"nuse3"]
nuse4 <- x[,"nuse4"]
div <- round(((1-(((nuse1*(nuse1 - 1)) + (nuse2*(nuse2 - 1)) + (nuse3*(nuse3 - 1)) + (nuse4*(nuse4 - 1)))/(total*(total - 1))))),digits=4)
return(div)
}
diverse$Simpson <- simp(diverse)
diverse
As you can see this works fine. However, how would I be able to create a function which could automatically adjust to, for example, 9 species (so up to nuse9)?
I have experimented with the paste function + as.formula as indicated here Formula with dynamic number of variables; however it is the expand form of (nuse1 * (nuse1 - 1)) that I'm struggling with. Does anyone have any suggestions please? Thanks.

How about something like:
diverse <- data.frame(nuse1=c(0,20,40,20), nuse2=c(5,5,3,20), nuse3=c(0,2,8,20), nuse4=c(5,8,2,20), total=c(10,35,53,80))
simp <- function(x, species) {
spcs <- grep(species, colnames(x)) # which column names have "nuse"
total <- rowSums(x[,spcs]) # sum by row
div <- round(1 - rowSums(apply(x[,spcs], 2, function(s) s*(s-1))) / (total*(total - 1)), digits = 4)
return(div)
}
diverse$Simpson2 <- simp(diverse, species = "nuse")
diverse
# nuse1 nuse2 nuse3 nuse4 total Simpson2
# 1 0 5 0 5 10 0.5556
# 2 20 5 2 8 35 0.6151
# 3 40 3 8 2 53 0.4107
# 4 20 20 20 20 80 0.7595
All it does is find out which columns start with "nuse" or any other species you have in your dataset. It constructs the "total" value within the function and does not require a total column in the dataset.

Running 'prop.test' multiple times in R

I have some data showing a long list of regions, the population of each region and the number of people in each region with a certain disease. I'm trying to show the confidence intervals for each proportion (but I'm not testing whether the proportions are statistically different).
One approach is to manually calculate the standard errors and confidence intervals but I'd like to use a built-in tool like prop.test, because it has some useful options. However, when I use prop.test with vectors, it runs a chi-square test across all the proportions.
I've solved this with a while loop (see dummy data below), but I sense there must be a better and simpler way to approach this problem. Would apply work here, and how? Thanks!
dat <- data.frame(1:5, c(10, 50, 20, 30, 35))
names(dat) <- c("X", "N")
dat$Prop <- dat$X / dat$N
ConfLower = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[1]
ConfLower <- c(ConfLower, a)
x <- x + 1
}
ConfUpper = 0
x = 1
while (x < 6) {
a <- prop.test(dat$X[x], dat$N[x])$conf.int[2]
ConfUpper <- c(ConfUpper, a)
x <- x + 1
}
dat$ConfLower <- ConfLower[2:6]
dat$ConfUpper <- ConfUpper[2:6]

Here's an attempt using Map, essentially stolen from a previous answer here:
https://stackoverflow.com/a/15059327/496803
res <- Map(prop.test,dat$X,dat$N)
dat[c("lower","upper")] <- t(sapply(res,"[[","conf.int"))
# X N Prop lower upper
#1 1 10 0.1000000 0.005242302 0.4588460
#2 2 50 0.0400000 0.006958623 0.1485882
#3 3 20 0.1500000 0.039566272 0.3886251
#4 4 30 0.1333333 0.043597084 0.3164238
#5 5 35 0.1428571 0.053814457 0.3104216

Trying to create and loop through matrix of unbalanced data in R

I am trying to conduct an hierarchical bayesian analysis but am having a little trouble with R and WinBUGS code. I don't have balanced data and am struggling with the coding. I have temperature data collected daily with iButtons (temperature recording devices) in transects and am trying to generate a model that relates this to remote sensing data. Unfortunately, each transect has a different number of iButtons so creating a 3D matrix of button(i), in transect(j), repeatedly "sampled" on day(t) is a problem for me.
Ultimately, my model will be something like:
Level 1
Temp[ijk] ~ N(theta[ijk], tau)
theta[ijk] = b0 + b1*x1 + . . . + bn*xn
Level 2
b0 = a00 + a01*y1 + . . . an*yn
b1 = a10 + a11*y1 ...
Level 3 (maybe?) - random level 2 intercepts
Normally I would do something like this:
Wide <- reshape(Data1, idvar = c("iButton","block"), timevar = "julian", direction = "wide")
J <- length(unique(Data$block))
I <- length(unique(Data$iButton))
Ti <- length(unique(Data$julian))
Temp <- array(NA, dim = c(I, Ti, J))
for(t in 1:Ti) {
sel.rows <- Wide$block == t
Temp[,,t] <- as.matrix(Wide)[sel.rows, 3:Ti]
}
Then I could have a 3D matrix that I could loop through in WinBUGS or OpenBUGS as such:
for(i in 1:J) { # Loop over transects/blocks
for(j in 1:I) { # Loop over buttons
for(t in 1:Ti) { # Loop over days
Temp[i,j,t] ~ dnorm(theta[i,j,t])
theta[i,j,t] <- alpha.lam[i] + blam1*radiation[i,j] + blam2*cwd[i,j] + blam3*swd[i,j]
}}}
Anyway, don't worry about the details of the code above, it's just thrown together as an example from other analyses. My main question is how to do this type of analysis when I don't have a balanced design with equal numbers of iButtons per transect? Any help would be greatly appreciated. I'm clearly new to R and WinBUGS and don't have much previous computer coding experience.
Thanks!
oh and here is what the data look like in long (stacked) format:
> Data[1:15, 1:4]
iButton julian block aveT
1 1 1 1 -4.5000000
2 1 2 1 -5.7500000
3 1 3 1 -3.5833333
4 1 4 1 -4.6666667
5 1 5 1 -2.5833333
6 1 6 1 -3.0833333
7 1 7 1 -1.5833333
8 1 8 1 -8.3333333
9 1 9 1 -5.0000000
10 1 10 1 -2.4166667
11 1 11 1 -1.7500000
12 1 12 1 -3.2500000
13 1 13 1 -3.4166667
14 1 14 1 -2.0833333
15 1 15 1 -1.7500000

Create a vector or array of lengths and use subindexing.
Using your example:
J <- length(unique(Data$block))
I <- tapply(Data$iButton, Data$block, function(x) length(unique(x))
Ti <- tapply(Data$julian, list(Data$iButton, Data$block), function(x) length(unique(x))
for(i in 1:J) { # Loop over transects/blocks
for(j in 1:I[i]) { # Loop over buttons
for(t in 1:Ti[i, j]) { # Loop over days
Temp[i,j,t] ~ dnorm(theta[i,j,t])
theta[i,j,t] <- alpha.lam[i] + blam1*radiation[i,j] + blam2*cwd[i,j] + blam3*swd[i,j]
}}}
I think it would work, but I haven't tested since there no data.

Can you try using a list instead?
This allows a variable length for each item in the list where each index would correspond to the transect.
So something like this:
theta <- list()
for(i in unique(Data$block)) {
ibuttons <- unique(Data$iButton[Data$block==i])
days <- unique(Data$julian[Data$block==i])
theta[[i]] <- matrix(NA, length(ibuttons), length(days)) # Empty matrix with NA's
for(j in 1:length(ibuttons)) {
for(t in 1:length(days)) {
theta[[i]][j,t] <- fn(i, ibuttons[j], days[t])
}
}
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex