permute dataframe but must have unique rows

permute dataframe but must have unique rows - r

Say I have a dataframe like this:
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d
time side id
1 1 L 1
2 3 R 2
3 5 R 1
4 6 L 2
5 11 L 4
6 15 L 3
7 15 L 4
8 18 R 2
9 18 R 1
10 20 R 1
I wish to permute the id variable and keep the other two constant. However, importantly, in my final permutations I do not want to have the same id on the same side at the same time. For instance, there are two times/sides where this might occur. In the original data at time 15 and 18 there are two unique ids at the same side (left for time 15 and right for time 18). If I permute using sample there is a chance that the same id shows up at the same time/side combination.
For example,
set.seed(11)
data.frame(time=d$time, side=d$side, id=sample(d$id))
time side id
1 1 L 1
2 3 R 1
3 5 R 4
4 6 L 1
5 11 L 4
6 15 L 2
7 15 L 3
8 18 R 2
9 18 R 2
10 20 R 1
Here, id=2 appears on two rows at time 18 on side "R". This is not allowed in the permutation I need.
One solution would be to brute force this - e.g. say I needed 100 permutation, I could generate 500 and discard those that fail the criteria. However, in my real data I have hundreds of rows and just using samplealmost always leads to a failure. I wonder if there is a better algorithm for doing this? Perhaps a birth-death algorithm?

Setup:
library(tidyverse)
d <- data.frame(time = c(1,3,5,6,11,15,15,18,18,20), side = c("L", "R", "R", "L", "L", "L", "L", "R","R","R"), id = c(1,2,1,2,4,3,4,2,1,1), stringsAsFactors = F)
d <- rownames_to_column(d)
I want the rownames to put it back in order at the end.
You need a function that takes a vector (like your id vector) and returns a sample of size n with the constraint that the values have to be different, as in the following (which assumes the sampling you want can actually take place, i.e. you haven't run out of items to sample). For convenience this also returns the "leftovers" that weren't sampled:
samp_uniq_n <- function(vec, n) {
x <- vec
out <- rep(NA, n)
for(i in 1:n) {
# Here would be a good place to make sure sampling is even possible.
probs <- prop.table(table(x))
out[i] <- sample(unique(x), 1, prob=probs)
x <- x[x != out[i]]
vec <- vec[-min(which(vec == out[i]))]
}
return(list(out=out, vec=vec))
}
Now, we need to split the data into a list of rows that have the same time and side and start the sampling with the largest such:
id <- d$id
d_split <- d %>% select(-id) %>% split(., list(d$time, d$side), drop = TRUE)
d_split_desc <- d_split[order(-sapply(d_split, nrow))]
Then we can do the sampling itself:
for(i in seq_along(d_split_desc)) {
samp <- samp_uniq_n(id, nrow(d_split_desc[[i]]))
this_id <- samp$out
d_split_desc[[i]]$id <- this_id
id <- samp$vec
}
Finally, some cleanup:
d_permute <- do.call(rbind, d_split_desc) %>%
arrange(as.numeric(rowname)) %>%
select(-rowname)
Putting all this in a big function is an annoyance I'll leave to anyone who is interested.

Related

R error in '[<-.data.frame 'C' temp',… replacement has # items, need #

I have a function that I've executed many times but is now throwing an error, which I do not understand. I'm trying to apply the function over a list.
I did not write the function and I have limited experience with functions. So, I'm not sure how to troubleshoot the code.
function:
myfun<-function(Year, SampleID, Species, Abundance, resamps) {
library(vegan)
counter<-1
simbaseline<-data.frame(array(NA,dim=c(length(unique(Year)),5)))
names(simbaseline)<-c("Year", "Jaccard","Horn","Bray","Pearson")
simnext<-data.frame(array(NA,dim=c(length(unique(Year)),5)))
names(simnext)<-c("Year", "Jaccard","Horn","Bray","Pearson")
simhind<-data.frame(array(NA,dim=c(length(unique(Year)),5)))
names(simhind)<-c("Year", "Jaccard","Horn","Bray","Pearson")
counter2<-1
# getting vector with number of samples per year
nsamples<-c()
for(y in unique(Year)){
nsamples<-c(nsamples, length(unique(SampleID[Year==y])))
}
t<-1
minsample<-min(nsamples)
for(repeats in 1:resamps){
raref<-data.frame(array(NA,dim=c(1,3)))
names(raref)<-c("Year","Species","Abundance")
for(y in unique(Year)){
#getting samples for this year
samps<-unique(SampleID[Year==y])
# re-sampling samples to equalize number of samples
sam<-as.character(sample(samps,minsample,replace=T))
# getting data that belongs to bootstraped samples
rarefyear<-data.frame(SampleID[which(SampleID %in% sam & Year == y)],
Species[which(SampleID %in% sam & Year == y)],
Abundance[which(SampleID %in% sam & Year == y)])
names(rarefyear)<-c("SampleID", "Species", "Abundance")
# calculating pooled abundances of eahc species to store
spabun<-tapply(as.numeric(rarefyear[,3]),rarefyear[,2],sum)
spar<-data.frame(rep(y, length(spabun)),names(spabun),spabun, row.names=NULL)
names(spar)<-c("Year","Species","Abundance")
raref<-rbind(raref,spar)
counter<-counter+1
}
# calculating year by species table of abundance
rareftabtemp<-with(raref,tapply(Abundance,list(Year,Species),function(x)x))
rareftabtemp[is.na(rareftabtemp)]<-0
Pearsoncor<-cor(t(log(rareftabtemp+1)), method="pearson")
# calculating between year similarities (NOT DISTANCE!) with Jaccard, Morisita-Horn, Bray and Pearson correlations
Jacsim<-as.matrix(1-vegdist(rareftabtemp, method="jaccard"))
Hornsim<-as.matrix(1-vegdist(rareftabtemp, method="horn"))
Braysim<-as.matrix(1-vegdist(rareftabtemp, method="bray"))
n<-length(unique(Year))
simbaseline[counter2:(counter2+n-2),]<-
cbind(unique(Year)[2:n],Jacsim[2:n],Hornsim[2:n],Braysim[2:n],Pearsoncor[2:n])
simnext[counter2:(counter2+n-2),]<-
cbind(unique(Year)[2:n],Jacsim[row(Jacsim)-col(Jacsim)==1],
Hornsim[row(Hornsim)-col(Hornsim)==1],
Braysim[row(Braysim)-col(Braysim)==1],
Pearsoncor[row(Pearsoncor)-col(Pearsoncor)==1])
# added hindcasting
simhind[counter2:(counter2+n-2),]<-
cbind(unique(Year)[1:(n-1)],
Jacsim[row(Jacsim) %in% 1:(max(row(Jacsim))-1) &
col(Jacsim)==max(col(Jacsim))],
Hornsim[row(Hornsim)%in%1:(max(row(Hornsim))-1) &
col(Hornsim)==max(col(Hornsim))],
Braysim[row(Braysim)%in%1:(max(row(Braysim))-1) &
col(Braysim)==max(col(Braysim))],
Pearsoncor[row(Pearsoncor)%in%1:(max(row(Pearsoncor))-1) &
col(Pearsoncor)==max(col(Pearsoncor))])
counter2<-counter2+n
}
baselinesim<-data.frame(unique(Year)[2:n],
tapply(simbaseline$Jaccard,simbaseline$Year,mean),
tapply(simbaseline$Horn,simbaseline$Year,mean),
tapply(simbaseline$Bray,simbaseline$Year,mean),
tapply(simbaseline$Pearson,simbaseline$Year,mean))
names(baselinesim)<-c("Year", "Jaccard","Horn","Bray","Pearson")
nextsim<-data.frame(unique(Year)[2:n],
tapply(simnext$Jaccard,simnext$Year,mean),
tapply(simnext$Horn,simnext$Year,mean),
tapply(simnext$Bray,simnext$Year,mean),
tapply(simnext$Pearson,simnext$Year,mean))
names(nextsim)<-c("Year", "Jaccard","Horn","Bray","Pearson")
hindcastsim<-data.frame(unique(Year)[1:(n-1)],
tapply(simhind$Jaccard,simhind$Year,mean),
tapply(simhind$Horn,simhind$Year,mean),
tapply(simhind$Bray,simhind$Year,mean),
tapply(simhind$Pearson,simhind$Year,mean))
names(hindcastsim)<-c("Year", "Jaccard","Horn","Bray","Pearson")
a<-list(baselinesim,nextsim,hindcastsim)
return(a)
}
error:
Error in [<-.data.frame(*tmp*, counter2:(counter2 + n - 2), , value = c(NA, : replacement has 2 items, need 5
Traceback
6.
stop(sprintf(ngettext(m, "replacement has %d item, need %d",
"replacement has %d items, need %d"), m, n * p), domain = NA)
5.
`[<-.data.frame`(`*tmp*`, counter2:(counter2 + n - 2), , value = structure(c(NA,
2009), .Dim = 2:1))
4.
`[<-`(`*tmp*`, counter2:(counter2 + n - 2), , value = structure(c(NA,
2009), .Dim = 2:1))
3.
myfun(x$Year, x$Bay, x$Species, x$Abundance, 20)
2.
FUN(X[[i]], ...)
1.
lapply(summer.split, function(x) myfun(x$Year, x$Bay,
x$Species, x$Abundance, 20))
Again, the function worked
Someone seems to have asked a similar question before and was answered by #Marat Talipov but I'm not experienced enough to make sense of what the solution was.
The answer was:
This error pops up when you're unlucky and i <- runif(n) < 1/2 consists only of FALSE, i.e. no permutations happen. You need to add a check in the swap function to fix this problem.
R error in '[<-.data.frame'... replacement has # items, need #
A subset of my data can be found here:
https://fil.email/sI4Kyhaj
The data was split by "Bay" to generate the list
Note that the function may not throw an error on a different machine because it seems to occur periodically.

The problem is located at the peace of code below (and similar indexed reasignment of dataframe's expressions in your code):
simbaseline[counter2:(counter2+n-2),]<-
cbind(unique(Year)[2:n],Jacsim[2:n],Hornsim[2:n],Braysim[2:n],Pearsoncor[2:n])
What happend is that the amount of rows you are trying to assign in the left-hand experession is not equal to the right-hand one. To avoid it you can use intermediate dataframe filled-in with e.g. NAs then reassign temporary dataframe to your target data frame. Please see the simulated code below with explanations how it can be done:
# simulation
df <- data.frame(i = 1:10, l = letters[1:10], stringsAsFactors = FALSE)
head(df)
# i l
# 1 1 a
# 2 1 1
# 3 a a
# 4 4 d
# 5 5 e
# 6 6 f
# with error
df[1:5, ] <- cbind(1:3, c("a", "b", "c"))
# Error in `[<-.data.frame`(`*tmp*`, 1:5, , value = c("1", "2", "3", "a", :
# replacement has 6 items, need 10
# without error
dftemp_in <- cbind(1:3, c("a", "b", "c"))
dftemp_out <- df[1:5, ]
dftemp_out[] <- NA
dftemp_out[seq(nrow(dftemp_in)), ] <- dftemp_in
df[1:5, ] <- dftemp_out
df
# i l
# 1 1 a
# 2 2 b
# 3 3 c
# 4 <NA> <NA>
# 5 <NA> <NA>
# 6 6 f
# 7 7 g
# 8 8 h
# 9 9 i
# 10 10 j

Using a custom summary function for factors within multiple columns

I conducted a survey with a large number of items, each of which has distinct categorical response options stored as factors. I need to summarize these columns in an efficient manner, preferably with functionality like that provided by forcats::fct_count(). I also need to know how many non-NA responses were provided for each variable, since different items were shown to different respondents. I wrote a function to make a tidy little summary data frame, but am struggling to efficiently run this function along each column and then combine the results into a single object (ala ddply).
I've tried sapply(), gather()-ing the data to long format and then running ddply(), but the problem of the distinct levels for each variable seems to keep getting in the way. See below for a reproducible example of the data set and my summarizing function. I could run the function for each variable (as shown below), but I know there's gotta be a more efficient way to do this that doesn't involve creating a ton of individual summary data-frame objects. Thanks for any help you can provide.
data <- data.frame(
ID = c(1:50),
X = as.factor(sample(c("yes", "no", NA), 50, replace = TRUE)),
Y = as.factor(sample(c("a", "b", "c", NA), 50, replace = TRUE)),
Z = as.factor(sample(c("d", "e", "f", "g", "h", NA), 50, replace = TRUE))
)
library(tidyverse)
library(forcats)
factorsummaries.f <- function(x) {
x <- na.omit(x)
counts <- fct_count(fct_drop(x), sort = T)
counts$f <- as.character(counts$f)
total <- data.frame(f = "sum", n = as.numeric(sum(counts$n)))
return(bind_rows(counts, total))
}
factorsummaries.f(data$X)
factorsummaries.f(data$Y)

Perhaps you are looking for purrr::map_dfr
map_dfr(data[,2:ncol(data)], factorsummaries.f, .id = "colname")
#output
colname f n
<chr> <chr> <dbl>
1 X no 18
2 X yes 17
3 X sum 35
4 Y a 14
5 Y c 13
6 Y b 12
7 Y sum 39
8 Z g 10
9 Z d 9
10 Z h 8
11 Z f 6
12 Z e 5
13 Z sum 38

R : Select either or, but not both

I am absolutely new to coding so please forgive me if this should be very easy to solve or to find - maybe it's so simple that nobody has bothered explaining so far or I just haven't been searching with the right keywords.
I have a column in my dataset that contains the letters f, n, i in all possible combinations. Now I want to find only those rows that contain either f or n, but not both of them. So that could be f, or fi, or n, or ni.
Then I want to compare those two sets of rows to each other in a boxplot. So ideally I would have two boxes: one with all the data points belonging to group f, including fi, and one with all the data points belonging to group n, including ni.
Example of my dataset:
df <- data.frame(D = c("f", "f", "fi", "n", "ni", "ni", "fn", "fn"), y = c(1, 0.8, 1.1, 2.1, 0.9, 8.8, 1.7, 5.4))
D y
1 f 1.0
2 f 0.8
3 fi 1.1
4 n 2.1
5 ni 0.9
6 ni 8.8
7 fn 1.7
8 fn 5.4
Now what I want to get is this subset:
D y
1 f 1.0
2 f 0.8
3 fi 1.1
4 n 2.1
5 ni 0.9
6 ni 8.8
and then somehow have 1,2,3 and 4,5,6 in a group each, to plot in a boxplot.
So far I have only succeeded in getting a subset that has only entries with either f or n, but not fi, ni etc, which is not what I want, with this code:
df2<-df[df$D==c("f","n"),]
and in creating a subset that has all different groups with f and n:
df2 <- df[grepl("f", df$D) | grepl("n", bat.df$D),]
I read about the "exclusive or" operator xor but when I try to use that like this:
df2 <- bat.df[xor(match("n", df$D), match("f", df$D)),]
it just gives me a dataframe full of NAs. But even if that did work, I guess I would only be able to make a boxplot with four groups, f, n, fi and ni, where I want only two groups. So how can I get that code to work, and how do I go on from there?
I hope this is not too terrible for a first question! I am kind of bleary eyed after spending far too much time on this. Any help, about my problem, on where to look for the answer or on how to improve the question is very much appreciated!

I think your last example is pretty close. xor only works with things that return logical like TRUE and FALSE, but match actually returns the integer position. So just use grepl with xor:
xor(grepl("f", df$D), grepl("n", df$D))
Or you could get fancy:
library(functional)
Reduce(xor, lapply(c("f", "n"), grepl, df$D))

We all cut our teeth on R at some point, so I'll try to construct an example for you that fits the question. How about:
# simulate a data.frame with "all possible combinations" of singles and pairs
df <- data.frame(txt = as.character(outer(c("i", "f", "n"), c("", "i", "f", "n"), paste0)),
stringsAsFactors = FALSE)
# create an empty factor variable to contain the result
df$has_only <- factor(rep(NA, nrow(df)), levels = 1:2, labels = c("f", "n"))
# replace with codes if contains either f or n, not both(f, n)
df$has_only[which(grepl("f", df$txt) & !grepl("f.*n|n.*f", df$txt))] <- "f"
df$has_only[which(grepl("n", df$txt) & !grepl("f.*n|n.*f", df$txt))] <- "n"
df
## txt has_only
## 1 i <NA>
## 2 f f
## 3 n n
## 4 ii <NA>
## 5 fi f
## 6 ni n
## 7 if f
## 8 ff f
## 9 nf <NA>
## 10 in n
## 11 fn <NA>
## 12 nn n
plot(df$has_only)
Note that this is a bar plot, not a box plot, since a box plot would only plot the range of continuous values, and you have not specified what are the continuous values or what they would look like. But if you did have such a variable, say df$myvalue, then you could produce a box plot with:
# simulate some continuous data
set.seed(50)
df$myvalue <- runif(nrow(df))
boxplot(myvalue ~ has_only, data = df)

Using R to Assign Treatments to Groups

We have seven exposures and 24 groups. We would like to randomly assign five of the seven exposures to groups while also ensuring that we end up with a consistent count for each exposure, meaning that each exposure ends up being exposed about the same number of times. I have written some code that does this but I cannot control how many times each exposure is shown. For example:
exposures <- c("A", "B", "C", "D", "E", "F", "G")
groups <- c(1:24)
table <- c()
for (i in 1:24){
draw <- sample(exposures, size=5, replace=F)
table <- rbind(table, draw)
}
table(table)
So the counts end up somewhat close but is there something I can do to ensure a minimum for each exposure? Thanks!
EDIT Also, we need each exposure to appear only once per group.

It's easier to think of it in terms of the two exposures that aren't used, rather than the five that are. Let's limit the number of times an exposure can be excluded:
draw_exc <- function(exposures,nexp,ng,max_excluded = 10){
nexc <- length(exposures)-nexp
exp_rem <- exposures
exc <- matrix(,ng,nexc)
for (i in 1:ng){
pool <- combn(exp_rem,nexc)
draw <- pool[,sample(1:ncol(pool), 1)]
exc[i,] <- draw
tab <- table(exc)
exp_rem <- setdiff(exp_rem, names(tab[tab > max_excluded]) )
}
exc
}
Here's an illustration:
set.seed(1)
exc <- draw_exc(exposures,5,24,10)
assignment <- apply(exc,1,function(x) setdiff(exposures,x))
table(exc)
# exc
# A B C D E F G
# 7 4 6 6 8 10 7
table(assignment)
# assignment
# A B C D E F G
# 17 20 18 18 16 14 17
So, with 24 groups, the maximum number of exclusions equals 24 minus the minimum number of appearances. This loop is not efficient, but it seems to get the job done.

R associative memory doesn't work as expected

I am trying to use associative memory and ddply to add a column to a data frame. For example:
First, I have defined association and a function that uses association to calculate product of two elements of a row (property damage and multiplier) to get actual damage in dollars. Here,"B" means Billion, "m|M" means MIllions, etc.
validMultiplierLetter <- c("B", "h", "H", "k", "K", "m", "M")
Multiplier <- c(1000000000, 100, 100, 1000, 1000, 1000000, 1000000)
names(Multiplier) <- validMultiplierLetter
The function ploss (property loss) is:
ploss <- function(pd,pm) {
if (pm %in% validMultiplierLetter) pd*Multiplier[pm]
else 0
}
here is a sample data frame with columns pd (property damage) and pm (multiplier) and ddply code to create a pl (property loss) column, which is a product of property damage and the associated value of multiplier. Invalid multipliers are equivalent to 0 (e.g., "+").
tdf <- data.frame(pd = c(5, 10, 15, 20, 25), pm = c("B", "m", "K", "+", "h"))
tldf <- ddply(tdf, .(pd, pm), transform, pl = ploss(pd,pm))
I get the following output when I execute the code above - you can see that the right multiplier was not used for the rows.
> tldf
pd pm pl
1 5 B 500
2 10 m 10000
3 15 K 15000
4 20 + 0
5 25 h 2500
Strangely though, when you pass constant, the multiplier works correctly. But, when you pass a variable (whose value is same as the constant), for some reason you get an incorrect result.
> Multiplier["B"]
B
1e+09
> tdf$pm[1]
[1] B
Levels: + B h K m
> Multiplier[tdf$pm[1]]
h
100
Any explanation of why this happens and how to fix it is greatly appreciated. Thanks.

The problem is that tdf$pm is a factor. When presented a factor, [ will use the factor levels rather than the character values:
x <- 10:15
names(x) <- LETTERS[1:6]
x
## A B C D E F
## 10 11 12 13 14 15
x[c('A','F')] # Lookup by name
## A F
## 10 15
x[factor(c('A','F'))] # Lookup by integer
## A B
## 10 11
This is fixed by using as.character around the factor, so that a character vector is presented to [:
x[as.character(factor(c('A','F')))]
## A F
## 10 15
For your problem, you can coerce to character in the transform function:
ddply(tdf, .(pd, pm), transform, pl = ploss(pd,as.character(pm)))
## pd pm pl
## 1 5 B 5.0e+09
## 2 10 m 1.0e+07
## 3 15 K 1.5e+04
## 4 20 + 0.0e+00
## 5 25 h 2.5e+03
In addition, you could vectorize your ploss function in the obvious way and do the job directly with transform:
ploss <- function(pd,pm) {
ifelse(pm %in% validMultiplierLetter, pd*Multiplier[pm], 0)
}
transform(tdf, pl=ploss(pd, as.character(pm)))
## pd pm pl
## 1 5 B 5.0e+09
## 2 10 m 1.0e+07
## 3 15 K 1.5e+04
## 4 20 + 0.0e+00
## 5 25 h 2.5e+03
And of course, the as.character coercion could be within the function ploss, so it isn't required in the transform call:
ploss <- function(pd,pm) {
ifelse(pm %in% validMultiplierLetter, pd*Multiplier[as.character(pm)], 0)
}

The problem I see is that, if you're using the default R options, tdf$pm is a factor, not a character. You can check this with class(tdf$pm). What's happening here is that "B" is really a mask for 2 (following the order in the printout: Levels: + B h K m), so pd has the value of 2 as far as [ is concerned, and Multiplier[2] is 100 as you've assigned.
When you call data.frame (or read.table) you need to add the argument stringsAsFactors = FALSE, or change the corresponding global option with the options function.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

permute dataframe but must have unique rows - r

Related

R error in '[<-.data.frame 'C' temp',… replacement has # items, need #

Using a custom summary function for factors within multiple columns

R : Select either or, but not both

Using R to Assign Treatments to Groups

R associative memory doesn't work as expected

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

permute dataframe but must have unique rows - r

Related

R error in '[<-.data.frame 'C' *temp*',… replacement has # items, need #

Using a custom summary function for factors within multiple columns

R : Select either or, but not both

Using R to Assign Treatments to Groups

R associative memory doesn't work as expected

Categories

Resources

R error in '[<-.data.frame 'C' temp',… replacement has # items, need #