randomize a data.frame based on a column and keeping proportions - r

I have a data.frame that looks like this (my real data.frame is bigger but the structure is similar):
df <- data.frame(ID=c(rep('A', 5), rep('B', 5), rep('C',5)), Score=c(1,1,0,0,0,1,1,1,0,0,1,1,1,0,0))
And I would like to obtain several randomized data.frames (e.g 100) where column Score is randomized and column ID remains the same, but I need to keep the same number of zeros and ones in `df$Score.
I've tried with:
df1 <- transform(df, Score=ave(Score, ID, FUN=function(b) sample(b, replace=T)))
but the proportions of 0s and 1s are not kept always,
Thanks

If you want to keep the 0-1 proportion within IDs, set replace=F (which is by default):
df1 <- transform(df, Score=ave(Score, ID, FUN=function(b) sample(b, replace=F)))
If you want to keep the overall 0-1 porportion, you can simply do this:
df1 <- data.frame(ID=df$ID, Score=sample(df$Score))

Related

Locate % of times that the second highest value appears for each column in R data frame

I have a dataframe in R as follows:
set.seed(123)
df <- as.data.frame(matrix(rnorm(20*5,mean = 0,sd=1),20,5))
I want to find the percentage of times that the highest value of each row appears in each column, which I can do as follows:
A <- table(names(df)[max.col(df)])/nrow(df)
Then the percentage of times that the second highest value of each row appears in each column can be found as follows:
df2 <- as.data.frame(t(apply(df,1,function(r) {
r[which.max(r)] <- 0.001
return(r)})))
B <- table(names(df2)[max.col(df2)])/nrow(df2)
How can I calculate in R the following?
C<- The percentage of times that the first and the second highest values
appear in the first two columns of `df` simultaneously
I would do it like this:
# compute reverse rank
df.rank <- ncol(df) - t(apply(df, 1, rank)) + 1
A <- colMeans(df.rank == 1)
B <- colMeans(df.rank == 2)
C <- mean(apply(df.rank[, 1:2], 1, prod)==2)
First I compute reverse rank which is analogous to using decreasing=T with sort() or order(). A and B is then rather straightforward. Please note that your original approach omits zeros for columns where no (second) maximum value appears which may cause problems in later usage.
For C, I take only first two columns of the rank matrix and compute their product for every row. If there are the two largest values in the first two columns the product has to be 2.
Also, if ties might appear in your data set you should consider selecting the appropriate ties.method argument for rank.

Randomly assign an integer within each group of ID's in Dataframe - R

I am trying to set a random integer within each group of existing ID's. The integers must meet the following conditions: unique, non repeating, and the highest integer for a group of ID's will not be greater than the number of rows with that ID.
I have tried doing this in a for loop, and it works for the first group of ID's, but does not repeat for the next set. I looked at several existing stack overflow questions and other sites which address this to some degree, but still have not been able to get it right. Links below:
Randomly Assign Integers in R within groups without replacement
http://r.789695.n4.nabble.com/Random-numbers-for-a-group-td964301.html
random selection within groups
I need it to be dynamic in the sense that week by week there could be more ID's or fewer ID's. The actual DF has several other columns, but for ease of reproduction they were left out as they are not used.
Example below:
#Desired Output
Groups <- c("A","A","A","A","B","B","B","B","B","B","B","B","C","C","C","C","C","C")
Desired_Integer <- c(1,4,2,3,6,3,1,2,8,5,7,4,5,6,1,4,3,2)
Example <- data.frame(Groups,Desired_Integer)
#Attempted For Loop for Example (assuming Example is a DF with one column, Groups for the For Loop)
Groups <- c("A","A","A","A","B","B","B","B","B","B","B","B","C","C","C","C","C","C")
Example <- as.data.frame(Groups)
for (i in Example$Groups)
{
Example$Desired_Integer <- sample.int(length(which(Example$Groups == i)))
}
Thank you in advance for your help!
You can do this with the base function ave with
dd <- data.frame(Groups = rep(c("A","B","C"), c(4,8,6)))
rand_seq_for <- function(x) sample.int(length(x))
dd$rand_int <- ave(1:nrow(dd), dd$Groups, FUN=rand_seq_for)
or using dplyr, you can do
library(dplyr)
dd %>% group_by(Groups) %>% mutate(rand_int=sample.int(n()))

Remove Duplicates, but Keep the Most Complete Iteration

I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed.
For example,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output would be the following:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is present
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.
Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.
If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!
Here is a solution. It is not very pretty but it should work for your application:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
This does rearrange your original data frame so beware if there is additional processing later on.
You can aggregate your data and select the row with max score:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
And if you want to keep the other columns, just do this:
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)

What is the best way to perform basic calculations (% of total) across dataframes in a list?

Consider a list of dataframes called listDF. Each of the dataframes has the same columns:
"Date" "Location" "V1" "V2" where V1 is a column filled with real numbers
I would like to calculate the % of total of say V1 for each Date/Location combination. That is sum V1 across all dataframes for each specific Date/Location pair and then calculate the share each V1 observation is of the relevant sample.
What I've tried:
I stack the dataframes because I don't know how to do the sweeping without looping through the Dataframe/Date/Location combinations which is clearly inefficient.
library(plyr)
aggregate <- rbind.fill(listDF)
ptt <- ddply(aggregate,.(Date,Location),transform, share= V1/sum(V1))
The last line leads to RStudio crashing and asking me to start a new session. FWIW, the avg dataframe has 50k rows and the list has about 1M rows total. Should I be using prop.table?
In an ideal world, I would have the percent to total (ptt) as a column in each dataframe, instead of in a single stacked dataframe which I would have to split after.
*Side question: is there a way to choose which subset of list elements to use for any given ptt? I've assumed using all dataframes in my initial question but would love to choose based on critera of say V2.
Thanks for your help.
If each data frame in the list has the same columns, it would be easier to work with a single data frame that has an extra variable indicating the original data frame. Then you can easily perform calculations grouped by data frame.
sample data
# two data frames
d1 <- data.frame(x = rep(LETTERS[1:2], each = 5), y = rnorm(10))
d2 <- data.frame(x = rep(LETTERS[1:2], each = 7), y = rnorm(14))
# put data frames in a list
L <- list(d1, d2)
We can use dplyr::bind_rows() to "unlist" L into a single data-frame. The .id option instructs bind_rows to create an explicit variable identifying the original data frame:
library(dplyr)
d <- bind_rows(L, .id = "dat")
Now you can do any summary grouped by the variable you created:
d %>%
group_by(dat) %>%
summarise(mean_y = mean(y))

R: Add columns to a data frame on the fly

new at R and programming in general over here. I have several binary matrices of presence/absence data for species (columns) and plots (rows). I'm trying to use them in several dissimilarity indices which requires that they all have the same dimensions. Although there are always 10 plots there are a variable number of columns based on which species were observed at that particular time. My attempt to add the 'missing' columns to each matrix so I can perform the analyses went as follows:
df1 <- read.csv('file1.csv', header=TRUE)
df2 <- read.csv('file2.csv', header=TRUE)
newCol <- unique(append(colnames(df1),colnames(df2)))
diff1 <- setdiff(newCol,colnames(df1))
diff2 <- setdiff(newCol,colnames(df2))
for (i in 1:length(diff1)) {
df1[paste(diff1[i])]
}
for (i in 1:length(diff2)) {
df2[paste(diff2[i])]
}
No errors are thrown, but df1 and df2 both remain unchanged. I suspect my issue is with my use of paste, but I couldn't find any other way to add columns to a data frame on the fly like that. When added, the new columns should have 0s in the matrix as well, but I think that's the default, so I didn't add anything to specify it.
Thanks all.
Using your code, you can generate the columns without the for loop by:
df1[, diff1] <- 0 #I guess you want `0` to fill those columns
df2[, diff2] <- 0
identical(sort(colnames(df1)), sort(colnames(df2)))
#[1] TRUE
Or if you want to combine the datasets to one, you could use rbind_list from data.table with fill=TRUE
library(data.table)
rbindlist(list(df1, df2), fill=TRUE)
data
set.seed(22)
df1 <- as.data.frame(matrix(sample(0:1, 10*6, replace=TRUE), ncol=6,
dimnames=list(NULL, sample(paste0("Species", 1:10), 6, replace=FALSE))))
set.seed(35)
df2 <- as.data.frame(matrix(sample(0:1, 10*8, replace=TRUE), ncol=8,
dimnames=list(NULL, sample(paste0("Species", 1:10),8 , replace=FALSE))))

Resources