I have large R matrix with 1,000 rows and 4 attributes and 4 levels each attribute such that:
Row A B C D
1 1 3 4 2
2 2 1 3 4
3 1 2 4 3
... ...
1000 3 4 1 2
I want to create a new table by pre-specified proportions such that level 1 of attribute A appears 25% of the time, level 2 50%, level 3 10% and level 4 15% of the time. The table can be of a smaller size than 1,000 rows and rows have to be uniques.
proportions <- c(0.25,0.5,0.1,0.15)
I know it's kind of a basic question but I have broken my head for two hours and haven't found anything on Stack Overflow nor internet.
UPDATE
I want to keep the same combinations within rows. So I want to create a new table with the proportions given but using the table, thus the combinations, that I already have.
You can create your set with the proportions you want then "reshuffle".
A <- c(rep(1,250), rep(2,500), rep(3,100), rep(4,150))
B <- sample(A, 1000)
EDIT:
it's not entirely cleat what the OP wants.
if you want the same exact table randomized you can try
df_new <- df[sample(1:nrow(df), nrow(df)),]
to get the same exact proportions you are restricted to a number of observation so that all the new counts are divisible by the old counts
in order to get the proportions of each unique row you can try :
# simulating the table
a <- c(rep(1,250), rep(2,500), rep(3,100), rep(4,150))
b <- sample(a, 4000, replace = T)
df <- as.data.frame(matrix(b, ncol = 4))
names(df) <- c('a','b','c','d')
# getting the proportions
z <- aggregate(row.names(df), list(df$a, df$b, df$c, df$d), function(x) freq = length(x))
Related
I have a dataframe like this:
mydf <- data.frame(A = c(40,9,55,1,2), B = c(12,1345,112,45,789))
mydf
A B
1 40 12
2 9 1345
3 55 112
4 1 45
5 2 789
I want to retain only 95% of the observations and throw out 5% of the data that have extreme values. First, I calculate how many observations they are:
th <- length(mydf$A) * 0.95
And then I want to remove all the rows above the th (or retain the rows below the th, as you wish). I need to sort mydf in an ascending order, to remove only those extreme values. I tried several approaches:
mydf[order(mydf["A"], mydf["B"]),]
mydf[order(mydf$A,mydf$B),]
mydf[with(mydf, order(A,B)), ]
plyr::arrange(mydf,A,B)
but nothing works, so mydf is not sorted in ascending order by the two columns at the same time. I looked here Sort (order) data frame rows by multiple columns but the most common solutions do not work and I don't get why.
However, if I consider only one column at a time (e.g., A), those ordering methods work, but then I don't get how to throw out the extreme values, because this:
mydf <- mydf[(order(mydf$A) < th),]
removes the second row that has a value of 9, while my intent is to subset mydf retaining only the values below threshold (intended in this case as number of observations, not value).
I can imagine it is something very simple and basic that I am missing... And probably there are nicer tidyverse approaches.
I think you want rank here, but it doesn't work on multiple columns. To work around that, note that rank(.) is equivalent to order(order(.)):
rank(mydf$A)
# [1] 4 3 5 1 2
order(order(mydf$A))
# [1] 4 3 5 1 2
With that, we can order on both (all) columns, then order again, then compare the resulting ranks with your th value.
mydf[order(do.call(order, mydf)) < th,]
# A B
# 1 40 12
# 2 9 1345
# 4 1 45
# 5 2 789
This approach benefits from preserving the natural sort of the rows.
If you would prefer to stick with a single call to order, then you can reorder them and use head:
head(mydf[order(mydf$A, mydf$B),], th)
# A B
# 4 1 45
# 5 2 789
# 2 9 1345
# 1 40 12
though this does not preserve the original order of rows (which may or may not be important to you).
Possible approach
An alternative to your approach would be to use a dplyr ranking function such as cume_dist() or percent_rank(). These can accept a dataframe as input and return ranks / percentiles based on all columns.
set.seed(13)
dat_all <- data.frame(
A = sample(1:60, 100, replace = TRUE),
B = sample(1:1500, 100, replace = TRUE)
)
nrow(dat_all)
# 100
dat_95 <- dat_all[cume_dist(dat_all) <= .95, ]
nrow(dat_95)
# 95
General cautions about quantiles
More generally, keep in mind that defining quantiles is slippery, as there are multiple possible approaches. You'll want to think about what makes the most sense given your goal. As an example, from the dplyr docs:
cume_dist(x) counts the total number of values less than or equal to x_i, and divides it by the number of observations.
percent_rank(x) counts the total number of values less than x_i, and divides it by the number of observations minus 1.
Some implications of this are that the lowest value is always 1 / nrow() for cume_dist() but 0 for percent_rank(), while the highest value is always 1 for both methods. This means different cases might be excluded depending on the method. It also means the code I provided will always remove the highest-ranking row, which may or may not match your expectations. (e.g., in a vector with just 5 elements, is the highest value "above the 95th percentile"? It depends on how you define it.)
Please forgive me if this question has been asked before!
So I have a dataframe (df) of individuals sampled from various populations with each individual given a population name and a corresponding number assigned to that population as follows:
Individual Population Popnum
ALM16-014 AimeesMdw 1
ALM16-024 AimeesMdw 1
ALM16-026 AimeesMdw 1
ALM16-003 AMKRanch 2
ALM16-022 AMKRanch 2
ALM16-075 BearPawLake 3
ALM16-076 BearPawLake 3
ALM16-089 BearPawLake 3
There are a total of 12 named populations (they do not all have the same number of individuals) with Popnum 1-12 in this file. What I need to do is randomly delete one or more populations (preferably using the 'Popnum' column) from the dataframe and repeating this 100 times and then saving each result as a separate dataframe (ie. df1, df2, df3, etc). The end result is 100 dfs with each one having one population removed randomly. The next step is to repeat this 100 times removing two random populations, then 3 random populations, and so on.
Any help would be greatly appreciated!!
You can write a function which takes dataframe as input and n i.e number of Popnum to remove.
remove_n_Popnum <- function(data, n) {
subset(data, !Popnum %in% sample(unique(Popnum), n))
}
To get one popnum you can do :
remove_n_Popnum(df, 1)
# Individual Population Popnum
#1 ALM16-014 AimeesMdw 1
#2 ALM16-024 AimeesMdw 1
#3 ALM16-026 AimeesMdw 1
#4 ALM16-003 AMKRanch 2
#5 ALM16-022 AMKRanch 2
To do this 100 times you can use replicate
list_data <- replicate(100, remove_n_Popnum(df1, 1), simplify = FALSE)
To pass different n in remove_n_Popnum function you can use lapply
nested_list_data <- lapply(seq_along(unique(df$Popnum)[-1]),
function(x) replicate(100, remove_n_Popnum(df, x), simplify = FALSE))
where seq_along generates a sequence which is 1 less than the number of unique values.
seq_along(unique(df$Popnum)[-1])
#[1] 1 2
I have a data set with 70 column variables, each is 0-1 dummy variable, and 3500 observations. I am looking to see how often observations with a 'success' in one variable are matched with another variable. In other words it obs 1 has a success dummy in variable one how often does it also have a success in variable 2 and so on for all the variables. I have found how to create a matrix table showing interactions when only two columns are involved however i cant find anything involving many columns. Ideally id like to present this in an interaction matrix with 70 variables across and 70 down. Here is an idea of the data set:
Dat A B C D
XX 1 1 1 1
XY 0 1 0 1
XZ 0 0 1 1
The output im hoping for would be:
Out A B C D
A 0 1 1 1
B 0 1 2
C 0 2
D 0
Showing the number of times that (A,B) is a pairing (B,C) is a pairing and so on.
I have tried using the table() command as well as as.matrix but it seems these require data organized as two columns and cannot understand the data when it refers to many column variables. I am fairly new to R so I apologize if my question isnt clear or is possibly quite simple.
Any help is appreciated. Thanks
Here's how to create a correlation matrix of indefinite size. First create a reproducible example of your dataset...
dat <- matrix(sample(0:1, size = 700, replace = TRUE), ncol = 70)
dat <- data.frame(dat)
Then calculate the correlation...
dat <- cor(dat)
And then plot the correlation visually...
library(corrplot)
corrplot(dat, method = "square")
You can also plot the correlation using numbers instead of colors...
corrplot(dat, method = "number")
Obviously you'll want to finesse these charts before using them in a publication. corrplot offers tons of options for chart appearance.
You can try:
res <- apply(combn(2:ncol(df), 2), 2, function(x, y) sum(rowSums(y[, x]) == 2), df)
m <- diag(x=0, ncol(df)-1)
m[upper.tri(m)] <- res
m[lower.tri(m)] <- NA
dimnames(m) <- list(colnames(df)[-1], colnames(df)[-1])
A B C D
A 0 1 1 1
B NA 0 1 2
C NA NA 0 2
D NA NA NA 0
Let's say I have a vector of integers 1:6
w=1:6
I am attempting to obtain a matrix of 90 rows and 6 columns that contains the multinomial combinations from these 6 integers taken as 3 groups of size 2.
6!/(2!*2!*2!)=90
So, columns 1 and 2 of the matrix would represent group 1, columns 3 and 4 would represent group 2 and columns 5 and 6 would represent group 3. Something like:
1 2 3 4 5 6
1 2 3 5 4 6
1 2 3 6 4 5
1 2 4 5 3 6
1 2 4 6 3 5
...
Ultimately, I would want to expand this to other multinomial combinations of limited size (because the numbers get large rather quickly) but I am having trouble getting things to work. I've found several functions that do binomial combinations (only 2 groups) but I could not locate any functions that do this when the number of groups is greater than 2.
I've tried two approaches to this:
Building up the matrix from nothing using for loops and attempting things with the reshape package (thinking that might be something there for this with melt() )
working backwards from the permutation matrix (720 rows) by attempting to retain unique rows within groups and or removing duplicated rows within groups
Neither worked for me.
The permutation matrix can be obtained with
library(gtools)
dat=permutations(6, 6, set=TRUE, repeats.allowed=FALSE)
I think working backwards from the full permutation matrix is a bit excessive but I'm tring anything at this point.
Is there a package with a prebuilt function for this? Anyone have any ideas how I shoud proceed?
Here is how you can implement your "working backwards" approach:
gps <- list(1:2, 3:4, 5:6)
get.col <- function(x, j) x[, j]
is.ordered <- function(x) !colSums(diff(t(x)) < 0)
is.valid <- Reduce(`&`, Map(is.ordered, Map(get.col, list(dat), gps)))
dat <- dat[is.valid, ]
nrow(dat)
# [1] 90
I have the following code, which does what I want. But I would like to know if there is a simpler/nicer way of getting there?
The overall aim of me doing this is that I am building a separate summary table for the overall data, so the average which comes out of this will go into that summary.
Test <- data.frame(
ID = c(1,1,1,2,2,2,3,3,3),
Thing = c("Apple","Apple","Pear","Pear","Apple","Apple","Kiwi","Apple","Pear"),
Day = c("Mon","Tue","Wed")
)
countfruit <- function(data){
df <- as.data.frame(table(data$ID,data$Thing))
df <- dcast(df, Var1 ~ Var2)
colnames(df) = c("ID", "Apple","Kiwi", "Pear")
#fixing the counts to apply a 1 for if there is any count there:
df$Apple[df$Apple>0] = 1
df$Kiwi[df$Kiwi>0] = 1
df$Pear[df$Pear>0] = 1
#making a new column in the summary table of how many for each person
df$number <- rowSums(df[2:4])
return(mean(df$number))}
result <- countfruit(Test)
I think you over complicate the problem, Here a small version keeping the same rationale.
df <- table(data$ID,data$Thing)
mean(rowSums(df>0)) ## mean of non zero by column
EDIT one linear solution:
with(Test , mean(rowSums(table(ID,Thing)>0)))
It looks like you are trying to count how many nonzero entries in each column. If so, either use as.logical which will convert any nonzero number to TRUE (aka 1) , or just count the number of zeros in a row and subtract from the number of pertinent columns.
For example, if I followed your code correctly, your dataframe is
Var1 Apple Kiwi Pear
1 1 2 0 1
2 2 2 0 1
3 3 1 1 1
So, (ncol(df)-1) - length(df[1,]==0) gives you the count for the first row.
Alternatively, use as.logical to convert all nonzero values to TRUE aka 1 and calculate the rowSums over the columns of interest.