From randomly to not randomly selecting columns - r

I have this piece of script for R and I want to adjust it a little bit.
Here's the script I have, mydata is an imported .csv file of n columns:
library(orddom)
R=6
delta = numeric (R)
for (i in 1:R) {`
a <- data.matrix(sample(mydata, 2, replace=FALSE))
drops <- c(colnames(a))
b <- data.matrix(mydata[,!(names(mydata) %in% drops)])
a1 <- na.omit(t(matrix(a,1)))
b1 <- na.omit(t(matrix(b,1)))
colnames(a1) <- c("Group 1")
colnames(b1) <- c("Group 2")
delta [i] <- abs(as.numeric(orddom(a1, b1, alpha = 0.05, paired=FALSE)[13,1]))
The problem is that for vector a, the columns of mydata get resampled randomly, leading to several equal delta values, because every time the iterative process start again there is a possibility that the same set of columns get selected.
Now I want the columns to be not randomly resampled. So I want all the possible column combinations, column 1 and 2 and 3 is the same combination as column 2 and 1 and 3 and so on, avoiding combinations of one column with itself, without repetition.
Is there a way to exclude column combinations that have already been selected before?
Then I would like to calculate delta for every combination and store it in a vector.
orddom: Ordinal Dominance Statistics

You can try the following:
#get the combos outside the loop
combos<-combn(length(mydata),2)
R<-ncol(combos)
delta<-numeric(R)
#in the loop, replace the first line
a <- mydata[,combos[,i]]
#the rest should be ok
There are some improvements you could make in the code but they are not relevant in what you are asking.

Related

Locate % of times that the second highest value appears for each column in R data frame

I have a dataframe in R as follows:
set.seed(123)
df <- as.data.frame(matrix(rnorm(20*5,mean = 0,sd=1),20,5))
I want to find the percentage of times that the highest value of each row appears in each column, which I can do as follows:
A <- table(names(df)[max.col(df)])/nrow(df)
Then the percentage of times that the second highest value of each row appears in each column can be found as follows:
df2 <- as.data.frame(t(apply(df,1,function(r) {
r[which.max(r)] <- 0.001
return(r)})))
B <- table(names(df2)[max.col(df2)])/nrow(df2)
How can I calculate in R the following?
C<- The percentage of times that the first and the second highest values
appear in the first two columns of `df` simultaneously
I would do it like this:
# compute reverse rank
df.rank <- ncol(df) - t(apply(df, 1, rank)) + 1
A <- colMeans(df.rank == 1)
B <- colMeans(df.rank == 2)
C <- mean(apply(df.rank[, 1:2], 1, prod)==2)
First I compute reverse rank which is analogous to using decreasing=T with sort() or order(). A and B is then rather straightforward. Please note that your original approach omits zeros for columns where no (second) maximum value appears which may cause problems in later usage.
For C, I take only first two columns of the rank matrix and compute their product for every row. If there are the two largest values in the first two columns the product has to be 2.
Also, if ties might appear in your data set you should consider selecting the appropriate ties.method argument for rank.

Resampling with entries with same code (ID)

In R, I'm trying to resample my dataset.
The database A includes some codes in the first column (integer) and characteristics of each row as follows:
A <- as.matrix(cbind(floor(runif(1000, 1,101)), matrix(rexp(20000, rate=.1), ncol=20) ))
Some codes are repeated in the first column.
I want to resample randomly codes from the first column and create a new matrix or dataframe such that for each entry in the resampled code vector it gives me the right hand side. If there are more vectors with the same resampled code it should include both. Also, if I'm resampling the same code twice, all rows in A with the same resample code should appear twice.
---EDIT---
The resampling is done with replacement. So far what I did is:
res <- resample(unique(A[,1]), size = length(unique(A[,1])) , replace = TRUE, prob= NULL)
A.new <- A[which(A[,1] %in% res),]
however, assume that two lines in A have the same code (say 2), and that the vector res selects 2 4 times. In A.new I will only have 2 twice (because there are two lines coded as 2 in A[,1]), instead that having these two lines repeated 4 times
We can do it like this:
A.new = sapply(res, function(x) A[A[,1] == x, ])
A.new = do.call(rbind, A.new)
The first line makes a list of matrices in which each value of res creates a list item that is the subset of A for which the 1st column equals that value of res. If res contains the same number more than once, a matrix will be created for each occurrence of that value.
The 2nd line uses rbind to condense this list into a single matrix

Forcing Rbind with uneven columns in R

I am trying to force some list objects (e.g. 4 tables of frequency count) into a matrix by doing rbind. However, they have uneven columns (i.e. some range from 2 to 5, while others range from 1:5). I want is to display such that if a table does not begin with a column of 1, then it displays NA in that row in the subsequent rbind matrix. I tried the approach below but the values repeat itself in the row rather than displaying NAs if is does not exist.
I considered rbind.fill but it requires for the table to be a data frame. I could create some loops but in the spirit of R, I wonder if there is another approach I could use?
# Example
a <- sample(0:5,100, replace=TRUE)
b <- sample(2:5,100, replace=TRUE)
c <- sample(1:4,100, replace=TRUE)
d <- sample(1:3,100, replace=TRUE)
list <- list(a,b,c,d)
table(list[4])
count(list[1])
matrix <- matrix(ncol=5)
lapply(list,(table))
do.call("rbind",(lapply(list,table)))
When I have a similar problem, I include all the values I want in the vector and then subtract one from the result
table(c(1:5, a)) - 1
This could be made into a function
table2 <- function(x, values, ...){
table(c(x, values), ...) - 1
}
Of course, this will give zeros rather than NA

Converting cross-sectional data into an adjacency matrix in R

I am trying to convert cross-sectional data into an adjacency matrix, as I want to analyze how often certain variables are present together with social network analysis.
In case empirical examples would help with the logic, it's basically analogous to presenting 4 people with a choice of three objects; they can choose from 0 to 3 of the objects. I'd like to analyze how commonly different objects were chosen together and visualize this as a network of preferences.
The data is set up as cross-sectional data, below:
ID1 <- c(1,0,0)
ID2 <- c(1,0,1)
ID3 <- c(1,1,1)
ID4 <- c(0,0,0)
IDs <- c("1","2","3","4")
df <- data.frame(rbind(ID1, ID2, ID3, ID4))
df <- cbind(IDs, df)
colnames(df) <- c("ID", "Var1", "Var2", "Var3")
I'd like to create a weighted adjacency matrix for Var1, Var2 and Var3, with each cell containing the total number of times the two variables occur together among the observations.
So the basic procedure I was thinking about is to create a separate matrix for each row (each ID number) with a 1 or 0 for each cell indicating whether or not both variables are present for the ID. And then add these matrices together, so the final matrix gives the total number of joint appearances.
I've been looking around and haven't quite gotten it right. I thought of using outer, but it'd need to work for each column in sequence. This answer was pretty close, but I wasn't exactly sure how they were adding together the values. I ended up with a list of matrices, but the values didn't correspond to the initial data-
Convert categorical data in data frame to weighted adjacency matrix. And this answer was also close, although it seemed to have a different type of data. It gave me an adjacency matrix based on the IDs-
http://r.789695.n4.nabble.com/Conversion-to-Adjacency-Matrix-td794102.html
Here is very messy code to manually create a matrix for one observation, just so you get a sense for what I'm going for (using a vector representing just the first ID observation)
ID1 <- c(1,0,0)
var1 <- ID1[[1]]
var2 <- ID1[[2]]
var3 <- ID1[[3]]
onetwo <- var1 * var2
onethree <- var1 * var3
twothree <- var2 * var3
oneone <- var1 * var1
twotwo <- var2 * var2
threethree <- var3 * var3
rows1 <- rbind(oneone, onetwo, onethree)
rows2 <- rbind(onetwo, twotwo, twothree)
rows3 <- rbind(onethree, twothree, threethree)
df2 <- cbind(rows1, rows2, rows3)
This obviously is not ideal, my actual dataset has 198 observations and 33 variables, so even with looping or the use of apply functions it would be very inefficient.
I can't tell if I'm making this more difficult than it needs to be, or if I'm trying to force my data to do something it wasn't meant to do. But if anyone has run into this sort of task before, please let me know. Is there a way to create the desired adjacency matrix directly? Should I transfer this into an edge list first, and is there a good way to do that? Is there code that would make the first step(creating a matrix for each row of the dataframe) more efficient?
Thanks for your help,
I'm not sure if I understand the question, but is this what you want?
nc=33
nr=198
m3<-matrix(sample(0:1,nc*nr,replace=TRUE),nrow=nr)
df3<-data.frame(m3)
m3b <-matrix(0,nrow=nc,ncol=nc)
for(i in seq(1,nc)) {
for (j in seq(1,nc)) {
t3<-table(df3[,i],df3[,j])
m3b[i,j] = t3[2,2] # t3[2,2] contains the count of df3[,i] = df3[,j] = 1
# or
# t3 = sum(df3[,i]==df3[,j] & df3[,i] == 1)
# m3b[i,j] = t3
}
}
or, if you want the sum of the product, which gives the same result if everything is 1 or 0
m3c <-matrix(0,nrow=nc,ncol=nc)
for(i in seq(1,nc)) {
for (j in seq(1,nc)) {
sv=0
for (k in seq(1,nr)) {
vi = df3[k,i]
vj = df3[k,j]
sv=sv+vi*vj
}
m3c[i,j] = sv
}
}

Add the counter value of nested for loop to each row while rbinding in r

I would like to be able create a new dataframe with 6 columns from an existing dataframe with 4 columns. The two extra columns should be the value of the counters (i and j) whilst the loop is working.
my draft code is as follows
a is binary,
b is categorical
c is a number (in this case 1 to 200)
d is a number (in this example 1 to 5, in real life 1 to 2500)
#### make an example of mydata
a<- c(0,0,0,0,0,0,0,0,0,0,1,1,0,1)
b<- c("a","b","a","b","b","c","a","e","c","a","a","b","d","f")
c<- c(20,30,40,40,54,76,23,23,78,23,34,1,88,1)
d<- c(1,1,1,2,2,2,3,3,4,5,5,5,5,5)
mydata<-data.frame(a,b,c,d)
## this just generates random numbers to randomly
##select row to bind together later
set.seed(1)
choose.test<- data.frame(matrix(NA, nrow = 20, ncol = 30))
for (i in 1:20)
{
choose.test[,i]<-sample(5, 20, replace = TRUE, prob = NULL)
#random selction of sites WITH replacment
}
# this is the bit I am having trouble with
data<- NULL
for( j in 1:10){
for (i in choose.test[,j])
{ data <- rbind(data, mydata[mydata[,4]== i,])
data[,5]<-j
data[,6]<-i
}}
It would also be acceptable to create separate dataframes at each loop iteration (in the second loop using i as a counter), or open to other better suggestions as I am new to r. I also tried using assign to do this with no luck.
At each iteration I need to rbind together all the rows in column 4 which have a value equal to a random number between 1 and 5 ( in this example anyway in real life it will be between 1 and 2500 sites). These random numbers are stored in a data frame, called choose.test , where the random numbers in each column is used only once then the next iteration moves onto the next column.
Without the "data[,5]<-j data[,6]<-i" it does what almost what I want , but I would really like to have a 5th and 6th column that identify which iteration of the i and j loop the rows were from so I can analyse the data at each iteration (I am bootstrapping with this data). Clearly the code above does not work, but I am not sure how to get it to do what I want. In the current version it just add the maximum counter value to all rows at columns 5 and 6.
Many thanks,
Ben
The following code fixed my problem
data<- NULL
for( j in 1:10){
for (i in choose.test[,j])
{ data <- rbind(data, cbind(mydata[mydata[,4]== i,], i=i, j=j))}}
Credit goes to MrFlick for providing a useful comment!

Resources