Filling matrix with data from two columns of dataframe - r

I am so desperated and even I am ready to lose some more rep points but I have to ask it.
(Yes, I read some threads about it).
I created a dataframe with only 2 columns I want to put to the matrix (I didn't know how to pick just 2 columns from whole data):
tbl_corel <- tbl_end[,c("diff", "abund_mean")]
In next step I created and empty matrix:
## Creating a empty matrix to check the correlation between diff and abund_mean
mat_corel <- matrix(0, ncol = 2)
colnames(mat_corel) <- c("diff", "abund_mean")
I tried to use that function to fill the matrix with the data:
mat_corel <- matrix(tbl_corel), nrow = 676,ncol = 2)
Of course I had to check manually how many rows I have in my data frame...
It doesn't work.
Tried that function as well:
mat_corel[ as.matrix(tbl_corel) ] <- 1
It doesn't work. I'd be so grateful for the help.
diff abund_mean
1 0 3444804.80
2 0 847887.02
3 0 93654.19
4 0 721692.76
5 0 382711.04
6 1 428656.66

If you want to create a matrix from your two-columns data frame, there is a more direct and simpler way : just transform you data frame as a matrix directly :
mat_corel <- as.matrix(tbl_corel)
But if you just want to compute a correlation coefficient, you can do it directly from your data frame :
cor(tbl_end$diff, tbl_end$abund_mean)

Related

Adding edge attribute to a network constructed from a co-occurrence matrix in R

I have a list of binary coded observations regarding whether one entity is present or absent at one given time, e.g.,
date a b c d e f g
07-07-2021 0 1 1 0 0 0 0
07-08-2021 1 0 0 0 1 1 1
07-10-2021 0 0 1 1 1 1 0
07-11-2021 1 1 1 0 0 1 1
I have created a network object from a co-occurrence matrix calculated by using crossprod(). I would like to add the observation date to the network as an edge attribute, but I'm not sure how to do that. I'm wondering how I can achieve my goal using R package(s). Thank you!
This is a very inelegant solution and I'm sure someone smarter than me could do better. Call it the brute force approach. The basic idea is that rather than using crossprod() to get a single adjacency matrix, create separate adjacency matrices for every date. You can do this by turning the initial data into a matrix that replicates each row by the size of the data, and then multiplying by the transpose of itself. Then turn each adjacency matrix into an edgelist and add the date as an attribute of every edge. Then combine all the edgelists into one. Create an igraph object from the edgelist. Then add the dates as an edge attribute (as far as I know, igraph requires that you do these last two as separate steps). I told you it was inelegant.
library(igraph)
dates <- paste("day",1:4) # I simplified the dates
data <-matrix(c(0,1,1,0,0,0,0,1,0,0,0,1,1,1,0,0,1,1,1,1,0,1,1,1,0,0,1,1),
ncol = 7, nrow = 4, byrow =T) # your data
colnames(data) <- letters[1:7]
rownames(data) <- dates
data <- as.data.frame(t(data)) # turn the data on its side
edgelists <- mapply(function(x, dates){
m <- matrix(x,nrow = length(x), ncol = length(x)) #turn each ROW of original data (now each COLUMN) into a matrix
rownames(m) <- colnames(m) <- rownames(data) # it will help to keep track of the names
n <- as.data.frame(as_edgelist(graph_from_adjacency_matrix(m*t(m)))) #create adjacency matrix and then turn it back into an edgelist
n$date <- dates # asign date
return(n)
},
x = data,
dates = as.list(dates),
SIMPLIFY =F)
el <- do.call("rbind", edgelists) # combine all edgelists into one
ig <- graph_from_edgelist(as.matrix(el[,1:2])) # make igraph object
E(ig)$date <- el$date # add the date as edge attribute
plot(ig, edge.label = E(ig)$date)) #check result

Problem with data frame transformation using dplyr package

Problem
Let's consider two data frames :
One containing only 1's and 0's and second one with data :
set.seed(20)
df<-data.frame(sample(0:1,5,T),sample(0:1,5,T),sample(0:1,5,T))
#zero_one data frame
sample.0.1..5..T. sample.0.1..5..T..1 sample.0.1..5..T..2
1 0 1 0
2 1 0 0
3 1 1 1
4 0 0 0
5 1 0 1
df1<-data.frame(append(rnorm(4),10),append(runif(4),-5),append(rexp(4),20))
#with data
append.rnorm.4...10. append.runif.4....5. append.rexp.4...20.
1 0.08609139 0.2374272 0.3341095
2 -0.63778176 0.2297862 0.7537732
3 0.22642990 0.9447793 1.3011998
4 -0.05418293 0.8448115 1.2097271
5 10.00000000 -5.0000000 20.0000000
Now what I want to do is to change values in second data frame for which first data frame takes values 0 by mean calculated for values for which first data frame takes value one.
Example
In first column I want to replace 0.08609139 and -0.05418293 (values for which first column in first data frame takes values 0) by mean(-0.63778176, 0.22642990,10.00000000) (values for which first column in first data frame takes values 1).
I want to do it using mutate_all() function from dplyr package.
My work so far
df1<-df1 %>% mutate_all(
function(x) ifelse(df[x]==0, mean(x[df==1],na.rm=T,x)))
I know that the condition df[x] is meaningless, but I have no idea what should i put there. Could you please help me with that ?
You could follow #deschen's suggestion and multiply the two data frames together.
Here is another approach to consider using mapply. For each column, identify the positions (indices) in df where value is zero.
Then, substitute the corresponding df1 column of those positions with the mean of other values in the column. y[-idx] should be all values in the df1 column that exclude those positions.
Note that my set.seed is different - when I used yours of 20 I got different values, and a column with all zeroes. Please let me know if you are able to reproduce.
set.seed(12)
df<-data.frame(sample(0:1,5,T),sample(0:1,5,T),sample(0:1,5,T))
df1<-data.frame(append(rnorm(4),10),append(runif(4),-5),append(rexp(4),20))
my_fun <- function(x, y) {
idx <- which(x == 0)
y[idx] <- mean(y[-idx])
return(y)
}
mapply(my_fun, df, df1)

Repeat function in R

I have a matrix 320X64 and I want to modify the 64 variables so that the first 8 are equal to 0 and the last 56 equal to 1.
I tried the repeat function :
pen.vect<-(rep(0,8),rep(1,56))
penalty.factor<-pen.vect
but it's not working
Thank you :)
You can change between matrices and data frames easily. Working with a data frame will allow you to accomplish this easier with bracket notation:
bm <- as.data.frame(B) # assuming your matrix is called "B"
bm[,1:8] <- 0
bm[,9:56] <- 1
B2 <- as.matrix(bm)
Here's a full, working example with dummy data:
B = matrix(c(2:65), nrow=320, ncol=64) # Create a matrix with dummy data
bm <- as.data.frame(B) # Change it to a data frame
bm[,1:8] <- 0 # Change each row in the first 8 columns to 0
bm[,9:56] <- 1 # Change the rest to 1
B2 <- as.matrix(bm) # Change the data back to a matrix
Also, take a look at this post for how to properly post an R question. I'm honestly shocked your question hasn't been deleted or flagged yet. R on SO can be brutal.

Data handling: 2 independent factors, which decide the position of a numeric value in a new data frame

I am new to Stackoverflow and to R, so I hope you can be a bit patient and excuse any formatting mistakes.
I am trying to write an R-script, which allows me to automatically analyze the raw data of a qPCR machine.
I was quite successful in cleaning up the data, but at some point I run into trouble. My goal is to consolidate the data into a comprehensive table.
The initial data frame (DF) looks something like this:
Sample Detector Value
1 A 1
1 B 2
2 A 3
3 A 2
3 B 3
3 C 1
My goal is to have a dataframe with the Sample-names as row names and Detector as column names.
A B C
1 1 2 NA
2 3 NA NA
3 2 3 1
My approach
First I took out the names of samples and detectors and saved them in vectors as factors.
detectors = summary(DF$Detector)
detectors = names(detectors)
samples = summary(DF$Sample)
samples = names(samples)
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
Then I subsetted the detectors into a new dataframe based on the name of the detector in the dataframe.
for (i in 1:length(detectors)){
assign(detectors[i], DF[which(DF$Detector == detectors[i]),])
}
Then I initialize an empty dataframe with the right column and row names:
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
So now the Problem. I have to get the values from the detector subsets into the result dataframe. Here it is important that each values finds the way to the right position in the dataframe. The issue is that there are not equally many values since some samples lack some detectors.
I tried to do the following: Iterate through the detector subsets, compare the rowname (=samplename) with each other and if it's the same write the value into the new dataframe. In case it it is not the same, it should write an NA.
for (i in 1:length(detectors)){
for (j in 1:length(get(detectors[i])$Sample)){
result[j,i] = ifelse(get(detectors[i])$Sample[j] == rownames(result[j,]), get(detectors[i])$Ct.Mean[j], NA)
}
}
The trouble is, that this stops the iteration through the detector$Sample column and it switches to the next detector. My understanding is that the comparing samples get out of sync, yielding the all following ifelse yield a NA.
I tried to circumvent it somehow by editing the ifelse(test, yes, no) NO with j=j+1 to get it back in sync, but this unfortunately didn't work.
I hope I could make my problem understandable to you!
Looking forward to hear any suggestions, or comments (also how to general improve my code ;)
We can use acast from library(reshape2) to convert from 'long' to 'wide' format.
acast(DF, Sample~Detector, value.var='Value') #returns a matrix output
# A B C
#1 1 2 NA
#2 3 NA NA
#3 2 3 1
If we need a data.frame output, use dcast.
Or use spread from library(tidyr), which will also have the 'Sample' as an additional column.
library(tidyr)
spread(DF, Detector, Value)

R chi-squared statistic for two different distribution

I have two file.dat (random1.dat and random2.dat) which are generated from a random uniform distribution (changing the seed):
http://www.filedropper.com/random1_1: random1.dat
http://www.filedropper.com/random2 : random2.dat
I like to use R to make the X-squared to understand if the two distribution are statistically the same.
To do that i prove:
x1 -> read.table("random1.dat")
x2 -> read.table("random2.dat")
chisq.test(x1,x2)
but I receive an error message:
'x' and 'y' need to have the same length
Now the problem is that this two files are both 1000's rows. So I don't understand that. Another question is if I want to make this process automatic (iterate it) for istance 100 times with 100 different file, can i make something like:
DO i=1,100
x1 -> read.table("random'(i)'.dat")
x2 -> read.table("fixedfile.dat")
chisq.test(x1,x2)
save results from the chisq analys
END DO
Thanks so much for Your help.
ADDED:
#eipi10,
I try to use the first method You gave here and it works well for the data You generate here. Then, when I try it for my data (I put in a single file a 2-column matrix enter link description here of 1000 rows of two uniform distribution with a different seed) something do not work correctly:
I load the file with: dat = read.table("random2col.dat");
I use the command: csq = lapply(dat[,-1], function(x) chisq.test(cbind(dat[,1],x))) and a warning message appear;
finally I use: unlist(lapply(csq, function(x) x$p.value)) BUT the output is something like:
[...] 1 1 1 1 1 1 1 1 1 1 1 1 1
[963] 1 1 1 1 1.....1 1 1 1
[1000] 1
I don't think you need to use a loop. You can use lapply instead. Also, you're entering x1 and x2 as separate columns of data. When you do this, chisq.test computes a contingency table from these two columns, which wouldn't be meaningful for columns of real numbers. Instead, you need to feed chisq.test a single matrix or data frame whose columns are x1 and x2. But even then, the chisq.test is expecting count data, which isn't what you have here (although the "expected" frequency doesn't necessarily have to be an integer). In any case, here's some code that will make the test run the way you seem to be hoping:
# Simulate data: 5 columns of data, each from the uniform distribution
dat = data.frame(replicate(5, runif(20)))
# Chi-Square test of each column against column 1.
# Note use of cbind to combine the two columns into a single data frame,
# rather than entering each column as separate arguments.
csq = lapply(dat[,-1], function(x) chisq.test(cbind(dat[,1],x)))
# Look at Chi-square stats and p-Values for each test
sapply(csq, function(x) x$statistic)
sapply(csq, function(x) x$p.value)
On the other hand, if you were intending your data to be two streams of values that would then be converted into a contingency table, here's an example of that:
# Simulate data of 5 factor variables, each with 10 different levels
dat = data.frame(replicate(5, sample(c(1:10), 1000, replace=TRUE)))
# Chi-Square test of each column against column 1. Here the two columns of data are
# entered as separate arguments, so that chisq.test will convert them to a two-way
# contingency table before doing the test.
csq = lapply(dat[,-1], function(x) chisq.test(dat[,1],x))
# Look at Chi-square stats and p-Values for each test
sapply(csq, function(x) x$statistic)
sapply(csq, function(x) x$p.value)

Resources