I have a matrix with the following values: a <- c(4,6,7,78,3,2,5,6,7,8)
I would like to create a second matrix b which lists the changes in a's value at each step.
So the solution would be: b <- c(2,1,71,-75,-1,3,1,1,1)
Is there a function for this in R and if there isn't what is the easiest way to proceed?
a <- c(4,6,7,78,3,2,5,6,7,8) #this is not a matrix in R btw
diff(a)
#[1] 2 1 71 -75 -1 3 1 1 1
Related
I have a data set with 70 column variables, each is 0-1 dummy variable, and 3500 observations. I am looking to see how often observations with a 'success' in one variable are matched with another variable. In other words it obs 1 has a success dummy in variable one how often does it also have a success in variable 2 and so on for all the variables. I have found how to create a matrix table showing interactions when only two columns are involved however i cant find anything involving many columns. Ideally id like to present this in an interaction matrix with 70 variables across and 70 down. Here is an idea of the data set:
Dat A B C D
XX 1 1 1 1
XY 0 1 0 1
XZ 0 0 1 1
The output im hoping for would be:
Out A B C D
A 0 1 1 1
B 0 1 2
C 0 2
D 0
Showing the number of times that (A,B) is a pairing (B,C) is a pairing and so on.
I have tried using the table() command as well as as.matrix but it seems these require data organized as two columns and cannot understand the data when it refers to many column variables. I am fairly new to R so I apologize if my question isnt clear or is possibly quite simple.
Any help is appreciated. Thanks
Here's how to create a correlation matrix of indefinite size. First create a reproducible example of your dataset...
dat <- matrix(sample(0:1, size = 700, replace = TRUE), ncol = 70)
dat <- data.frame(dat)
Then calculate the correlation...
dat <- cor(dat)
And then plot the correlation visually...
library(corrplot)
corrplot(dat, method = "square")
You can also plot the correlation using numbers instead of colors...
corrplot(dat, method = "number")
Obviously you'll want to finesse these charts before using them in a publication. corrplot offers tons of options for chart appearance.
You can try:
res <- apply(combn(2:ncol(df), 2), 2, function(x, y) sum(rowSums(y[, x]) == 2), df)
m <- diag(x=0, ncol(df)-1)
m[upper.tri(m)] <- res
m[lower.tri(m)] <- NA
dimnames(m) <- list(colnames(df)[-1], colnames(df)[-1])
A B C D
A 0 1 1 1
B NA 0 1 2
C NA NA 0 2
D NA NA NA 0
I am trying to run a summation on each row of dataframe. Let's say I want to take the sum of 100n^2, from n=1 to n=4.
> df <- data.frame(n = seq(1:4),a = rep(100))
> df
n a
1 1 100
2 2 100
3 3 100
4 4 100
Simpler example:
Let's make fun1 our example summation function. I can pull 100 out because I can just multiply it in later.
fun <- function(x) {
i <- seq(1,x,1)
sum(i^2) }
I want to then apply this function to each row to the dataframe, where df$n provides the upper bound of the summation.
The desired outcome would be as follows, in df$b:
> df
n a b
1 1 100 1
2 2 100 5
3 3 100 14
4 4 100 30
To achieve these results I've tried the apply function
apply(df$n,1,phi)
and also with df converted into a matrix
mat <- as.matrix(df)
apply(mat[1,],1,phi)
Both return an error:
Error in seq.default(1, x, 1) : 'to' must be of length 1
I understand this error, in that I understand why seq requires a 'to' value of length 1. I don't know how to go forward.
I have also tried the same while reading the dataframe as a matrix.
Maybe less simple example:
In my case I only need to multiply the results above, df$b, by 100 (or df$a) to get my final answer for each row. In other cases, though, the second value might be more entrenched, for example a^i. How would I call on both variables, a and n?
Underlying question:
My underlying goal is to apply a summation to each row of a dataframe (or a matrix). The above questions stem from my attempt to do so using seq(), as I saw advised in an answer on this site. I will gladly accept an answer that obviates the above questions with a different way to run a summation.
If we are applying seq it doesn't take a vector for from and to. So we can loop and do it
df$b <- sapply(df$n, fun)
df$b
#[1] 1 5 14 30
Or we can Vectorize
Vectorize(fun)(df$n)
#[1] 1 5 14 30
I'm working with an expression matrix obtained by single cell RNA sequencing, but I have a question related with the R code one mate has sent me...
sort(unique(1 + slot(as(data_matrix, "dgTMatrix"), "i")))
# there isn't more details in the code...
In theory, this function is to delete non expressed genes (if it's zero in all samples, it think...), but it's impossible for me to understand it, anyone can give me a tip?
Well, I think I have understood this code... let's try to explain it! (please, correct me if I'm wrong).
Our data has a structure of sparse matrix (ie. more handly in regards to memory, link) and with as it's coerced to a specific format for this kind of matrix (Triplet Format for Sparse Matrices, link): three columns with i and j index for these non-zero values.
y <- matrix_counts # sparse matrix
AAACCTGAGAACAACT-1 AAACCTGTCGGAAATA-1 AAACGGGAGAGCTGCA-1
ENSG00000243485 1 . .
ENSG00000237613 . . 2
y2 <- as(y, "dgTMatrix") #triplet format for sparse matrix
i j x
1 9 1 1 #in row(9) and column(1) we have the value 1
2 50 1 2
3 60 1 1
4 62 1 2
5 78 1 1
6 87 1 1
After, it takes only the column "i" (slot(data, "i")), because we only need the row index (to know what rows are different to zero), and delete duplicates (unique) to finally obtain a vector with the row index which will be used to filter the raw data:
y3 <- unique(1 + slot(as(exprs(gbm), "dgTMatrix"), "i"))
[1] 9 50 60 62 78 87
data <- data_raw[y3,]
I am a bit confused with sort and 1+, but I think this is the basic concept. So, to summarize, we take the row index from this non-zero rows (genes) and use it to filter our raw data... another original method for delete non-expressed genes, interesting!
I have a vector v <- c(6,8,5,5,8) of which I can obtain the unique values using
> u <- unique(v)
> u
[1] 6 8 5
Now I need an index i = [2,3,1,1,3] that returns the original vector v when indexed into u.
> u[i]
[1] 6,8,5,5,8
I know such an index can be generated automatically in Matlab, the ci index, but does not seem to be part of the standard repertoire in R. Is anyone aware of a function that can do this?
The background is that I have several vectors with anonymized IDs that are long character strings:
ids
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
"PTefkd43fmkl28en==3rnl4"
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
To reduce the file size and simplify the code, I want to transform them into integers of the sort
ids
1
2
1
1
2
and found that the index of the unique vector does just this. Since there are many rows, I am hesitant to write a function that loops over each element of the unique vector and wonder whether there is a more efficient way — or a completely different way to transform the character strings into matching integers.
Try with match
df1$ids <- with(df1, match(ids, unique(ids)) )
df1$ids
#[1] 1 2 1 1 2
Or we can convert to factor and coerce to numeric
with(df1,as.integer(factor(ids, levels=unique(ids))))
#[1] 1 2 1 1 2
Using u and v. Based on the output of 'u' in the OP's post, it must have been sorted
u <- sort(unique(v))
match(v, u)
#[1] 2 3 1 1 3
Or using findInterval. Make sure that 'u' is sorted.
findInterval(v,u)
#[1] 2 3 1 1 3
I'm an enthusiastic R newbie that needs some help! :)
I have a data frame that looks like this:
id<-c(100,200,300,400)
a<-c(1,1,0,1)
b<-c(1,0,1,0)
c<-c(0,0,1,1)
y=data.frame(id=id,a=a,b=b,c=c)
Where id is an unique identifier (e.g. a person) and a, b and c are dummy variables for whether the person has this feature or not (as always 1=TRUE).
I want R to create a matrix or data frame where I have the variables a, b and c both as the names of the columns and of the rows. For the values of the matrix R will have to calculate the number of identifiers that have this feature, or the combination of features.
So for example, IDs 100, 200 and 400 have feature a then in the diagonal of the matrix where a and a cross, R will input 3. Only ID 100 has both features a and b, hence R will input 1 where a and b cross, and so forth.
The resulting data frame will have to look like this:
l<-c("","a","b","c")
m<-c("a",3,1,1)
n<-c("b",1,2,1)
o<-c("c",1,1,2)
result<-matrix(c(l,m,n,o),nrow=4,ncol=4)
As my data set has 10 variables and hundreds of observations, I will have to automate the whole process.
Your help will be greatly appreciated.
Thanks a lot!
With base R:
crossprod(as.matrix(y[,-1]))
# a b c
# a 3 1 1
# b 1 2 1
# c 1 1 2
This is called an adjacency matrix. You can do this pretty easily with the qdap package:
library(qdap)
adjmat(y[,-1])$adjacency
## a b c
## a 3 1 1
## b 1 2 1
## c 1 1 2
It throws a warning because you're feeding it a dataframe. Not a big deal and can be ignored. Also noticed I dropped the first column (ID's) with negative indexing y[, -1].
Note that because you started out with a Boolean matrix you could have gotten there with:
Y <- as.matrix(y[,-1])
t(Y) %*% Y