Related
I'm an absolute beginner in coding and R and this is my third week doing it for a project. (for biologists, I'm trying to find the sum of risk alleles for PRS) but I need help with this part
df
x y z
1 t c a
2 a t a
3 g g t
so when code applied:
x y z
1 t 0 0
2 a 0 1
3 g 1 0
```
I'm trying to make it that if the rows in y or z match x the value changes to 1 and if not, zero
I started with:
```
for(i in 1:ncol(df)){
df[, i]<-df[df$x == df[,i], df[ ,i]<- 1]
}
```
But got all NA values
In reality, I have 100 columns I have to compare with x in the data frame. Any help is appreciated
An alternative way to do this is by using ifelse() in base R.
df$y <- ifelse(df$y == df$x, 1, 0)
df$z <- ifelse(df$z == df$x, 1, 0)
df
# x y z
#1 t 0 0
#2 a 0 1
#3 g 1 0
Edit to extend this step to all columns efficiently
For example:
df1
# x y z w
#1 t c a t
#2 a t a a
#3 g g t m
To apply column editing efficiently, a better approach is to use a function applied to all targeted columns in the data frame. Here is a simple function to do the work:
edit_col <- function(any_col) any_col <- ifelse(any_col == df1$x, 1, 0)
This function takes a column, and then compare the elements in the column with the elements of df1$x, and then edit the column accordingly. This function takes a single column. To apply this to all targeted columns, you can use apply(). Because in your case x is not a targeted column, you need to exclude it by indexing [,-1] because it is the first column in df.
# Here number 2 indicates columns. Use number 1 for rows.
df1[, -1] <- apply(df1[,-1], 2, edit_col)
df1
# x y z w
#1 t 0 0 1
#2 a 0 1 1
#3 g 1 0 0
Of course you can also define a function that edit the data frame so you don't need to do apply() manually.
Here is an example of such function
edit_df <- function(any_df){
edit_col <- function(any_col) any_col <- ifelse(any_col == any_df$x, 1, 0)
# Create a vector containing all names of the targeted columns.
target_col_names <- setdiff(colnames(any_df), "x")
any_df[,target_col_names] <-apply( any_df[,target_col_names], 2, edit_col)
return(any_df)
}
Then use the function:
edit_df(df1)
# x y z w
#1 t 0 0 1
#2 a 0 1 1
#3 g 1 0 0
A tidyverse approach
library(dplyr)
df <-
tibble(
x = c("t","a","g"),
y = c("c","t","g"),
z = c("a","a","t")
)
df %>%
mutate(
across(
.cols = c(y,z),
.fns = ~if_else(. == x,1,0)
)
)
# A tibble: 3 x 3
x y z
<chr> <dbl> <dbl>
1 t 0 0
2 a 0 1
3 g 1 0
I have two matrices with the same dimensions and they both have the same stock names as colnames, but in a different order!
I would like to sort the matrix "A" by the colnames of the matrix "B".
So the A colnames and the according value should be in the same order as the colnames of B.
How can I do this?
Example:
Kind Regards
Your example in R terms would be
A <- matrix(c(1, 4, 2), nrow = 1)
colnames(A) <- c("B", "D", "E")
A
# B D E
# [1,] 1 4 2
B <- matrix(c(2, 5, 1), nrow = 1)
colnames(B) <- c("E", "B", "D")
B
# E B D
# [1,] 2 5 1
Then we may simply subset the columns of A in the same order as they are in B:
A[, colnames(B)]
# E B D
# 2 1 4
I'm trying to create a function where I can pass a function as a variable to perform on a variable number of columns, after removing zeros. I'm not too comfortable with ellipses yet, and I'm guessing this is where the problem is arising. The function is using all the values in the specified rows, summarizing them based on the selected function, and then mutating that one value. I'd like to maintain the function across the row (e.g. rowMeans)
Example:
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
df <- data.frame(a, b, c, d)
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
# get all the values - I'm sure this is the problem, need to somehow turn it back into a df?
vals <- unlist(list(...))
# Remove 0's and perform function
valsNo0 <- vals[vals != 0]
compiledVals <- FUN(valsNo0)
return(compiledVals)
}
df %>%
mutate(foo = FUNexcludeZero(function_name = 'sd', a, b))
a b c d foo
1 1 0 5 2 1.457738
2 2 4 4 0 1.457738
3 3 3 3 1 1.457738
4 4 0 2 0 1.457738
5 5 1 1 4 1.457738
df %>%
mutate(foo = FUNexcludeZero(function_name = 'min', a, b))
a b c d foo
1 1 0 5 2 1
2 2 4 4 0 1
3 3 3 3 1 1
4 4 0 2 0 1
5 5 1 1 4 1
# Try row-function (same error occurs with rowMeans)
df %>%
mutate(foo = FUNexcludeZero(function_name = 'pmin', a, b))
Error in mutate_impl(.data, dots) :
Column `foo` must be length 5 (the number of rows) or one, not 8
For function_name = 'sd' the column should be c(NA, 1.41, 0, NA, 2.828) and the min and pmin should be c(1, 2, 3, 4, 1). I'm 100% sure the error has something to do with the list/unlist, but any other way I try it I end up with an error.
I am not sure if this is exactly what you what. You needed to perform a row wise operation on the two vectors, thus I used the apply function. This should work for any number of equal length vectors.
# Setup dataframe
a <- 1:5
b <- c(0, 4, 3, 0, 1)
c <- c(5:1)
d <- c(2, 0, 1, 0, 4)
#df <- data.frame(a, b, c, d) #not used
FUNexcludeZero <- function(function_name, ...){
# Match function name
FUN <- match.fun(function_name)
#combine the vectors into a matrix
df<-cbind(...)
#remove 0 from rows and apply function to the rows
compiledVals <- apply(df, 1, function(x) { x<-x[x!=0]
FUN(x)})
return(compiledVals)
}
FUNexcludeZero(function_name = 'sd', a, b)
#[1] NA 1.414214 0.000000 NA 2.828427
FUNexcludeZero(function_name = 'min', a, b)
#[1] 1 2 3 4 1
I am studying social network analysis and will be using Ucinet to draw network graphs. For this, I have to convert the csv file to an edge list format. Converting the adjacency matrix to the edge list was successful. However, it is difficult to convert an incidence matrix to the edge list format.
The csv file('some.csv') I have, with a incidence matrix like this:
A B C D
a 1 0 3 1
b 0 0 0 2
c 3 2 0 1
The code that converted the adjacency matrix to the edge list was as follows:
x<-read.csv("C:/.../something.csv", header=T, row.names=1)
net<-as.network(x, matrix.type='adjacency', ignore.eval=FALSE, names.eval='dd', loops=FALSE)
el<-edgelist(net, attrname='dd')
write.csv(el, file='C:/.../result.csv')
Now It only succeedded in loading the file. I tried to follow the above method, but I get an error.
y<-read.csv("C:/.../some.csv", header=T, row.names=1)
net2<-network(y, matrix.type='incidence', ignore.eval=FALSE, names.eval='co', loops=FALSE)
Error in network.incidence(x, g, ignore.eval, names.eval, na.rm, edge.check) :
Supplied incidence matrix has empty head/tail lists. (Did you get the directedness right?)
I want to see the result in this way:
a A 1
a C 3
a D 1
b D 2
c A 3
c B 2
c D 1
I tried to put the values as the error said, but I could not get the result i wanted.
Thank you for any assistance with this.
Here's your data:
inc_mat <- matrix(
c(1, 0, 3, 1,
0, 0, 0, 2,
3, 2, 0, 1),
nrow = 3, ncol = 4, byrow = TRUE
)
rownames(inc_mat) <- letters[1:3]
colnames(inc_mat) <- LETTERS[1:4]
inc_mat
#> A B C D
#> a 1 0 3 1
#> b 0 0 0 2
#> c 3 2 0 1
Here's a generalized function that does the trick:
as_edgelist.weighted_incidence_matrix <- function(x, drop_rownames = TRUE) {
melted <- do.call(cbind, lapply(list(row(x), col(x), x), as.vector)) # 3 col matrix of row index, col index, and `x`'s values
filtered <- melted[melted[, 3] != 0, ] # drop rows where column 3 is 0
# data frame where first 2 columns are...
df <- data.frame(mode1 = rownames(x)[filtered[, 1]], # `x`'s rownames, indexed by first column in `filtered``
mode2 = colnames(x)[filtered[, 2]], # `x`'s colnames, indexed by the second column in `filtered`
weight = filtered[, 3], # the third column in `filtered`
stringsAsFactors = FALSE)
out <- df[order(df$mode1), ] # sort by first column
if (!drop_rownames) {
return(out)
}
`rownames<-`(out, NULL)
}
Take it for a spin:
el <- as_edgelist.weighted_incidence_matrix(inc_mat)
el
#> mode1 mode2 weight
#> 1 a A 1
#> 2 a C 3
#> 3 a D 1
#> 4 b D 2
#> 5 c A 3
#> 6 c B 2
#> 7 c D 1
Here are the results you wanted:
control_df <- data.frame(
mode1 = c("a", "a", "a", "b", "c", "c", "c"),
mode2 = c("A", "C", "D", "D", "A", "B", "D"),
weight = c(1, 3, 1, 2, 3, 2, 1),
stringsAsFactors = FALSE
)
control_df
#> mode1 mode2 weight
#> 1 a A 1
#> 2 a C 3
#> 3 a D 1
#> 4 b D 2
#> 5 c A 3
#> 6 c B 2
#> 7 c D 1
Do they match?
identical(control_df, el)
#> [1] TRUE
This might not be the most efficient way, but it produces expected result:
y <- matrix( c(1,0,3,0,0,2,3,0,0,1,2,1), nrow=3)
colnames(y) <- c("e.A","e.B","e.C","e.D")
dt <- data.frame(rnames=c("a","b","c"))
dt <- cbind(dt, y)
# rnames e.A e.B e.C e.D
#1 a 1 0 3 1
#2 b 0 0 0 2
#3 c 3 2 0 1
# use reshape () function to convert dataframe into the long format
M <- reshape(dt, direction="long", idvar = "rnames", varying = c("e.A","e.B","e.C","e.D"))
M <- M[M$e >0,]
M
# rnames time e
# a.A a A 1
# c.A c A 3
# c.B c B 2
# a.C a C 3
# a.D a D 1
# b.D b D 2
# c.D c D 1
# If M needs to be sorted by the column rnames:
M[order(M$rnames), ]
# rnames time e
# a.A a A 1
# a.C a C 3
# a.D a D 1
# b.D b D 2
# c.A c A 3
# c.B c B 2
# c.D c D 1
I'm trying to create a random data set in R that has metric, binomial and categorical variables. However, in the end when I check the class of my categorical variables R says they are numeric, but I need them to be factors for my further analysis. Does anybody have an idea what I'm doing wrong here?
that's my code:
set.seed(3456)
R.dat <- function(n = 5000,metr=1,bin=1,cat=3) {
j <- metr
X <- (matrix(0,n,j))
for (i in 1:n) {
X[i,] <- rnorm(j, mean = 0, sd = 1)
}
BIN <- matrix(0,n,bin)
for (i in 1:bin) {
BIN[,i] <- rbinom(n,1, 0.5)
}
CAT <- matrix(0,n,cat)
for (i in 1:cat) {
CAT[,i] <- factor(sample(1:4, n, TRUE))
}
X <- as.data.frame(cbind(X,BIN, CAT))
return(X)
}
Dat <- R.dat(n=5000,metr=1,bin=1, cat=3)
summary(Dat)
If I just sample like this:
x <- factor(sample(1:4, n, TRUE))
class(x)
it says x is a factor, so I don't get why it doesn't do the same when I use it in the function and loop...any help is much apprecciated, thanks in advance!
When you do this:
CAT <- matrix(0,n,cat)
for (i in 1:cat) {
CAT[,i] <- factor(sample(1:4, n, TRUE))
}
you create a numeric matrix CAT, and then you assign a new value to a subset of that matrix. When you do that assignment, the new value is coerced to the type of CAT, which is numeric.
Also, when you cbind the matrices X, BIN and CAT at the end, you coerce all of them to a common type. This would again mess up your variable types, even assuming everything was working correctly up to this point.
The rest of your code can also be simplified considerably. In particular, you don't need looping to reassign values to matrices; you can call the matrix constructor function directly on a vector of values.
Try this instead:
R.dat <- function(n=5000, metr=1, bin=1, cat=3)
{
X <- matrix(rnorm(n * metr), nrow=n)
B <- matrix(rbinom(n * bin, 1, 0.5), nrow=n)
F <- matrix(as.character(sample(1:4, n * cat, TRUE)), nrow=n)
data.frame(X=X, B=B, F=F)
}
You don't need a loop, If you switch to data.table, you can generate them by reference.
library(data.table)
n <- 10
bin <- 1
DT <- data.table(X=replicate(n, rnorm(bin, mean=0, sd = 1)),
BIN = rbinom(n,1, 0.5),
CAT = factor(sample(1:4, n, TRUE)))
## If you need you can add more columns
cols <- paste0("CAT", 1:3)
DT[, (cols):= lapply(rep(n, 3) ,rbinom, 1, .5) ]
cols <- paste0("BIN", 1:3)
DT[, (cols):= lapply(rep(n, 3) ,function(x){factor(sample(1:4, n, TRUE)) }) ]
DT
lapply(DT, class)
DT
X BIN CAT CAT1 CAT2 CAT3 BIN1 BIN2 BIN3
1: 1.2934720 1 2 0 0 0 1 1 2
2: -0.1183180 1 2 0 0 1 3 3 1
3: 0.3648810 1 2 1 1 1 3 2 3
4: -0.2149963 1 2 1 1 0 2 3 2
5: 0.3204577 1 1 0 1 1 2 2 4
6: -0.5941640 0 4 1 0 0 2 3 1
7: -1.8852835 1 4 1 0 0 2 1 1
8: -0.8329852 0 2 0 0 1 1 1 2
9: -0.1353628 0 4 0 1 1 1 4 1
10: -0.2943969 1 4 0 1 0 4 3 3
> lapply(DT, class)
$X
[1] "numeric"
$BIN
[1] "integer"
$CAT
[1] "factor"
$CAT1
[1] "integer"
$CAT2
[1] "integer"
$CAT3
[1] "integer"
$BIN1
[1] "factor"
$BIN2
[1] "factor"
$BIN3
[1] "factor"
Because matrix does not accept factor vector, it will be coerced into numbers.
Just change it into a dataframe :
CAT <- matrix(0,n,cat)
CAT <- as.data.frame(CAT)
This will do the trick.