How to extract column index of a dataframe with the variable name? - r

I would like to extract the column index of a variable of a dataframe using the variable name.
here is the df for exemple:
>df
Mean Var Max
a 1 0.5 3
b 1.5 0.4 4
c 0.7 0.3 2.5
d 0.3 0.1 0.5
I want to "reverse" this:
> variable.names(df[2])
[1] "Var"
with something like that:
> variable.names(df$Var)
NULL
But getting "2" instead of "NULL"
here is my entire problem:
my_fct ← function(data, v_cont, v_cat){
for (i in 1:nlevels(as.factor(v_cat))){
sub <- subset(data , v_cat == levels(as.factor(v_cat))[i])
sub_stat <- c(levels(as.factor(v_cat))[i],
mean( **sub[,COLINDEX(v_cat)**] , na.rm = TRUE)
mat_stat <- rbind(mat_stat, sub_stat)
sub[,COLINDEX(v_cat) is what need to solve. How to select the initial variable in my new matrix freshly created?
Note: v_cat and v_cont have the following form: df$variable1 , df$variable2
thanks for helping

It is not entirely clear about the situation. But based on the function provided, it can rewritten by passing the column name and subsetting with [[ instead of passing the df$variable1 or df$variable2
my_fct <- function(data, v_cont, v_cat){
mat_stat <- NULL
for (i in 1:nlevels(as.factor(data[[v_cat]]))){
sub <- subset(data , data[[v_cat]] ==
levels(as.factor(data[[v_cat]]))[i])
sub_stat <- c(levels(as.factor(data[[v_cat]]))[i],
mean(sub[,v_cat] , na.rm = TRUE)
mat_stat <- rbind(mat_stat, sub_stat)
}
return(mat_stat)
}
-testing
my_fct(df, "variable1", "variable2")
With the OP's original function if the input is df$variable1, df$variable2, an option is to use deparse(subsitute to capture the argument, extract the column name with sub and use that as column name
my_fct <- function(data, v_cont, v_cat){
nm1 <- sub(".*\\$", "", deparse(substitute(v_cat)))
mat_stat <- NULL
for (i in 1:nlevels(as.factor(v_cat))){
sub <- subset(data , v_cat == levels(as.factor(v_cat))[i])
sub_stat <- c(levels(as.factor(v_cat))[i],
mean(sub[, nm1] , na.rm = TRUE)
mat_stat <- rbind(mat_stat, sub_stat)
}
return(mat_stat)
}
-testing
my_fct(df, df$variable1, df$variable2)

Similar to LMc(+1) solution -> We could use grep:
df <- structure(list(Mean = c(1, 1.5, 0.7, 0.3), Var = c(0.5, 0.4,
0.3, 0.1), Max = c(3, 4, 2.5, 0.5)), class = "data.frame", row.names = c("a",
"b", "c", "d"))
grep("Var", colnames(df))
output:
[1] 2

Use match:
match("Var", colnames(df))

This should do it using which
df <- data.frame(Mean=c(1,1.5,0.7,0.3),Var=c(0.5,0.4,0.3,0.1),Max=c(3,4,2.5,0.5))
df
Mean Var Max
1 1.0 0.5 3.0
2 1.5 0.4 4.0
3 0.7 0.3 2.5
4 0.3 0.1 0.5
which(colnames(df)=="Var")
Output:
[1] 2

Related

Replace integers in a data frame column with other integers in R?

I want to replace a vector in a dataframe that contains only 4 numbers to specific numbers as shown below
tt <- rep(c(1,2,3,4), each = 10)
df <- data.frame(tt)
I want to replace 1 = 10; 2 = 200, 3 = 458, 4 = -0.1
You could use recode from dplyr. Note that the old values are written as character. And the new values are integers since the original column was integer:
library(tidyverse):
df %>%
mutate(tt = recode(tt, '1'= 10, '2' = 200, '3' = 458, '4' = -0.1))
tt
1 10.0
2 10.0
3 200.0
4 200.0
5 458.0
6 458.0
7 -0.1
8 -0.1
To correct the error in the code in the question and provide for a shorter example we use the input in the Note at the end. Here are several alternatives. nos defined in (1) is used in some of the others too. No packages are used.
1) indexing To get the result since the input is 1 to 4 we can use indexing. This is probably the simplest solution given that the original values of tt are in 1:4.
nos <- c(10, 200, 458, -0.1)
transform(df, tt = nos[tt])
## tt
## 1 10.0
## 2 10.0
## 3 200.0
## 4 200.0
## 5 458.0
## 6 458.0
## 7 -0.1
## 8 -0.1
1a) If the input is not necessarily in 1:4 then we could use this generalization
transform(df, tt = nos[match(tt, 1:4)])
2) arithmetic Another approach is to use arithmetic:
transform(df, tt = 10 * (tt == 1) +
200 * (tt == 2) +
458 * (tt == 3) +
-0.1 * (tt == 4))
3) outer/matrix multiplication This would also work:
transform(df, tt = c(outer(tt, 1:4, `==`) %*% nos))
3a) This is the same except we use model.matrix instead of outer.
transform(df, tt = c(model.matrix(~ factor(tt) + 0, df) %*% nos))
4) factor The levels of the factor are 1:4 and the corresponding labels are defined by nos. Extract the labels using format and then convert them to numeric.
transform(df, tt = as.numeric(format(factor(tt, levels = 1:4, labels = nos))))
4a) or as a pipeline
transform(df, tt = tt |>
factor(levels = 1:4, labels = nos) |>
format() |>
as.numeric())
5) loop We can use a simple loop. Nulling out i at the end is so that it is not made into a column.
within(df, { for(i in 1:4) tt[tt == i] <- nos[i]; i <- NULL })
6) Reduce This is somewhat similar to (5) but implements the loop using Reduce.
fun <- function(tt, i) replace(tt, tt == i, nos[i])
transform(df, tt = Reduce(fun, init = tt, 1:4))
Note
df <- data.frame(tt = c(1, 1, 2, 2, 3, 3, 4, 4))

Delete matrix rows based on a threshold coverage with another matrix in r

I have a matrix composed of sites and species. Some species have a certain trait value but not all of them.
I want to keep only the site-species matrix rows that contain enough trait information, in my case more than 60%.
So far, I have the following for-loop but I would like to have a faster version of this code. How can I optimize this and skip the for-loop part?
# site-species matrix
A <- matrix(c(0, 0.2, 0.2, 0.6, 0.3, 0.3, 0, 0.4), byrow = T, nrow = 2)
colnames(A) <- paste0("sp_", seq(ncol(A)))
rownames(A) <- paste0("site_", seq(nrow(A)))
# trait information
B <- data.frame(sp = paste0("sp_", seq(1:ncol(A))),
value = c(NA, NA, 2, 3))
# For-loop to get the coverage percentage for each row
pcover <- c()
for(i in 1:nrow(A)){
non_null_A <- A[i, ][A[i, ] > 0]
B_match <- match(names(non_null_A), B[, "sp"])
B_value <- B[B_match, "value"]
pcover <- rbind(pcover,
sum(!is.na(B_value)) / length(B_value) * 100)
}
A
A[pcover > 60, , drop = FALSE] # in this case, the second site is removed
The idea is that you have two conditions working together:
is A positive ?
is B$value NA ?
We compute these tests from the start and use only vectorized code :
Apos <- A[,B$sp] > 0 # or just A>0 here but I assumed from your code that you'd needed this
pcover <- 100* colSums(t(Apos) & !is.na(B$value)) /rowSums(Apos)
pcover
# site_1 site_2
# 66.66667 33.33333
A[pcover > 60, , drop = FALSE]
# sp_1 sp_2 sp_3 sp_4
# site_1 0 0.2 0.2 0.6

paste list based on a logical operator combined with identical matching IDs

I am new to R and will try to explain my problem as good as I can.
I am working in a dataframe where I have 15571 obs and 18976 variables. The colnames and the rownames are gene-names and most of them have an identical name match. The entries consist of only numeric values and are correlation values. This is how it looks like.
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.71 0.98 0.32
[GENE128] 0.23 0.61 0.90
[GENE271] 0.87 0.95 0.63
What I am trying to do is to write a code where I paste a list with all the genes in the df with the logical operator, x > 0.8, AND only the genes where the genenames (col- and rownames) are identical so with the example above only the "GENE271" would be "TRUE" in this case.
Is there a way to do this?
your example data as data frame
vec = c( 0.71,0.98,0.32,0.23,0.61,0.90,0.87,0.95,0.63)
mt = matrix(vec, 3, 3, byrow = T)
coln = c('GENE128', 'GENE271', 'GENE2983')
rown = c('GENE231', 'GENE128', 'GENE271')
df = data.frame(mt)
colnames(df) = coln
rownames(df) = rown
use the row-names and colnames to build a new data frame and vectorize the values
ndf = data.frame(coln = as.vector(sapply(coln, function(x) rep(x, ncol(df)))), rown = rep(rown, ncol(df)), data = as.vector(as.matrix(df)), stringsAsFactors = F)
idx_true = sapply(1:nrow(ndf), function(x) ndf[x, 1] == ndf[x, 2])
subs_ndf = ndf[idx_true, ]
subs_ndf[which(ndf[idx_true, 'data'] > 0.8 ), ]
output
coln rown data
6 GENE271 GENE271 0.95
I'm sure someone has a better, faster way. This way will be slow but it should work....
test <- data.frame(GENE128 = c(0.71,0.23,0.87), GENE271 = c(0.98,0.61,0.95),
GENE2983 = c(0.32,0.90,0.63))
row.names(test) <- c('GENE231', 'GENE128', 'GENE271')
gene.equal <- function(x, limit = 0.8){
df <- c()
for(i in 1:nrow(x)){
row <- x[i,]
indexes <- which(row.names(row) == colnames(x))
if(length(indexes) > 0 && row[,indexes] > limit){
row[,indexes] <- 'TRUE'
}
df <- rbind(df, row)
}
df
}
new.df <- gene.equal(x = test)
I made 'TRUE' as text because otherwise it'll convert it to '1.00' if you use TRUE (no quotes).
Following statement provides the desired result in 2 steps (df is your data frame).
> df <- df[which(row.names(df) %in% colnames(df) & df >= 0.8),]
> df
GENE128 GENE271 GENE2983
GENE271 0.87 0.95 0.63
NA NA NA NA
NA.1 NA NA NA
> na.omit(df)
GENE128 GENE271 GENE2983
GENE271 0.87 0.95 0.63
I have to use na.omit(df) to get rid of those NA, but the solution provides accurate data without running complex code.

R How to calculate a proportion of some value by column and by row in a data frame

Sample dataframe:
df <- data.frame(c('ab','cd','..'),c('ab','..','ab'),c('..','cd','cd'))
I'm trying to get the proportion of ab's for each column and row, but ignoring ..'s from the total in the numerator and denominator.
Proportion of ab's = total number of ab's excluding ../ number of any symbol except ..
For example for column 1 (values are ab,cd,and ..), the proportion of ab's is 0.5
What I have so far:
fun <- function(x) {
length(which(x == 'ab'))/length(which(x != '..'))
}
byColumn<- sapply(df[,1:ncol(df)],fun)
byRow <- sapply(df[1:nrow(df),],fun)
Expected result:
byColumn <- c(0.5,1.0,0.0)
byRow <- c(1.0,0.0,0.5)
Actual result:
byColumn <- c(0.5,1.0,0.0)
byRow <- c(0.5,1.0,0.0)
But byRow isn't working... it seems to be the same output as byColumn?
I would define the function as follows (you can play around with the settings)
Propfunc <- function(x, dim = "col", equal = "ab", ignore = ".."){
if(dim == "col") return(unname(colSums(x == equal)/colSums(x != ignore)))
if(dim == "row") return(rowSums(x == equal)/rowSums(x != ignore))
else stop("Unknown dim")
}
Propfunc(df)
## [1] 0.5 1.0 0.0
Propfunc(df, dim = "row")
## [1] 1.0 0.0 0.5
Propfunc(df, dim = "blabla")
## Error in Propfunc(df, dim = "blabla") : Unknown dim
You can keep your function. Then byRowyou use the same code that is working byColumn but transposing the data frame:
byColumn <- sapply(df[, 1:ncol(df)], fun)
byRow <- sapply(as.data.frame(t(df))[, 1:ncol(df)], fun)
Output:
# By column
col1 col2 col3
0.5 1.0 0.0
# By row
V1 V2 V3
1.0 0.0 0.5

generate a new dataset based on individual information

I have a dataset like this:
df <- data.frame(ID=1:10, baseline = c(1.8,2.4,3.2,2.3,2.1,2.2,3,2.8,2,2.9))
I want to create a new column called "response", this column should be created based on the following equation:
individual response=individual baseline+0.5*sin(2*3.14*(t-7.5)/24)
in this equation, t is generated based on this vector
t=rep(seq(0,24,by=0.1))
so for each ID, there should be 241 responses generated. How could I generate the new dataset containing ID, baseline, time, and response?
Another approach:
t <- rep(seq(0, 24, by = 0.1), each = nrow(df))
vals <- 0.5 * sin(2 * 3.14 * (t - 7.5) / 24)
new_df <- cbind(df, t, response = df$baseline + vals)
Try
library(reshape2)
res <- melt(apply(df[,2, drop=FALSE], 1,
function(x) x+0.5*sin(2*3.14*(t-7.5)/24)))
indx <- rep(1:nrow(df), each=241)
df1 <- cbind(df[indx,], time= rep(t, nrow(df)), response=res[,3])
row.names(df1) <- NULL
dim(df1)
#[1] 2410 4
head(df1,3)
# ID baseline time response
#1 1 1.8 0.0 1.337870
#2 1 1.8 0.1 1.333034
#3 1 1.8 0.2 1.328518
Or
t <- seq(0,24, by=0.1)
indx <- rep(1:nrow(df), each=length(t))
df2 <- within(df[indx,], {response<-baseline+0.5*sin(2*3.14*(t-7.5)/24)
time <- t})
row.names(df2) <- NULL
all.equal(df1, df2)
#[1] TRUE

Resources