operating between columns and classifing values per groups R - r

I try to obtain percentages grouping values regarding one variable.
For this I used sapply to obtain the percentage of each column regarding another one, but I dont know how to group these values by type (another variable)
x <- data.frame("A" = c(0,0,1,1,1,1,1), "B" = c(0,1,0,1,0,1,1), "C" = c(1,0,1,1,0,0,1),
"type" = c("x","x","x","y","y","y","x"), "yes" = c(0,0,1,1,0,1,1))
x
A B C type yes
1 0 0 1 x 0
2 0 1 0 x 0
3 1 0 1 x 1
4 1 1 1 y 1
5 1 0 0 y 0
6 1 1 0 y 1
7 1 1 1 x 1
I need to obtaing the next value (percentage): A==1&yes==1/A==1, and for this I use the next code:
result <- as.data.frame(sapply(x[,1:3],
function(i) (sum(i & x$yes)/sum(i))*100))
result
sapply(x[, 1:3], function(i) (sum(i & x$yes)/sum(i)) * 100)
A 80
B 75
C 75
Now I need to obtain the same math operation but taking into account the varible "type". It means, obtaing the same percentage but discriminating it by type. So, my expected table was:
type sapply(x[, 1:3], function(i) (sum(i & x$yes)/sum(i)) * 100)
A x 40
A y 40
B x 25
B y 50
C x 50
C y 25
In the example it's possible to observe that, by letters, the percentage sum is the same value that the obtained in the first result, just here is discriminated by type.
thanks a lot.

You can do the following using data.table:
Code
setDT(df)
cols = c('A', 'B', 'C')
mat = df[yes == 1, lapply(.SD, function(x){
100 * sum(x)/df[, lapply(.SD, sum), .SDcols = cols][[substitute(x)]]
# Here, the numerator is sum(x | yes == 1) for x == columns A, B, C
# If we look at the denominator, it equals sum(x) for x == columns A, B, C
# The reason why we need to apply substitute(x) is because df[, lapply(.SD, sum)]
# generates a list of column sums, i.e. list(A = sum(A), B = sum(B), ...).
# Hence, for each x in the column names we must subset the list above using [[substitute(x)]]
# Ultimately, the operation equals sum(x | yes == 1)/sum(x) for A, B, C.
}), .(type), .SDcols = cols]
# '.(type)' simply means that we apply this for each type group,
# i.e. once for x and once for y, for each ABC column.
# The dot is just shorthand for 'list()'.
# .SDcols assigns the subset that I want to apply my lapply statement onto.
Result
> mat
type A B C
1: x 40 25 50
2: y 40 50 25
Long format (your example)
> melt(mat)
type variable value
1: x A 40
2: y A 40
3: x B 25
4: y B 50
5: x C 50
6: y C 25
Data
df <- data.frame("A" = c(0,0,1,1,1,1,1), "B" = c(0,1,0,1,0,1,1), "C" = c(1,0,1,1,0,0,1),
"type" = c("x","x","x","y","y","y","x"), "yes" = c(0,0,1,1,0,1,1))

Related

What exactly does the logical parameter on the `subset` function in R?

I am Learning R with the book Learning R - Richard Cotton, Chapter 5: List and Dataframes and I don't understand this example give, I have this dataframe and the following scripts:
(a_data_frame <- data.frame(
x = letters[1:5],
y = rnorm(5),
z = runif(5) > 0.5
))
x y z
1 a 0.6395739 FALSE
2 b -1.1645383 FALSE
3 c -1.3616093 FALSE
4 d 0.5658254 FALSE
5 e 0.4345538 FALSE
subset(a_data_frame, y > 0 | z, x) # what exactly mean y > 0 | z ?
I read the book and said:
subset takes up to three arguments: a data frame to subset, a
logical vector of conditions for rows to include, and a vector of
column names to keep
No more information about the second logic parameter.
It's a tricky example because the (a_data_frame, y > 0 | z, x) the second parameter means y > 0 and the "| z" means or the values in z column that are True.
y>0 evaluate the values given by rnorm(5) your values is different than the book because are randomly generate also the "or" "|" symbol is in the case the column z is selected if the condition is True, in your case all the values False and you can't see what's going on but as didactic example if we change z = rnorm(5) instead of runif(5)>5, you can understand better how works this function.
(a_data_frame <- data.frame(
x = letters[1:5],
y = rnorm(5),
z = rnorm(5)
))
x y z
1 a -0.91016367 2.04917552
2 b 0.01591093 0.03070526
3 c 0.19146220 -0.42056236
4 d 1.07171934 1.31511485
5 e 1.14760483 -0.09855757
So If we have y<0 or z<0 the output of column will be the row a,c,e
> subset(a_data_frame, y < 0 | z < 0, x)
x
1 a
3 c
5 e
> subset(a_data_frame, y < 0 & z<0, x)
[1] x
<0 rows> (or 0-length row.names) # there is no values for y<0 and z<0
> subset(a_data_frame, y < 0 & z, x) # True for row 2.
x
2 b
> subset(a_data_frame, y < 0 | z, x) # true for row 2 and row 4.
x
2 b
4 d

How to assign 1s and 0s to columns if variable in row matches or not match in R

I'm an absolute beginner in coding and R and this is my third week doing it for a project. (for biologists, I'm trying to find the sum of risk alleles for PRS) but I need help with this part
df
x y z
1 t c a
2 a t a
3 g g t
so when code applied:
x y z
1 t 0 0
2 a 0 1
3 g 1 0
```
I'm trying to make it that if the rows in y or z match x the value changes to 1 and if not, zero
I started with:
```
for(i in 1:ncol(df)){
df[, i]<-df[df$x == df[,i], df[ ,i]<- 1]
}
```
But got all NA values
In reality, I have 100 columns I have to compare with x in the data frame. Any help is appreciated
An alternative way to do this is by using ifelse() in base R.
df$y <- ifelse(df$y == df$x, 1, 0)
df$z <- ifelse(df$z == df$x, 1, 0)
df
# x y z
#1 t 0 0
#2 a 0 1
#3 g 1 0
Edit to extend this step to all columns efficiently
For example:
df1
# x y z w
#1 t c a t
#2 a t a a
#3 g g t m
To apply column editing efficiently, a better approach is to use a function applied to all targeted columns in the data frame. Here is a simple function to do the work:
edit_col <- function(any_col) any_col <- ifelse(any_col == df1$x, 1, 0)
This function takes a column, and then compare the elements in the column with the elements of df1$x, and then edit the column accordingly. This function takes a single column. To apply this to all targeted columns, you can use apply(). Because in your case x is not a targeted column, you need to exclude it by indexing [,-1] because it is the first column in df.
# Here number 2 indicates columns. Use number 1 for rows.
df1[, -1] <- apply(df1[,-1], 2, edit_col)
df1
# x y z w
#1 t 0 0 1
#2 a 0 1 1
#3 g 1 0 0
Of course you can also define a function that edit the data frame so you don't need to do apply() manually.
Here is an example of such function
edit_df <- function(any_df){
edit_col <- function(any_col) any_col <- ifelse(any_col == any_df$x, 1, 0)
# Create a vector containing all names of the targeted columns.
target_col_names <- setdiff(colnames(any_df), "x")
any_df[,target_col_names] <-apply( any_df[,target_col_names], 2, edit_col)
return(any_df)
}
Then use the function:
edit_df(df1)
# x y z w
#1 t 0 0 1
#2 a 0 1 1
#3 g 1 0 0
A tidyverse approach
library(dplyr)
df <-
tibble(
x = c("t","a","g"),
y = c("c","t","g"),
z = c("a","a","t")
)
df %>%
mutate(
across(
.cols = c(y,z),
.fns = ~if_else(. == x,1,0)
)
)
# A tibble: 3 x 3
x y z
<chr> <dbl> <dbl>
1 t 0 0
2 a 0 1
3 g 1 0

Loop within a loop with column names in R

I have the following data:
id A B C
1 1 1 0
2 1 1 1
3 0 1 1
I will like to create a function that computes the following three information between columns:
the number of individuals i) with A and B, ii) with A but not B, iii) B but not A. Similarly, I will like a recursive loop that computes these three numbers for A and C, and B and C. Is there a smart way to do so? a loop within a loop? So far, I have tried the following:
for(ii in colnames(df)){
for(jj in (ii+1):df){
print(ii,jj)
}}
Perhaps something like this:
# function to return your metrics
foo = function(x, y) {
c(
"x and y" = sum(x & y),
"x not y" = sum(x & !y),
"y not x" = sum(!x & y)
)
}
# generate combinations of columns
col_combos = combn(names(df)[-1], 2)
result = apply(col_combos, 2, function(x) foo(df[[x[1]]], df[[x[2]]]))
colnames(result) = apply(col_combos, 2, toString)
result
# A, B A, C B, C
# x and y 2 1 2
# x not y 0 1 1
# y not x 1 1 0
Using this data:
df = read.table(text = 'id A B C
1 1 1 0
2 1 1 1
3 0 1 1 ', header = TRUE)

Dispatch values in list column to separate columns

I have a data.table with a list column "c":
df <- data.table(a = 1:3, c = list(1L, 1:2, 1:3))
df
a c
1: 1 1
2: 2 1,2
3: 3 1,2,3
I want to create separate columns for the values in "c".
I create a set of new columns F_1, F_2, F_3:
mmax <- max(df$a)
flux <- paste("F", 1:mmax, sep = "_")
df[, (flux) := 0]
df
a c F_1 F_2 F_3
1: 1 1 0 0 0
2: 2 1,2 0 0 0
3: 3 1,2,3 0 0 0
I want to dispatch values in "c" to columns F_1, F_2, F_3 like this:
df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
What I have tried:
comp_vect <- function(vec, mmax){
vec <- vec %>% unlist()
n <- length(vec)
answr <- c(vec, rep(0, l = mmax -n))
}
df[ , ..flux := mapply(comp_vect, c, mmax)]
The expected data.table is :
> df
a c F_1 F_2 F_3
1: 1 1 1 0 0
2: 2 1,2 1 2 0
3: 3 1,2,3 1 2 3
I followed a radically different approach. I rbinded the list column and then dcasted it, obtaining the desired result. Last part is to set the names.
library(data.table)
df <- data.table(a = 1:3, d = list(1L, c(1L, 2L), c(1L, 2L, 3L)))
df2 <- df[, rbind(d), by = a][, dcast(.SD, a ~ V1, fill = 0)]
setnames(df2, 2:4, flux)[]
a F_1 F_2 F_3
1: 1 1 0 0
2: 2 1 2 0
3: 3 1 2 3
where flux is the variable of names that you defined in your question.
Please notice that avoided using the column name c, as it may be confused with the function c().
Solution :
for(idx in seq(max(sapply(df$c, length)))){ # maximum number of values according to all the elements of the list
set(x = df,
i = NULL,
j = paste0("F_",idx), # column's name
value = sapply(df$c, function(x){
if(is.na(x[idx])){
return(0) # 0 instead of NA
} else {
return(x[idx])
}
})
)
}
Explications :
We can extract the values from a list like this :
sapply(df$c, function(ll) return(ll[1])) # first value
[1] 1 1 1
sapply(df$c, function(ll) return(ll[2])) # second value
[1] NA 2 2
sapply(df$c, function(ll) return(ll[3])) # third value
[1] NA NA 3
We see that if there is no value, we have a NA.
We need an iterator to extract all values at the position idx. For that, we'll find the number of values in each element of df$c (the list) and keep the maximum.
max(sapply(df$c, length))
[1] 3
If we want zeros instead of NAs, we need to create a function in the sapply to convert them :
vec <- c(NA, 5, 1, NA)
> sapply(vec, function(x) if(is.na(x)) return(0) else return(x))
[1] 0 5 1 0

Returning above and below rows of specific rows in r dataframe

Consider any dataframe
col1 col2 col3 col4
row.name11 A 23 x y
row.name12 A 29 x y
row.name13 B 17 x y
row.name14 A 77 x y
I have a list of rownames which I want to return from this dataframe. Lets say I have row.name12 and row.name13 in a list. I can easily return these rows from dataframe. But I also want to return 4 rows above and 4 rows below these rows. It means I want to return from row.name8 to row.name17. I think it is similar to grep -A -B in shell.
Probable solution- Is there any way to return row number by row name? Because if I have row number than I can easily subtract 4 and add 4 in row number and return rows.
Note: Here rownames are just examples. Rownames could be anything like RED, BLUE, BLACK, etc.
Try that:
extract.with.context <- function(x, rows, after = 0, before = 0) {
match.idx <- which(rownames(x) %in% rows)
span <- seq(from = -before, to = after)
extend.idx <- c(outer(match.idx, span, `+`))
extend.idx <- Filter(function(i) i > 0 & i <= nrow(x), extend.idx)
extend.idx <- sort(unique(extend.idx))
return(x[extend.idx, , drop = FALSE])
}
dat <- data.frame(x = 1:26, row.names = letters)
extract.with.context(dat, c("a", "b", "j", "y"), after = 3, before = 1)
# x
# a 1
# b 2
# c 3
# d 4
# e 5
# i 9
# j 10
# k 11
# l 12
# m 13
# x 24
# y 25
# z 26
Perhaps a combination of which() and %in% would help you:
dat[which(rownames(dat) %in% c("row.name13")) + c(-1, 1), ]
# col1 col2 col3 col4
# row.name12 A 29 x y
# row.name14 A 77 x y
In the above, we are trying to identify which row names in "dat" are "row.name13" (using which()), and the + c(-1, 1) tells R to return the row before and the row after. If you wanted to include the row, you could do something like + c(-1:1).
To get the range of rows, switch the comma to a colon:
dat[which(rownames(dat) %in% c("row.name13")) + c(-1:1), ]
# col1 col2 col3 col4
# row.name12 A 29 x y
# row.name13 B 17 x y
# row.name14 A 77 x y
Update
Matching a list is a little bit trickier, but without thinking about it too much, here is a possibility:
myRows <- c("row.name12", "row.name13")
rowRanges <- lapply(which(rownames(dat) %in% myRows), function(x) x + c(-1:1))
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 2 3 4
#
lapply(rowRanges, function(x) dat[x, ])
# [[1]]
# col1 col2 col3 col4
# row.name11 A 23 x y
# row.name12 A 29 x y
# row.name13 B 17 x y
#
# [[2]]
# col1 col2 col3 col4
# row.name12 A 29 x y
# row.name13 B 17 x y
# row.name14 A 77 x y
This outputs a list of data.frames which might be handy since you might have duplicated rows (as there are in this example).
Update 2: Using grep if it is more appropriate
Here is a variation of your question, one which would be less convenient to solve using the which()...%in% approach.
set.seed(1)
dat1 <- data.frame(ID = 1:25, V1 = sample(100, 25, replace = TRUE))
rownames(dat1) <- paste("rowname", sample(apply(combn(LETTERS[1:4], 2),
2, paste, collapse = ""),
25, replace = TRUE),
sprintf("%02d", 1:25), sep = ".")
head(dat1)
# ID V1
# rowname.AD.01 1 27
# rowname.AB.02 2 38
# rowname.AD.03 3 58
# rowname.CD.04 4 91
# rowname.AD.05 5 21
# rowname.AD.06 6 90
Now, imagine you wanted to identify the rows with AB and AC, but you don't have a list of the numeric suffixes.
Here's a little function that can be used in such a scenario. It borrows a little from #Spacedman to make sure that the rows returned are within the range of the data (as per #flodel's suggestion).
getMyRows <- function(data, matches, range) {
rowMatches = lapply(unlist(lapply(matches, function(x)
grep(x, rownames(data)))), function(y) y + range)
rowMatches = lapply(rowMatches, function(x) x[x > 0 & x <= nrow(data)])
lapply(rowMatches, function(x) data[x, ])
}
You can use it as follows (but I won't print the results here). First, specify the dataset, then the pattern(s) you want matched, then the range (in this example, three rows before and four rows after).
getMyRows(dat1, c("AB", "AC"), -3:4)
Applying it to the earlier example of matching row.name12 and row.name13, you can use it as follows: getMyRows(dat, c(12, 13), -1:1).
You can also modify the function to make it more general (for example, to specify matching with a column instead of row names).
Create some sample data:
> dat=data.frame(col1=letters,col2=sample(26),col3=sample(letters))
> dat
col1 col2 col3
1 a 26 x
2 b 12 i
3 c 15 v
...
Set our target vector (note I choose an edge case and overlapping cases), and find matching rows:
> target=c("a","e","g","s")
> match = which(dat$col1 %in% target)
Create sequences from -2 to +2 of the matches (adjust for your needs) and merge:
> getThese = unique(as.vector(mapply(seq,match-2,match+2)))
> getThese
[1] -1 0 1 2 3 4 5 6 7 8 9 17 18 19 20 21
Fix the edge cases:
> getThese = getThese[getThese > 0 & getThese <= nrow(dat)]
> dat[getThese,]
col1 col2 col3
1 a 26 x
2 b 12 i
3 c 15 v
4 d 22 d
5 e 2 j
6 f 9 l
7 g 1 w
8 h 21 n
9 i 17 p
17 q 18 a
18 r 10 m
19 s 24 o
20 t 13 e
21 u 3 k
>
Remember our targets were a, e, g and s. You've now got those plus two rows above and two rows below for each, with no duplicates.
If you are using row names, just create 'match' from those. I was using a column.
I'd write a bunch more tests using the testthat package if this were my problem.
Another option will be to use filter. In case stats::filter is masked e.g. by dplyr::filter you have to use stats::filter.
dat <- data.frame(x = seq_along(letters), row.names = letters)
i <- rownames(dat) %in% c("a", "b", "j", "y") #Get the matches
nAfter <- 3
nBefore <- 1
fi <- seq(-nBefore, nAfter)
n <- max(abs(x))
fi <- seq(-n, n) %in% fi
dat[head(tail(filter(c(rep(FALSE, n), i, rep(FALSE, n)), fi), -n), -n) > 0,, drop = FALSE]
# x
#a 1
#b 2
#c 3
#d 4
#e 5
#i 9
#j 10
#k 11
#l 12
#m 13
#x 24
#y 25
#z 26
I would simply proceed as follow:
dat[(grep("row.name12",row.names(dat))-4):(grep("row.name13",row.names(dat))+4),]
grep("row.name12",row.names(dat)) gives you the row number that have "row.name12" as name, so
(grep("row.name12",row.names(dat))-4):(grep("row.name13",row.names(dat))+4)
gives you a serie of row numbers ranging from the 4th row preceding the row named "row.name12" to the 4th row after the one named "row.name13".

Resources