Recode dataframe based on one column - in reverse - r

I asked this question a while ago (Recode dataframe based on one column) and the answer worked perfectly. Now however, i almost want to do the reverse. Namely, I have a (700k * 2000) of 0/1/2 or NA. In a separate dataframe I have two columns (Ref and Obs). The 0 corresponds to two instances of Ref, 1 is one instance of Ref and one instance of Obs and 2 is two Obs. To clarify, data snippet:
Genotype File ---
Ref Obs
A G
T C
G C
Ref <- c("A", "T", "G")
Obs <- c("G", "C", "C")
Current Data---
Sample.1 Sample.2 .... Sample.2000
0 1 2
0 0 0
0 NA 1
mat <- matrix(nrow=3, ncol=3)
mat[,1] <- c(0,0,0)
mat[,2] <- c(1,0,NA)
mat[,3] <- c(2,0,1)
Desired Data format---
Sample.1 Sample.1 Sample.2 Sample.2 Sample.2000 Sample.2000
A A A G G G
T T T T T T
G G 0 0 G C
I think that's right. The desired data format has two columns (space separated) for each sample. 0 in this format (plink ped file for the bioinformaticians out there) is missing data.

MAJOR ASSUMPTION: your data is in 3 element frames, i.e. you want to apply your mapping to the first 3 rows, then the next 3, and so on, which I think makes sense given DNA frames. If you want a rolling 3 element window this will not work (but code can be modified to make it work). This will work for an arbitrary number of columns, and arbitrary number of 3 row groups:
# Make up a matrix with your properties (4 cols, 6 rows)
col <- 4L
frame <- 3L
mat <- matrix(sample(c(0:2, NA_integer_), 2 * frame * col, replace=T), ncol=col)
# Mapping data
Ref <- c("A", "T", "G")
Obs <- c("G", "C", "C")
map.base <- cbind(Ref, Obs)
num.to.let <- matrix(c(1, 1, 1, 2, 2, 2), byrow=T, ncol=2) # how many from each of ref obs
# Function to map 0,1,2,NA to Ref/Obs
re_map <- function(mat.small) { # 3 row matrices, with col columns
t(
mapply( # iterate through each row in matrix
function(vals, map, num.to.let) {
vals.2 <- unlist(lapply(vals, function(x) map[num.to.let[x + 1L, ]]))
ifelse(is.na(vals.2), 0, vals.2)
},
vals=split(mat.small, row(mat.small)), # a row
map=split(map.base, row(map.base)), # the mapping for that row
MoreArgs=list(num.to.let=num.to.let) # general conversion of number to Obs/Ref
) )
}
# Split input data frame into 3 row matrices (assumes frame size 3),
# and apply mapping function to each group
mat.split <- split.data.frame(mat, sort(rep(1:(nrow(mat) / frame), frame)))
mat.res <- do.call(rbind, lapply(mat.split, re_map))
colnames(mat.res) <- paste0("Sample.", rep(1:ncol(mat), each=2))
print(mat.res, quote=FALSE)
# Sample.1 Sample.1 Sample.2 Sample.2 Sample.3 Sample.3 Sample.4 Sample.4
# 1 G G A G G G G G
# 2 C C 0 0 T C T C
# 3 0 0 G C G G G G
# 1 A A A A A G A A
# 2 C C C C T C C C
# 3 C C G G 0 0 0 0

I am not sure but this could be what you need:
first same simple data
geno <- data.frame(Ref = c("A", "T", "G"), Obs = c("G", "C", "C"))
data <- data.frame(s1 = c(0,0,0),s2 = c(1, 0, NA))
then a couple of functions:
f <- function(i , x, geno){
x <- x[i]
if(!is.na(x)){
if (x == 0) {y <- geno[i , c(1,1)]}
if (x == 1) {y <- geno[i, c(1,2)]}
if (x == 2) {y <- geno[i, c(2,2)]}
}
else y <- c(0,0)
names(y) <- c("s1", "s2")
y
}
g <- function(x, geno){
Reduce(rbind, lapply(1:length(x), FUN = f , x = x, geno = geno))
}
The way f() is defined may not be the most elegant but it does the job
Then simply run it as a doble for loop in a lapply fashion
as.data.frame(Reduce(cbind, lapply(data , g , geno = geno )))
hope it helps

Here's one way based on the sample data in your answer:
# create index
idx <- lapply(data, function(x) cbind((x > 1) + 1, (x > 0) + 1))
# list of matrices
lst <- lapply(idx, function(x) {
tmp <- apply(x, 2, function(y) geno[cbind(seq_along(y), y)])
replace(tmp, is.na(tmp), 0)
})
# one data frame
as.data.frame(lst)
# s1.1 s1.2 s2.1 s2.2
# 1 A A A G
# 2 T T T T
# 3 G G 0 0

Related

Perform same operation to multiple variables, assigning result

I'm working on a project where I have to apply the same transformation to multiple variables. For example
a <- a + 1
b <- b + 1
d <- d + 1
e <- e + 1
I can obviously perform the operations in sequence using
for (i in c(a, b, d, e)) i <- i + 1
However, I can't actually assign the result to each variable this way, since i is a copy of each variable, not a reference.
Is there a way to do this? Obviously, it'd be easier if the variables were merged in a data.frame or something, but that's not possible.
Usually if you find yourself doing the same thing to multiple objects, they should be stored / thought-of as single object with sub-components. You say that storing these as a data.frame is not possible, so you can use a list instead. This allows you to use lapply/sapply to apply a function to each element of the list in one step.
a <- c(1, 2, 3)
b <- c(1, 4)
c <- 5
d <- rnorm(10)
e <- runif(5)
lstt <- list(a = a, b = b, c = c, d = d, e = e)
lstt$a
# [1] 1 2 3
lstt <- lapply(lstt, '+', 1)
lstt$a
# [1] 2 3 4
The question states that the variables to increment cannot be in a larger structure but then in the comments it is stated that that is not so after all so we will assume they are in a list L.
L <- list(a = 1, b = 2, d = 3, e = 4) # test data
for(nm in names(L)) L[[nm]] <- L[[nm]] + 1
# or
L <- lapply(L, `+`, 1)
# or
L <- lapply(L, function(x) x + 1)
Scalars
If they are all scalars then they can be put in an ordinary vector:
v <- c(a = 1, b = 2, d = 3, e = 4)
v <- v + 1
Vectors
If they are all vectors of the same length they can be put in data frame or if they are also of the same type they can be put in a matrix in which case we can also add 1 to it.
Environment
If the variables do have to be free in an environment then if nms is a vector of the variable names then we can iterate over the names and use those names to subscript the environment env. If the names follow some pattern we may be able to use nms <- ls(pattern = "...", envir = env) or if they are the only variables in that environment we can use nms <- ls(env).
a <- b <- d <- e <- 1 # test data
env <- .GlobalEnv # can change this if not being done in global envir
nms <- c("a", "b", "d", "e")
for(nm in nms) env[[nm]] <- env[[nm]] + 1
a;b;d;e # check
## [1] 2
## [1] 2
## [1] 2
## [1] 2

Building a Tree with Node Pairs

I have a data.table of node pairs where Parent is higher up the tree than Child.
I need to extract all the individual chains from these rules e.g. if I have in format parent>child: (a>b, b>c, b>e, c>d), the chains are (a>b>c>d, a>b>e).
I've made an example with some dummy data showing what I want to do. Any suggestions on how to do this would be great? It feels like it should be straightforward but I'm struggling to think how to start. Thank you :)
library(data.table)
library(data.tree)
# example input and expected output
input <- data.table(Parent = c("a", "b", "c",
"e", "b"),
Child = c("b", "c", "d",
"b", "f"))
output <- data.table(Tree = c(rep(1,4), rep(2,3), rep(3,3), rep(4,4)),
List = c("a", "b", "c", "d",
"e", "b", "f",
"a", "b", "f",
"e", "b", "c", "d"),
Hierarchy = c(1:4, 1:3, 1:3, 1:4))
# attempt with data.tree, only builds the node pairs.
# ignore world part, was following: https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html#tree-creation
input[, pathString := paste("world", Parent, Child, sep = "/")]
data.tree::as.Node(input)
# attempt to re-structure
input[, Tree := .I]
dt1 <- input[, .(List = c(Parent, Child),
Hierarchy = 1:2), by=Tree]
Here is another possible solution - a little messy as well though
Output
output(input)
# tree_nums elems hierarchy
# 1: 1 a 1
# 2: 1 b 2
# 3: 1 c 3
# 4: 1 d 4
# 5: 2 e 1
# 6: 2 b 2
# 7: 2 c 3
# 8: 2 d 4
# 9: 3 a 1
# 10: 3 b 2
# 11: 3 f 3
# 12: 4 e 1
# 13: 4 b 2
# 14: 4 f 3
#
Function
output <- function (input) {
# init
helper <- do.call(paste0, input)
elements <- unique(unlist(input))
res <- integer(length(elements))
ind <- elements %in% input$Child
# first generation
parents <- elements[!ind]
res[!ind] <- 1L
# later generations
val <- 1L
parents <- parents
trees <- setNames(as.list(seq_along(parents)), parents)
while (any(res == 0L)) {
val <- val + 1L
children <- unique(input$Child[input$Parent %in% parents])
res[elements %in% children] <- val
# create the tree
nextHelper <- expand.grid(parents, children)
nextHelper$conc <- do.call(paste0, nextHelper)
nextHelper <- nextHelper[nextHelper$conc %in% helper,]
df_1 <- do.call(rbind, strsplit(names(trees),''))
df_2 <- base::merge(df_1, nextHelper[,-3L], by.x = ncol(df_1), by.y = 'Var1', all.x = TRUE)
n1 <- ncol(df_2)
if (n1 > 2L) df_2 <- df_2[,c(2:(n1-1),1L,n1)]
df_2$Var2 <- as.character(df_2$Var2)
df_2$Var2[is.na(df_2$Var2)] <- ''
treeNames <- do.call(paste0, df_2)
trees <- setNames(as.list(seq_along(treeNames)), treeNames)
parents <- children
}
elems <- strsplit(names(trees),'')
tree_nums <- rep(as.integer(trees), lengths(elems))
elems <- unlist(elems)
output <- data.table::data.table(tree_nums,elems)
out <- data.table::data.table(elements, res)
output$hierarchy <- out$res[match(output$elems, out$elements)]
output
}
I have a solution after a bit of a slog, but would prefer something more efficient if it exists.
library(stringi)
# convert to string
setkey(input, Parent)
sep <- ">>"
split_regex <- "(?<=%1$s)[^(%1$s)]*$"
trees <- sprintf("%s%s%s", input$Parent, sep, input$Child)
# get the base nodes, the children
children <- stri_extract_first_regex(trees, sprintf(split_regex, sep),
simplify = TRUE)
# find that which are parents
grid <- input[J(unique(children)), ][!is.na(Child), ]
update <- unique(grid$Parent)
N <- nrow(grid)
while(N > 0){
# add the children on for the ones at the base of the chains, might mean
# making more tree splits
all_trees <- unique(unlist(lapply(update, function(x){
pos <- children == x
y <- grid[Parent %in% x, Child]
trees <- c(trees[!pos], CJ(trees[pos], y)[, sprintf("%s%s%s", V1, sep, V2)])
trees
})))
# I have some trees embedded now, so remove these ones
trim <- sapply(seq_along(all_trees), function(i){
any(stri_detect_fixed(all_trees[-i], all_trees[i]))
})
trees <- all_trees[!trim]
# update operations on expanded trees until no children remain with a dependency
children <- stri_extract_first_regex(trees, sprintf(split_regex, sep, sep),
simplify = TRUE)
grid <- input[J(unique(children)), ][!is.na(Child), ]
update <- unique(grid$Parent)
N <- nrow(grid)
}
# re-structure to appropriate format
output <- data.table(pattern = trees)
output[, Tree := 1:.N]
output[, split := stri_split_regex(pattern, sep)]
output <- output[, .(List = split[[1]],
Hierarchy = 1:length(split[[1]])), by=Tree]
output[]

How to run a function with multiple arguments of varying length in a loop in R

I need to run this function like 6000 times with all of its iterations. I have 6 arguments in total for the function. The first 3 of them go hand in hand and number 75. The next argument has 9 values. And the last 2 arguments have 3 values.
#require dplyr
#data is history as list
matchloop <- function(data, data2, x, a, b, c) {
#history as list
split <- data
#history for reference
fh <- FullHistory
#start counter
n<-1
#end counter
m<-a
tempdf0.3 <- fh
#set condition for loop
while(nrow(tempdf0.3) > 1 && m <= (nrow(data2))*b) {
#put history into a variable
tempdf0.0 <- split
#put fh into a variable
tempdf0.5 <- fh
#put test path into variable from row n to m
tempdf0.1 <- as.data.frame(data2[n:m,], stringsAsFactors = FALSE)
#change column name of test path
colnames(tempdf0.1) <- "directions"
#put row n to m of history into variable
tempdf0.2 <- lapply(tempdf0.0, function(df) df[n:m,])
#put output into output
tempdf0.3 <- orderedDistancespos(tempdf0.2, tempdf0.1,
"allPaths","directions")
#add to output routeID based on reference from fh-the test path ID
tempdf0.3 <- mutate(tempdf0.3, routeID = (subset(tempdf0.5, routeID
!= x)$routeID))
#reduce output based on the matched threshold
tempdf0.3 <- subset(tempdf0.3, dists >= a*c)
#create new history based on the IDs remaining in output
split <- split[as.character(tempdf0.3$routeID)]
#create new history for reference based on the IDs remaining in
output
fh <- subset(fh, routeID %in% tempdf0.3$routeID)
#increase loop counter
n <- n+a
#increase loop counter
m <- n+(a-1)
}
#show output
mylist <- list(tempdf0.3, nrow(tempdf0.3))
return(mylist)
}
I tried putting the 3 arguments with 75 elements in them to their own lists and use mapply. This works. But even at this level I still have to run the code 81 times to cover all the variables because as far as I understand mapply recycles based on the length of the longest argument.
mapply(matchloop, mylist2,mylist3,mylist4, MoreArgs = list(a=a, b=b, c=c))
data is a list of dataframes
data2 is a dataframe
x, a, b, c are all numerical.
Right now I'm trying to streamline my output so that its in just 1 line. So if possible I would like 1 single csv output with all 6000+ lines.
You can combine mapply and apply function to cycle through all possible combination of a, b and c variables. To create all possible combinations you can use expand.grid. Finally you can contatenate list of rows into a data.frame with the help of do.call and rbind functions as follows:
matchloop_stub <- matchloop <- function(data, data2, x, a, b, c) {
# stub
c(d = sum(data), d2 = sum(data2), x = sum(x), a = a, b = b, c = c, r = a + b + c)
}
set.seed(123)
mylist2 <- replicate(75, data.frame(rnorm(1)))
mylist3 <- replicate(75, data.frame(rnorm(2)))
mylist4 <- replicate(75, data.frame(rnorm(3)))
a <- 1:9
b <- 1:3
c <- 1:3
abc <- expand.grid(a, b, c)
names(abc) <- c("a", "b", "c")
xs <- apply(abc, 1, function(x) (mapply(matchloop_stub, mylist2, mylist3, mylist4, x[1], x[2], x[3], SIMPLIFY = FALSE)))
df <- do.call(rbind, do.call(rbind, xs))
write.csv(df, file = "temp.csv")
res <- read.csv("temp.csv")
nrow(res)
# [1] 6075
head(res)
# X d d2 x a b c r
# 1 1 -0.5604756 0.7407984 -1.362065 1 1 1 3
# 2 2 -0.5604756 0.7407984 -1.362065 2 1 1 4
# 3 3 -0.5604756 0.7407984 -1.362065 3 1 1 5
# 4 4 -0.5604756 0.7407984 -1.362065 4 1 1 6
# 5 5 -0.5604756 0.7407984 -1.362065 5 1 1 7
# 6 6 -0.5604756 0.7407984 -1.362065 6 1 1 8

Vectorise a function with a supplied variable

I have a function I've been trying to vectorise going from if(){} to ifelse(). It works fine when all the arguments to the function are contained within the data set it is working on, but if I supply an argument as a string, then the vectorisation stops and the first result is used for the whole data set.
Here's an example
# data
dat <- data.frame(var1 = rep(c(0,1), 4),
var2 = c(rep("a", 4), rep("b", 4))
)
# function
my_fun <- function(x, y){
z <- ifelse(y == "a", fun_a(x), fun_b(x))
return(z)
}
fun_a <- function(x){
z <- ifelse(x == 0, "zero", x)
return(z)
}
fun_b <- function(x){
z <- ifelse(x == 1, "ONE", x)
return(z)
}
dat$var3 <- my_fun(dat$var1, dat$var2)
This returns what I expect, a vector with a row-wise value based on var1 and var2
> dat
var1 var2 var3
1 0 a zero
2 1 a 1
3 0 a zero
4 1 a 1
5 0 b 0
6 1 b ONE
7 0 b 0
8 1 b ONE
However, I want to use this functions on different data sets where var2 is not included. I realise that an easy way around would be to add var2 as an extra column in the data set, but I don't really want to do that.
This is what happens when I supply var2 as a string:
other_dat <- data.frame(var1 = rep(c(0,1), 4))
other_dat$var3 <- my_fun(other_dat$var1, y = "a")
other_dat
var1 var3
1 0 zero
2 1 zero
3 0 zero
4 1 zero
5 0 zero
6 1 zero
7 0 zero
8 1 zero
How can I vectorise this function so that it accepts a string argument and returns the result I desire?
You can vectorise the y i.e. make y of similar length as x and then the ifelse will apply the function my_func on all the values. Revised code:
# data
dat <- data.frame(var1 = rep(c(0,1), 4),
var2 = c(rep("a", 4), rep("b", 4))
)
# function
my_fun <- function(x, y){
if(length(y) == 1) {
y <- rep(y, length(x))
}
z <- ifelse(y == "a", fun_a(x), fun_b(x))
return(z)
}
fun_a <- function(x){
z <- ifelse(x == 0, "zero", x)
return(z)
}
fun_b <- function(x){
z <- ifelse(x == 1, "ONE", x)
return(z)
}
dat$var3 <- my_fun(dat$var1, "a")
other_dat <- data.frame(var1 = rep(c(0,1), 4))
other_dat$var3 <- my_fun(other_dat$var1, y = "a")
other_dat
Hope this helps.

R - Looping through datasets and change column names

I'm trying to loop through a bunch of datasets and change columns in R.
I have a bunch of datasets, say a,b,c,etc, and all of them have three columns, say X, Y, Z.
I would like to change their names to be a_X, a_Y, a_Z for dataset a, and b_X, b_Y, b_Z for dataset b, and so on.
Here's my code:
name.list = ("a","b","c")
for(i in name.list){
names(i) = c(paste(i,"_X",sep = ""),paste(i,"_Y",sep = ""),paste(i,"_Y",sep = ""));
}
However, the code above doesn't work since i is in text format.
I've considered assign function but doesn't seem to fit as well.
I would appreciate if any ideas.
Something like this :
list2env(lapply(mget(name.list),function(dat){
colnames(dat) <- paste(nn,colnames(dat),sep='_')
dat
}),.GlobalEnv)
for ( i in name.list) {
assign(i, setNames( get(i), paste(i, names(get(i)), sep="_")))
}
> a
a_X a_Y a_Z
1 1 3 A
2 2 4 B
> b
b_X b_Y b_Z
1 1 3 A
2 2 4 B
> c
c_X c_Y c_Z
1 1 3 A
2 2 4 B
Here's some free data:
a <- data.frame(X = 1, Y = 2, Z = 3)
b <- data.frame(X = 4, Y = 5, Z = 6)
c <- data.frame(X = 7, Y = 8, Z = 9)
And here's a method that uses mget and a custom function foo
name.list <- c("a", "b", "c")
foo <- function(x, i) setNames(x, paste(name.list[i], names(x), sep = "_"))
list2env(Map(foo, mget(name.list), seq_along(name.list)), .GlobalEnv)
a
# a_X a_Y a_Z
# 1 1 2 3
b
# b_X b_Y b_Z
# 1 4 5 6
c
# c_X c_Y c_Z
# 1 7 8 9
You could also avoid get or mget by putting a, b, and c into their own environment (or even a list). You also wouldn't need the name.list vector if you go this route, because it's the same as ls(e)
e <- new.env()
e$a <- a; e$b <- b; e$c <- c
bar <- function(x, y) setNames(x, paste(y, names(x), sep = "_"))
list2env(Map(bar, as.list(e), ls(e)), .GlobalEnv)
Another perk of doing it this way is that you still have the untouched data frames in the environment e. Nothing was overwritten (check a versus e$a).

Resources