I have this function called newBamAD and dataframe x. what this function does is it matches the letters in REF and ALT columns and grabs the respective numbers for REF and ALT values in x. What I need to know is how do I make this function give 0 in ref or alt column instead of NA. How do I replace NA with zero here?
x <- as.matrix(read.csv(text="start,A,T,G,C,REF,ALT,TYPE
chr20:5363934,95,29,14,59,C,T,snp
chr5:8529759,,,,,G,C,snp
chr14:9620689,65,49,41,96,T,G,snp
chr18:547375,94,1,51,67,G,C,snp
chr8:5952145,27,80,25,96,T,T,snp
chr14:8694382,68,94,26,30,A,A,snp
chr16:2530921,49,15,79,72,A,T,snp:2530921
chr16:2530921,49,15,79,72,A,G,snp:2530921
chr16:2530921,49,15,79,72,A,T,snp:2530921flat
chr16:2530331,9,2,,,A,T,snp:2530331
chr16:2530331,9,2,,,A,G,snp:2530331
chr16:2530331,9,2,,,A,T,snp:2530331flat
chr16:2533924,42,13,19,52,G,T,snp:flat
chr16:2543344,4,13,13,42,G,T,snp:2543344flat
chr16:2543344,42,23,13,42,G,A,snp:2543344
chr14:4214117,73,49,18,77,G,A,snp
chr4:7799768,36,28,1,16,C,A,snp
chr3:9141263,27,41,93,90,A,A,snp", stringsAsFactors=FALSE))
newBamAD <- function (x,base.types=c("A","C","G","T")) {
# the version above
rownames(x) <- 1:nrow(x)
ref <- x[cbind(1:nrow(x), x[, 'REF'])]
alt <- x[cbind(1:nrow(x), x[, 'ALT'])]
which.flat <- grep('flat$', x[, 'TYPE'])
alt[which.flat] <- sapply(which.flat, function (i,base.types) {
sum(as.numeric(x[i, c( base.types[!( base.types %in% x[i, 'REF'])] )] ) ,na.rm=TRUE) },base.types)
cbind(x[,c("start","REF","ALT","TYPE")],bam.AD=paste(ref, alt, sep=','))
# cbind(x, bam.AD=paste(ref, alt, sep=','))
}
You could take the advice of thelatemail and switch to data frame, then take the NA out first
df <- as.data.frame(x)
types <- c("A", "T", "G", "C")
df[types][is.na(df[types])] <- 0
head(newBamAD(df))
# start REF ALT TYPE bam.AD
# 1 chr20:5363934 C T snp 59,29
# 2 chr5:8529759 G C snp 0, 0
# 3 chr14:9620689 T G snp 49,41
# 4 chr18:547375 G C snp 51,67
# 5 chr8:5952145 T T snp 80,80
# 6 chr14:8694382 A A snp 68,68
We can use gsub to do that
gsub('NA', 0, newBamAD(x)[,5])
Related
I have a list of data frames, and want to perform a function on each column in the data frame.
I've been googling for a while, but the issue I have is this:
df.1 <- data.frame(data=cbind(rnorm(5, 0), rnorm(5, 2), rnorm(5, 5)))
df.2 <- data.frame(data=cbind(rnorm(5, 0), rnorm(5, 2), rnorm(5, 5)))
names(df.1) <- c("a", "b", "c")
names(df.2) <- c("a", "b", "c")
ls.1<- list(df.1,df.2)
res <- lapply(ls.1, function(x){
x$d <- x$b + x$c
return(x)
})
Returns a new list "res" with a group of unnamed dataframes in them (res[[1]], res[[2]] etc).
[[1]]
a b c d
1 2.2378686 3.640607 4.793172 8.433780
2 -0.4411046 3.690850 5.290814 8.981664
3 -1.1490879 3.081092 4.982820 8.063912
4 -0.3024211 1.929033 4.743569 6.672602
5 1.3658726 3.395564 2.800131 6.195695
[[2]]
a b c d
1 0.3452530 3.264709 7.384127 10.648836
2 -1.2031949 3.118633 4.840496 7.959129
3 0.6177369 1.119107 4.938917 6.058024
4 -1.0470713 1.942357 5.747748 7.690106
5 0.8732836 2.704501 5.805754 8.510254
I'm interested in adding columns to the original dataframes (df.1, df.2) How would I do this?
You can name your list elements, or use tibble::lst which will do it for you:
ls.1<- list(df.1 = df.1,df.2 = df.2)
ls.2<- tibble::lst(df.1, df.2)
res1 <- lapply(ls.1, function(x){
x$d <- x$b + x$c
return(x)
})
res2 <- lapply(ls.2, function(x){
x$d <- x$b + x$c
return(x)
})
# $df.1
# a b c d
# 1 0.6782608 4.0774244 2.845351 6.922776
# 2 2.3620601 1.9395314 5.438832 7.378364
# 3 -0.5913838 2.0579972 4.312360 6.370357
# 4 0.5532147 0.8581389 5.867889 6.726027
# 5 -0.3251044 1.9838598 4.321008 6.304867
#
# $df.2
# a b c d
# 1 1.9918131 3.195105 5.715858 8.910963
# 2 0.2525537 2.507358 5.040691 7.548050
# 3 0.5038298 3.112855 5.265974 8.378830
# 4 0.4873384 3.377182 5.685714 9.062896
# 5 -0.6539881 0.157948 5.407508 5.565456
To overwrite the original data.frames you can use list2env on the output.
In order to add columns, you will have to either overwrite your ls.1 with res or perhaps manually assign result to your original data.frames, e.g. df.1 <- res[[1]]. But there are a hundred ways to skin a cat (pun intended) and there may be other better approaches.
I have a number of data frames. Each with the same format.
Like this:
A B C
1 -0.02299388 0.71404158 0.8492423
2 -1.43027866 -1.96420767 -1.2886368
3 -1.01827712 -0.94141194 -2.0234436
I would like to change the name of the third column--C--so that it includes part if the name of the variable name associated with the data frame.
For the variable df_elephant the data frame should look like this:
A B C.elephant
1 -0.02299388 0.71404158 0.8492423
2 -1.43027866 -1.96420767 -1.2886368
3 -1.01827712 -0.94141194 -2.0234436
I have a function which will change the column name:
rename_columns <- function(x) {
colnames(x)[colnames(x)=='C'] <-
paste( 'C',
strsplit (deparse (substitute(x)), '_')[[1]][2], sep='.' )
return(x)
}
This works with my data frames. However, I would like to provide a list of data frames so that I do not have to call the function multiple times by hand. If I use lapply like so:
lapply( list (df_elephant, df_horse), rename_columns )
The function renames the data frames with an NA rather than portion of the variable name.
[[1]]
A B C.NA
1 -0.02299388 0.71404158 0.8492423
2 -1.43027866 -1.96420767 -1.2886368
3 -1.01827712 -0.94141194 -2.02344361
[[2]]
A B C.NA
1 0.45387054 0.02279488 1.6746280
2 -1.47271378 0.68660595 -0.2505752
3 1.26475917 -1.51739927 -1.3050531
Is there some way that I kind provide a list of data frames to my function and produce the desired result?
You are trying to process the data frame column names instead of the actual lists' name. And this is why it's not working.
# Generating random data
n = 3
item1 = data.frame(A = runif(n), B = runif(n), C = runif(n))
item2 = data.frame(A = runif(n), B = runif(n), C = runif(n))
myList = list(df_elephant = item1, df_horse = item2)
# 1- Why your code doesnt work: ---------------
names(myList) # This will return the actual names that you want to use : [1] "df_elephant" "df_horse"
lapply(myList, names) # This will return the dataframes' column names. And thats why you are getting the "NA"
# 2- How to make it work: ---------------
lapply(seq_along(myList), # This will return an array of indicies
function(i){
dfName = names(myList)[i] # Get the list name
dfName.animal = unlist(strsplit(dfName, "_"))[2] # Split on underscore and take the second element
df = myList[[i]] # Copy the actual Data frame
colnames(df)[colnames(df) == "C"] = paste("C", dfName.animal, sep = ".") # Change column names
return(df) # Return the new df
})
# [[1]]
# A B C.elephant
# 1 0.8289368 0.06589051 0.2929881
# 2 0.2362753 0.55689663 0.4854670
# 3 0.7264990 0.68069346 0.2940342
#
# [[2]]
# A B C.horse
# 1 0.08032856 0.4137106 0.6378605
# 2 0.35671556 0.8112511 0.4321704
# 3 0.07306260 0.6850093 0.2510791
You can also try. Somehow similar to Akrun's answer using also Map in the end:
# Your data
d <- read.table("clipboard")
# create a list with names A and B
d_list <- list(A=d, B=d)
# function
foo <- function(x, y){
gr <- which(colnames(x) == "C") # get index of colnames C
tmp <- colnames(x) #new colnames vector
tmp[gr] <- paste(tmp[gr], y, sep=".") # replace the old with the new colnames.
setNames(x, tmp) # set the new names
}
# Result
Map(foo, d_list, names(d_list))
$A
A B C.A
1 -0.02299388 0.7140416 0.8492423
2 -1.43027866 -1.9642077 -1.2886368
3 -1.01827712 -0.9414119 -2.0234436
$B
A B C.B
1 -0.02299388 0.7140416 0.8492423
2 -1.43027866 -1.9642077 -1.2886368
3 -1.01827712 -0.9414119 -2.0234436
We can try with Map. Get the datasets in a list (here we used mget to return the values of the strings in a list), using Map, we change the names of the third column with that of the corresponding vector of names.
Map(function(x, y) {names(x)[3] <- paste(names(x)[3], sub(".*_", "", y), sep="."); x},
mget(c("df_elephant", "df_horse")), c("df_elephant", "df_horse"))
#$df_elephant
# A B C.elephant
#1 -0.02299388 0.7140416 0.8492423
#2 -1.43027866 -1.9642077 -1.2886368
#3 -1.01827712 -0.9414119 -2.0234436
#$df_horse
# A B C.horse
#1 0.4538705 0.02279488 1.6746280
#2 -1.4727138 0.68660595 -0.2505752
#3 1.2647592 -1.51739927 -1.3050531
df<-data.frame(w=c("r","q"), x=c("a","b"))
y=c(1,2)
How do I combine df and y into a new data frame that has all combinations of rows from df with elements from y? In this example, the output should be
data.frame(w=c("r","r","q","q"), x=c("a","a","b","b"),y=c(1,2,1,2))
w x y
1 r a 1
2 r a 2
3 q b 1
4 q b 2
This should do what you're trying to do, and without too much work.
dl <- unclass(df)
dl$y <- y
merge(df, expand.grid(dl))
# w x y
# 1 q b 1
# 2 q b 2
# 3 r a 1
# 4 r a 2
data.frame(lapply(df, rep, each = length(y)), y = y)
this should work
library(combinat)
df<-data.frame(w=c("r","q"), x=c("a","b"))
y=c("one", "two") #for generality
indices <- permn(seq_along(y))
combined <- NULL
for(i in indices){
current <- cbind(df, y=y[unlist(i)])
if(is.null(combined)){
combined <- current
} else {
combined <- rbind(combined, current)
}
}
print(combined)
Here is the output:
w x y
1 r a one
2 q b two
3 r a two
4 q b one
... or to make it shorter (and less obvious):
combined <- do.call(rbind, lapply(indices, function(i){cbind(df, y=y[unlist(i)])}))
First, convert class of columns from factor to character:
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
Then, use expand.grid to get a index matrix for all combinations of rows of df and elements of y:
ind.mat = expand.grid(1:length(y), 1:nrow(df))
Finally, loop through the rows of ind.mat to get the result:
data.frame(t(apply(ind.mat, 1, function(x){c(as.character(df[x[2], ]), y[x[1]])})))
Is there a way to have a dataframe refer to itself?
I find myself spending a lot of time writing things like y$Category1[is.na(y$Category1)]<-NULL which are hard to read and feel like a lot of slow repetitive typing. I wondered if there was something along the lines of:
y$Category1[is.na(self)] <- NULL I could use instead.
Thanks
What a great question. Unfortunately, as #user295691 pointed out in the coments, the issue is with regards to referencing a vector twice: once as the object being indexed and once as the subject of a condition. It does appear impossible to avoid the double reference.
numericVector[cond(numericVector)] <- newVal
What I think we can do is have a nice and neat function so that instead of
# this
y$Category1[is.na(y$Category1)] <- list(NULL)
# we can have this:
NAtoNULL(y$Category1)
For example, the following functions wrap selfAssign() (below):
NAtoNULL(obj) # Replaces NA values in obj with NULL.
NAtoVal(obj, val) # Replaces NA values in obj with val.
selfReplace(obj, toReplace, val) # Replaces toReplace values in obj with val
# and selfAssign can be called directly, but I'm not sure there would be a good reason to
selfAssign(obj, ind, val) # equivalent to obj[ind] <- val
Example:
# sample df
df <- structure(list(subj=c("A",NA,"C","D","E",NA,"G"),temp=c(111L,112L,NA,114L,115L,116L,NA),size=c(0.7133,NA,0.7457,NA,0.0487,NA,0.8481)),.Names=c("subj","temp","size"),row.names=c(NA,-7L),class="data.frame")
df
subj temp size
1 A 111 0.7133
2 <NA> 112 NA
3 C NA 0.7457
4 D 114 NA
5 E 115 0.0487
6 <NA> 116 NA
7 G NA 0.8481
# Make some replacements
NAtoNULL(df$size) # Replace all NA's in df$size wtih NULL's
NAtoVal(df$temp, 0) # Replace all NA's in df$tmp wtih 0's
NAtoVal(df$subj, c("B", "E")) # Replace all NA's in df$subj with alternating "B" and "E"
# the modified df is now:
df
subj temp size
1 A 111 0.7133
2 B 112 NULL
3 C 0 0.7457
4 D 114 NULL
5 E 115 0.0487
6 E 116 NULL
7 G 0 0.8481
# replace the 0's in temp for NA
selfReplace(df$temp, 0, NA)
# replace NULL's in size for 1's
selfReplace(df$size, NULL, 1)
# replace all "E"'s in subj with alternate c("E", "F")
selfReplace(df$subj, c("E"), c("E", "F"))
df
subj temp size
1 A 111 0.7133
2 B 112 1
3 C NA 0.7457
4 D 114 1
5 E 115 0.0487
6 F 116 1
7 G NA 0.8481
Right now this works for vectors, but will fail with *apply. I would love to get it working fully, especially with applying plyr. The key would be to modify
FUNCTIONS
The code for the functions are below.
An important point. This does not (yet!) work with *apply / plyr.
I believe it can by modifying the value of n and adjusting sys.parent(.) in match.call() but it still needs some fiddling.
Any suggestions / modifications would be grealy appreciated
selfAssign <- function(self, ind, val, n=1, silent=FALSE) {
## assigns val to self[ind] in environment parent.frame(n)
## self should be a vector. Currently will not work for matricies or data frames
## GRAB THE CORRECT MATCH CALL
#--------------------------------------
# if nested function, match.call appropriately
if (class(match.call()) == "call") {
mc <- (match.call(call=sys.call(sys.parent(1))))
} else {
mc <- match.call()
}
# needed in case self is complex (ie df$name)
mc2 <- paste(as.expression(mc[[2]]))
## CLEAN UP ARGUMENT VALUES
#--------------------------------------
# replace logical indecies with numeric indecies
if (is.logical(ind))
ind <- which(ind)
# if no indecies will be selected, stop here
if(identical(ind, integer(0)) || is.null(ind)) {
if(!silent) warning("No indecies selected")
return()
}
# if val is a string, we need to wrap it in quotes
if (is.character(val))
val <- paste('"', val, '"', sep="")
# val cannot directly be NULL, must be list(NULL)
if(is.null(val))
val <- "list(NULL)"
## CREATE EXPRESSIONS AND EVAL THEM
#--------------------------------------
# create expressions to evaluate
ret <- paste0("'[['(", mc2, ", ", ind, ") <- ", val)
# evaluate in parent.frame(n)
eval(parse(text=ret), envir=parent.frame(n))
}
NAtoNULL <- function(obj, n=1) {
selfAssign(match.call()[[2]], is.na(obj), NULL, n=n+1)
}
NAtoVal <- function(obj, val, n=1) {
selfAssign(match.call()[[2]], is.na(obj), val, n=n+1)
}
selfReplace <- function(obj, toReplace, val, n=1) {
## replaces occurrences of toReplace within obj with val
# determine ind based on value & length of toReplace
# TODO: this will not work properly for data frames, but neither will selfAssign, yet.
if (is.null(toReplace)) {
ind <- sapply(obj, function(x) is.null(x[[1]]))
} else if (is.na(toReplace)) {
ind <- is.na(obj)
} else {
if (length(obj) > 1) { # note, this wont work for data frames
ind <- obj %in% toReplace
} else {
ind <- obj == toReplace
}
}
selfAssign(match.call()[[2]], ind, val, n=n+1)
}
## THIS SHOULD GO INSIDE NAtoNULL, NAtoVal etc.
# todo: modify for use with *apply
if(substr(paste(as.expression(x1)), 1, 10) == "FUN(obj = ") {
# PASS. This should identify when the call is coming from *apply.
# in such a case, need to increase n by 1 for apply & lapply. Increase n by 2 for sapply
# I'm not sure the increase required for plyr functions
}
I regularly have situations where I need to replace missing values from a data.frame with values from some other data.frame that is at a different level of aggregation. So, for example, if I have a data.frame full of county data I might replace NA values with state values stored in another data.frame. After writing the same merge... ifelse(is.na()) yada yada a few dozen times I decided to break down and write a function to do this.
Here's what I cooked up, along with an example of how I use it:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols){
mergedDf <- merge(naDf, fillDf, by=mergeCols)
for (col in fillCols){
colWithNas <- mergedDf[[paste(col, "x", sep=".")]]
colWithOutNas <- mergedDf[[paste(col, "y", sep=".")]]
k <- which( is.na( colWithNas ) )
colWithNas[k] <- colWithOutNas[k]
mergedDf[col] <- colWithNas
mergedDf[[paste(col, "x", sep=".")]] <- NULL
mergedDf[[paste(col, "y", sep=".")]] <- NULL
}
return(mergedDf)
}
## test case
fillDf <- data.frame(a = c(1,2,1,2), b = c(3,3,4,4) ,f = c(100,200, 300, 400), g = c(11, 12, 13, 14))
naDf <- data.frame( a = sample(c(1,2), 100, rep=TRUE), b = sample(c(3,4), 100, rep=TRUE), f = sample(c(0,NA), 100, rep=TRUE), g = sample(c(0,NA), 200, rep=TRUE) )
fillNaDf(naDf, fillDf, mergeCols=c("a","b"), fillCols=c("f","g") )
So after I got this running I had this odd feeling that someone has probably solved this problem before me and in a much more elegant way. Is there a better/easier/faster solution to this problem? Also, is there a way that eliminates the loop in the middle of my function? That loop is there because I am often replacing NAs in more than one column. And, yes, the function assumes the columns we're filling from are named the same and the columns we are filling to and the same applies to the merge.
Any guidance or refactoring would be helpful.
EDIT on Dec 2 I realized I had logic flaws in my example which I fixed.
What a great question.
Here's a data.table solution:
# Convert data.frames to data.tables (i.e. data.frames with extra powers;)
library(data.table)
fillDT <- data.table(fillDf, key=c("a", "b"))
naDT <- data.table(naDf, key=c("a", "b"))
# Merge data.tables, based on their keys (columns a & b)
outDT <- naDT[fillDT]
# a b f g f.1 g.1
# [1,] 1 3 NA 0 100 11
# [2,] 1 3 NA NA 100 11
# [3,] 1 3 NA 0 100 11
# [4,] 1 3 0 0 100 11
# [5,] 1 3 0 NA 100 11
# First 5 rows of 200 printed.
# In outDT[i, j], on the following two lines
# -- i is a Boolean vector indicating which rows will be operated on
# -- j is an expression saying "(sub)assign from right column (e.g. f.1) to
# left column (e.g. f)
outDT[is.na(f), f:=f.1]
outDT[is.na(g), g:=g.1]
# Just keep the four columns ultimately needed
outDT <- outDT[,list(a,b,g,f)]
# a b g f
# [1,] 1 3 0 0
# [2,] 1 3 11 0
# [3,] 1 3 0 0
# [4,] 1 3 11 0
# [5,] 1 3 11 0
# First 5 rows of 200 printed.
Here's a slightly more concise/robust version of your approach. You could replace the for-loop with a call to lapply, but I find the loop easier to read.
This function assumes any columns not in mergeCols are fair game to have their NAs filled. I'm not really sure this helps, but I'll take my chances with the voters.
fillNaDf.ju <- function(naDf, fillDf, mergeCols) {
mergedDf <- merge(fillDf, naDf, by=mergeCols, suffixes=c(".fill",""))
dataCols <- setdiff(names(naDf),mergeCols)
# loop over all columns we didn't merge by
for(col in dataCols) {
rows <- is.na(mergedDf[,col])
# skip this column if it doesn't contain any NAs
if(!any(rows)) next
rows <- which(rows)
# replace NAs with values from fillDf
mergedDf[rows,col] <- mergedDf[rows,paste(col,"fill",sep=".")]
}
# don't return ".fill" columns
mergedDf[,names(naDf)]
}
My preference would be to pull out the code from merge that does the matching and do it myself so that I could keep the ordering of the original data frame intact, both row-wise and column-wise. I also use matrix indexing to avoid any loops, though to do so I create a new data frame with the revised fillCols and replace the columns of the original with it; I thought I could fill it in directly but apparently you can't use matrix ordering to replace parts of a data.frame, so I wouldn't be surprised if a loop over the names would be faster in some situations.
With matrix indexing:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
na.ind <- is.na(naDf[,fillCols])
fill.ind <- cbind(match(naB, fillB)[row(na.ind)[na.ind]], col(na.ind)[na.ind])
naX <- naDf[,fillCols]
fillX <- fillDf[,fillCols]
naX[na.ind] <- fillX[fill.ind]
naDf[,colnames(naX)] <- naX
naDf
}
With a loop:
fillNaDf2 <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
m <- match(naB, fillB)
for(col in fillCols) {
fix <- which(is.na(naDf[,col]))
naDf[fix, col] <- fillDf[m[fix],col]
}
naDf
}