Replace NaN values in a list with zero (0) - r

Hi dear I have a problem with NaN. I am working with a large dataset with many variables and they have NaN. The data is like this:
z=list(a=c(1,2,3,NaN,5,8,0,NaN),b=c(NaN,2,3,NaN,5,8,NaN,NaN))
I used this commands to force the list to data frame but I got this:
z=as.data.frame(z)
> is.list(z)
[1] TRUE
> is.data.frame(z)
[1] TRUE
> replace(z,is.nan(z),0)
Error en is.nan(z) : default method not implemented for type 'list'
I forced z to data frame but it wasn't enough, maybe there is a form to change NaN in list. Thanks for your help. This data is only an example my original data has 36000 observations and 40 variables.

This is a perfect use case for rapply.
> rapply( z, f=function(x) ifelse(is.nan(x),0,x), how="replace" )
$a
[1] 1 2 3 0 5 8 0 0
$b
[1] 0 2 3 0 5 8 0 0
lapply would work too, but rapply deals properly with nested lists in this situation.

As you don't seem to mind having your data in a dataframe, you can do something highly vectorised too. However, this will only work if each list element is of equal length. I am guessing in your data (36000/40 = 900) that this is the case:
z <- as.data.frame(z)
dim <- dim(z)
y <- unlist(z)
y[ is.nan(y) ] <- 0
x <- matrix( y , dim )
# [,1] [,2]
# [1,] 1 0
# [2,] 2 2
# [3,] 3 3
# [4,] 0 0
# [5,] 5 5
# [6,] 8 8
# [7,] 0 0
# [8,] 0 0

Following OP's edit: Following your edited title, this should do it.
unstack(within(stack(z), values[is.nan(values)] <- 0))
# a b
# 1 1 0
# 2 2 2
# 3 3 3
# 4 0 0
# 5 5 5
# 6 8 8
# 7 0 0
# 8 0 0
unstack automatically gives you a data.frame if the resulting output is of equal length (unlike the first example, shown below).
Old solution (for continuity).
Try this:
unstack(na.omit(stack(z)))
# $a
# [1] 1 2 3 5 8 0
# $b
# [1] 2 3 5 8
Note 1: It seems from your post that you want to replace NaN with 0. The output of stack(z), it can be saved to a variable and then replaced to 0 and then you can unstack.
Note 2: Also, since na.omit removes NA as well as NaN, I also assume that your data contains no NA (from your data above).

z = do.call(data.table, rapply(z, function(x) ifelse(is.nan(x),0,x), how="replace"))
If you initially have data.table and want to 1-line the replacement.
But keep in mind that keys are need to be redefined after that:
> key(x1)
[1] "date"
> x1 = do.call(data.table, rapply(x1, function(x) ifelse(is.na(x), 0, x), how="replace"))
> key(x1)
NULL

Related

Remove negative values from nested list

I have following list:
[[1]]
[1] 0 -1 -2 -3 -4
[[2]]
[1] 2 1 0 -1 -2
[[3]]
[1] 2 3 4 5 6
I want to remove negative values from above list.
I am trying with following code in r
x[ x > 0 ]
But, it does not remove negative values.
We can use lapply over each list element and select only those values which are greater than or equal to 0.
lapply(lst, function(x) x[x >= 0])
#$l1
#[1] 0
#$l2
#[1] 2 1 0
#$l3
#[1] 2 3 4 5 6
Data
lst = list(l1 = c(0,-1,-2,-3,-4),l2 = c(2,1,0,-1,-2), l3 = c(2,3,4,5,6))

conditionally look up column names to populate new column in r

I have a data.frame that looks like this:
A C G T
1 6 0 14 0
2 0 0 20 0
3 14 0 6 0
4 14 0 6 0
5 6 0 14 0
(actually, I have 1800 of the with varying numbers of rows..)
Just to explain what you are looking at:
Each row is one SNP, so it can either be one base (A,C,G,T) or another base (A,C,G,T)
SNP1’s Major allele is “G”, which appears in 14 individuals, the minor allele is “A”, which appears in 6 out of the 20 individuals in the dataset.
The 14 individuals that show G at SNP1 are the same the show A at SNP3, so there are two possibilities for the combination of bases along the 5 rows: one would be GGAAG and one would be AGGGA.
These can (theoretically) be built from the colnames of all the cells containing either 6 or 14 in the corresponding row, resulting in something like this:
A C G T 14 6
1 6 0 14 0 G A
2 0 0 20 0 G G
3 14 0 6 0 A G
4 14 0 6 0 A G
5 6 0 14 0 G A
Is there an elegant way to achieve something like this?
I have a piece of code from the answer to a somewhat related question that will return positions of a specific value within a matrix.
mat <- matrix(c(1:3), nrow = 4, ncol = 4)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 1
[2,] 2 3 1 2
[3,] 3 1 2 3
[4,] 1 2 3 1
find <- function(mat, value) {
nr <- nrow(mat)
val_match <- which(mat == value)
out <- matrix(NA, nrow= length(val_match), ncol= 2)
out[,2] <- floor(val_match / nr) + 1
out[,1] <- val_match %% nr
return(out)
}
find(mat, 2)
[,1] [,2]
[1,] 2 1
[2,] 1 2
[3,] 0 3
[4,] 3 3
[5,] 2 4
I think I can figure out how to adjust this to where it returns the colname from the original data.frame, but it requires the value it is looking for as input. – There are potentially several of those in one data snippet (as seen in the example above, 14 and 6), and it is/they are different for each snippet of my data.
In some of them, there are no duplicates at all.
In addition, if one of the values hits 20, then the corresponding colname is automatically the one to choose (as seen in row 2 on the example above).
EDIT
I have tried the code suggested by thelatemail, and it works fine on some of the data, but not on all of them.
This one, for example, produces results that I don't fully understand:
subset looks like this:
A C G T
1 0 0 3 1
2 0 9 0 3
3 3 0 0 2
4 0 3 0 2
5 2 0 0 3
6 0 2 0 3
sel <- subset > 0
ord <- order(row(subset)[sel], -subset[sel])
haplo1 <- split(names(subset)[col(subset)[sel]][ord], row(subset)[sel][ord])
This produces
1
[1] "G" "T"
2
[1] "C" "T"
3
[1] "A" "T"
4
[1] "C" "T"
5
[1] "T" "A"
6
[1] "T" "C"
Since there is a 3 in every row, I don't understand why these are not all in one of these possibilities (which would result in GTACTT and TCTTAC instead).
I have also realized that I have a lot of missing alleles, were only one or two individuals were found to have a base in this locis.
Can a column with "missing" be included somehow? - I tried to just tack it on, which gave me an error about non-corresponding row numbers.
In order to get my minimum function to work, I had to covert zero's to NA. For some reason, na.rm=TRUE doesn't work with which.min
See if this is helpful for you:
A <- c(6,0,14,14,6)
C <- c(0,0,0,0,0)
G <- c(14,20,6,6,14)
T <- c(0,0,0,0,0)
mymatrix <- as.matrix(cbind(A,C,G,T))
mymatrix<-ifelse(mymatrix==0,mymatrix==NA,mymatrix)
mymatrix
major_allele <- colnames(mymatrix)[apply(mymatrix,1,which.max)] ; head(major_allele)
minor_allele <- colnames(mymatrix)[apply(mymatrix,1,which.min)] ; head(minor_allele)
myds<-as.data.frame(cbind(mymatrix,major_allele,minor_allele))
myds
> myds
A C G T major_allele minor_allele
1 6 <NA> 14 <NA> G A
2 <NA> <NA> 20 <NA> G G
3 14 <NA> 6 <NA> A G
4 14 <NA> 6 <NA> A G
5 6 <NA> 14 <NA> G A
Here's an attempt that will work for however many hits there are in each row. It returns a list object, which is probably appropriate for differing lengths of results per row.
sel <- dat > 0
ord <- order(row(dat)[sel], -dat[sel])
split(names(dat)[col(dat)[sel]][ord], row(dat)[sel][ord] )
#List of 5
# $ 1: chr [1:2] "G" "A"
# $ 2: chr "G"
# $ 3: chr [1:2] "A" "G"
# $ 4: chr [1:2] "A" "G"
# $ 5: chr [1:2] "G" "A"
Where dat was:
dat <- read.table(text="
A C G T
1 6 0 14 0
2 0 0 20 0
3 14 0 6 0
4 14 0 6 0
5 6 0 14 0
", header=TRUE)

Conditional replacement of column values in a dataframe using R

Let's make a dummy dataset
ll = data.frame(rbind(c(2,3,5), c(3,4,6), c(9,4,9)))
colnames(ll)<-c("b", "c", "a")
> ll
b c a
1 2 3 5
2 3 4 6
3 9 4 9
P = data.frame(cbind(c(3,5), c(4,6), c(8,7)))
colnames(P)<-c("a", "b", "c")
> P
a b c
1 3 4 8
2 5 6 7
I want to create a new dataframe where the values in each column of ll would be turned into 0 when it is less than corresponding values of a,b, & c in the first row of P; in other words, I'd like to see
> new_ll
b c a
1 0 0 5
2 0 0 6
3 9 0 9
so I tried it this way
nn=c("a", "b", "c")
new_ll = sapply(nn, function(i)
ll[,paste0(i)][ll[,paste0(i)] < P[,paste0(i)][1]] <- 0)
But it doesn't work for some reason! I must be doing a silly mistake in my script!! Any idea?
> new_ll
a b c
0 0 0
You can find the values in ll that are smaller than the first row of P with an apply:
t(apply(ll, 1, function(x) x<P[1,][colnames(ll)]))
[,1] [,2] [,3]
[1,] TRUE TRUE FALSE
[2,] TRUE TRUE FALSE
[3,] FALSE TRUE FALSE
Here, the first row of P is ordered to match ll, then the elements are compared.
Credit to Ananda Mahto for recognizing that apply is not required:
ll < c(P[1, names(ll)])
b c a
[1,] TRUE TRUE FALSE
[2,] TRUE TRUE FALSE
[3,] FALSE TRUE FALSE
The TRUE values show where you want to substitute with 0:
ll[ ll < c(P[1, names(ll)]) ] <- 0
ll
b c a
1 0 0 5
2 0 0 6
3 9 0 9
To fix your code, you want something like this:
do.call(cbind, lapply(names(ll), function(i) {
ll[,i][ll[,i] < P[,i][1]] <- 0
return(ll[i])}))
b c a
1 0 0 5
2 0 0 6
3 9 0 9
What's changed? First, sapply is changed to lapply and the function returns a vector for each iteration. Second, the names are presented in the correct order for the expected results. Third, the results are put together with cbind to get the final matrix. As a bonus, the redundant calls to paste0 have been removed.
You could also try mapply, which applies the function to the each corresponding element. Here, the ll and P are both data.frames. So, it applies the function for each column and does the recycling also. Here, I matched the column names of P with that of ll (similar to #Matthew Lundberg) and looked for which elements of ll in each column is < than the corresponding column (the one row of P gets recycled) and returns a logical index. Then the elements that matches the logical condition are assigned to 0.
indx <- mapply(`<`, ll, P[1,][names(ll)])
new_ll <- ll
new_ll[indx] <- 0
new_ll
# b c a
#1 0 0 5
#2 0 0 6
#3 9 0 9
In case you know that ll and P are numeric you can do it also as
llm <- as.matrix(ll)
pv <- as.numeric(P[1, colnames(llm)])
llm[sweep(llm, 2, pv, `<=`)] <- 0
data.frame(llm)
# b c a
# 1 0 0 5
# 2 0 0 6
# 3 9 0 9

R Turning a list into a matrix when the list contains objects of "different size"

I've seen a couple of questions about turning matrices into lists (not really clear why you would want that) but the reverse operation I've been unable to find.
Basically, following
# ind.dum = data frame with 29 observations and 2635 variables
for (i in 1:ncol(ind.dum))
tmp[[i]]<-which(rollapply(ind.dum[,i],4,identical,c(1,0,0,0),by.column=T))
I got a list of 2635 objects, most of which contain 1 value, bust some up to 7. I'd need to convert this to a matrix with 2635 rows and as many columns as necessary to fit every value in a separate cells (with 0 values for the rest).
I tried all the coerce measures I know (as.data.frame, as.matrix ...) and also the option to define a new matrix with the maximum dimensions but nothing works.
m<-matrix(0,nrow=2635,ncol=7)
tmp_m<-structure(tmp,dim=dim(m))
Error in structure(tmp,dim=dim(m))dims [product 18445] do not match the length of object [2635]
I'm sure there's a quick fix for this so I'm hoping someone can help me with it. Btw, my values in the tmp list's objects are numeric, although some are "integer(0)" , i.e. when the pattern c(1,0,0,0) was not found in the columns of the original ind.dum matrix.
Not sure if there is a way to use unlist without losing the information about which values belong originally to the same row...
Desired Output
A matrix or dataframe with 2635 rows and 7 columns and looking like this
12 0 0 0 0 0 0
8 14 0 0 0 0 0
0 0 0 0 0 0 0
1 4 8 12 0 0 0
...
The values basically refer to years in which a specific pattern started. I need to be able to be able to use that information to tie this problem to an earlier problem described before (see this link).
Try this for example:
do.call(rbind,lapply(ll,
function(x)
if(length(x)==1)c(x,rep(0,6))
else x))
Here's a fast alternative that does what it sounds like you are describing:
First, sample data always helps:
LL <- list(1:3, numeric(0), c(1:3,1), 1:7)
LL
# [[1]]
# [1] 1 2 3
#
# [[2]]
# numeric(0)
#
# [[3]]
# [1] 1 2 3 1
#
# [[4]]
# [1] 1 2 3 4 5 6 7
Second, we'll make use of a little trick referred to as matrix indexing to fill an empty matrix with the values from your list.
## We need to know how many columns are needed for each list item
Ncol <- vapply(LL, length, 1L)
## M is our empty matrix, pre-filled with zeroes
M <- matrix(0, nrow = length(LL), ncol = max(Ncol))
## IJ is the row/column combination where values need to be inserted
IJ <- cbind(rep(seq_along(Ncol), times = Ncol), sequence(Ncol))
## Extract and insert!
M[IJ] <- unlist(LL, use.names = FALSE)
## View the result
M
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 1 2 3 0 0 0 0
# [2,] 0 0 0 0 0 0 0
# [3,] 1 2 3 1 0 0 0
# [4,] 1 2 3 4 5 6 7
I have a solution.
Not sure if it is good enough or there's any bug.
LL <- list(1:3, numeric(0), c(1:3, 1), 1:7)
with(data.frame(m <- plyr::rbind.fill.matrix(lapply(LL, matrix, nrow = 1))), replace(m, is.na(m), 0))

Generate vectors using R

I would like to ask,if some of You dont know any simple way to solve this kind of problem:
I need to generate all combinations of A numbers taken from a set B (0,1,2...B), with their sum = C.
ie if A=2, B=3, C=2:
Solution in this case:
(1,1);(0,2);(2,0)
So the vectors are length 2 (A), sum of all its items is 2 (C), possible values for each of vectors elements come from the set {0,1,2,3} (maximum is B).
A functional version since I already started before SO updated:
A=2
B=3
C=2
myfun <- function(a=A, b=B, c=C) {
out <- do.call(expand.grid, lapply(1:a, function(x) 0:b))
return(out[rowSums(out)==c,])
}
> out[rowSums(out)==c,]
Var1 Var2
3 2 0
6 1 1
9 0 2
z <- expand.grid(0:3,0:3)
z[rowSums(z)==2, ]
Var1 Var2
3 2 0
5 1 1
7 0 2
If you wanted to do the expand grid programmatically this would work:
z <- expand.grid( rep( list(C), A) )
You need to expand as a list so that the items remain separate. rep(0:3, 3) would not return 3 separate sequences. So for A=3:
> z <- expand.grid(rep(list(0:3), 3))
> z[rowSums(z)==2, ]
Var1 Var2 Var3
3 2 0 0
6 1 1 0
9 0 2 0
18 1 0 1
21 0 1 1
33 0 0 2
Using the nifty partitions() package, and more interesting values of A, B, and C:
library(partitions)
A <- 2
B <- 5
C <- 7
comps <- t(compositions(C, A))
ii <- apply(comps, 1, FUN=function(X) all(X %in% 0:B))
comps[ii, ]
# [,1] [,2]
# [1,] 5 2
# [2,] 4 3
# [3,] 3 4
# [4,] 2 5

Resources