I have a data.frame that looks like this:
A C G T
1 6 0 14 0
2 0 0 20 0
3 14 0 6 0
4 14 0 6 0
5 6 0 14 0
(actually, I have 1800 of the with varying numbers of rows..)
Just to explain what you are looking at:
Each row is one SNP, so it can either be one base (A,C,G,T) or another base (A,C,G,T)
SNP1’s Major allele is “G”, which appears in 14 individuals, the minor allele is “A”, which appears in 6 out of the 20 individuals in the dataset.
The 14 individuals that show G at SNP1 are the same the show A at SNP3, so there are two possibilities for the combination of bases along the 5 rows: one would be GGAAG and one would be AGGGA.
These can (theoretically) be built from the colnames of all the cells containing either 6 or 14 in the corresponding row, resulting in something like this:
A C G T 14 6
1 6 0 14 0 G A
2 0 0 20 0 G G
3 14 0 6 0 A G
4 14 0 6 0 A G
5 6 0 14 0 G A
Is there an elegant way to achieve something like this?
I have a piece of code from the answer to a somewhat related question that will return positions of a specific value within a matrix.
mat <- matrix(c(1:3), nrow = 4, ncol = 4)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 1
[2,] 2 3 1 2
[3,] 3 1 2 3
[4,] 1 2 3 1
find <- function(mat, value) {
nr <- nrow(mat)
val_match <- which(mat == value)
out <- matrix(NA, nrow= length(val_match), ncol= 2)
out[,2] <- floor(val_match / nr) + 1
out[,1] <- val_match %% nr
return(out)
}
find(mat, 2)
[,1] [,2]
[1,] 2 1
[2,] 1 2
[3,] 0 3
[4,] 3 3
[5,] 2 4
I think I can figure out how to adjust this to where it returns the colname from the original data.frame, but it requires the value it is looking for as input. – There are potentially several of those in one data snippet (as seen in the example above, 14 and 6), and it is/they are different for each snippet of my data.
In some of them, there are no duplicates at all.
In addition, if one of the values hits 20, then the corresponding colname is automatically the one to choose (as seen in row 2 on the example above).
EDIT
I have tried the code suggested by thelatemail, and it works fine on some of the data, but not on all of them.
This one, for example, produces results that I don't fully understand:
subset looks like this:
A C G T
1 0 0 3 1
2 0 9 0 3
3 3 0 0 2
4 0 3 0 2
5 2 0 0 3
6 0 2 0 3
sel <- subset > 0
ord <- order(row(subset)[sel], -subset[sel])
haplo1 <- split(names(subset)[col(subset)[sel]][ord], row(subset)[sel][ord])
This produces
1
[1] "G" "T"
2
[1] "C" "T"
3
[1] "A" "T"
4
[1] "C" "T"
5
[1] "T" "A"
6
[1] "T" "C"
Since there is a 3 in every row, I don't understand why these are not all in one of these possibilities (which would result in GTACTT and TCTTAC instead).
I have also realized that I have a lot of missing alleles, were only one or two individuals were found to have a base in this locis.
Can a column with "missing" be included somehow? - I tried to just tack it on, which gave me an error about non-corresponding row numbers.
In order to get my minimum function to work, I had to covert zero's to NA. For some reason, na.rm=TRUE doesn't work with which.min
See if this is helpful for you:
A <- c(6,0,14,14,6)
C <- c(0,0,0,0,0)
G <- c(14,20,6,6,14)
T <- c(0,0,0,0,0)
mymatrix <- as.matrix(cbind(A,C,G,T))
mymatrix<-ifelse(mymatrix==0,mymatrix==NA,mymatrix)
mymatrix
major_allele <- colnames(mymatrix)[apply(mymatrix,1,which.max)] ; head(major_allele)
minor_allele <- colnames(mymatrix)[apply(mymatrix,1,which.min)] ; head(minor_allele)
myds<-as.data.frame(cbind(mymatrix,major_allele,minor_allele))
myds
> myds
A C G T major_allele minor_allele
1 6 <NA> 14 <NA> G A
2 <NA> <NA> 20 <NA> G G
3 14 <NA> 6 <NA> A G
4 14 <NA> 6 <NA> A G
5 6 <NA> 14 <NA> G A
Here's an attempt that will work for however many hits there are in each row. It returns a list object, which is probably appropriate for differing lengths of results per row.
sel <- dat > 0
ord <- order(row(dat)[sel], -dat[sel])
split(names(dat)[col(dat)[sel]][ord], row(dat)[sel][ord] )
#List of 5
# $ 1: chr [1:2] "G" "A"
# $ 2: chr "G"
# $ 3: chr [1:2] "A" "G"
# $ 4: chr [1:2] "A" "G"
# $ 5: chr [1:2] "G" "A"
Where dat was:
dat <- read.table(text="
A C G T
1 6 0 14 0
2 0 0 20 0
3 14 0 6 0
4 14 0 6 0
5 6 0 14 0
", header=TRUE)
Related
Let's make a dummy dataset
ll = data.frame(rbind(c(2,3,5), c(3,4,6), c(9,4,9)))
colnames(ll)<-c("b", "c", "a")
> ll
b c a
1 2 3 5
2 3 4 6
3 9 4 9
P = data.frame(cbind(c(3,5), c(4,6), c(8,7)))
colnames(P)<-c("a", "b", "c")
> P
a b c
1 3 4 8
2 5 6 7
I want to create a new dataframe where the values in each column of ll would be turned into 0 when it is less than corresponding values of a,b, & c in the first row of P; in other words, I'd like to see
> new_ll
b c a
1 0 0 5
2 0 0 6
3 9 0 9
so I tried it this way
nn=c("a", "b", "c")
new_ll = sapply(nn, function(i)
ll[,paste0(i)][ll[,paste0(i)] < P[,paste0(i)][1]] <- 0)
But it doesn't work for some reason! I must be doing a silly mistake in my script!! Any idea?
> new_ll
a b c
0 0 0
You can find the values in ll that are smaller than the first row of P with an apply:
t(apply(ll, 1, function(x) x<P[1,][colnames(ll)]))
[,1] [,2] [,3]
[1,] TRUE TRUE FALSE
[2,] TRUE TRUE FALSE
[3,] FALSE TRUE FALSE
Here, the first row of P is ordered to match ll, then the elements are compared.
Credit to Ananda Mahto for recognizing that apply is not required:
ll < c(P[1, names(ll)])
b c a
[1,] TRUE TRUE FALSE
[2,] TRUE TRUE FALSE
[3,] FALSE TRUE FALSE
The TRUE values show where you want to substitute with 0:
ll[ ll < c(P[1, names(ll)]) ] <- 0
ll
b c a
1 0 0 5
2 0 0 6
3 9 0 9
To fix your code, you want something like this:
do.call(cbind, lapply(names(ll), function(i) {
ll[,i][ll[,i] < P[,i][1]] <- 0
return(ll[i])}))
b c a
1 0 0 5
2 0 0 6
3 9 0 9
What's changed? First, sapply is changed to lapply and the function returns a vector for each iteration. Second, the names are presented in the correct order for the expected results. Third, the results are put together with cbind to get the final matrix. As a bonus, the redundant calls to paste0 have been removed.
You could also try mapply, which applies the function to the each corresponding element. Here, the ll and P are both data.frames. So, it applies the function for each column and does the recycling also. Here, I matched the column names of P with that of ll (similar to #Matthew Lundberg) and looked for which elements of ll in each column is < than the corresponding column (the one row of P gets recycled) and returns a logical index. Then the elements that matches the logical condition are assigned to 0.
indx <- mapply(`<`, ll, P[1,][names(ll)])
new_ll <- ll
new_ll[indx] <- 0
new_ll
# b c a
#1 0 0 5
#2 0 0 6
#3 9 0 9
In case you know that ll and P are numeric you can do it also as
llm <- as.matrix(ll)
pv <- as.numeric(P[1, colnames(llm)])
llm[sweep(llm, 2, pv, `<=`)] <- 0
data.frame(llm)
# b c a
# 1 0 0 5
# 2 0 0 6
# 3 9 0 9
Convert a table with missing values to a data frame of the same structure as the original table?
Neither of the following methods work, as they either change the structure or do not work with missing values:
t1 <- with( mtcars, table( gear,cyl, exclude = NULL ) ) # the table
data.frame(t1)
as.data.frame(t1)
as.data.frame.table(t1)
as.data.frame.matrix(t1)
The following code works but I was hoping for a solution involving less writing:
library(reshape2)
dcast( data.frame(t1), value.var = "Freq", formula = gear ~ cyl )
The solutions to this SO question does not work with missing values:
How to convert a table to a data frame
maybe I'm just too lazy. :/
library(data.table); library(tidyr)
t1 <- with( mtcars, table( gear,cyl, exclude = NULL ) )
as.data.table(t1) %>% spread(cyl, N)
# gear 4 6 8 NA
# 1 3 1 2 12 0
# 2 4 8 4 0 0
# 3 5 2 1 2 0
# 4 <NA> 0 0 0 0
The problem is that the NA that you see in the result of "t1" is actually an NA value in the dimnames, so you can't directly use as.data.frame.matrix.
I guess if you do this a lot and want to save typing, the best recourse might be to write a function like the following:
dimFixDF <- function(intable) {
as.data.frame.matrix(`dimnames<-`(intable, lapply(dimnames(intable),
function(x) {
ifelse(is.na(x), "<NA>", x)
})))
}
When you use it, it replaces the NA values in the dimnames with the character representation, and then it converts it to a data.frame with as.data.frame.matrix.
dimFixDF(t1)
# 4 6 8 <NA>
# 3 1 2 12 0
# 4 8 4 0 0
# 5 2 1 2 0
# <NA> 0 0 0 0
If you're happy with a matrix, you can do this in base R by setting the object's class to 'matrix':
t1 <- with(mtcars, table(gear, cyl, exclude = NULL))
class(t1) <- 'matrix'
t1
# cyl
# gear 4 6 8 <NA>
# 3 1 2 12 0
# 4 8 4 0 0
# 5 2 1 2 0
# <NA> 0 0 0 0
str(t1)
# int [1:4, 1:4] 1 8 2 0 2 4 1 0 12 0 ...
# - attr(*, "dimnames")=List of 2
# ..$ gear: chr [1:4] "3" "4" "5" NA
# ..$ cyl : chr [1:4] "4" "6" "8" NA
Hi dear I have a problem with NaN. I am working with a large dataset with many variables and they have NaN. The data is like this:
z=list(a=c(1,2,3,NaN,5,8,0,NaN),b=c(NaN,2,3,NaN,5,8,NaN,NaN))
I used this commands to force the list to data frame but I got this:
z=as.data.frame(z)
> is.list(z)
[1] TRUE
> is.data.frame(z)
[1] TRUE
> replace(z,is.nan(z),0)
Error en is.nan(z) : default method not implemented for type 'list'
I forced z to data frame but it wasn't enough, maybe there is a form to change NaN in list. Thanks for your help. This data is only an example my original data has 36000 observations and 40 variables.
This is a perfect use case for rapply.
> rapply( z, f=function(x) ifelse(is.nan(x),0,x), how="replace" )
$a
[1] 1 2 3 0 5 8 0 0
$b
[1] 0 2 3 0 5 8 0 0
lapply would work too, but rapply deals properly with nested lists in this situation.
As you don't seem to mind having your data in a dataframe, you can do something highly vectorised too. However, this will only work if each list element is of equal length. I am guessing in your data (36000/40 = 900) that this is the case:
z <- as.data.frame(z)
dim <- dim(z)
y <- unlist(z)
y[ is.nan(y) ] <- 0
x <- matrix( y , dim )
# [,1] [,2]
# [1,] 1 0
# [2,] 2 2
# [3,] 3 3
# [4,] 0 0
# [5,] 5 5
# [6,] 8 8
# [7,] 0 0
# [8,] 0 0
Following OP's edit: Following your edited title, this should do it.
unstack(within(stack(z), values[is.nan(values)] <- 0))
# a b
# 1 1 0
# 2 2 2
# 3 3 3
# 4 0 0
# 5 5 5
# 6 8 8
# 7 0 0
# 8 0 0
unstack automatically gives you a data.frame if the resulting output is of equal length (unlike the first example, shown below).
Old solution (for continuity).
Try this:
unstack(na.omit(stack(z)))
# $a
# [1] 1 2 3 5 8 0
# $b
# [1] 2 3 5 8
Note 1: It seems from your post that you want to replace NaN with 0. The output of stack(z), it can be saved to a variable and then replaced to 0 and then you can unstack.
Note 2: Also, since na.omit removes NA as well as NaN, I also assume that your data contains no NA (from your data above).
z = do.call(data.table, rapply(z, function(x) ifelse(is.nan(x),0,x), how="replace"))
If you initially have data.table and want to 1-line the replacement.
But keep in mind that keys are need to be redefined after that:
> key(x1)
[1] "date"
> x1 = do.call(data.table, rapply(x1, function(x) ifelse(is.na(x), 0, x), how="replace"))
> key(x1)
NULL
I am new to R, I have 0's and 1's X matrix and associated with y's as the data.
I need to remove the observations that have less than 10 one's so I add the columns for x and i return the column name to a vector. then drop the y's that associated with the one's then I need to remove the columns because it will be column with zero.
so I am getting this error and I dont know how to fix and improve the code
Error in -Col[i] : invalid argument to unary operator
Here is the code
a0=rep(1,40)
a=rep(0:1,20)
b=c(rep(1,20),rep(0,20))
c0=c(rep(0,12),rep(1,28))
c1=c(rep(1,5),rep(0,35))
c2=c(rep(1,8),rep(0,32))
c3=c(rep(1,23),rep(0,17))
x=matrix(cbind(a0,a,b,c0,c1,c2,c3),nrow=40,ncol=7)
nam <- paste("V",1:7,sep="")
colnames(x)<-nam
dat <- cbind(y=rnorm(40,50,7),x)
#===================================
toSum <- apply(dat,2,sum)
Col <- Val <- NULL
for(i in 1:length(toSum)){
if(toSum[i]<10){
Col <- c(Col,colnames(dat)[i])
Val <- c(Val,toSum[i])}
}
for(i in 1:length(Col)){
indx <- dat[,Col[i]]==0
datnw <- dat[indx,]
datnw2 <- datnw[,-Col[i]]
}
Can some one help please? I am not sure if there is a way to get the position for the columns in Col vector. I have around 1500 columns on my original data.
Thanks
This should do the trick
datnw2 <- dat[, -which(toSum<10)]
This allows you to avoid the loop
head(datnw2)
y V1 V2 V3 V4 V7
[1,] 60.88166 1 0 1 0 1
[2,] 54.35388 1 1 1 0 1
[3,] 39.78881 1 0 1 0 1
[4,] 44.20074 1 1 1 0 1
[5,] 42.27351 1 0 1 0 1
[6,] 43.52390 1 1 1 0 1
Edit: Some pointers
toSum<10 will give a logical vector to you, the length of this vector is the same as length(toSum)
which(toSum<10) will give you the positions of those elements meeting the condition
Since you want to select those columns from dat which the associated toSum<10 is FALSE, then you have to left those columns out by doing dat[, -which(toSum<10)], this means: chose all columns but 6 and 7 which are the ones meeting condition toSum<10
Using your example data, if you want to find which rows (i.e. observations) have fewer than 10 1s
rs <- rowSums(dat[, -1]) < 10
If you want to know which columns (i.e. variables) have less than 10 "presences" then
cs <- colSums(dat[, -1]) < 10
R> cs
V1 V2 V3 V4 V5 V6 V7
FALSE FALSE FALSE FALSE TRUE TRUE FALSE
Both rs and cs are logical variables that can be used to index to remove rows/columns.
To get rid of the columns we use:
dat2 <- dat
dat2 <- dat2[, !cs]
head(dat2)
R> head(dat2)
y V1 V2 V3 V6 V7
[1,] 47.61253 1 0 1 1 1
[2,] 60.51697 1 1 1 1 1
[3,] 53.69815 1 0 1 1 1
[4,] 53.79534 1 1 1 1 1
[5,] 49.04329 1 0 1 1 1
[6,] 42.04286 1 1 1 1 1
Next it seems that you are concerned that some rows will now be all zero? Is that what you are trying to do with the final step? That doesn't appear to be the case here, so perhaps the way or removing the columns I show has solved that problem too?
R> rowSums(dat2[,-1])
[1] 4 5 4 5 4 5 4 5 3 4 3 4 3 4 3 4 3 4 3 4 2 3 2 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
[39] 1 2
I would like to ask,if some of You dont know any simple way to solve this kind of problem:
I need to generate all combinations of A numbers taken from a set B (0,1,2...B), with their sum = C.
ie if A=2, B=3, C=2:
Solution in this case:
(1,1);(0,2);(2,0)
So the vectors are length 2 (A), sum of all its items is 2 (C), possible values for each of vectors elements come from the set {0,1,2,3} (maximum is B).
A functional version since I already started before SO updated:
A=2
B=3
C=2
myfun <- function(a=A, b=B, c=C) {
out <- do.call(expand.grid, lapply(1:a, function(x) 0:b))
return(out[rowSums(out)==c,])
}
> out[rowSums(out)==c,]
Var1 Var2
3 2 0
6 1 1
9 0 2
z <- expand.grid(0:3,0:3)
z[rowSums(z)==2, ]
Var1 Var2
3 2 0
5 1 1
7 0 2
If you wanted to do the expand grid programmatically this would work:
z <- expand.grid( rep( list(C), A) )
You need to expand as a list so that the items remain separate. rep(0:3, 3) would not return 3 separate sequences. So for A=3:
> z <- expand.grid(rep(list(0:3), 3))
> z[rowSums(z)==2, ]
Var1 Var2 Var3
3 2 0 0
6 1 1 0
9 0 2 0
18 1 0 1
21 0 1 1
33 0 0 2
Using the nifty partitions() package, and more interesting values of A, B, and C:
library(partitions)
A <- 2
B <- 5
C <- 7
comps <- t(compositions(C, A))
ii <- apply(comps, 1, FUN=function(X) all(X %in% 0:B))
comps[ii, ]
# [,1] [,2]
# [1,] 5 2
# [2,] 4 3
# [3,] 3 4
# [4,] 2 5