R: Creating a data frame from list with missing values. - r

I have a list here that looks like this:
head(h)
[[1]]
[1] "gene=dnaA" "locus_tag=CD630_00010" "location=1..1320"
[[2]]
character(0)
[[3]]
[1] "locus_tag=CD630_05950" "location=719777..720313"
[[4]]
[1] "gene=dnrA" "locus_tag=CD630_00010" "location=50..1320"
I'm having trouble trying to manipulate this list to create a data.frame with three columns. For the rows with missing gene info, I want to list them as "gene=unnamed" and completely remove the empty rows into a matrix as shown:
[,1] [,2] [,3]
[1,] "gene=dnaA" "locus_tag=CD630_00010" "location=1..1320"
[2,] "gene=thrA" "locus_tag=CD630_05950" "location=719777..720313"
[3,] "gene=dnrA" "locus_tag=CD630_00010" "location=50..1320"
This is what I have right now, but I get an error about missing values in the gene column. Any suggestions?
h <- data.frame(h[lapply(h,length)>0])
h <- t(h)
rownames(h) <- NULL

# Data
l <- list(c("gene=dnaA","locus_tag=CD630_00010", "location=1..1320"),
character(0), c("locusc_tag=CD630_05950", "location=719777..720313"),
c("gene=dnrA","locus_tag=CD630_00010" ,"location=50..1320" ))
# Manipulation
n <- sapply(l, length)
seq.max <- seq_len(max(n))
df <- t(sapply(l, "[", i = seq.max))
df <- t(apply(df,1,function(x){
c(x[is.na(x)],x[!is.na(x)])}))
df <- df[rowSums(!is.na(df))>0, ]
df[is.na(df)] <- "gen=unnamed"
Output:
[,1] [,2] [,3]
[1,] "gene=dnaA" "locus_tag=CD630_00010" "location=1..1320"
[2,] "gen=unnamed" "locusc_tag=CD630_05950" "location=719777..720313"
[3,] "gene=dnrA" "locus_tag=CD630_00010" "location=50..1320"

There are a number of methods for binding lists with unequal lengths. See bind_rows from dplyr, rbind.fill from plyr or rbindlist from data.table. Here is using base R
## Sample data
h <- list(letters[1:3],
character(0),
letters[4:5])
out <- do.call(rbind, lapply(h, `length<-`, 3)) # fix lengths and make matrix
out <- out[rowSums(!is.na(out))>0, ] # remove empty rows
out[is.na(out)] <- "gen=unnamed" # rename NA
data.frame(out)
# X1 X2 X3
# 1 a b c
# 2 d e gen=unnamed

Related

List that contains some dataframes with NaN values. How to remove them?

I want to make a single dataframe out of a list dataframes that contains a single row. I tried to binding them together to remove the NaN values but you can only bind if there is no NaN values!
What I got:
[[1]]
[1] NaN 9.840158e+17
[[2]]
[1] NaN 9.838244e+17
[[3]]
[1] 6.842105e-01 9.837743e+17
[[4]]
[1] 2.527174e+00 9.837643e+17
[[5]]
[1] 1.168269e+00 9.836988e+17
[[6]]
[1] NaN 9.83663e+17
What I want:
[1]
impact id
6.842105e-01 9.837743e+17
2.527174e+00 9.837643e+17
1.168269e+00 9.836988e+17
What I tried:
bind_rows(data, .id = "id")
Error: Argument 1 must have names
for reproduction of the problem:
list(c(NaN, 984015782521835008), c(NaN, 983824424532144000),
c(0.684210526315789, 983774270886236032), c(2.52717391304348,
983764328443796992), c(1.16826923076923, 983698762760704000
), c(NaN, 983662953894435968))
Any tips on how to solve this?
We remove the list element having NaN and then rbind the elements
out <- do.call(rbind, lst1[!sapply(lst1, function(x) any(is.nan(x)))])
colnames(out) <- c("impact", "id")
out
# impact id
#[1,] 0.6842105 9.837743e+17
#[2,] 2.5271739 9.837643e+17
#[3,] 1.1682692 9.836988e+17
Or another option is to rbind the elements and then use na.omit
na.omit(do.call(rbind, lst1))
Or with Filter
do.call(rbind, Filter(function(x) !any(is.nan(x)), lst1))
Or using discard (from purrr)
library(purrr)
discard(lst1, ~ any(is.nan(.x))) %>%
do.call(rbind, .)
When you want to remove NA's from dataframe you can use functions
data
DF <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))
na.omit(DF)
print (DF)
x y
1 0
2 10
na.exclude(DF)
print(DF)
x y
1 0
2 10

max.col with the value not the index

If I have a matrix:
mod_xgb_softprob$pred[1:3,1:3]
[,1] [,2] [,3]
[1,] 6.781361e-04 6.781361e-04 6.781422e-04
[2,] 2.022457e-07 2.022457e-07 4.051039e-07
[3,] 6.714367e-04 6.714367e-04 6.714399e-04
Generated by:
> dput(mod_xgb_softprob$pred[1:3,1:3])
structure(c(0.00067813612986356, 2.02245701075299e-07, 0.000671436660923064,
0.00067813612986356, 2.02245701075299e-07, 0.000671436660923064,
0.000678142241667956, 4.05103861567113e-07, 0.000671439862344414
), .Dim = c(3L, 3L))
I can transform it into a data frame and get the column with the highest value:
x <- mymatrix %>% as.data.frame %>% mutate(max_prob = max.col(., ties.method = "last"))
Looks like this:
> x
V1 V2 V3 max_prob
1 6.781361e-04 6.781361e-04 6.781422e-04 3
2 2.022457e-07 2.022457e-07 4.051039e-07 3
3 6.714367e-04 6.714367e-04 6.714399e-04 3
If I wanted max_prob to be the actual value not the column index, how would I do that?
If you don't mind base R you can use apply. For example:
> x <- matrix(rnorm(9), ncol = 3)
> apply(x, 1, max)
[1] 0.246652 1.063506 2.148525
gives the maximum of the column vectors of x.
Beside the apply method from #Mariane and matrix indexing from #lmo's comment, you can also use matrixStats::rowMaxs:
matrixStats::rowMaxs(mymatrix)
# [1] 6.781422e-04 4.051039e-07 6.714399e-04
If you have a data frame, you can use do.call(pmax, ...) to calculate the parallel maxima of the input columns:
mymatrix %>% as.data.frame %>% mutate(max_val = do.call(pmax, .))
# V1 V2 V3 max_val
#1 6.781361e-04 6.781361e-04 6.781422e-04 6.781422e-04
#2 2.022457e-07 2.022457e-07 4.051039e-07 4.051039e-07
#3 6.714367e-04 6.714367e-04 6.714399e-04 6.714399e-04
Another option which uses max.col, seq_along and mathematics. If m is your matrix, then the following works as well:
mc <- max.col(m, ties.method = 'last')
m[(mc - 1) * nrow(m) + seq_along(mc)]
The result:
[1] 6.781422e-04 4.051039e-07 6.714399e-04
With cbind you can than bind this result to the matrix again:
> cbind(m, m[(mc - 1) * nrow(m) + seq_along(mc)])
[,1] [,2] [,3] [,4]
[1,] 6.781361e-04 6.781361e-04 6.781422e-04 6.781422e-04
[2,] 2.022457e-07 2.022457e-07 4.051039e-07 4.051039e-07
[3,] 6.714367e-04 6.714367e-04 6.714399e-04 6.714399e-04
This is a variation on #h3rm4n's answer, but you can use a special kind of matrix subsetting as well:
> x[cbind(1:nrow(x), max.col(x))]
[1] 6.781361e-04 4.051039e-07 6.714367e-04
Using an index like cbind(i, j) extracts row i and column j for each entry in the resulting matrix.

merging values horizontally in dataframe

I am trying to merge two subsets of a dataframe together, but neither merge nor cbind seem to do exactly what I want. So far I have this:
library(psych)
df1<-NULL
df1$a<-c(1,2,3,4,5)
df1$b<-c(4,5,2,6,1)
df1$c<-c(0,9,0,6,3)
df1$gender<-c(0,0,0,1,1)
df1<-as.data.frame(df1)
male<-subset(df1,gender<1)
male<-male[,-c(4)]
female<-subset(df1,gender>=1)
female<-female[,-c(4)]
library(psych)
merge(corr.test(male)$r,corr.test(female)$r)
My end goal is something like this in every cell:
a b c
a 1/1 -0.6546537/-1 0/-1
....
You can concatenate the entries in both matrices, then just fix the dimensions of the new vector to be the same as the corr.test output using dim<-, aka dim(...) <-.
## Concatenate the entries
strs <- sprintf("%s/%s", round(corr.test(male)$r,2),
round(corr.test(female)$r, 2))
## Set the dimensions
dim(strs) <- c(3,3)
## Or (to have the value returned at the same time)
`dim<-`(strs, c(3, 3))
# [,1] [,2] [,3]
# [1,] "1/1" "-0.65/-1" "0/-1"
# [2,] "-0.65/-1" "1/1" "0.76/1"
# [3,] "0/-1" "0.76/1" "1/1"
Another trick, if you want to have those rownames and column names as in the output of corr.test, and not have to worry about dimensions,
## Get one result
ctest <- corr.test(male)$r
## Concatenate
strs <- sprintf("%s/%s", round(ctest,2),
round(corr.test(female)$r, 2))
## Overwrite the matrix with the strings
ctest[] <- strs
ctest
# a b c
# a "1/1" "-0.65/-1" "0/-1"
# b "-0.65/-1" "1/1" "0.76/1"
# c "0/-1" "0.76/1" "1/1"

Triplicates in R

I have a set of 80 samples, with 2 variables, each measured as triplicate:
sample var1a var1b var1c var2a var2b var2c
1 -169.784 -155.414 -146.555 -175.295 -159.534 -132.511
2 -180.577 -180.792 -178.192 -177.294 -171.809 -166.147
3 -178.605 -184.183 -177.672 -167.321 -168.572 -165.335
and so on. How do I apply functions like mean, sd, se etc. for each row for var1 and var2? Also, the dataset contains NAs. Thanks for bothering with such basic questions
What is your expected result when there are NAs? apply(df[-1], 1, mean) (or whatever function) will work, but it would give NA as a result for the row. If you can replace NA with 0 then you could do df[is.na(df)] <- 0 first, and then the apply function in order to get the results.
One approach could be to reshape your data set. Another one might be just apply a function over rows of a subset of the data frame.
So, for var2X you have:
apply(dat[5:7], 1, function(x){m <- mean(x); s <- sd(x); da <-c(m, s) })
[,1] [,2] [,3]
[1,] -155.78000 -171.750000 -167.076000
[2,] 21.63763 5.573734 1.632348
and for var1X:
apply(dat[2:4], 1, function(x){m <- mean(x); s <- sd(x); da <-c(m, s) })
[,1] [,2] [,3]
[1,] -157.25100 -179.853667 -180.153333
[2,] 11.72295 1.443055 3.520835

Finding all possible combinations of vector intersections?

I have a set of four vectors that look like this:
[1] PRI2CO HEISCO PRI2CO DIALGU DIALGU ALSEBL
Levels: ALSEBL DIALGU HEISCO PRI2CO
[1] PRI2CO TET2PA ALSEBL PRI2CO ALSEBL TET2PA
[7] HEISCO TET2PA
Levels: ALSEBL HEISCO PRI2CO TET2PA
I would like to generate a vector that contains all values that match between every possible combination of the four vectors. For the two above, it would contain ALESBL, HEISCO, and PRI2CO. I've been doing every combination by hand so far but its tedious and I figure there has to be a better way. I tried writing a loop for it but I'm pretty new to R and it hasn't worked yet. Here's what I've been doing:
trees.species.P234<-intersect(intersect(trees.species.P2,trees.species.P3),trees.species.P4)
> trees.species.P234
[1] "PRI2CO " "ALSEBL "
I was thinking a for loop that involved a factorial might do it, but I can't get it to work.
Here you go, using the same vectors as proposed by gadzooks:
v1 <- c("PRI2CO","HEISCO","PRI2CO","DIALGU","DIALGU","ALSEBL")
v2 <- c("PRI2CO", "TET2PA","ALSEBL","PRI2CO","ALSEBL","TET2PA","HEISCO","TET2PA")
v3 <- c("PRI2CO","HEISCO","PRI2CO","DIALGU","DIALGU","ALSEBL")
v4 <- c("PRI2CO", "TET2PA","ALSEBL","PRI2CO","ALSEBL","TET2PA","HEISCO","TET2PA")
veclist <- list(v1,v2,v3,v4)
combos <- Reduce(c,lapply(2:length(veclist),
function(x) combn(1:length(veclist),x,simplify=FALSE) ))
lapply(combos, function(x) Reduce(intersect,veclist[x]) )
#[[1]]
#[1] "PRI2CO" "HEISCO" "ALSEBL"
#
#[[2]]
#[1] "PRI2CO" "HEISCO" "DIALGU" "ALSEBL"
#
#[[3]]
#[1] "PRI2CO" "HEISCO" "ALSEBL"
#etc etc
First you have to list all the combinations. For that use combn function.
> combn(1:4,2)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 1 1 2 2 3
[2,] 2 3 4 3 4 4
Now we can use the apply function to find intersection between your vectors. But before that
lets create a list of vectors. For easy reproducibility i created this list.
c <- combn(1:4,2)
l <- list(c("a","b"),c("b","c"),c("c","d"),c("d","e"))
Result <- apply(c,2,function(x){intersect(l[[x[1]]],l[[x[2]]])})
This result will be a list if you want it as vector you can use do.call
do.call("c",Result)
[1] "b" "c" "d"
For unique components
unique(do.call("c",Result))
This can be used for large lists as well.
v1 <- c("PRI2CO","HEISCO","PRI2CO","DIALGU","DIALGU","ALSEBL")
v2 <- c("PRI2CO", "TET2PA","ALSEBL","PRI2CO","ALSEBL","TET2PA","HEISCO","TET2PA")
v3 <- c("PRI2CO","HEISCO","PRI2CO","DIALGU","DIALGU","ALSEBL")
v4 <- c("PRI2CO", "TET2PA","ALSEBL","PRI2CO","ALSEBL","TET2PA","HEISCO","TET2PA")
vall <- unique(c(v1,v2,v3,v4))
for(x in vall){
if((x %in% v1)&(x %in% v2)&(x %in% v3)&(x %in% v4)){
print(x)}
}

Resources