Remove column values with NA in R - r

I have a data frame, called gen, which is represented below
A B C D E
1 NA 4.35 35.3 3.36 4.87
2 45.2 .463 34.3 NA 34.4
3 NA 34.5 35.6 .457 46.3
I would like to remove the columns where there are NA's. (I know na.omit does it for rows, but I can't seem to find one for columns). The final result would read:
B C E
1 4.35 35.3 4.87
2 .463 34.3 34.4
3 34.5 35.6 46.3
Thanks!

gen <- gen[sapply(gen, function(x) all(!is.na(x)))]

dfrm[ , sapply(dfrm, function(x){ !any(is.na(x)) } )
You might want to use instead this variant:
dfrm[ , sapply(dfrm, function(x){ all(is.finite(x)) } )
If you have Inf or -Inf values in a vector they are not removed or identified with selection based on is.na.

Just use this:
gen[colSums(is.na(gen)) == 0]

Related

How to calculate the cumulative mean of a block of the data frame up to each row in R

I have a simplified dataframe that looks like this:
df
volume ask1 ask2 bid1 bid2
0 38 NA 38 37.9
100 38.1 38.2 37.8 38.2
0 38.4 38.5 38.2 38.3
0 38.4 38.5 38.2 NA
200 38.3 38.1 38 38.4
250 38.4 38.2 NA 38.6
I want to have another column, which contains the mean of df[1:i, 2:5] on the ith row.
I can do this with a for loop:
df[, "midpoint"] <- NA
for (i in 1:nrow(df)) {
df$midpoint[i] <- mean(as.matrix(df[c(1:i), c(2:5)]), na.rm = TRUE)
}
But as my dataframe is actually large, the for loop takes a long time.
I have tried sapply but failed:
df[, "midpoint"] <- sapply(df, function(i) mean(as.matrix(df[c(1:i), c(2:5)]), na.rm = TRUE))
Could anyone give me some suggestions?
With sapply you can do :
mat <- as.matrix(df[, 2:5])
df$midpoint <- sapply(seq(nrow(df)), function(i) mean(mat[1:i, ], na.rm = TRUE))
You can also take mean of means which would be faster but introduce a small error.
library(dplyr)
df %>%
mutate(res = rowMeans(select(., 2:5), na.rm = TRUE),
res = cummean(res))
# volume ask1 ask2 bid1 bid2 midpoint res
#1 0 38.0 NA 38.0 37.9 37.96667 37.96667
#2 100 38.1 38.2 37.8 38.2 38.02857 38.02083
#3 0 38.4 38.5 38.2 38.3 38.14545 38.13056
#4 0 38.4 38.5 38.2 NA 38.19286 38.18958
#5 200 38.3 38.1 38.0 38.4 38.19444 38.19167
#6 250 38.4 38.2 NA 38.6 38.22381 38.22639
Here midpoint is the actual answer from the for loop or sapply code and res is the answer from the above calculation.
You were close with your sapply command, but you need to iterate over the number of rows.
Try
sapply(1:nrow(df), function(x) mean(as.matrix(df)[x, 2:5], na.rm = TRUE))
rowSums(cumsum(df[2:5]), na.rm=T) / cumsum(rowSums(!is.na(df[2:5])))

na.omit seems to be removing negative values in my data frame?

I am trying to figure out different temperature ranges for specific locations (CB, HK, etc.) in my data frame,
it looks like this:
'head(join)'
OTU_num location date otus Depth DO Temperature pH Secchi.Depth
1 Otu0001 CB 03JUN09 21 0.0 7.60 21.0 3.68 NA
2 Otu0001 CB 03JUN09 21 0.5 8.27 16.4 3.68 NA
3 Otu0001 CB 03JUN09 21 1.0 7.65 14.9 3.68 NA
4 Otu0001 CB 03JUN09 21 1.5 5.26 12.2 3.25 NA
5 Otu0001 CB 03JUN09 21 2.0 4.01 10.1 3.25 NA
I am calculating the range using:
ranges <- join %>%
group_by(location) %>%
na.omit %>%
mutate(min=min(Temperature), max=max(Temperature), subtract=min-max) %>%
arrange(subtract)
Some of the temperature values are "NA" so I used na.omit, however it appears to be taking out the negative values? so the ranges I get are wrong.
location min max subtract
MA 0.1 27.3 -27.2
I double checked using the range function for one of the locations (there are a lot and I did not want to use range for each location)
MA <- subset(join, location=="MA")
range(MA$Temperature, na.rm = TRUE)
[1] -2.2 27.6
Why are the values different? Any help is appreciated!!!
I think you should use join %>% filter(!is.na(Temperature)), so only rows that have NA temperatures will be removed.

R generate bins from a data frame respecting blanks

I need to generate bins from a data.frame based on the values of one column. I have tried the function "cut".
For example: I want to create bins of air temperature values in the column "AirTDay" in a data frame:
AirTDay (oC)
8.16
10.88
5.28
19.82
23.62
13.14
28.84
32.21
17.44
31.21
I need the bin intervals to include all values in a range of 2 degrees centigrade from that initial value (i.e. 8-9.99, 10-11.99, 12-13.99...), to be labelled with the average value of the range (i.e. 9.5, 10.5, 12.5...), and to respect blank cells, returning "NA" in the bins column.
The output should look as:
Air_T (oC) TBins
8.16 8.5
10.88 10.5
5.28 NA
NA
19.82 20.5
23.62 24.5
13.14 14.5
NA
NA
28.84 28.5
32.21 32.5
17.44 18.5
31.21 32.5
I've gotten as far as:
setwd('C:/Users/xxx')
temp_data <- read.csv("temperature.csv", sep = ",", header = TRUE)
TAir <- temp_data$AirTDay
Tmin <- round(min(TAir, na.rm = FALSE), digits = 0) # is start at minimum value
Tmax <- round(max(TAir, na.rm = FALSE), digits = 0)
int <- 2 # bin ranges 2 degrees
mean_int <- int/2
int_range <- seq(Tmin, Tmax + int, int) # generate bin sequence
bin_label <- seq(Tmin + mean_int, Tmax + mean_int, int) # generate labels
temp_data$TBins <- cut(TAir, breaks = int_range, ordered_result = FALSE, labels = bin_label)
The output table looks correct, but for some reason it shows a sequential additional column, shifts column names, and collapse all values eliminating blank cells. Something like this:
Air_T (oC) TBins
1 8.16 8.5
2 10.88 10.5
3 5.28 NA
4 19.82 20.5
5 23.62 24.5
6 13.14 14.5
7 28.84 28.5
8 32.21 32.5
9 17.44 18.5
10 31.21 32.5
Any ideas on where am I failing and how to solve it?
v<-ceiling(max(dat$V1,na.rm=T))
breaks<-seq(8,v,2)
labels=seq(8.5,length.out=length(s)-1,by=2)
transform(dat,Tbins=cut(V1,breaks,labels))
V1 Tbins
1 8.16 8.5
2 10.88 10.5
3 5.28 <NA>
4 NA <NA>
5 19.82 18.5
6 23.62 22.5
7 13.14 12.5
8 NA <NA>
9 NA <NA>
10 28.84 28.5
11 32.21 <NA>
12 17.44 16.5
13 31.21 30.5
This result follows the logic given: we have
paste(seq(8,v,2),seq(9.99,v,by=2),sep="-")
[1] "8-9.99" "10-11.99" "12-13.99" "14-15.99" "16-17.99" "18-19.99" "20-21.99"
[8] "22-23.99" "24-25.99" "26-27.99" "28-29.99" "30-31.99"
From this we can tell that 19.82 will lie between 18 and 20 thus given the value 18.5, similar to 10.88 being between 10-11.99 thus assigned the value 10.5

nested ifelse in R so close to working

I'm working with the following four columns of raw weight measurement data and a very nearly-functioning nested ifelse statement that results in the 'kg' vector.
Id G4_R_2_4 G4_R_2_5 G4_R_2_5_option2 kg
219 13237 16.0 NA NA 16.0
220 139129 8.50 55.70 47.20 8.50
221 139215 28.9 NA NA 28.9
222 139216 NA 46.70 8.50 46.70
223 139264 12.40 NA NA 12.40
224 139281 13.60 NA NA 13.60
225 139366 16.10 NA NA 16.10
226 139376 61.80 NA NA 61.80
227 140103 NA 48.60 9.10 48.60
The goal is to merge the three 'G4' columns into kg based on the following conditions:
1) If G4_R_2_4 is not NA, print its value
2) If G4_R_2_4 is NA, print the lesser of the values appearing in G4_R_2_5 and G4_R_2_5_option2 (sorry for lame variable names)
I've been working with the following statement (big dataset called 'child'):
> child$kg <- ifelse(child$G4_R_2_4 == 'NA' & child$G4_R_2_5 < child$G4_R_2_5_option2,
child$G4_R_2_5, ifelse(child$G4_R_2_4 == 'NA' & child$G4_R_2_5 > child$G4_R_2_5_option2,
child$G4_R_2_5_option2, child$G4_R_2_4))
Which results in the 'kg' vector I have now. It seems to satisfy the G4_R_2_4 condition (is/is not NA) but always prints the value from G4_R_2_5 for the NA cases. How do I get it to incorporate the greater than/less than condition?
It's not clear from your example, but I think the problem is you're handling NA's incorrectly and\or using wrong type for data.frame's columns. Try rewriting your code like that:
#if your columns are of character type (warnings are ok)
child$G4_R_2_4<-as.numeric(child$G4_R_2_4)
child$G4_R_2_5<-as.numeric(child$G4_R_2_5)
child$G4_R_2_5_option2<-as.numeric(child$G4_R_2_5_option2)
#correct NA handling
child$kg<-ifelse(is.na(child$G4_R_2_4) & child$G4_R_2_5 <
child$G4_R_2_5_option2, child$G4_R_2_5, ifelse(is.na(child$G4_R_2_4) &
child$G4_R_2_5 > child$G4_R_2_5_option2, child$G4_R_2_5_option2, child$G4_R_2_4))
Here is an alternative version that might be interesting, assuming that the values are stored in numerical form (else the column entries should be converted into numerical values, as suggested in the other answers):
get_kg <- function(x){
if(!is.na(x[2])) return (x[2])
return (min(x[3], x[4], na.rm = T))}
child$kg <- apply(child,1,get_kg)
#> child
# Id G4_R_2_4 G4_R_2_5 G4_R_2_5_option2 kg
#219 13237 16.0 NA NA 16.0
#220 139129 8.5 55.7 47.2 8.5
#221 139215 28.9 NA NA 28.9
#222 139216 NA 46.7 8.5 8.5
#223 139264 12.4 NA NA 12.4
#224 139281 13.6 NA NA 13.6
#225 139366 16.1 NA NA 16.1
#226 139376 61.8 NA NA 61.8
#227 140103 NA 48.6 9.1 9.1
We could do this using pmin. Assuming that your 'G4' columns are 'character' class, we convert those columns to 'numeric' class and use pmin on that columns.
indx <- grep('^G4', names(child))
child[indx] <- lapply(child[indx], as.numeric)
d1 <- child[indx]
child$kgN <- ifelse(is.na(d1[,1]), do.call(pmin, c(d1[-1], na.rm=TRUE)), d1[,1])
child$kgN
#[1] 16.0 8.5 28.9 8.5 12.4 13.6 16.1 61.8 9.1
Or without using ifelse
cbind(d1[,1], do.call(pmin, c(d1[-1], na.rm=TRUE)))[cbind(1:nrow(d1),
(is.na(d1[,1]))+1L)]
#[1] 16.0 8.5 28.9 8.5 12.4 13.6 16.1 61.8 9.1
Benchmarks
set.seed(24)
child1 <- as.data.frame(matrix(sample(c(NA,0:50), 1e6*3, replace=TRUE),
ncol=3, dimnames=list(NULL, c('G4_R_2_4', 'G4_R_2_5',
'G4_R_2_5_option2'))) )
cyberj0g <- function(){
with(child1, ifelse(is.na(G4_R_2_4) & G4_R_2_5 <
G4_R_2_5_option2, G4_R_2_5, ifelse(is.na(G4_R_2_4) &
G4_R_2_5 > G4_R_2_5_option2, G4_R_2_5_option2, G4_R_2_4)))
}
get_kg <- function(x){
if(!is.na(x[2])) return (x[2])
return (min(x[3], x[4], na.rm = T))}
RHertel <- function() apply(child1,1,get_kg)
akrun <- function(){cbind(child1[,1], do.call(pmin, c(child1[-1],
na.rm=TRUE)))[cbind(1:nrow(child1), (is.na(child1[,1]))+1L)]}
system.time(cyberj0g())
# user system elapsed
# 0.451 0.000 0.388
system.time(RHertel())
# user system elapsed
# 11.808 0.000 10.928
system.time(akrun())
# user system elapsed
# 0.000 0.000 0.084
library(microbenchmark)
microbenchmark(cyberj0g(), akrun(), unit='relative', times=20L)
#Unit: relative
# expr min lq mean median uq max neval cld
# cyberj0g() 3.750391 4.137777 3.538063 4.091793 2.895156 3.197511 20 b
# akrun() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 20 a
I'm pretty sure the problem is that you're not testing whether the values are NAs, you're testing whether they are equal to the string "NA", which they never are. This should work:
child$kg <- ifelse(is.na(child$G4_R_2_4) &
child$G4_R_2_5 < child$G4_R_2_5_option2,
child$G4_R_2_5,
ifelse(is.na(child$G4_R_2_4) &
child$G4_R_2_5 > child$G4_R_2_5_option2,
child$G4_R_2_5_option2,
child$G4_R_2_4))

Why subsetting rows with 'apply' in data frame doesn't work in R

I have a data that looks like this.
Name|ID|p72|p78|p51|p49|c36.1|c32.1|c32.2|c36.2|c37
hsa-let-7a-5p|MIMAT0000062|9.1|38|12.7|185|8|4.53333333333333|17.9|23|63.3
hsa-let-7b-5p|MIMAT0000063|11.3|58.6|27.5|165.6|20.4|8.5|21|30.2|92.6
hsa-let-7c|MIMAT0000064|7.8|40.2|9.6|147.8|11.8|4.53333333333333|15.4|17.7|62.3
hsa-let-7d-5p|MIMAT0000065|4.53333333333333|27.7|13.4|158.1|8.5|4.53333333333333|14.2|13.5|50.5
hsa-let-7e-5p|MIMAT0000066|6.2|4.53333333333333|4.53333333333333|28|4.53333333333333|4.53333333333333|5.6|4.7|12.8
hsa-let-7f-5p|MIMAT0000067|4.53333333333333|4.53333333333333|4.53333333333333|78.2|4.53333333333333|4.53333333333333|6.8|4.53333333333333|8.9
hsa-miR-15a-5p|MIMAT0000068|4.53333333333333|70.3|10.3|147.6|4.53333333333333|4.53333333333333|21.1|30.2|100.8
hsa-miR-16-5p|MIMAT0000069|9.5|562.6|60.5|757|25.1|4.53333333333333|89.4|142.9|613.9
hsa-miR-17-5p|MIMAT0000070|10.5|71.6|27.4|335.1|6.3|10.1|51|51|187.1
hsa-miR-17-3p|MIMAT0000071|4.53333333333333|4.53333333333333|4.53333333333333|17.2|4.53333333333333|4.53333333333333|9.5|4.53333333333333|7.3
hsa-miR-18a-5p|MIMAT0000072|4.53333333333333|14.6|4.53333333333333|53.4|4.53333333333333|4.53333333333333|9.5|25.5|29.7
hsa-miR-19a-3p|MIMAT0000073|4.53333333333333|11.6|4.53333333333333|42.8|4.53333333333333|4.53333333333333|4.53333333333333|5.5|17.9
hsa-miR-19b-3p|MIMAT0000074|8.3|93.3|15.8|248.3|4.53333333333333|6.3|44.7|53.2|135
hsa-miR-20a-5p|MIMAT0000075|4.53333333333333|75.2|23.4|255.7|6.6|4.53333333333333|43.8|38|130.3
hsa-miR-21-5p|MIMAT0000076|6.2|19.7|18|299.5|6.8|4.53333333333333|49.9|68.5|48
hsa-miR-22-3p|MIMAT0000077|40.4|128.4|65.4|547.1|56.5|33.4|104.9|84.1|248.3
hsa-miR-23a-3p|MIMAT0000078|58.3|99.3|58.6|617.9|36.6|21.4|107.1|125.5|120.9
hsa-miR-24-1-5p|MIMAT0000079|4.53333333333333|4.53333333333333|4.53333333333333|9.2|4.53333333333333|4.53333333333333|4.53333333333333|4.9|4.53333333333333
hsa-miR-24-3p|MIMAT0000080|638.2|286.9|379.5|394.4|307.8|240.4|186|234.2|564
What I want to do is to simply pick rows where all the values is greater than 10.
But why this code of mine only report the last one?
The data clearly showed that there are more rows that satisfy this condition.
> dat<-read.delim("http://dpaste.com/1215552/plain/",sep="|",na.strings="",header=TRUE,blank.lines.skip=TRUE,fill=FALSE)
But why this code of mine only report the last one?
> dat[apply(dat[, -1], MARGIN = 1, function(x) all(x > 10)), ]
Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186 234.2 564
What is the right way to do it?
Update:
alexwhan solution works. But I wonder how can I generalized his approach
so that it can handle data with missing values (NA)
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",heade‌​r=FALSE,blank.lines.skip=TRUE,fill=FALSE)
Since you're including your ID column (which is a factor) in the all(), it's getting messed up. Try:
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10)), ]
# Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
EDIT
For the case where you have NA, you can just just use the na.rm argument for all(). Using your new data (from the comment):
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",header=FALSE,blank.lines.skip=TRUE,fill=FALSE)
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10, na.rm = T)), ]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# 7 hsa-miR-15a-5p MIMAT0000068 NA 70.3 10.3 147.6 NA NA 21.1 30.2 100.8
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
# 20 hsa-miR-25-3p MIMAT0000081 19.3 78.6 25.6 84.3 14.9 16.9 19.1 27.2 113.8
# 21 hsa-miR-26a-5p MIMAT0000082 NA 22.8 31.0 561.2 12.4 NA 67.0 55.8 48.9
ANother idea is to transform your data ton long format( or molton format). I think it is even better to avoid missing values problem with:
library(reshape2)
dat.m <- melt(dat,id.vars=c('Name','ID'))
dat.m$value <- as.numeric(dat.m$value)
library(plyr)
res <- ddply(dat.m,.(Name,ID), summarise, keepme = all(value > 10))
res[res$keepme,]
# Name ID keepme
# 16 hsa-miR-22-3p MIMAT0000077 TRUE
# 17 hsa-miR-23a-3p MIMAT0000078 TRUE
# 19 hsa-miR-24-3p MIMAT0000080 TRUE

Resources