I'm trying to use complete.cases to clear out the NAs from a file.
I've been using help from this site but it isn't working and I'm no longer sure if what I'm trying to do is possible.
juulDataRaw <- read.csv(url("http://blah"));
juulDataRaw[complete.cases(juulDataRaw),]
I tried this (one of the examples from here)
dog<-structure(list(Sample = 1:6
,gene = c("ENSG00000208234","ENSG00000199674","ENSG00000221622","ENSG00000207604","ENSG00000207431","ENSG00000221312")
,hsap = c(0,0,0,0,0,0)
,mmul = c(NA,2,NA,NA,NA,1)
,mmus = c(NA,2,NA,NA,NA,2)
,rnor = c(NA,2,NA,1,NA,3)
,cfam = c(NA,2,NA,2,NA,2))
,.Names = c("gene", "hsap", "mmul", "mmus", "rnor", "cfam"), class = "data.frame", row.names = c(NA, -6L))
dog[complete.cases(dog),]
and that works.
So can mine be done?
What is the difference between the two?
Aren't they both just data frames?
You have quotes around the numeric values so they are read in as factors. That makes the "NA" just another string rather than an R NA.
> juulDataRaw[] <- lapply(juulDataRaw, as.character)
> juulDataRaw[] <- lapply(juulDataRaw, as.numeric)
Warning messages:
1: In lapply(juulDataRaw, as.numeric) : NAs introduced by coercion
2: In lapply(juulDataRaw, as.numeric) : NAs introduced by coercion
3: In lapply(juulDataRaw, as.numeric) : NAs introduced by coercion
> juulDataRaw[complete.cases(juulDataRaw),]
age height igf1 weight
55 6.00 111.6 98 19.1
57 6.08 116.7 242 21.7
61 6.26 120.3 196 24.7
66 6.40 115.5 179 19.6
69 6.42 115.6 126 20.6
71 6.43 116.1 142 20.2
80 6.61 130.3 236 28.0
81 6.63 122.2 148 21.6
83 6.70 126.2 174 26.1
84 6.72 125.6 136 22.6
85 6.72 121.0 164 24.4
snipped remaining output.....
Related
I have a simplified dataframe that looks like this:
df
volume ask1 ask2 bid1 bid2
0 38 NA 38 37.9
100 38.1 38.2 37.8 38.2
0 38.4 38.5 38.2 38.3
0 38.4 38.5 38.2 NA
200 38.3 38.1 38 38.4
250 38.4 38.2 NA 38.6
I want to have another column, which contains the mean of df[1:i, 2:5] on the ith row.
I can do this with a for loop:
df[, "midpoint"] <- NA
for (i in 1:nrow(df)) {
df$midpoint[i] <- mean(as.matrix(df[c(1:i), c(2:5)]), na.rm = TRUE)
}
But as my dataframe is actually large, the for loop takes a long time.
I have tried sapply but failed:
df[, "midpoint"] <- sapply(df, function(i) mean(as.matrix(df[c(1:i), c(2:5)]), na.rm = TRUE))
Could anyone give me some suggestions?
With sapply you can do :
mat <- as.matrix(df[, 2:5])
df$midpoint <- sapply(seq(nrow(df)), function(i) mean(mat[1:i, ], na.rm = TRUE))
You can also take mean of means which would be faster but introduce a small error.
library(dplyr)
df %>%
mutate(res = rowMeans(select(., 2:5), na.rm = TRUE),
res = cummean(res))
# volume ask1 ask2 bid1 bid2 midpoint res
#1 0 38.0 NA 38.0 37.9 37.96667 37.96667
#2 100 38.1 38.2 37.8 38.2 38.02857 38.02083
#3 0 38.4 38.5 38.2 38.3 38.14545 38.13056
#4 0 38.4 38.5 38.2 NA 38.19286 38.18958
#5 200 38.3 38.1 38.0 38.4 38.19444 38.19167
#6 250 38.4 38.2 NA 38.6 38.22381 38.22639
Here midpoint is the actual answer from the for loop or sapply code and res is the answer from the above calculation.
You were close with your sapply command, but you need to iterate over the number of rows.
Try
sapply(1:nrow(df), function(x) mean(as.matrix(df)[x, 2:5], na.rm = TRUE))
rowSums(cumsum(df[2:5]), na.rm=T) / cumsum(rowSums(!is.na(df[2:5])))
I'm trying to create a function that summarizes several vectors and the prompt is
Write a function data_summary which takes three inputs:\
`dataset`: A data frame\
`vars`: A character vector whose elements are names of columns from dataset which the user wants summaries for\
`group.name`: A length one character vector which gives the name of the column from dataset which contains the factor which will be used as a grouping variable
\`var.names`: A character vector of the same length as vars which gives the names that the user would like used as the entries under “Variable” in the resulting output. This should be set equal to vars by default, so the default behavior is to use the column names from dataset.
The output of the function should be a data frame with the following structure:
Column names of the data frame will be:\
`Variable`\
`Missing`\
The `first` level of the factor group.name\
The `second` level of the factor group.name\
…\
The `kth` level of the factor group.name\
`p-value`
I've set up the code already,
data_summary <- function(dataset,vars,group.name,var.names) {
}
but I'm unsure how to proceed because I do not understand what this is trying to accomplish and what the output should look like. There is an example that shows
#data_summary<-function(dataset, vars,group.name, var.name){}
#example
#data_summary(titanic4, c("survived", "female", "age", "sibsp", "parch", "fare", "cabin"), "pclass")
#data_summary(titanic4, c("survived", "female", "age", "sibsp", "parch", "fare", "cabin"), "pclass", c("Survival rate", "% Female", "Age", "# siblings/spouses aboard", "# children/parents aboard", "Fare ($)", "Cabin"))
But it really did not help me outside of inputting the arguments for the function.
You can use dplyr package for this function. Also I don't know by which functions you want summarise your dataframe, so I use all functions which summary function returns from base package.
My data:
> NewSKUMatrix
# A tibble: 268,918 x 4
LagerID FilialID CSBID Price
<int> <int> <int> <dbl>
1 233 2578 1005 38.3
2 333 2543 NA 61.0
3 334 2543 NA 15.0
4 335 2543 NA 11.0
5 337 2301 NA 71.0
6 338 2031 NA 37.0
7 338 2044 NA 35.0
8 338 2054 NA 36.0
9 338 2060 NA 37.0
10 338 2063 NA 36.0
# ... with 268,908 more rows
Function:
data_summary <- function(data,
variables,
values,
names = NULL) {
if (is.null(x = names)) {
names <- variables
}
data %>%
group_by_at(.vars = variables) %>%
summarise_at(
.vars = values,
.funs = list(
Min. = min,
`1st Qu.` = ~ quantile(x = ., probs = 0.25),
Median = median,
Mean = mean,
`3rd Qu.` = ~ quantile(x = ., probs = 0.75),
Max. = max
)
) %>%
rename_at(.vars = variables,
.funs = ~ names)
}
Output:
data_summary(NewSKUMatrix,
c('LagerID'),
c('Price'),
c('SKU'))
# A tibble: 32,454 x 7
SKU Min. `1st Qu.` Median Mean `3rd Qu.` Max.
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 17 39.0 39.0 39.0 39.0 39.0 39.0
2 18 120. 120. 120. 121. 120. 140.
3 21 289. 289. 289. 289. 289. 289.
4 24 37.0 37.0 37.0 45.2 45.2 70.0
5 25 14.0 14.0 14.0 14.0 14.0 14.0
6 55 30.9 30.9 30.9 30.9 30.9 30.9
7 117 26.9 26.9 26.9 26.9 26.9 26.9
8 118 24.8 24.9 24.9 25.1 25.1 25.7
9 119 24.8 24.8 24.9 25.1 25.3 25.7
10 158 104. 108. 108. 107. 108. 108.
# ... with 32,444 more rows
I am trying to calculate the pseudo-median (Hodges-Lehmann estimator) calculated from the wilcox.test function in R between several different columns in my dataset (8 separate fields/columns).
> head(fantasyproj2)
V1 V2 V3 V4 V5 V6 V7 V8
1 25 25.87 20.35 26.65 27.20 27.0 17.970 27.70
2 24 27.48 19.81 19.57 22.20 27.5 16.350 20.04
3 22 19.89 17.62 21.99 19.12 26.0 22.484 23.70
4 21 17.72 16.09 15.55 18.60 18.5 17.450 14.59
5 21 21.56 17.90 18.46 20.80 23.0 16.540 19.76
6 20 NA 17.73 15.84 19.00 20.5 19.080 15.05
They are all numeric, and do include some missing values (the missing value have no data within them and have been changed to NA in R). when I run the code:
fantasyproj$hodges.lehmann <- apply(fantasyproj2[,c(1,2,3,4,5,6,7,8)],1, function(x) wilcox.test(x, conf.int=TRUE, na.action=na.exclude)$estimate)
I get the error:
Error in uniroot(wdiff, c(mumin, mumax), tol = 1e-04, zq = qnorm(alpha/2, : f() values at end points not of opposite sign
There is not much literature on this error out there except where it says to add the argument exact = TRUEto the statement, but that does nto help. Any help would be greatly appreciated!
I have a data frame like the one you see here.
DRSi TP DOC DN date Turbidity Anions
158 5.9 3371 264 14/8/06 5.83 2246.02
217 4.7 2060 428 16/8/06 6.04 1632.29
181 10.6 1828 219 16/8/06 6.11 1005.00
397 5.3 1027 439 16/8/06 5.74 314.19
2204 81.2 11770 1827 15/8/06 9.64 2635.39
307 2.9 1954 589 15/8/06 6.12 2762.02
136 7.1 2712 157 14/8/06 5.83 2049.86
1502 15.3 4123 959 15/8/06 6.48 2648.12
1113 1.5 819 195 17/8/06 5.83 804.42
329 4.1 2264 434 16/8/06 6.19 2214.89
193 3.5 5691 251 17/8/06 5.64 1299.25
1152 3.5 2865 1075 15/8/06 5.66 2573.78
357 4.1 5664 509 16/8/06 6.06 1982.08
513 7.1 2485 586 15/8/06 6.24 2608.35
1645 6.5 4878 208 17/8/06 5.96 969.32
Before I got here i used the following code to remove those columns that had no values at all or some NA's.
rem = NULL
for(col.nr in 1:dim(E.3)[2]){
if(sum(is.na(E.3[, col.nr]) > 0 | all(is.na(E.3[,col.nr])))){
rem = c(rem, col.nr)
}
}
E.4 <- E.3[, -rem]
Now I need to remove the "date" column but not based on its column name, rather based on the fact that it's a character string.
I've seen here (Remove an entire column from a data.frame in R) already how to simply set it to NULL and other options but I want to use a different argument.
First use is.character to find all columns with class character. However, make sure that your date is really a character, not a Date or a factor. Otherwise use is.Date or is.factor instead of is.character.
Then just subset the columns that are not characters in the data.frame, e.g.
df[, !sapply(df, is.character)]
I was having a similar problem but the answer above isn't resolve it for a Date columns (that's what I needed), so I've found another solution:
df[,-grep ("Date|factor|character", sapply (df, class))]
Will return you your df without Date, character and factor columns.
I have a data that looks like this.
Name|ID|p72|p78|p51|p49|c36.1|c32.1|c32.2|c36.2|c37
hsa-let-7a-5p|MIMAT0000062|9.1|38|12.7|185|8|4.53333333333333|17.9|23|63.3
hsa-let-7b-5p|MIMAT0000063|11.3|58.6|27.5|165.6|20.4|8.5|21|30.2|92.6
hsa-let-7c|MIMAT0000064|7.8|40.2|9.6|147.8|11.8|4.53333333333333|15.4|17.7|62.3
hsa-let-7d-5p|MIMAT0000065|4.53333333333333|27.7|13.4|158.1|8.5|4.53333333333333|14.2|13.5|50.5
hsa-let-7e-5p|MIMAT0000066|6.2|4.53333333333333|4.53333333333333|28|4.53333333333333|4.53333333333333|5.6|4.7|12.8
hsa-let-7f-5p|MIMAT0000067|4.53333333333333|4.53333333333333|4.53333333333333|78.2|4.53333333333333|4.53333333333333|6.8|4.53333333333333|8.9
hsa-miR-15a-5p|MIMAT0000068|4.53333333333333|70.3|10.3|147.6|4.53333333333333|4.53333333333333|21.1|30.2|100.8
hsa-miR-16-5p|MIMAT0000069|9.5|562.6|60.5|757|25.1|4.53333333333333|89.4|142.9|613.9
hsa-miR-17-5p|MIMAT0000070|10.5|71.6|27.4|335.1|6.3|10.1|51|51|187.1
hsa-miR-17-3p|MIMAT0000071|4.53333333333333|4.53333333333333|4.53333333333333|17.2|4.53333333333333|4.53333333333333|9.5|4.53333333333333|7.3
hsa-miR-18a-5p|MIMAT0000072|4.53333333333333|14.6|4.53333333333333|53.4|4.53333333333333|4.53333333333333|9.5|25.5|29.7
hsa-miR-19a-3p|MIMAT0000073|4.53333333333333|11.6|4.53333333333333|42.8|4.53333333333333|4.53333333333333|4.53333333333333|5.5|17.9
hsa-miR-19b-3p|MIMAT0000074|8.3|93.3|15.8|248.3|4.53333333333333|6.3|44.7|53.2|135
hsa-miR-20a-5p|MIMAT0000075|4.53333333333333|75.2|23.4|255.7|6.6|4.53333333333333|43.8|38|130.3
hsa-miR-21-5p|MIMAT0000076|6.2|19.7|18|299.5|6.8|4.53333333333333|49.9|68.5|48
hsa-miR-22-3p|MIMAT0000077|40.4|128.4|65.4|547.1|56.5|33.4|104.9|84.1|248.3
hsa-miR-23a-3p|MIMAT0000078|58.3|99.3|58.6|617.9|36.6|21.4|107.1|125.5|120.9
hsa-miR-24-1-5p|MIMAT0000079|4.53333333333333|4.53333333333333|4.53333333333333|9.2|4.53333333333333|4.53333333333333|4.53333333333333|4.9|4.53333333333333
hsa-miR-24-3p|MIMAT0000080|638.2|286.9|379.5|394.4|307.8|240.4|186|234.2|564
What I want to do is to simply pick rows where all the values is greater than 10.
But why this code of mine only report the last one?
The data clearly showed that there are more rows that satisfy this condition.
> dat<-read.delim("http://dpaste.com/1215552/plain/",sep="|",na.strings="",header=TRUE,blank.lines.skip=TRUE,fill=FALSE)
But why this code of mine only report the last one?
> dat[apply(dat[, -1], MARGIN = 1, function(x) all(x > 10)), ]
Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186 234.2 564
What is the right way to do it?
Update:
alexwhan solution works. But I wonder how can I generalized his approach
so that it can handle data with missing values (NA)
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",header=FALSE,blank.lines.skip=TRUE,fill=FALSE)
Since you're including your ID column (which is a factor) in the all(), it's getting messed up. Try:
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10)), ]
# Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
EDIT
For the case where you have NA, you can just just use the na.rm argument for all(). Using your new data (from the comment):
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",header=FALSE,blank.lines.skip=TRUE,fill=FALSE)
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10, na.rm = T)), ]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# 7 hsa-miR-15a-5p MIMAT0000068 NA 70.3 10.3 147.6 NA NA 21.1 30.2 100.8
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
# 20 hsa-miR-25-3p MIMAT0000081 19.3 78.6 25.6 84.3 14.9 16.9 19.1 27.2 113.8
# 21 hsa-miR-26a-5p MIMAT0000082 NA 22.8 31.0 561.2 12.4 NA 67.0 55.8 48.9
ANother idea is to transform your data ton long format( or molton format). I think it is even better to avoid missing values problem with:
library(reshape2)
dat.m <- melt(dat,id.vars=c('Name','ID'))
dat.m$value <- as.numeric(dat.m$value)
library(plyr)
res <- ddply(dat.m,.(Name,ID), summarise, keepme = all(value > 10))
res[res$keepme,]
# Name ID keepme
# 16 hsa-miR-22-3p MIMAT0000077 TRUE
# 17 hsa-miR-23a-3p MIMAT0000078 TRUE
# 19 hsa-miR-24-3p MIMAT0000080 TRUE