Delete a row when the first column contains a value equal to zero [closed] - unix

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
Improve this question
I have a datafile with 5 columns and N numbers of rows, I want delete every rows that have value equal to zero (0.00) in the first column. For example, this is the original file
-1.3 -2.00 -3.00 4.00 9.00
0.10 -0.20 -0.80 4.50 1.70
0.00 -3.40 -6.80 5.60 9.30
-0.4 -3.20 -4.70 0.80 -0.9
1.03 -2.00 -3.00 4.00 9.00
0.00 -6.80 -9.30 3.40 5.60
0.00 -4.70 -0.80 8.90 -0.3
And this is the file that i want to get
-1.3 -2.00 -3.00 4.00 9.00
0.10 -0.20 -0.80 4.50 1.70
-0.4 -3.20 -4.70 0.80 -0.9
1.03 -2.00 -3.00 4.00 9.00
Please help me

I would use GNU AWK for this task following way, let file.txt content be
-1.3 -2.00 -3.00 4.00 9.00
0.10 -0.20 -0.80 4.50 1.70
0.00 -3.40 -6.80 5.60 9.30
-0.4 -3.20 -4.70 0.80 -0.9
1.03 -2.00 -3.00 4.00 9.00
0.00 -6.80 -9.30 3.40 5.60
0.00 -4.70 -0.80 8.90 -0.3
then
awk '!($1=="0.00")' file.txt
gives output
-1.3 -2.00 -3.00 4.00 9.00
0.10 -0.20 -0.80 4.50 1.70
-0.4 -3.20 -4.70 0.80 -0.9
1.03 -2.00 -3.00 4.00 9.00
Explanation: single condition just describe lines which you want: not (!) first field ($1) is (==) string 0.00 ("0.00")
(tested in gawk 4.2.1)

Related

How to read multiple data with changing NA patterns?

I try to read in some measurement data with the following code
UGT2008 <- rbind.fill(lapply(filelist[1:70], fread, header = F, dec = ".", sep = "\t", na.strings = c("NA","%-","%"), skip = 1L,stringsAsFactors=FALSE))
I use this code, becasue I have multiple data, I want to bind together to one big dataset.
The problem is, wrong values, which should be treated as NA, are marked by "%" at the beginning of the value. So, NA is not one single characters, but a sample of different numbers starting with "%".
The number of different wrong values is to large to name all of the them as "na.strings".
After reading the data, all columns are character, but should be numeric.
The data look like this.
Datum Zeit Temp1/grad Temp2/grad Pyrr+/W/qm Pyrr-/W/qm Global/W/qm H-Flux/W/qm Windr./grad Pegel/cmWS Phar/µmol Bspg./V Widerst./kOhm Blattb./j/n
18.12.00 09:55 2.64 -98.14 -42.34 47.23 68.14 7.44 341.08 0.15 151.76 11.08 %-2546.78 1.00
18.12.00 09:56 2.63 -98.13 -19.07 47.04 65.36 7.31 346.73 0.02 151.28 11.06 %-2546.78 1.00
18.12.00 09:57 2.62 -98.14 -43.73 44.92 64.32 7.36 353.86 -0.01 147.53 11.07 %-2546.78 1.00
18.12.00 09:58 2.75 -98.18 -43.83 44.21 63.42 7.40 360.33 0.12 143.96 11.10 %-2546.78 1.00
18.12.00 09:59 2.65 -98.12 -43.53 43.60 63.42 7.40 356.76 0.12 144.44 11.08 %-2546.78 1.00
18.12.00 10:00 2.74 -98.18 -43.70 43.67 63.42 7.40 359.96 0.13 144.73 11.10 %-2546.78 1.00
18.12.00 10:01 2.62 -98.14 -44.24 42.90 61.00 7.57 3.66 0.14 139.34 11.16 %-2546.78 1.00
18.12.00 10:02 2.62 -98.12 -44.34 40.52 58.08 7.06 356.71 0.00 136.45 11.08 %-2546.78 1.00
18.12.00 10:03 2.74 -98.18 -46.03 41.04 59.19 7.53 360.35 0.14 135.87 11.12 %-2546.78 1.00
18.12.00 10:04 2.63 -98.12 -44.64 42.35 60.86 7.31 347.55 0.13 140.11 11.12 %-2546.78 1.00
18.12.00 10:05 2.62 -98.13 -20.39 43.54 60.37 7.14 361.00 -0.02 144.35 11.09 %-2546.78 1.00
18.12.00 10:06 2.72 -98.18 -45.32 41.20 58.92 7.36 353.24 0.13 135.77 11.13 %-2546.78 1.00
18.12.00 10:07 2.73 -98.18 -45.56 40.91 57.88 7.36 356.10 0.10 134.04 11.13 %-2546.78 1.00
18.12.00 10:08 2.62 -98.12 -43.05 41.94 58.85 7.01 6.54 0.01 140.11 11.14 %-2546.78 1.00
18.12.00 10:09 2.63 -98.14 -43.90 43.06 60.72 7.23 338.23 -0.01 144.25 11.10 %-2546.78 1.00
18.12.00 10:10 2.62 -98.13 -43.86 43.48 61.27 7.23 356.67 -0.01 145.12 11.10 %-2546.78 1.00
18.12.00 10:11 2.63 -98.13 -44.13 42.74 59.26 7.19 360.77 -0.01 141.07 11.11 %-2546.78 1.00
18.12.00 10:12 2.62 -98.12 -45.18 41.39 58.43 7.36 360.31 0.13 136.06 11.15 %-2546.78 1.00
18.12.00 10:13 2.61 -98.18 -31.82 40.72 58.08 7.36 0.85 0.00 140.20 11.14 %-2546.78 1.00
18.12.00 10:14 2.63 -98.13 -44.88 41.42 59.12 7.53 6.60 0.09 139.53 11.20 %-2546.78 1.00
18.12.00 10:15 2.62 -98.11 -43.29 41.71 59.82 7.10 10.82 0.00 143.77 11.16 %-2546.78 1.00
18.12.00 10:16 2.62 -98.12 -43.05 43.99 64.32 7.31 7.32 0.12 151.09 11.20 %-2546.78 1.00
18.12.00 10:17 2.74 -98.18 -40.82 48.32 71.39 7.36 156.24 0.11 166.50 11.19 %-2546.78 1.00
18.12.00 10:18 2.61 -98.18 -38.28 52.98 74.17 7.06 188.01 0.01 178.16 11.16 %-2546.78 1.00
18.12.00 10:19 2.62 -98.13 -37.61 53.94 76.94 7.44 142.70 0.12 179.41 11.22 %-2546.78 1.00
18.12.00 10:20 2.63 -98.12 -37.40 53.49 76.04 7.40 305.02 0.11 179.51 11.21 %-2546.78 1.00
18.12.00 10:21 2.63 -98.14 -38.52 52.27 73.89 7.31 312.70 -0.01 179.61 11.20 %-2546.78 1.00
18.12.00 10:22 2.63 -98.13 -14.97 52.82 71.60 7.06 280.18 -0.01 176.72 11.20 %-2546.78 1.00
I tried
na.strings = c("NA",grepl("^","%"))
but that`s not working.
na.strings = c("NA",patter=("%*"))
is also not working.
Do you have any idea, how to set changing patterns of na.strings or to identify "%" as the start of an NA-Value?
Cheers,
Florian
Data:
df <- data.frame(
v1 = c("%123", "123", "5.5"),
v2 = c("45.00006", "%-12.8899", "%900.77"),
v3 = c("55", "66", "%432.002")
)
Solution:
Use as.numeric and lapply to accomplish both goals, that of turning the %-headed values into NA and that of converting the columns in the dataframe to numeric type:
df[] <- lapply(df, as.numeric)
v1 v2 v3
[1,] NA 45.00006 55
[2,] 123.0 NA 66
[3,] 5.5 NA NA

is there an R function i could use to create a cluster column of an imported dataset in csv format using ggplot2

i want to plot a stacked column using ggplot2 with R1, R2, R3 as the y variables while the varieties names remain in the x variable.
i have tried it on excel it worked but i decided importing the dataset in csv format to R for a more captivating outlook as this is part of my final year project.
varieties R1 R2 R3 Relative.yield SD
1 bd 0.40 2.65 1.45 1.50 1.13
2 bdj1 4.60 NA 2.80 3.70 1.27
3 bdj2 2.40 1.90 0.50 1.60 0.98
4 bdj3 2.40 1.65 5.20 3.08 1.87
5 challenge 2.10 5.15 1.35 2.87 2.01
6 doris 4.20 2.50 2.55 3.08 0.97
7 fel 0.80 2.40 0.75 1.32 0.94
8 fel2 NA 0.70 1.90 1.30 0.85
9 felbv 0.10 2.95 2.05 1.70 1.46
10 felnn 1.50 4.05 1.25 2.27 1.55
11 lad1 0.55 2.20 0.20 0.98 1.07
12 lad2 0.50 NA 0.50 0.50 0.00
13 lad3 1.10 3.90 1.00 2.00 1.65
14 lad4 1.50 1.65 0.50 1.22 0.63
15 molete1 2.60 1.80 2.75 2.38 0.51
16 molete2 1.70 4.70 4.20 3.53 1.61
17 mother's delight 0.10 4.00 1.90 2.00 1.95
18 ojaoba1a 1.90 3.45 2.75 2.70 0.78
19 ojaoba1b 4.20 2.75 4.30 3.75 0.87
20 ojoo 2.80 NA 3.60 3.20 0.57
21 omini 0.20 0.30 0.25 0.25 0.05
22 papa1 2.20 6.40 3.55 4.05 2.14
23 pk5 1.00 2.75 1.10 1.62 0.98
24 pk6 2.30 1.30 3.10 2.23 0.90
25 sango1a 0.40 0.90 1.55 0.95 0.58
26 sango1b 2.60 5.10 3.15 3.62 1.31
27 sango2a 0.50 0.55 0.75 0.60 0.13
28 sango2b 2.95 NA 2.60 2.78 0.25
29 usman 0.60 3.50 1.20 1.77 1.53
30 yau 0.05 0.85 0.20 0.37 0.43
> barplot(yield$R1)
> barplot(yield$Relative.yield)
> barplot(yield$Relative.yield, names.arg = varieties)
Error in barplot.default(yield$Relative.yield, names.arg = varieties) :
object 'varieties' not found
> ggplot(data = yield, mapping = aes(x = varieties, y = yield[,2:4])) + geom_()
Error in geom_() : could not find function "geom_"
> ggplot(data = yield, mapping = aes(x = varieties, y = yield[,2:4])) + geom()
Error in geom() : could not find function "geom"
You should put it in long format first, with tidyr::gather provides this functionality:
library(tidyverse)
gather(df[1:4],R, value, R1:R3) %>%
ggplot(aes(varieties,value, fill = R)) + geom_col()
#> Warning: Removed 5 rows containing missing values (position_stack).
data
df <- read.table(h=T,strin=F,text=
" varieties R1 R2 R3 Relative.yield SD
1 bd 0.40 2.65 1.45 1.50 1.13
2 bdj1 4.60 NA 2.80 3.70 1.27
3 bdj2 2.40 1.90 0.50 1.60 0.98
4 bdj3 2.40 1.65 5.20 3.08 1.87
5 challenge 2.10 5.15 1.35 2.87 2.01
6 doris 4.20 2.50 2.55 3.08 0.97
7 fel 0.80 2.40 0.75 1.32 0.94
8 fel2 NA 0.70 1.90 1.30 0.85
9 felbv 0.10 2.95 2.05 1.70 1.46
10 felnn 1.50 4.05 1.25 2.27 1.55
11 lad1 0.55 2.20 0.20 0.98 1.07
12 lad2 0.50 NA 0.50 0.50 0.00
13 lad3 1.10 3.90 1.00 2.00 1.65
14 lad4 1.50 1.65 0.50 1.22 0.63
15 molete1 2.60 1.80 2.75 2.38 0.51
16 molete2 1.70 4.70 4.20 3.53 1.61
17 'mother\\'s delight' 0.10 4.00 1.90 2.00 1.95
18 ojaoba1a 1.90 3.45 2.75 2.70 0.78
19 ojaoba1b 4.20 2.75 4.30 3.75 0.87
20 ojoo 2.80 NA 3.60 3.20 0.57
21 omini 0.20 0.30 0.25 0.25 0.05
22 papa1 2.20 6.40 3.55 4.05 2.14
23 pk5 1.00 2.75 1.10 1.62 0.98
24 pk6 2.30 1.30 3.10 2.23 0.90
25 sango1a 0.40 0.90 1.55 0.95 0.58
26 sango1b 2.60 5.10 3.15 3.62 1.31
27 sango2a 0.50 0.55 0.75 0.60 0.13
28 sango2b 2.95 NA 2.60 2.78 0.25
29 usman 0.60 3.50 1.20 1.77 1.53
30 yau 0.05 0.85 0.20 0.37 0.43"
)

How to do Shapiro test for multicolumns in data.frame? And avoid 2 errors: values are identical and missing value where TRUE/FALSE needed

I have a dataframe like this:
head(Betula, 10)
year start Start_DayOfYear end End_DayOfYear duration DateMax Max_DayOfYear BetulaPollenMax SPI Jan.NAO Jan.AO
1 1997 <NA> NA <NA> NA NA <NA> NA NA NA -0.49 -0.46
2 1998 <NA> 143 <NA> 184 41 <NA> 146 42 361 0.39 -2.08
3 1999 <NA> 148 <NA> 188 40 <NA> 158 32 149 0.77 0.11
4 2000 <NA> 135 <NA> 197 62 <NA> 156 173 917 0.60 1.27
5 2001 <NA> 143 <NA> 175 32 <NA> 154 113 457 0.25 -0.96
Jan.SO Feb.NAO Feb.AO Feb.SO Mar.NAO Mar.AO Mar.SO Apr.NAO Apr.AO Apr.SO DecJanFebMarApr.NAO DecJanFebMar.NAO
1 0.5 1.70 1.89 1.7 1.46 1.09 -0.4 -1.02 0.32 -0.6 0.14 0.43
2 -2.7 -0.11 -0.18 -2.0 0.87 -0.25 -2.4 -0.68 -0.04 -1.4 0.27 0.51
3 1.8 0.29 0.48 1.0 0.23 -1.49 1.3 -0.95 0.28 1.4 0.39 0.73
4 0.7 1.70 1.08 1.7 0.77 -0.45 1.3 -0.03 -0.28 1.2 0.49 0.62
5 1.0 0.45 -0.62 1.7 -1.26 -1.69 0.9 0.00 0.91 0.2 -0.28 -0.35
DecJanFeb.NAO DecJan.NAO JanFebMarApr.NAO JanFebMar.NAO JanFeb.NAO FebMarApr.NAO FebMar.NAO MarApr.NAO
1 0.08 -0.73 0.41 0.89 0.61 0.71 1.58 0.22
2 0.38 0.63 0.12 0.38 0.14 0.03 0.38 0.10
3 0.89 1.19 0.09 0.43 0.53 -0.14 0.26 -0.36
4 0.57 0.01 0.76 1.02 1.15 0.81 1.24 0.37
5 -0.04 -0.29 -0.14 -0.19 0.35 -0.27 -0.41 -0.63
DecJanFebMarApr.AO DecJanFebMar.AO DecJanFeb.AO DecJan.AO JanFebMarApr.AO JanFebMar.AO JanFeb.AO FebMarApr.AO
1 0.55 0.61 0.45 -0.27 0.71 0.84 0.72 1.10
2 -0.24 -0.29 -0.30 -0.37 -0.64 -0.84 -1.13 -0.16
3 0.08 0.04 0.54 0.58 -0.16 -0.30 0.30 -0.24
4 -0.15 -0.11 0.00 -0.54 0.41 0.63 1.18 0.12
5 -0.74 -1.15 -0.97 -1.14 -0.59 -1.09 -0.79 -0.47
FebMar.AO MarApr.AO DecJanFebMarApr.SO DecJanFebMar.SO DecJanFeb.SO DecJan.SO JanFebMarApr.SO JanFebMar.SO
1 1.49 0.71 0.04 0.20 0.40 -0.25 0.30 0.60
2 -0.22 -0.15 -1.42 -1.43 -1.10 -0.65 -2.13 -2.37
3 -0.51 -0.61 1.38 1.38 1.40 1.60 1.38 1.37
4 0.32 -0.37 1.14 1.13 1.07 0.75 1.23 1.23
5 -1.16 -0.39 0.60 0.70 0.63 0.10 0.95 1.20
JanFeb.SO FebMarApr.SO FebMar.SO MarApr.SO TmaxAprI TminAprI TmeanAprI RainfallAprI HumidityAprI SunshineAprI
1 1.10 0.23 0.65 -0.50 3.27 -3.86 -0.44 0.82 76.3 3.45
2 -2.35 -1.93 -2.20 -1.90 4.52 -3.28 -0.15 0.12 73.5 7.12
3 1.40 1.23 1.15 1.35 4.11 -3.86 -0.34 1.32 78.4 4.85
4 1.20 1.40 1.50 1.25 6.11 -1.31 1.93 0.80 71.9 4.20
5 1.35 0.93 1.30 0.55 1.46 -2.37 -1.04 2.83 84.4 1.21
CloudAprI WindAprI SeeLevelPressureAprI TmaxAprII TminAprII TmeanAprII RainfallAprII HumidityAprII
1 6.30 5.26 1008.63 12.12 2.11 6.17 0.23 76.5
2 3.93 3.86 1022.39 5.57 -0.44 1.82 0.83 77.9
3 5.02 3.23 1007.09 0.20 -6.36 -3.23 2.63 82.5
4 6.15 5.13 1012.21 2.74 -4.88 -2.35 0.34 76.0
5 7.50 3.90 1009.50 6.75 -3.22 1.16 0.32 71.5
SunshineAprII CloudAprII WindAprII SeeLevelPressureAprII TmaxAprIII TminAprIII TmeanAprIII RainfallAprIII
1 3.12 6.53 5.19 1024.31 7.35 0.33 3.37 0.33
2 2.41 6.85 3.70 1012.01 6.34 0.76 2.69 2.01
3 4.99 5.87 6.23 1019.66 8.65 0.73 4.23 0.70
4 6.63 5.17 5.84 1022.62 5.84 -1.81 2.02 0.00
5 6.11 4.82 3.92 1018.81 8.47 1.02 4.17 1.09
HumidityAprIII SunshineAprIII CloudAprIII WindAprIII SeeLevelPressureAprIII TmaxDecI TminDecI TmeanDecI
1 75.0 3.73 6.40 4.08 1009.91 -0.90 -5.88 -3.67
2 83.5 1.52 7.31 4.66 1008.33 5.33 0.01 2.46
3 73.4 6.62 5.12 3.16 1017.01 -0.24 -6.93 -3.64
4 69.0 8.80 4.80 4.99 1021.18 4.67 1.86 2.79
5 72.7 5.33 5.41 4.27 1005.48 3.69 -1.43 1.65
RainfallDecI HumidityDecI SunshineDecI CloudDecI WindDecI SeeLevelPressureDecI TmaxDecII TminDecII TmeanDecII
1 0.12 77.3 0.22 5.08 3.49 1003.15 7.99 0.77 4.10
2 1.10 73.5 0.04 6.29 5.21 999.94 0.24 -4.74 -2.67
3 2.41 82.3 0.00 6.70 4.92 998.64 1.22 -5.90 -2.05
4 3.13 88.1 0.00 7.97 4.00 997.82 2.76 -3.89 -0.54
5 1.60 79.1 0.07 5.44 5.76 996.35 10.82 4.36 6.90
RainfallDecII HumidityDecII SunshineDecII CloudDecII WindDecII SeeLevelPressureDecII TmaxDecIII TminDecIII
1 1.90 71.3 0 4.96 5.55 1007.16 4.78 -2.12
2 4.34 82.2 0 7.03 6.06 998.02 2.07 -4.60
3 1.94 78.6 0 6.53 5.82 1008.33 2.09 -2.48
4 1.45 77.2 0 6.57 5.26 1005.11 -1.49 -8.37
5 1.15 66.6 0 5.74 5.47 1030.02 1.40 -7.34
TmeanDecIII RainfallDecIII HumidityDecIII SunshineDecIII CloudDecIII WindDecIII SeeLevelPressureDecIII TmaxFebI
1 1.15 3.96 82.36 0 6.01 4.02 991.60 -0.23
2 -0.51 4.10 81.18 0 6.67 3.91 986.52 0.79
3 -0.61 1.97 81.27 0 6.21 5.53 982.13 2.19
4 -5.28 1.26 79.64 0 6.11 4.22 1019.63 3.27
5 -3.45 1.19 82.18 0 6.20 4.77 1015.53 2.42
TminFebI TmeanFebI RainfallFebI HumidityFebI SunshineFebI CloudFebI WindFebI SeeLevelPressureFebI TmaxFebII
1 -6.67 -3.57 0.84 84.3 1.11 6.81 5.35 990.51 2.97
2 -7.79 -4.49 2.31 72.2 1.88 4.73 4.53 990.39 3.31
3 -4.14 -1.77 0.42 73.3 1.29 6.02 5.57 1007.67 1.55
4 -2.48 0.04 2.28 77.0 0.46 6.84 4.29 982.97 -1.24
5 -3.52 -0.74 1.98 81.5 0.76 5.78 4.93 1008.29 6.71
TminFebII TmeanFebII RainfallFebII HumidityFebII SunshineFebII CloudFebII WindFebII SeeLevelPressureFebII
1 -2.31 -0.10 1.44 82.2 1.07 6.45 4.42 980.59
2 -4.85 -0.99 3.84 75.0 2.54 5.91 5.05 999.98
3 -5.76 -2.44 2.89 75.3 0.40 6.95 5.82 990.44
4 -8.47 -4.65 3.33 83.1 0.63 6.55 4.95 1000.10
5 -0.25 3.01 1.38 66.1 1.16 6.18 6.28 1001.46
TmaxFebIII TminFebIII TmeanFebIII RainfallFebIII HumidityFebIII SunshineFebIII CloudFebIII WindFebIII
1 0.05 -6.01 -3.35 4.60 83.50 1.29 6.58 4.71
2 -0.45 -7.43 -4.51 2.93 78.38 1.00 6.91 5.99
3 2.13 -4.51 -1.21 2.90 79.38 2.51 5.76 5.46
4 0.59 -3.79 -1.92 5.94 88.33 1.40 6.86 6.70
5 -2.68 -7.23 -5.05 1.39 83.88 1.13 7.41 5.69
SeeLevelPressureFebIII TmaxJanI TminJanI TmeanJanI RainfallJanI HumidityJanI SunshineJanI CloudJanI WindJanI
1 980.25 0.38 -5.57 -3.36 0.01 82.9 0.27 3.45 2.97
2 997.71 4.29 -0.03 2.08 3.70 82.9 0.00 7.39 5.01
3 988.45 1.02 -4.47 -1.87 2.22 82.3 0.00 6.94 4.29
4 987.21 0.04 -6.28 -3.03 4.99 85.8 0.00 5.84 4.75
5 1023.84 -0.33 -5.11 -3.17 0.66 81.2 0.00 7.08 3.88
SeeLevelPressureJanI TmaxJanII TminJanII TmeanJanII RainfallJanII HumidityJanII SunshineJanII CloudJanII
1 1023.71 0.09 -6.48 -2.50 4.29 86.5 0.01 7.23
2 984.57 -0.34 -6.49 -3.61 2.74 80.2 0.23 6.99
3 1004.06 0.32 -5.59 -3.03 5.28 83.3 0.00 6.68
4 983.42 8.38 1.46 4.97 0.64 69.3 0.10 6.13
5 1010.31 7.35 3.00 5.09 1.27 66.3 0.03 6.19
WindJanII SeeLevelPressureJanII TmaxJanIII TminJanIII TmeanJanIII RainfallJanIII HumidityJanIII SunshineJanIII
1 5.42 998.88 5.66 -2.39 1.97 1.03 74.27 0.65
2 6.38 1011.44 3.84 -3.32 -0.37 0.70 73.55 0.55
3 6.24 980.15 4.33 -5.19 -0.59 2.23 76.64 0.69
4 6.44 1019.41 4.09 -2.67 0.05 2.18 71.73 0.42
5 6.74 1006.10 4.43 -0.86 1.58 1.91 80.09 0.20
CloudJanIII WindJanIII SeeLevelPressureJanIII TmaxMarI TminMarI TmeanMarI RainfallMarI HumidityMarI
1 6.47 7.59 1004.59 2.83 -3.60 -0.72 2.14 79.9
2 5.25 4.72 1019.95 -5.31 -12.52 -9.52 2.28 72.6
3 5.34 4.65 1001.66 -0.70 -6.67 -4.47 1.39 81.0
4 5.85 4.83 1007.23 0.10 -7.91 -3.98 2.36 80.2
5 6.53 3.63 992.53 -0.38 -4.59 -2.27 3.00 86.4
SunshineMarI CloudMarI WindMarI SeeLevelPressureMarI TmaxMarII TminMarII TmeanMarII RainfallMarII HumidityMarII
1 0.85 6.77 6.64 986.96 -1.48 -8.43 -5.58 1.09 81.0
2 2.92 5.91 4.68 1013.17 6.53 -1.81 2.56 0.43 65.5
3 2.40 5.71 4.02 1014.62 0.53 -5.17 -2.90 5.20 82.8
4 0.91 7.02 5.87 1006.64 5.32 -0.94 1.23 1.11 74.4
5 0.19 7.82 4.49 999.35 1.60 -4.29 -1.89 0.95 79.3
SunshineMarII CloudMarII WindMarII SeeLevelPressureMarII TmaxMarIII TminMarIII TmeanMarIII RainfallMarIII
1 2.12 5.51 3.93 1021.57 3.88 -1.95 0.55 1.42
2 2.25 6.29 6.11 1008.31 3.95 -2.46 -0.15 1.30
3 1.00 6.61 5.77 1006.63 -0.68 -6.60 -4.07 0.70
4 2.16 6.61 6.45 1003.23 5.49 -0.68 1.65 1.58
5 4.07 5.21 3.14 1017.24 -0.66 -7.21 -4.00 1.37
HumidityMarIII SunshineMarIII CloudMarIII WindMarIII SeeLevelPressureMarIII
1 80.45 2.80 6.13 4.03 995.31
2 72.09 3.98 5.99 5.14 1000.32
3 78.73 2.34 6.46 3.81 1005.67
4 74.64 2.85 6.54 6.34 1013.45
5 79.45 4.71 5.65 4.95 1010.47
[ reached 'max' / getOption("max.print") -- omitted 5 rows ]
And I would like to do the normality test for all column in once. I tried
apply(x, shapiro.test)
Betula_shapiro <- apply(Betula, shapiro.test)
Error in FUN(X[[i]], ...) : is.numeric(x) is not TRUE
and it didn´t work. I also tried this:
Betula <- apply(Betula[which(sapply(Betula, is.numeric))], 2, shapiro.test)
Error in FUN(newX[, i], ...) : all 'x' values are identical
f<-function(x){if(diff(range(x))==0)list()else shapiro.test(x)}
Betula <- apply(Betula[which(sapply(Betula, is.numeric))], 2, f)
Error in if (diff(range(x)) == 0) list() else shapiro.test(x) :
missing value where TRUE/FALSE needed
So I did:
Betula_numerics_only <- Betula[which(sapply(Betula, is.numeric))]
selecting columns with at least 3 not missing values and applying shapiro.test on them
Betula_numerics_only_filled_columns <- Betula_numerics_only[which(apply(Betula_numerics_only, 2, function(f) sum(!is.na(f))>=3 ))]
Betula_shapiro<-apply(Betula_numerics_only_filled_columns, 2, shapiro.test)
Error in FUN(newX[, i], ...) : all 'x' values are identical
Could you please help me with this problem?
Since i was talking about readability in my comment i felt i should provide something more readable too as an answer.
Lets make some dummy-data:
data_test <- data.frame(matrix(rnorm(100, 10, 1), ncol = 5, byrow = T), stringsAsFactors = F)
Lets apply shapiro.test to each column
apply(data_test, 2, shapiro.test)
In case there are non numeric columns:
Lets add a dummy-char column for testing-purposes
data_test$non_numeric <- sample(c("hello", "hi", "good morning"), NROW(data_test), replace = T)
and try to apply the test again
apply(data_test, 2, shapiro.test)
which results in:
> apply(data_test, 2, shapiro.test)
Error: is.numeric(x) is not TRUE
To solve this we select only numeric colums by using sapply:
data_test[which(sapply(data_test, is.numeric))]
and combine it with the apply:
apply(data_test[which(sapply(data_test, is.numeric))], 2, shapiro.test)
Removing colums, that are all NA:
data_test_numerics_only <- data_test[which(sapply(data_test, is.numeric))]
Selecting colums with at least 3 not missing values and applying shapiro.test on them:
data_test_numerics_only_filled_colums = data_test_numerics_only[which(apply(data_test_numerics_only, 2, function(f) sum(!is.na(f)) >= 3))]
apply(data_test_numerics_only_filled_colums, 2, shapiro.test)
We will get this running, lets try once more :)
remove non numeric columns
Betula_numerics <- Betula[which(sapply(Betula, is.numeric))]
Remove columns with less than 3 values
Betula_numerics_filled <- Betula_numerics[which(apply(Betula_numerics, 2, function(f) sum(!is.na(f)) >= 3))]
Remove columns with zero variance
Betula_numerics_filled_not_constant <- Betula_numerics_filled [apply(Betula_numerics_filled , 2, function(f) var(f, na.rm = T) != 0)]
Shapiro.test and hope for the best :)
apply(Betula_numerics_filled_not_constant, 2, shapiro.test)

How to skip NA when applying geometric-mean function

I have the following data frame:
1 8.03 0.37 0.55 1.03 1.58 2.03 15.08 2.69 1.63 3.84 1.26 1.9692516
2 4.76 0.70 NA 0.12 1.62 3.30 3.24 2.92 0.35 0.49 0.42 NA
3 6.18 3.47 3.00 0.02 0.19 16.70 2.32 69.78 3.72 5.51 1.62 2.4812459
4 1.06 45.22 0.81 1.07 8.30 196.23 0.62 118.51 13.79 22.80 9.77 8.4296220
5 0.15 0.10 0.07 1.52 1.02 0.50 0.91 1.75 0.02 0.20 0.48 0.3094169
7 0.27 0.68 0.09 0.15 0.26 1.54 0.01 0.21 0.04 0.28 0.31 0.1819510
I want to calculate the geometric mean for each row. My codes is
dat <- read.csv("MXreport.csv")
if(any(dat$X18S > 25)){ print("Fail!") } else { print("Pass!")}
datpass <- subset(dat, dat$X18S <= 25)
gene <- datpass[, 42:52]
gm_mean <- function(x){ prod(x)^(1/length(x))}
gene$score <- apply(gene, 1, gm_mean)
head(gene)
I got this output after typing this code:
1 8.03 0.37 0.55 1.03 1.58 2.03 15.08 2.69 1.63 3.84 1.26 1.9692516
2 4.76 0.70 NA 0.12 1.62 3.30 3.24 2.92 0.35 0.49 0.42 NA
3 6.18 3.47 3.00 0.02 0.19 16.70 2.32 69.78 3.72 5.51 1.62 2.4812459
4 1.06 45.22 0.81 1.07 8.30 196.23 0.62 118.51 13.79 22.80 9.77 8.4296220
5 0.15 0.10 0.07 1.52 1.02 0.50 0.91 1.75 0.02 0.20 0.48 0.3094169
7 0.27 0.68 0.09 0.15 0.26 1.54 0.01 0.21 0.04 0.28 0.31 0.1819510
The problem is I got NA after applying the geometric mean function to the row that has NA. How do I skip NA and calculate the geometric mean for the row that has NA
When I used gene<- na.exclude(datpass[, 42:52]). It skipped the row that has NA and not calculate the geometric mean at all. That is now what I want. I want to also calculate the geometric mean for the row that has NA also. How do I do this?

R: `unlist` using a lot of time when summing a subset of a matrix

I have a program which pulls data out of a MySQL database, decodes a pair of
binary columns, and then sums together a subset of of the rows within the pair
of binary columns. Running the program on a sample data set takes 12-14 seconds,
with 9-10 of those taken up by unlist. I'm wondering if there is any way to
speed things up.
Structure of the table
The rows I'm getting from the database look like:
| array_length | mz_array | intensity_array |
|--------------+-----------------+-----------------|
| 98 | 00c077e66340... | 002091c37240... |
| 74 | c04a7c7340... | db87734000... |
where array_length is the number of little-endian doubles in the two arrays
(they are guaranteed to be the same length). So the first row has 98 doubles in
each of mz_array and intensity_array. array_length has a mean of 825 and a
median of 620 with 13,000 rows.
Decoding the binary arrays
Each row gets decoded by being passed to the following function. Once the binary
arrays have been decoded, array_length is no longer needed.
DecodeSpectrum <- function(array_length, mz_array, intensity_array) {
sapply(list(mz_array=mz_array, intensity_array=intensity_array),
readBin,
what="double",
endian="little",
n=array_length)
}
Summing the arrays
The next step is to sum the values in intensity_array, but only if their
corresponding entry in mz_array is within a certain window. The arrays are
ordered by mz_array, ascending. I am using the following function to sum up
the intensity_array values:
SumInWindow <- function(spectrum, lower, upper) {
sum(spectrum[spectrum[,1] > lower & spectrum[,1] < upper, 2])
}
Where spectrum is the output from DecodeSpectrum, a matrix.
Operating over list of rows
Each row is handled by:
ProcessSegment <- function(spectra, window_bounds) {
lower <- window_bounds[1]
upper <- window_bounds[2]
## Decode a single spectrum and sum the intensities within the window.
SumDecode <- function (...) {
SumInWindow(DecodeSpectrum(...), lower, upper)
}
do.call("mapply", c(SumDecode, spectra))
}
And finally, the rows are fetched and handed off to ProcessSegment with this
function:
ProcessAllSegments <- function(conn, window_bounds) {
nextSeg <- function() odbcFetchRows(conn, max=batchSize, buffsize=batchSize)
while ((res <- nextSeg())$stat == 1 && res$data[[1]] > 0) {
print(ProcessSegment(res$data, window_bounds))
}
}
I'm doing the fetches in segments so that R doesn't have to load the entire data
set into memory at once (it was causing out of memory errors). I'm using the
RODBC driver because the RMySQL driver isn't able to return unsullied binary
values (as far as I could tell).
Performance
For a sample data set of about 140MiB, the whole process takes around 14 seconds
to complete, which is not that bad for 13,000 rows. Still, I think there's room
for improvement, especially when looking at the Rprof output:
$by.self
self.time self.pct total.time total.pct
"unlist" 10.26 69.99 10.30 70.26
"SumInWindow" 1.06 7.23 13.92 94.95
"mapply" 0.48 3.27 14.44 98.50
"as.vector" 0.44 3.00 10.60 72.31
"array" 0.40 2.73 0.40 2.73
"FUN" 0.40 2.73 0.40 2.73
"list" 0.30 2.05 0.30 2.05
"<" 0.22 1.50 0.22 1.50
"unique" 0.18 1.23 0.36 2.46
">" 0.18 1.23 0.18 1.23
".Call" 0.16 1.09 0.16 1.09
"lapply" 0.14 0.95 0.86 5.87
"simplify2array" 0.10 0.68 11.48 78.31
"&" 0.10 0.68 0.10 0.68
"sapply" 0.06 0.41 12.36 84.31
"c" 0.06 0.41 0.06 0.41
"is.factor" 0.04 0.27 0.04 0.27
"match.fun" 0.04 0.27 0.04 0.27
"<Anonymous>" 0.02 0.14 13.94 95.09
"unique.default" 0.02 0.14 0.06 0.41
$by.total
total.time total.pct self.time self.pct
"ProcessAllSegments" 14.66 100.00 0.00 0.00
"do.call" 14.50 98.91 0.00 0.00
"ProcessSegment" 14.50 98.91 0.00 0.00
"mapply" 14.44 98.50 0.48 3.27
"<Anonymous>" 13.94 95.09 0.02 0.14
"SumInWindow" 13.92 94.95 1.06 7.23
"sapply" 12.36 84.31 0.06 0.41
"DecodeSpectrum" 12.36 84.31 0.00 0.00
"simplify2array" 11.48 78.31 0.10 0.68
"as.vector" 10.60 72.31 0.44 3.00
"unlist" 10.30 70.26 10.26 69.99
"lapply" 0.86 5.87 0.14 0.95
"array" 0.40 2.73 0.40 2.73
"FUN" 0.40 2.73 0.40 2.73
"unique" 0.36 2.46 0.18 1.23
"list" 0.30 2.05 0.30 2.05
"<" 0.22 1.50 0.22 1.50
">" 0.18 1.23 0.18 1.23
".Call" 0.16 1.09 0.16 1.09
"nextSeg" 0.16 1.09 0.00 0.00
"odbcFetchRows" 0.16 1.09 0.00 0.00
"&" 0.10 0.68 0.10 0.68
"c" 0.06 0.41 0.06 0.41
"unique.default" 0.06 0.41 0.02 0.14
"is.factor" 0.04 0.27 0.04 0.27
"match.fun" 0.04 0.27 0.04 0.27
$sample.interval
[1] 0.02
$sampling.time
[1] 14.66
I'm surprised to see unlist taking up so much time; this says to me that there
might be some redundant copying or rearranging going on. I'm new at R, so it's
entirely possible that this is normal, but I'd like to know if there's anything
glaringly wrong.
Update: sample data posted
I've posted the full version of the program
here and the sample data I use
here. The sample data is the
gziped output from mysqldump. You need to set the proper environment
variables for the script to connect to the database:
MZDB_HOST
MZDB_DB
MZDB_USER
MZDB_PW
To run the script, you must specify the run_id and the window boundaries. I
run the program like this:
Rscript ChromatoGen.R -i 1 -m 600 -M 1200
These window bounds are pretty arbitrary, but select roughly a half to a third
of the range. If you want to print the results, put a print() around the call
to ProcessSegment within ProcessAllSegments. Using those parameters, the
first 5 should be:
[1] 7139.682 4522.314 3435.512 5255.024 5947.999
You probably want want to limit the number of results, unless you want 13,000
numbers filling your screen :) The simplest way is just add LIMIT 5 at the end
of query.
I've figured it out!
The problem was in the sapply() call. sapply does a fair amount of
renaming and property setting which slows things down massively for arrays of
this size. Replacing DecodeSpectrum with the following code brought the sample
time from 14.66 seconds down to 3.36 seconds, a 4-fold increase!
Here's the new body of DecodeSpectrum:
DecodeSpectrum <- function(array_length, mz_array, intensity_array) {
## needed to tell `vapply` how long the result should be. No, there isn't an
## easier way to do this.
resultLength <- rep(1.0, array_length)
vapply(list(mz_array=mz_array, intensity_array=intensity_array),
readBin,
resultLength,
what="double",
endian="little",
n=array_length,
USE.NAMES=FALSE)
}
The Rprof output now looks like:
$by.self
self.time self.pct total.time total.pct
"<Anonymous>" 0.64 19.75 2.14 66.05
"DecodeSpectrum" 0.46 14.20 1.12 34.57
".Call" 0.42 12.96 0.42 12.96
"FUN" 0.38 11.73 0.38 11.73
"&" 0.16 4.94 0.16 4.94
">" 0.14 4.32 0.14 4.32
"c" 0.14 4.32 0.14 4.32
"list" 0.14 4.32 0.14 4.32
"vapply" 0.12 3.70 0.66 20.37
"mapply" 0.10 3.09 2.54 78.40
"simplify2array" 0.10 3.09 0.30 9.26
"<" 0.08 2.47 0.08 2.47
"t" 0.04 1.23 2.72 83.95
"as.vector" 0.04 1.23 0.08 2.47
"unlist" 0.04 1.23 0.08 2.47
"lapply" 0.04 1.23 0.04 1.23
"unique.default" 0.04 1.23 0.04 1.23
"NextSegment" 0.02 0.62 0.50 15.43
"odbcFetchRows" 0.02 0.62 0.46 14.20
"unique" 0.02 0.62 0.10 3.09
"array" 0.02 0.62 0.04 1.23
"attr" 0.02 0.62 0.02 0.62
"match.fun" 0.02 0.62 0.02 0.62
"odbcValidChannel" 0.02 0.62 0.02 0.62
"parent.frame" 0.02 0.62 0.02 0.62
$by.total
total.time total.pct self.time self.pct
"ProcessAllSegments" 3.24 100.00 0.00 0.00
"t" 2.72 83.95 0.04 1.23
"do.call" 2.68 82.72 0.00 0.00
"mapply" 2.54 78.40 0.10 3.09
"<Anonymous>" 2.14 66.05 0.64 19.75
"DecodeSpectrum" 1.12 34.57 0.46 14.20
"vapply" 0.66 20.37 0.12 3.70
"NextSegment" 0.50 15.43 0.02 0.62
"odbcFetchRows" 0.46 14.20 0.02 0.62
".Call" 0.42 12.96 0.42 12.96
"FUN" 0.38 11.73 0.38 11.73
"simplify2array" 0.30 9.26 0.10 3.09
"&" 0.16 4.94 0.16 4.94
">" 0.14 4.32 0.14 4.32
"c" 0.14 4.32 0.14 4.32
"list" 0.14 4.32 0.14 4.32
"unique" 0.10 3.09 0.02 0.62
"<" 0.08 2.47 0.08 2.47
"as.vector" 0.08 2.47 0.04 1.23
"unlist" 0.08 2.47 0.04 1.23
"lapply" 0.04 1.23 0.04 1.23
"unique.default" 0.04 1.23 0.04 1.23
"array" 0.04 1.23 0.02 0.62
"attr" 0.02 0.62 0.02 0.62
"match.fun" 0.02 0.62 0.02 0.62
"odbcValidChannel" 0.02 0.62 0.02 0.62
"parent.frame" 0.02 0.62 0.02 0.62
$sample.interval
[1] 0.02
$sampling.time
[1] 3.24
It's possible that some additional performance could be squeezed out of messing
with the do.call('mapply', ...) call, but I'm satisfied enough with the
performance as is that I'm not willing to waste time on that.

Resources