Clean bad data automatically [duplicate] - r

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 9 years ago.
I am building an App using shiny and openair to analyze wind data.
Right now the data needs to be “cleaned” before uploading by the user.
I am interested in doing this automatically.
Some of the data is empty, some of is not numeric, so it is not possible to build a wind rose.
I want to:
1. Estimate how much of the data is not numeric
2. Cut it out and leave only numeric data
here is an example of the data:
the "NO2.mg" is read as a factor and not int becuse it does not consist only numbers
OK
here is a reproducible example:
no2<-factor(c(5,4,"c1",54,"c5",seq(2:50)))
no2
[1] 5 4 c1 54 c5 1 2 3 4 5 6 7 8 9 10 11 12 13 14
[20] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
[39] 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
52 Levels: 1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 ... c5
> as.numeric(no2)
[1] 45 34 51 46 52 1 12 23 34 45 47 48 49 50 2 3 4 5 6
[20] 7 8 9 10 11 13 14 15 16 17 18 19 20 21 22 24 25 26 27
[39] 28 29 30 31 32 33 35 36 37 38 39 40 41 42 43 44

Worst R haiku ever:
Some of the data is empty,
some of is not numeric,
so it is not possible to build a wind rose.

To convert a factor to numeric, you need to convert to character first:
no2<-factor(c(5,4,"c1",54,"c5",seq(2:50)))
no2_num <- as.numeric(as.character(no2))
#Warning message:
# NAs introduced by coercion
no2_clean <- na.omit(no2_num) #remove NAs resulting from the bad data
# [1] 5 4 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
# [40] 37 38 39 40 41 42 43 44 45 46 47 48 49
# attr(,"na.action")
# [1] 3 5
# attr(,"class")
# [1] "omit"
length(attr(no2_clean,"na.action"))/length(no2)*100
#[1] 3.703704

OK this is how i did it i am sure someone has abetter way
i'd love it if you share with me
this is my data:
no2<-factor(c(5,4,"c1",54,"c5",seq(2:50)))
to count the "bad data:"
sum(is.na((as.numeric(as.vector(no2)))))
and to estimate the percent of bad data:
sum(is.na((as.numeric(as.vector(no2)))))/length(no2)*100

Related

R plot numbers of factor levels having n, n+1, .... counts

I have a very large dataset (> 200000 lines) with 6 variables (only the first two shown)
>head(gt7)
ChromKey POS
1 2447 25
2 2447 183
3 26341 75
4 26341 2213
5 26341 2617
6 54011 1868
I have converted the Chromkey variable to a factor variable made up of > 55000 levels.
> gt7[1] <- lapply(gt7[1], factor)
> is.factor(gt7$ChromKey)
[1] TRUE
I can further make a table with counts of ChromKey levels
> table(gt7$ChromKey)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
88 88 44 33 11 11 33 22 121 11 22 11 11 11 22 11 33
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
22 22 44 55 22 11 22 66 11 11 11 22 11 11 11 187 77
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
77 11 44 11 11 11 11 11 11 22 66 11 22 11 44 22 22
... outut cropped
Which I can save in table format
> table <- table(gt7$ChromKey)
> head(table)
1 2 3 4 5 6
88 88 44 33 11 11
I would like to know whether is it possible to have a table (and histogram) of the number of levels with specific count numbers. From the example above, I would expect
88 44 33 11
2 1 1 2
I would very much appreciate any hint.
We can apply table again on the output to get the frequency count of the frequency
table(table(gt7$ChromKey))

What is the name and reason for the [1] at the output prompt?

What's the name for the [1] below.
What is its significance?
Is it always only [1]? If not, then under what conditions is it something else? (example please)
> bb <- c(5,6,7)
> bb
[1] 5 6 7
It shows the count of the variables. In your case, it shows
bb <- c(5,6,7)
> bb
# [1] 5 6 7
Try,
c(1:50)
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
#[35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
You can also avoid that being displayed by using cat
cat(c(1:50))
#1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Creating a sequence in R [duplicate]

This question already has answers here:
Create integer sequences defined by 'from' and 'to' vectors
(2 answers)
Closed 5 years ago.
Let's say, I created two vectors like:
Ncla = 10
CC.1 = seq(2,((Ncla *Ncla)-Ncla),(Ncla+1))
CC.2 = seq(Ncla,((Ncla *Ncla)-Ncla),(Ncla))
and, I tried to create the following sequence:
#[1] 2 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 20 24 25 26
# 27 28 29 30 35 36 37 38 39 40 46 47 48 49 50 57 58 59 60 68 69 70 79 80 90
using the statement:
for(i in 1:(Ncla-1)) A.1[i]={c(seq(CC.1[i],CC.2[i],length = 1))}
but it doesn't work.
Any help is greatly appreciated.
Try
unlist(Map(seq, CC.1, CC.2))
# [1] 2 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 20 24 25 26 27 28 29 30 35
#[26] 36 37 38 39 40 46 47 48 49 50 57 58 59 60 68 69 70 79 80 90
Or
unlist(sapply(seq_along(CC.1), function(i) seq(CC.1[i], CC.2[i])))
Or
A.1 <- list()
for(i in seq_along(CC.1)) A.1[[i]] <- seq(CC.1[i], CC.2[i])
unlist(A.1)
# [1] 2 3 4 5 6 7 8 9 10 13 14 15 16 17 18 19 20 24 25 26 27 28 29 30 35
#[26] 36 37 38 39 40 46 47 48 49 50 57 58 59 60 68 69 70 79 80 90
test<-NULL
for(i in 1:(Ncla-1)) {
A.1=c(seq(CC.1[i],CC.2[i],1))
test<-c(test,A.1)
}
test
Your mistake: You were not saving your results.

converting readHTMLTable to numeric

I am working with readHTMLTable and am having difficulties performing calculations on the columns, as when I convert to numeric with as.numeric the values in the column are changed from values to rank.
Can anyone help
a=readHTMLTable("http://www.nhl.com/ice/standings.htm?season=20132014&type=LEA",which=3,trim=F)
> a[,5]
[1] 54 54 52 52 51 51 46 46 46 46 43 45 42 43 39 40 38 37 38 35 37 37 38 36 36 34 35 29 29 21
Levels: 21 29 34 35 36 37 38 39 40 42 43 45 46 51 52 54
> a[,5]=as.numeric(a[,5])
> a[,5]
[1] 16 16 15 15 14 14 13 13 13 13 11 12 10 11 8 9 7 6 7 4 6 6 7 5 5 3 4 2 2 1
I would like to be able to perform functions on the values of a[,5], not the ranks. such as mean(a[,5]) = (54+54+52...+21)/30, not
mean(a[,5])
[1] 8.933333
The problem is trying to convert a factor variable to numeric. See this post.
The canonical way to handle the problem would be as.numeric(levels(a[,5]))[a[,5]]
However, the method I often use is as.numeric(as.character(a[,5])) because it's easier to remember.

How do you create an inverse sequence in R

I Have a sequence below that has a pattern that does not change. I can create a vector to represent the missing variables to this pattern(below). But I can't seem to figure out a way to print this specific sequence below as a vector. How would one create a vector that shows the sequence below but stops at a specific row(the last row in the pattern), instead of 51? Thanks
bad <- seq(1,51,by=3)
2,3,5,6,8,9,11,12,14,15,17,18,20,21,23,24,26,27,29,30,32,33,35,36,38,39,41,42,44,45,47,48,50,51
The most straightforward way I can think of is to use "recycling" of a logical vector:
(1:51)[c(FALSE, TRUE, TRUE)]
# [1] 2 3 5 6 8 9 11 12 14 15 17 18 20 21 23 24 26 27 29 30 32 33 35 36 38 39
# [27] 41 42 44 45 47 48 50 51
> bad <- 2:51
> bad[!bad %% 3 == 1]
[1] 2 3 5 6 8 9 11 12 14 15 17 18 20 21 23 24 26 27 29 30 32 33 35 36 38 39 41 42 44 45 47 48 50 51
cumsum(rep(c(2,1), 51/3))
probably inefficient though.
a = 2:51
b = seq(1, 51, by=3)
setdiff(a,b)

Resources