Remove rows whose number is "NA" in R - r

I am trying to delete some rows according to a filter.
clean <-function(z){
f <- z[!(z$"Current_status" == "T" & z$"Start_date" == 2012),]
return(f)
}
It worked on all dataframes but one. On this one, the lines I want to delete were fully emptied ("NA" value for each column) but the column remains. This is what I get:
Current_status Start_date
1 O 2005
2 O 2004
3 O 2004
4 O 2002
5 O 2002
NA <NA> NA
NA.1 <NA> NA
8 O 0
9 O 0
10 O 0
11 O 0
NA.2 <NA> NA
I tried several methods but none worked.
My hypothesis is that the problem is due to the fact that the number of the row changed and also became "NA".
How could I get rid of these rows?
Many thanks!

You could subset with the help of is.na():
f <- f[!is.na(f$Current_status) & !is.na(f$Start_date), ]

Related

R - enter basic formula

I am new to R and struggling to understand its quirks. I'm trying to do something which should be really simple, but is turning out to be apparently very complicated.
I am used to Excel, SQL and Minitab, where you can enter a value in one column which includes references to other columns and parameters. However, R doesn't seem to be allowing me to do this.
I have a table with (currently) four columns:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 NA 0.787890411
3 30/12/2011 662 NA NA
4 31/12/2011 28 NA NA
5 01/01/2012 46 NA NA
6 02/01/2012 403 NA NA
7 03/01/2012 282 NA NA
8 04/01/2012 315 NA NA
9 05/01/2012 327 NA NA
10 06/01/2012 458 NA NA
and have a parameter "beta", with a value which I have assigned as 0.0002.
All I want to do is assign a formula to rows 3:10 which is:
beta*(Pallets t - Pallets t-1)+(1-beta)*Tt t-1.
I thought that the appropriate code might be:
Table[3:10,4]<-beta*(Table[3:10,"Pallets"]-Table[2:9,"Pallets"])+(1-beta)*Table[2:9,"Tt"]
However, this doesn't work. The first time I enter this formula, it generates:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 NA 0.7878904
3 30/12/2011 662 NA 0.8431328
4 31/12/2011 28 NA NA
5 01/01/2012 46 NA NA
6 02/01/2012 403 NA NA
7 03/01/2012 282 NA NA
8 04/01/2012 315 NA NA
9 05/01/2012 327 NA NA
10 06/01/2012 458 NA NA
So it's generated the correct answer for the second item in the series, but not for any of the subsequent values.
It seems as though R doesn't automatically update each row, and the relationship to each other row, when you enter a formula, as Excel does. Having said that, Excel actually would require me to enter the formula in cell [4,Tt], and then drag this down to all of the other cells. Perhaps R is the same, and there is an equivalent to "dragging down" which I need to do?
Finally, I also noticed that when I change the value of the beta parameter, through, e.g. beta<-0.5, and then print the Table values again, they are unchanged - so the table hasn't updated even though I have changed the value of the parameter.
Appreciate that these are basic questions, but I am very new to R.
In R, the computations are not made "cell by cell", but are vectorised - in your example, R takes the vectors Table[3:10,"Pallets"], Table[2:9,"Pallets"] and Table[2:9,"Tt"] as they are at the moment, computes the resulting vector, and finally assigns it to Table[3:10,4].
If you want to make some computations "cell by cell", you have to use the for loop:
beta <- 0.5
df <- data.frame(v1 = 1:12, v2 = 0)
for (i in 3:10) {
df[i, "v2"] <- beta * (df[i, "v1"] - df[i-1, "v1"]) + (1 - beta) * df[i-1, "v2"]
}
df
v1 v2
1 1 0.0000000
2 2 0.0000000
3 3 0.5000000
4 4 0.7500000
5 5 0.8750000
6 6 0.9375000
7 7 0.9687500
8 8 0.9843750
9 9 0.9921875
10 10 0.9960938
11 11 0.0000000
12 12 0.0000000
As it comes to your second question, R will never update any values on its own (imagine having set manual calculation in Excel). So you need to repeat the computations after changing beta.
Although it's generally a bad design, but you can iterate over rows in a loop:
Table$temp <- c(0,diff(Table$Palletes,1))
prevTt = 0
for (i in 1:10)
{
Table$Tt[i] = Table$temp * beta + (1-beta)*prevTt
prevTt = Table$Tt[i]
}
Table$temp <- NULL

Issue with NA values when removing rows from data frame in R

This is my data frame:
ID <- c('TZ1','TZ2','TZ3','TZ4')
hr <- c(56,32,38,NA)
cr <- c(1,4,5,2)
data <- data.frame(ID,hr,cr)
ID hr cr
1 TZ1 56 1
2 TZ2 32 4
3 TZ3 38 5
4 TZ4 NA 2
I want to remove the rows where data$hr = 56. This is what I want the end product to be:
ID hr cr
2 TZ2 32 4
3 TZ3 38 5
4 TZ4 NA 2
This is what I thought would work:
data = data[data$hr !=56,]
However the resulting data frame looks like this:
ID hr cr
2 TZ2 32 4
3 TZ3 38 5
NA <NA> NA NA
How can I mofify my code to encorporate the NA value so this doesn't happen? Thank you for your help, I can't figure it out.
EDIT: I also want to keep the NA value in the data frame.
The issue is that when we do the == or !=, if there are NA values, it will remain as such and create an NA row for that corresponding NA value. So one way to make the logical index with only TRUE/FALSE values will be to use is.na also in the comparison.
data[!(data$hr==56 & !is.na(data$hr)),]
# ID hr cr
#2 TZ2 32 4
#3 TZ3 38 5
#4 TZ4 NA 2
We could also apply the reverse logic
subset(data, hr!=56|is.na(hr))
# ID hr cr
#2 TZ2 32 4
#3 TZ3 38 5
#4 TZ4 NA 2

2 variables in a for loop in R

I have two vectors that I would like to reference in a for loop, but each is of different lengths.
n=1:50
m=letters[1:14]
I tried a single loop to read it
for (i in c(11:22,24,25)){
cat (paste(n[i],m[i],sep='\t'),sep='\n')
}
and ended up with:
11 k
12 l
13 m
14 n
15 NA
16 NA
17 NA
18 NA
19 NA
20 NA
21 NA
22 NA
24 NA
25 NA
but I would like to obtain:
11 a
12 b
13 c
...
25 n
is there a way to have a double variable declaration?
for (i in c(11:22,24,25) and j in 1:14){
cat (paste(n[i],m[j],sep='\t'),sep='\n')
}
or something similar to get the result I want?
No there isn't. But you can do this:
ind_j <- c(11:22,24,25)
ind_k <- 1:14
for (i in seq_along(ind_j)){
cat (paste(n[ind_j[i]],m[ind_k[i]],sep='\t'),sep='\n')
}
Of course, it's very probable that you shouldn't use a for loop for your actual problem.
If you want m to start over when it has reached the end, you can take advantage of recycling in R.
cat(paste(n, m, sep='\t', collapse='\n'), '\n')
When the end of m is reached, it will start over until all elements of n have been iterated over. If you need this in a loop, replace cat with a for loop.
your problem lies in assigning the values to i in for (i in c(11:22,24,25) - this assigns the values 11,12,13,14,15 .... to i.
then you want to get the values of m[i].
but remember: m[i] has only 1..14 items so for item 15 and above - you'll get NAs
maybe this is what you wanted - there are more robust answers here and #Roland's is far better but imho - this fixes your problem without changing your initial approach
for (i in c(1:12,14,15)){cat (paste(n[i],m[i],sep='\t'),sep='\n')}
if you just subtract 10 from your sequence - the indexing problem will go away and u'll get
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
11 k
12 l
14 n
15 o

looping over the name of the columns in R for creating new columns

I am trying to use the loop over the column names of the existing dataframe and then create new columns based on one of the old column.Here is my sample data:
sample<-list(c(10,12,17,7,9,10),c(NA,NA,NA,10,12,13),c(1,1,1,0,0,0))
sample<-as.data.frame(sample)
colnames(sample)<-c("x1","x2","D")
>sample
x1 x2 D
10 NA 1
12 NA 1
17 NA 1
7 10 0
9 20 0
10 13 0
Now, I am trying to use for loop to generate two variables x1.imp and x2.imp that have values related to D=0 when D=1 and values related to D=1 when D=0(Here I actually don't need for loop but for my original dataset with large cols (variables), I really need the loop) based on the following condition:
for (i in names(sample[,1:2])){
sample$i.imp<-with (sample, ifelse (D==1, i[D==0],i[D==1]))
i=i+1
return(sample)
}
Error in i + 1 : non-numeric argument to binary operator
However, the following works, but it doesn't give the names of new cols as imp.x2 and imp.x3
for(i in sample[,1:2]){
impt.i<-with(sample,ifelse(D==1,i[D==0],i[D==1]))
i=i+1
print(as.data.frame(impt.i))
}
impt.i
1 7
2 9
3 10
4 10
5 12
6 17
impt.i
1 10
2 12
3 13
4 NA
5 NA
6 NA
Note that I already know the solution without loop [here]. I want with loop.
Expected output:
x1 x2 D x1.impt x2.imp
10 NA 1 7 10
12 NA 1 9 20
17 NA 1 10 13
7 10 0 10 NA
9 20 0 12 NA
10 13 0 17 NA
I would greatly appreciate your valuable input in this regard.
This is nuts, but since you are asking for it... Your code with minimum changes would be:
for (i in colnames(sample)[1:2]){
sample[[paste0(i, '.impt')]] <- with(sample, ifelse(D==1, get(i)[D==0],get(i)[D==1]))
}
A few comments:
replaced names(sample[,1:2]) with the more elegant colnames(sample)[1:2]
the $ is for interactive usage. Instead, when programming, i.e. when the column name is to be interpreted, you need to use [ or [[, hence I replaced sample$i.imp with sample[[paste0(i, '.impt')]]
inside with, i[D==0] will not give you x1[D==0] when i is "x1", hence the need to dereference it using get.
you should not name your data.frame sample as it is also the name of a pretty common function
This should work,
test <- sample[,"D"] == 1
for (.name in names(sample)[1:2]){
newvar <- paste(.name, "impt", sep=".")
sample[[newvar]] <- ifelse(test, sample[!test, .name],
sample[test, .name])
}
sample

R throw away rows on multiple conditions

I have a question about filtering in my dataset. My dataset look like this:
PROJECT FREQ
1 <NA> NA
2 <NA> NA
3 FSHD 0.01282051
4 <NA> NA
5 <NA> NA
6 GROEI,CMS 0.02564103
7 <NA> NA
8 GROEI 0.00000132
9 <NA> NA
10 NMD,BRCA 0.03846154
Here is my problem: I want to throw away all the rows that haven't in the PROJECT field: GROEI and in the FREQ field: bigger than 0.01.
I thought about something like this, but this isn't the way..
a1<-a[!(a$PROJECT != "GROEI" & a$FREQINHDB >= 0.02),]
Can anyone help me with this?
Thanks!
Since you want to match on a partial string, you can use grepl to match a regular expression with your data:
na.omit(a[!grepl("GROEI", a$PROJECT), ])
n PROJECT FREQ
3 3 FSHD 0.01282051
10 10 NMD,BRCA 0.03846154

Resources