Choose some items in a dataframe and change them - r

I have a data frame with some information. Some data is NA. Something like:
id fact sex
1 1 3 M
2 2 6 F
3 3 NA <NA>
4 4 8 F
5 5 2 F
6 6 2 M
7 7 NA <NA>
8 8 1 F
9 9 10 M
10 10 10 M
I have to change fact by some rule(e.x. multiply by 3 elements, that have (data == "M")).
I tried survey$fact[survey$sex== "M"] <- survey$fact[survey$sex== "M"] * 3, but I have some error because of NA.
I know I can check if element is NA with is.na(x), and add this condition in [...], but I hope that exists more beautiful solution

I really like ifelse, it always seems to have the desired behaviour with respect to NA values for me.
survey$fact <- ifelse(survey$sex == "M", survey$fact * 3, survey$fact)
?ifelse shows that the first argument is the test, the second the value assigned if the test is true and the final argument the value if false. If you assign the original data.frame column as the false return value, it will assign rows for which the test fails without modifying them.
This is an extension of what you asked, to show that you can also test for NA values.
survey$fact <- ifelse(is.na(survey$sex), survey$fact * 2, survey$fact)
I also like that it's very readable.

which can filter those NAs:
survey$fact[which(survey$sex == "M")] <- survey$fact[which(survey$sex== "M")] * 3
There are many ways you can make that a little cleaner, e.g.:
males <- which(survey$sex == "M")
survey$fact[males] <- 3 * survey$fact[males]
or
survey <- within(survey, fact[males] <- 3 * fact[males])

Related

why doses sub-setting dataframe results in NA rows [duplicate]

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.
When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:
example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z"))
example
var1 var2
1 A X
2 B Y
3 A Z
then I run:
example[example$var1=="A",]
var1 var2
1 A X
3 A Z
NA<NA> <NA>
Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.
Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.
Thanks
Wrap the condition in which:
df[which(df$number1 < df$number2), ]
How it works:
It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.
Say that:
which(df$number1 < df$number2)
returns row numbers 1, 2, 3, 4 and 5.
As such, writing:
df[which(df$number1 < df$number2), ]
is the same as writing:
df[c(1, 2, 3, 4, 5), ]
Or an even simpler version is:
df[1:5, ]
I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).
First of all, some sample data:
> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA))
> df
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
6 F 6 5
7 G 7 4
8 H 8 3
9 I 9 NA
10 J 10 NA
Now for a simple filter:
> df[df$number1 < df$number2, ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
NA <NA> NA NA
NA.1 <NA> NA NA
The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:
> df[df$number1 < df$number2 & !is.na(df$number2), ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
I get the same problem when using code similar to what you posted. Using the function subset()
subset(example,example$var1=="A")
the NA row instead gets excluded.
Using dplyr:
library(dplyr)
filter(df, number1 < number2)
I find using %in$ instead of == can solve this issue although I am still wondering why.
For example, instead of:
df[df$num == 1,]
use:
df[df$num %in% c(1),] will work.
> example <- data.frame("var1"=c("A", NA, "A"), "var2"=c("X", "Y", "Z"))
> example
var1 var2
1 A X
2 <NA> Y
3 A Z
> example[example$var1=="A",]
var1 var2
1 A X
NA <NA> <NA>
3 A Z
Probably this must be your result u are expecting...Try this
try using which condition before condition to avoid NA's
example[which(example$var1=="A"),]
var1 var2
1 A X
3 A Z
Another cause may be that you get the condition wrong, such as checking if a factor column is equal to a value that is not among its levels. Troubled me for a while.

differences between 'dplyr::filter' and [conditions, ] [duplicate]

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.
When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:
example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z"))
example
var1 var2
1 A X
2 B Y
3 A Z
then I run:
example[example$var1=="A",]
var1 var2
1 A X
3 A Z
NA<NA> <NA>
Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.
Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.
Thanks
Wrap the condition in which:
df[which(df$number1 < df$number2), ]
How it works:
It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.
Say that:
which(df$number1 < df$number2)
returns row numbers 1, 2, 3, 4 and 5.
As such, writing:
df[which(df$number1 < df$number2), ]
is the same as writing:
df[c(1, 2, 3, 4, 5), ]
Or an even simpler version is:
df[1:5, ]
I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).
First of all, some sample data:
> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA))
> df
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
6 F 6 5
7 G 7 4
8 H 8 3
9 I 9 NA
10 J 10 NA
Now for a simple filter:
> df[df$number1 < df$number2, ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
NA <NA> NA NA
NA.1 <NA> NA NA
The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:
> df[df$number1 < df$number2 & !is.na(df$number2), ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
I get the same problem when using code similar to what you posted. Using the function subset()
subset(example,example$var1=="A")
the NA row instead gets excluded.
Using dplyr:
library(dplyr)
filter(df, number1 < number2)
I find using %in$ instead of == can solve this issue although I am still wondering why.
For example, instead of:
df[df$num == 1,]
use:
df[df$num %in% c(1),] will work.
> example <- data.frame("var1"=c("A", NA, "A"), "var2"=c("X", "Y", "Z"))
> example
var1 var2
1 A X
2 <NA> Y
3 A Z
> example[example$var1=="A",]
var1 var2
1 A X
NA <NA> <NA>
3 A Z
Probably this must be your result u are expecting...Try this
try using which condition before condition to avoid NA's
example[which(example$var1=="A"),]
var1 var2
1 A X
3 A Z
Another cause may be that you get the condition wrong, such as checking if a factor column is equal to a value that is not among its levels. Troubled me for a while.

Data frame in R: interesting behavior for counting rows [duplicate]

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.
When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:
example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z"))
example
var1 var2
1 A X
2 B Y
3 A Z
then I run:
example[example$var1=="A",]
var1 var2
1 A X
3 A Z
NA<NA> <NA>
Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.
Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.
Thanks
Wrap the condition in which:
df[which(df$number1 < df$number2), ]
How it works:
It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.
Say that:
which(df$number1 < df$number2)
returns row numbers 1, 2, 3, 4 and 5.
As such, writing:
df[which(df$number1 < df$number2), ]
is the same as writing:
df[c(1, 2, 3, 4, 5), ]
Or an even simpler version is:
df[1:5, ]
I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).
First of all, some sample data:
> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA))
> df
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
6 F 6 5
7 G 7 4
8 H 8 3
9 I 9 NA
10 J 10 NA
Now for a simple filter:
> df[df$number1 < df$number2, ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
NA <NA> NA NA
NA.1 <NA> NA NA
The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:
> df[df$number1 < df$number2 & !is.na(df$number2), ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
I get the same problem when using code similar to what you posted. Using the function subset()
subset(example,example$var1=="A")
the NA row instead gets excluded.
Using dplyr:
library(dplyr)
filter(df, number1 < number2)
I find using %in$ instead of == can solve this issue although I am still wondering why.
For example, instead of:
df[df$num == 1,]
use:
df[df$num %in% c(1),] will work.
> example <- data.frame("var1"=c("A", NA, "A"), "var2"=c("X", "Y", "Z"))
> example
var1 var2
1 A X
2 <NA> Y
3 A Z
> example[example$var1=="A",]
var1 var2
1 A X
NA <NA> <NA>
3 A Z
Probably this must be your result u are expecting...Try this
try using which condition before condition to avoid NA's
example[which(example$var1=="A"),]
var1 var2
1 A X
3 A Z
Another cause may be that you get the condition wrong, such as checking if a factor column is equal to a value that is not among its levels. Troubled me for a while.

When trying to replace values, "missing values are not allowed in subscripted assignments of data frames"

I have a table that has two columns: whether you were sick (H01) and the number of days sick (H03). However, the number of days sick is NA if H01 == false, and I would like to set it to 0. When I do this:
test <- pe94.person[pe94.person$H01 == 12,]
test$H03 <- 0
It works fine. However, I'd like to replace the values in the original dataframe. This, however, fails:
pe94.person[pe94.person$H01 == 12,]$H03 <- 0
It returns:
> pe94.person[pe94.person$H01 == 12,]$H03 <- 0
Error in `[<-.data.frame`(`*tmp*`, pe94.person$H01 == 12, , value = list( :
missing values are not allowed in subscripted assignments of data frames
Any idea why this is? For what it's worth, here's a frequency table:
> table(pe94.person[pe94.person$H01 == 12,]$H03)
2 3 5 28
3 1 1 1
It is due to missingness in H01 variable.
> x <- data.frame(a=c(NA,2:5), b=c(1:5))
> x
a b
1 NA 1
2 2 2
3 3 3
4 4 4
5 5 5
> x[x$a==2,]$b <- 99
Error in `[<-.data.frame`(`*tmp*`, x$a == 1, , value = list(a = NA_integer_, :
missing values are not allowed in subscripted assignments of data frames
The assignment won't work because x$a has a missing value.
Subsetting first works:
> z <- x[x$a==2,]
> z$b <- 99
> z <- x[x$a==2,]
> z
a b
NA NA NA
2 2 2
But that's because the [<- function apparently can't handle missing values in its extraction indices, even though [ can:
> `[<-`(x,x$a==2,,99)
Error in `[<-.data.frame`(x, x$a == 2, , 99) :
missing values are not allowed in subscripted assignments of data frames
So instead, trying specifying your !is.na(x$a) part when you're doing the assignment:
> `[<-`(x,!is.na(x$a) & x$a==2,'b',99)
a b
1 NA 1
2 2 99
3 3 3
4 4 4
5 5 5
Or, more commonly:
> x[!is.na(x$a) & x$a==2,]$b <- 99
> x
a b
1 NA 1
2 2 99
3 3 3
4 4 4
5 5 5
Note that this behavior is described in the documentation:
The replacement methods can be used to add whole column(s) by specifying non-existent column(s), in which case the column(s) are added at the right-hand edge of the data frame and numerical indices must be contiguous to existing indices. On the other hand, rows can be added at any row after the current last row, and the columns will be in-filled with missing values. Missing values in the indices are not allowed for replacement.
You can use ifelse, like so
pe94.person$foo <- ifelse(!is.na(pe94.person$H01) & pe94.person$H01 == 12, 0, pe94.person$H03)
check if foo meets your criteria and then go ahead and assign it to pe94.person$H03 directly. I find it safer to assign it a new variable and usually use that in subsequent analysis.
There might be an NA somewhere in the column that is causing the error. Run the index on a specific column instead of the entire data frame.
movies[movies$Actors == "N/A",] = NA #ERROR
movies$Actors[movies$Actors == "N/A"] = NA #Works
I realise the question is very old, but I think the most elegant solution is by using the which() function:
pe94.person[which(pe94.person$H01 == 12),]$H03 <- 0
should do what the original poster asked for. Because which() drops the NAs and keeps the (positions of the) TRUE results only.
Simply use the subset() function to exclude all NA from the string.
It works as x[subset & !is.na(subset)]. Look at this data:
> x <- data.frame(a = c(T,F,T,F,NA,F,T, F, NA,NA,T,T,F),
> b = c(F,T,T,F,T, T,NA,NA,F, T, T,F,F))
Subsetting with [ operator returns this:
> x[x$b == T & x$a == F, ]
a b
2 FALSE TRUE
NA NA NA
6 FALSE TRUE
NA.1 NA NA
NA.2 NA NA
And subset() does what we want:
> subset(x, b == T & a == F)
a b
2 FALSE TRUE
6 FALSE TRUE
To change the values of subsetted variables:
> ss <- subset(x, b == T & a == F)
> x[rownames(ss), 'a'] <- T
> x[c(2,6), ]
a b
2 TRUE TRUE
6 TRUE TRUE
Following works. Watch out there is no comma in sub setting:
x <- data.frame(a=c(NA,2:5), b=c(1:5))
x$a[x$a==2] <- 99

Subsetting R data frame results in mysterious NA rows

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.
When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:
example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z"))
example
var1 var2
1 A X
2 B Y
3 A Z
then I run:
example[example$var1=="A",]
var1 var2
1 A X
3 A Z
NA<NA> <NA>
Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.
Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.
Thanks
Wrap the condition in which:
df[which(df$number1 < df$number2), ]
How it works:
It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.
Say that:
which(df$number1 < df$number2)
returns row numbers 1, 2, 3, 4 and 5.
As such, writing:
df[which(df$number1 < df$number2), ]
is the same as writing:
df[c(1, 2, 3, 4, 5), ]
Or an even simpler version is:
df[1:5, ]
I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).
First of all, some sample data:
> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA))
> df
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
6 F 6 5
7 G 7 4
8 H 8 3
9 I 9 NA
10 J 10 NA
Now for a simple filter:
> df[df$number1 < df$number2, ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
NA <NA> NA NA
NA.1 <NA> NA NA
The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:
> df[df$number1 < df$number2 & !is.na(df$number2), ]
name number1 number2
1 A 1 10
2 B 2 9
3 C 3 8
4 D 4 7
5 E 5 6
I get the same problem when using code similar to what you posted. Using the function subset()
subset(example,example$var1=="A")
the NA row instead gets excluded.
Using dplyr:
library(dplyr)
filter(df, number1 < number2)
I find using %in$ instead of == can solve this issue although I am still wondering why.
For example, instead of:
df[df$num == 1,]
use:
df[df$num %in% c(1),] will work.
> example <- data.frame("var1"=c("A", NA, "A"), "var2"=c("X", "Y", "Z"))
> example
var1 var2
1 A X
2 <NA> Y
3 A Z
> example[example$var1=="A",]
var1 var2
1 A X
NA <NA> <NA>
3 A Z
Probably this must be your result u are expecting...Try this
try using which condition before condition to avoid NA's
example[which(example$var1=="A"),]
var1 var2
1 A X
3 A Z
Another cause may be that you get the condition wrong, such as checking if a factor column is equal to a value that is not among its levels. Troubled me for a while.

Resources