Getting NA in Standard deviation - r

when i write mean it worked
mean(b1temp[1, ])
but for standard deviation, it returns NA
sd(b1temp[1, ])
NA
SO, I modified the function but still returns NA
sd(b1temp[1, ], na.rm=FALSE)
NA
my dataset contains only a row. Is this an issue?

The problem here is incorrect subsetting of data.frame as in effect when you execute b1temp[1, ] you get only 1 number where the standard deviation is not defined. Which is the reason of getting NA.
By default data.frame data are organized by column, not by row. So to apply sd to your data you should use subsetting for the columns bitemp[, 1].
Please see the code and simulation below:
b1temp <- data.frame(x = 1:10)
b1temp[1, ]
# [1] 1
mean(b1temp[1, ])
# [1] 1
sd(b1temp[1, ])
# [1] NA
sd(1)
# [1] NA
b1temp[1, ]
# [1] 1 2 3 4 5 6 7 8 9 10
sd(b1temp[, 1])
# [1] 3.02765

Related

NA Remove to calculation

I have some problems with NA value cause my dataset from excel is not same column number so It showed NA. It deleted all row containing NA value when make calculation Similarity Index function Psicalc in RInSp package.
B F
4 7
5 6
6 8
7 5
NA 4
NA 3
NA 2
Do you know how to handle with NA or remove it but not delete all row or not affect to package?. Beside when I import.RinSP it has message
In if (class(filename) == "character") { :
the condition has length > 1 and only the first element will be used
Thank you so much
Many R functions ( specifically base R ) have an na.rm argument, which is FALSE by default. That means if you omit this argument, and your data has NA, your "calculation" will result in NA. To remove these in the calculations, include an na.rm argument and assign it to TRUE.
Example:
x <- c(4,5,6,7,NA,NA)
mean(x) # Oops!
[1] NA
mean(x, na.rm=TRUE)
[1] 5.5

RowSums NA + NA gives 0 [duplicate]

This question already has answers here:
There is pmin and pmax each taking na.rm, why no psum?
(3 answers)
Closed 6 years ago.
I'll just understand a (for me) weird behavior of the function rowSums. Imagine I have this super simple dataframe:
a = c(NA, NA,3)
b = c(2,NA,2)
df = data.frame(a,b)
df
a b
1 NA 2
2 NA NA
3 3 2
and now I want a third column that is the sum of the other two. I cannot use simply + because of the NA:
df$c <- df$a + df$b
df
a b c
1 NA 2 NA
2 NA NA NA
3 3 2 5
but if I use rowSums the rows that have NA are calculated as 0, while if there is only one NA everything works fine:
df$d <- rowSums(df, na.rm=T)
df
a b c d
1 NA 2 NA 2
2 NA NA NA 0
3 3 2 5 10
am I missing something?
Thanks to all
One option with rowSums would be to get the rowSums with na.rm=TRUE and multiply with the negated (!) rowSums of negated (!) logical matrix based on the NA values after converting the rows that have all NAs into NA (NA^)
rowSums(df, na.rm=TRUE) *NA^!rowSums(!is.na(df))
#[1] 2 NA 10
Because
sum(numeric(0))
# 0
Once you used na.rm = TRUE in rowSums, the second row is numeric(0). After taking sum, it is 0.
If you want to retain NA for all NA cases, it would be a two-stage work. I recommend writing a small function for this purpose:
my_rowSums <- function(x) {
if (is.data.frame(x)) x <- as.matrix(x)
z <- base::rowSums(x, na.rm = TRUE)
z[!base::rowSums(!is.na(x))] <- NA
z
}
my_rowSums(df)
# [1] 2 NA 10
This can be particularly useful, if the input x is a data frame (as in your case). base::rowSums would first check whether input is matrix or not. If it gets a data frame, it would convert it into a matrix first. Type conversion is in fact more costly than actual row sum computation. Note that we call base::rowSums two times. To reduce type conversion overhead, we should make sure x is a matrix beforehand.
For #akrun's "hacking" answer, I suggest:
akrun_rowSums <- function (x) {
if (is.data.frame(x)) x <- as.matrix(x)
rowSums(x, na.rm=TRUE) *NA^!rowSums(!is.na(x))
}
akrun_rowSums(df)
# [1] 2 NA 10

Subsetting a vector with variables changes the length of the result vector

I want to subset a vector called text using variables for the slicing points. The problem is that the resulting vector has the same length as the input vector and the last 10 elements are NA. What is the reason of this?
> data_starts_at+1
[1] 11
> length(text)
[1] 12335
> x = text[11:12335]
> length(x)
[1] 12325
> x = text[data_starts_at+1:length(text)]
> length(x)
[1] 12335
> tail(x)
[1] NA NA NA NA NA NA
Update:
It looks like this problem is related to operator precedence:
> x = text[(data_starts_at+1):length(text)]
> length(x)
[1] 12325
But using x = text[data_starts_at+1:length(text)] makes subsetting correctly except that there are additional 10 elements appended at the end of x.
it is a probelm of parenthesis
x <- 1:10
data_starts_at <- 2
data_starts_at+1:10
[1] 3 4 5 6 7 8 9 10 11 12
# it means that you add data_starts_at at the vector 1:10
# then the final NA is because you ask for data that does not exist (11 & 12)
x[(data_starts_at+1):10]
[1] 3 4 5 6 7 8 9 10 #it's ok
hth
Use correct parentheses:
x = text[(data_starts_at+1):length(text)]
What you were doing is to take indices 1:length(text) and add data_starts_at to each of these values.

When trying to replace values, "missing values are not allowed in subscripted assignments of data frames"

I have a table that has two columns: whether you were sick (H01) and the number of days sick (H03). However, the number of days sick is NA if H01 == false, and I would like to set it to 0. When I do this:
test <- pe94.person[pe94.person$H01 == 12,]
test$H03 <- 0
It works fine. However, I'd like to replace the values in the original dataframe. This, however, fails:
pe94.person[pe94.person$H01 == 12,]$H03 <- 0
It returns:
> pe94.person[pe94.person$H01 == 12,]$H03 <- 0
Error in `[<-.data.frame`(`*tmp*`, pe94.person$H01 == 12, , value = list( :
missing values are not allowed in subscripted assignments of data frames
Any idea why this is? For what it's worth, here's a frequency table:
> table(pe94.person[pe94.person$H01 == 12,]$H03)
2 3 5 28
3 1 1 1
It is due to missingness in H01 variable.
> x <- data.frame(a=c(NA,2:5), b=c(1:5))
> x
a b
1 NA 1
2 2 2
3 3 3
4 4 4
5 5 5
> x[x$a==2,]$b <- 99
Error in `[<-.data.frame`(`*tmp*`, x$a == 1, , value = list(a = NA_integer_, :
missing values are not allowed in subscripted assignments of data frames
The assignment won't work because x$a has a missing value.
Subsetting first works:
> z <- x[x$a==2,]
> z$b <- 99
> z <- x[x$a==2,]
> z
a b
NA NA NA
2 2 2
But that's because the [<- function apparently can't handle missing values in its extraction indices, even though [ can:
> `[<-`(x,x$a==2,,99)
Error in `[<-.data.frame`(x, x$a == 2, , 99) :
missing values are not allowed in subscripted assignments of data frames
So instead, trying specifying your !is.na(x$a) part when you're doing the assignment:
> `[<-`(x,!is.na(x$a) & x$a==2,'b',99)
a b
1 NA 1
2 2 99
3 3 3
4 4 4
5 5 5
Or, more commonly:
> x[!is.na(x$a) & x$a==2,]$b <- 99
> x
a b
1 NA 1
2 2 99
3 3 3
4 4 4
5 5 5
Note that this behavior is described in the documentation:
The replacement methods can be used to add whole column(s) by specifying non-existent column(s), in which case the column(s) are added at the right-hand edge of the data frame and numerical indices must be contiguous to existing indices. On the other hand, rows can be added at any row after the current last row, and the columns will be in-filled with missing values. Missing values in the indices are not allowed for replacement.
You can use ifelse, like so
pe94.person$foo <- ifelse(!is.na(pe94.person$H01) & pe94.person$H01 == 12, 0, pe94.person$H03)
check if foo meets your criteria and then go ahead and assign it to pe94.person$H03 directly. I find it safer to assign it a new variable and usually use that in subsequent analysis.
There might be an NA somewhere in the column that is causing the error. Run the index on a specific column instead of the entire data frame.
movies[movies$Actors == "N/A",] = NA #ERROR
movies$Actors[movies$Actors == "N/A"] = NA #Works
I realise the question is very old, but I think the most elegant solution is by using the which() function:
pe94.person[which(pe94.person$H01 == 12),]$H03 <- 0
should do what the original poster asked for. Because which() drops the NAs and keeps the (positions of the) TRUE results only.
Simply use the subset() function to exclude all NA from the string.
It works as x[subset & !is.na(subset)]. Look at this data:
> x <- data.frame(a = c(T,F,T,F,NA,F,T, F, NA,NA,T,T,F),
> b = c(F,T,T,F,T, T,NA,NA,F, T, T,F,F))
Subsetting with [ operator returns this:
> x[x$b == T & x$a == F, ]
a b
2 FALSE TRUE
NA NA NA
6 FALSE TRUE
NA.1 NA NA
NA.2 NA NA
And subset() does what we want:
> subset(x, b == T & a == F)
a b
2 FALSE TRUE
6 FALSE TRUE
To change the values of subsetted variables:
> ss <- subset(x, b == T & a == F)
> x[rownames(ss), 'a'] <- T
> x[c(2,6), ]
a b
2 TRUE TRUE
6 TRUE TRUE
Following works. Watch out there is no comma in sub setting:
x <- data.frame(a=c(NA,2:5), b=c(1:5))
x$a[x$a==2] <- 99

Putting output in R into excel

Guys I have a code that generates 2 columns of data (e.g Number, Median) which refers to a particular person...but I have taken samples of 7 people
so basically I get this output:
[[1]
Number Median
1 5
2 3
.....
[[2]]
Number Median
1 6
2 4
....
[[3]]
Number Median
1 3
2 5
So I basically get this output....up til [[7]]
I tried transferring this output in excel using this code
write.csv(cbind(data),"data1.csv")
and I get this type of output:
list(c(Median =.......It lists all the median on the rows
But I want it to save the data referring to the 'median' and 'Number' in columns NOT ROWS
If I just type
write.csv(data,"data1.csv")
I get an error
arguments imply differing number of rows: 157, 179, 178, 180
As Marius said, you have a list of data.frames which can't be written to a .csv file. You need to do:
NewDataFrame <- do.call("rbind", YourList)
write.csv(NewDataFrame, "Data.csv")
do.call takes each of the elements from a list and applies whatever function you tell it (in this case rbind) to all of them.
Here are two options. Both use the following sample data:
myList <- list(data.frame(matrix(1:4, ncol = 2)),
data.frame(matrix(3:10, ncol = 2)),
data.frame(matrix(11:14, ncol =2)))
myList
# [[1]]
# X1 X2
# 1 1 3
# 2 2 4
#
# [[2]]
# X1 X2
# 1 3 7
# 2 4 8
# 3 5 9
# 4 6 10
#
# [[3]]
# X1 X2
# 1 11 13
# 2 12 14
Option 1: Write a csv file where the data.frames are presented as they are in the list
sink("list_of_dataframes.csv", type="output")
invisible(lapply(myList, function(x) dput(write.csv(x))))
sink()
If you open the resulting "list_of_dataframes.csv" file in a text editor, you will get something that looks like this. When you read this into a spreadsheet program, the first column will include the rownames and NULL separating each data.frame:
"","X1","X2"
"1",1,3
"2",2,4
NULL
"","X1","X2"
"1",3,7
"2",4,8
"3",5,9
"4",6,10
NULL
"","X1","X2"
"1",11,13
"2",12,14
NULL
Option 2: Write or search around for a version of cbind that accommodates binding data.frames with differing number of rows.
Here is one such function that I've written.
cbind2 <- function(datalist) {
nrows <- max(sapply(datalist, nrow))
expandmyrows <- function(mydata, rowsneeded) {
temp1 = names(mydata)
rowsneeded = rowsneeded - nrow(mydata)
temp2 = setNames(data.frame(
matrix(rep(NA, length(temp1) * rowsneeded),
ncol = length(temp1))), temp1)
rbind(mydata, temp2)
}
do.call(cbind, lapply(datalist, expandmyrows, rowsneeded = nrows))
}
And here is that function applied to your list:
cbind2(myList)
# X1 X2 X1 X2 X1 X2
# 1 1 3 3 7 11 13
# 2 2 4 4 8 12 14
# 3 NA NA 5 9 NA NA
# 4 NA NA 6 10 NA NA
That output should be easy for you to use with write.csv and related functions.

Resources