R: if a value is less or is na update another data.frame - r

I have two data.frames A and B.
A contains negative, absolute and NA values.
B contains only positive and NA values.
The dimensions of the data frames are the same.
data.frame A looks like this:
ENSMUSG00000000001.4/Gnai3 0.1943315 0.3021675 NA NA
ENSMUSG00000000003.9/Pbsn -1.4843914 -1.2608270 -0.2587953 -0.46167430
ENSMUSG00000000028.8/Cdc45 -0.2388901 -0.1106236 0.9046436 0.08968331
ENSMUSG00000000037.9/Scml 0.3242902 0.5385371 0.2311202 0.51110287
ENSMUSG00000000049.5/Apoh -1.7606033 -1.8159545 -0.2087083 -1.09614630
ENSMUSG00000000056.7/Narf NA NA -0.3747798 -0.55547798
I need to check if a value is NA or negative in this table then I need to update data.frame B on the same indices to the value 0.999.
For example:
The first record of A has two NA values, indexes are [1,4] and [1,5] meaning, I will update B[1,4]=0.999 and B[1,5]=0.999.
I could do this in the nested loops for columns and rows but it would take too much time. Is there a faster way?

You can pass a Boolean mask as an index if it's the same size:
b[is.na(a) | a < 0] <- 0.999

I would use ifelse to do this, since the dataframes have the same dimensions.
A<-matrix(data=1:15,nrow=5) # create matrices (works with dataframe as well)
B<-matrix(data=16:30,nrow=5)
B[1,2]<-NA # introduce some NA and negative values
B[5,3]<-(-1)
ifelse(is.na(B) | B<=0,A,B) # new matrix with "updated" values

Related

Combine table and matrix with R

I am performing an analysis in R. I want to fill the first row of an matrix with the content of a table. The problem I have is that the content of the table is variable depending on the data so sometimes certain identifiers that appear in the matrix do not appear in the table.
> random.evaluate
DNA LINE LTR/ERV1 LTR/ERVK LTR/ERVL LTR/ERVL-MaLR other SINE
[1,] NA NA NA NA NA NA NA NA
> y
DNA LINE LTR/ERVK LTR/ERVL LTR/ERVL-MaLR SINE
1 1 1 1 1 4
Due to this, when I try to join the data of the matrix with the data of the table, I get the following error
random.evaluate[1,] <- y
Error in random.evaluate[1, ] <- y :
number of items to replace is not a multiple of replacement length
Could someone help me fix this bug? I have found solutions to this error but in my case they do not work for me.
First check if the column names of the table exist in the matrix
Check this link
If it exists, just set the value as usual.

Why I get NA when I do indexing a vector (or dataframe) that do not match my condition?

When I do indexing a vector or dataframe in R, I sometimes get an empty vector (e.g. numeric(0), integer(0), or factor(0)...), and sometimes get NA.
I guess that I get NA when the vector or dataframe I deal with contains NA.
For example,
iris_test = iris
iris_test$Sepal.Length[1] = NA
iris[iris$Sepal.Length < 0, "Sepal.Length"] # numeric(0)
iris_test[iris_test$Sepal.Length < 0, "Sepal.Length"] # NA
It's intuitive for me to get numeric(0) when I find values that do not match my condition
(no search result --> no element in the resulted vector --> numeric(0)).
However, why I get NA rather than numeric(0)?
Your assumption is kind of correct that is you get NA values when there is NA in the data.
The comparison yields NA values
iris_test$Sepal.Length < 0
#[1] NA FALSE FALSE FALSE.....
When you subset a vector with NA it returns NA. See for example,
iris$Sepal.Length[c(1, NA)]
#[1] 5.1 NA
This is what the second case returns. For first case, all the values are FALSE so you get numeric(0)
iris$Sepal.Length[FALSE]
#numeric(0)
Adding to #Ronak's
The discussion of NA at R for Data Science makes it easy for me to understand NA. NA stands for Not Available which is a representation for an unknown values. According to the book linked above, missing values are "contagious"; almost any operation involving an unknown (NA) value will also be unknown. Here are some examples:
# Is unknown greater than 0? Result is unknown (NA)
NA > 0
#NA
# Is unknown less than 0? Output is unknown (NA).
NA < 0
# NA
# Is unknown equal to unknown? Output is unknown(NA).
NA == NA
# NA
Getting back to your data, when you do:
iris_test$Sepal.Length[1] = NA, you are assigning the value of iris_test$Sepal.Length[1] as "unknown" (NA).
The question is "Is unknown less than 0?".
The answer will be unknown and that is why you'r subsetting returns NA as output. The value is unknown (NA).
There is a function called is.na() which I'm sure you're aware of to handle missing values.
Hope that adds some insight to your question.

I am trying make a function that checks how many na are in each column category, and then delete the column if more than 20% of the entries are blank

I am programming in R for a commercial real estate project from this place I started to work at. I have data frames that have 195 categories for each of the properties sold in that area for the last year. The categories are along the top and the properties along the row.
I tried to make a function called cuttingvariables1 to cut out the number of variables first by taking a subset of the categories based on if they have seller, buyer, buyers, listing in the column name.
I was able to have it work when I ran it as commands, but why isn't it working when I try to make function in the source file and run off that.
Cuttingvariables2 is my second function and I do not understand why it stops working at line 7 for that loop. The loop is meant to check every na_count for each category and then see if it is greater than 20% the number of properties listed in that loaded csv. If it is, then the column gets deleted.
Any help would be appreciated.
cuttingvariables1 <- function(dataset)
(
dataset <- (subset(dataset,select=c(!grepl("Seller|Buyer|Buyers|Listing",names(dataset))))
)
)
Cuttingvariables2 function below!
cuttingvariables2 <- function(dataset)
{
z = ncol(dataset)
na_count <- c(lapply(dataset, function(y) sum(length(which(is.na(y))))))
setDT(na_count, keep.rownames = TRUE)[]
j = ncol(na_count)
for (i in 1:j) if((as.integer(na_count[,i])) > (nrow(dataset)/5)) na_count <- na_count[,-i]
for (i in 1:ncol(dataset)) if(colnames(dataset)[i] %in% (colnames(na_count))) dataset <- dataset[,-i]
return (dataset[1:5,1:5])
return (colnames(dataset))
}
#sample data
BROWNSVILLEMF2016TO2017[1:12,1:5]
Actual.Cap.Rate Age Asking.Price Assessed.Improved Assessed.Land
1 NA 31 NA 12039000 1776000
2 NA NA NA 1434000 1452000
3 NA 87 NA 306900 270000
4 NA 11 NA 432900 337950
5 NA 89 NA 281700 107100
6 4.5 87 3300000 NA NA
7 NA 96 NA 427500 66150
8 NA 87 NA 1228000 300000
9 NA 95 NA NA NA
10 NA 95 NA NA NA
11 NA 87 NA 210755 14418
12 NA 87 NA NA NA
I would not use subset directly with grep because you have so many fields. There may very different versions of the words and you want them whether they are capitalized or not.
(be sure to check my R grammar I have been working in python all day)
#Empty List - you will build a list of names
extractList<-list()
#names you are looking for in column names saved as a list (lowercase)
nameList<- c("seller","buyer","buyers","listing")
#Create the outer loop to grab index of columns and pull the column name off each one at a time
for (i in 1:ncol(dataset)){
cName<-names(dataset[i])
lcName<-tolower(cName)
#Created a loop within that loop to compare each keyword on your nameList to the columns to see if the word is in the title (with title case lowered)
for (j in nameList){
#if it is append the column name to the list NOT LOWER CASE, ***ORIGINAL***
if(grepl(j, lcName)==TRUE ){extractList=append(cName,extractList)}
} }
#Now remove duplicates names for the extract list
extractList<-unique(extractlist)
At this point you should have a concatenated list of column names each of which has one (or more) of those four words in ANY FORM capital or lowercase or camel case...which was the point of lowering the case of the column name before comparing them. Now you just need to subset the data frame the easy way!
newSet<- dataset[,which((names(dataset) %in% extractList)==TRUE)
This creates a logical vector with %in% statement so only names in the data frame which appear on the new list of unique column names with ANY version of your keywords will show as TRUE and be included in the columns of the new set.
Now you should have a complete set of data with only the types of column names you are looking to use. DO NOT JUST USE THIS...look at it and try to understand why some of the more esoteric tricks are at play so that you can work through similar problems in the future.
Almost forgot:
install.packages("questionr")
Then:
freq.na(newSet)
will give you a formatted table with the #missing and the percent of na's for each column, you can set this to a variable to use it in you vetting process!

Conditional searching which omits NA values

I'm doing a conditional search of part of a dataset that has multiple NA values within each row.
Something like this (a preview)..
time1 time2 time3 time4 slice1 slice2 slice3 slice4
pt1 1 3 NA NA NA 1 3 5
pt2 NA 1 3 5 5 2 2 4
I want to do some conditional searching which applies a condition (comparing whether one column within a row is larger than another) for each row. I want to find all the rows (pt's) where a variable column (e.g. time1) is smaller than the corresponding column (e.g. slice 1).
all.smaller<-subset(patientdata, time1>slice1 & time2>slice2 & time3>slice3 & time4>slice4, na.rm=TRUE, select=c(1))
When I use this code (on a larger expanded table of this format), it only returns the rows without any NAs, where all the values are added in. This makes sense given the use of '&'.
My question is: Is there a way to find which rows fit my conditional search that ignores the NA's but only returns the rows where in all the column variables where values are provided, it searches whether time1>slice1, time2>slice2 etc.?
Any help is appreciated. Thanks.
You can make a function that takes a boolean (possibly NA) and maps it to TRUE if it is NA and its value otherwise.
na.true <- function(x) ifelse(is.na(x), TRUE, x)
You can then replace your subset with
na.true(time1 > slice1) & na.true(time2 > slice2) & na.true(time3 > slice3) & na.true(time4 > slice4)
You could try this.
n=1:4
cond <- paste0('((is.na(time',n,')|is.na(slice',n,'))|(time',n,'>slice',n,'))')
conds <- paste(cond, collapse=' & ')
all.smaller <- subset( patientdata, eval(parse(text=conds)) )
Essentially this checks if either time or slice are NA and forces a TRUE, and if not, check whether time is greater than slice. (Individually for each index.) It becomes clearer if you print out conds to see what it looks like.

summing across columns with missing values in a data.frame

i want to get the index of a column with the highest value. However, I don't know how to handle missing values to make the correct calculation. NAs should be omitted (=ignored during summing up) and not converted to "0".
x=rep(NA,3); y=c(NA,0,-1); z=c(0, rep(NA,2))
data=cbind(x,y,z)
x y z
[1,] NA NA 0
[2,] NA 0 NA
[3,] NA -1 NA
I want to get the index of a column with the highest value. In the example above it's [,3].
However the functions
which.max(colSums(!is.na(data)))
or
apply(data,2,sum, na.rm=T)
don't generate the expected output.
Any help appreciated. Thx.
You can determine the column index of the column whose sum is greatest among the columns with non missing values in this way:
dataAvailIdx <- which(apply(data,2,function(x) any(!is.na(x))))
dataAvailIdx[which.max(colSums(data[,dataAvailIdx],na.rm=TRUE))]

Resources