Extracting values from matrix to add to a dataframe column using dplyr - r

I am using mutate to add a fifth (h) and sixth column (d) to my data frame containing 37975 rows and 4 columns names i,j,k,x. For d, I am picking values from a matrix which was basically a raster. Addition of this column (d) is causing problems. I receive the following error when I try to view the dataframe.
Error: cannot allocate vector of size 9.3 Gb Error: no more error
handlers available (recursive errors?); invoking 'abort' restart
I use the following code to do this:
vxfile <- vxfile %>%
select(1:4) %>%
mutate(h = k+as.numeric(sub(".*\\s","",txt)),
d = as.vector(dt_mat[35-j, i+1]))
When I check lengths of the columns, column d is different. It is 1442100625 (37975 x 37975) while others are obviously 37975
When I check the class of each columns, column d is "matrix" "array" and others are numeric
When I check the str(vxfile), column d is $ d : num [1:37975, 1:37975] NA NA NA NA NA NA NA NA NA NA ...
Clearly the problem is with how I am picking the value from the matrix. Could someone explain why this is causing this problem?
I tried dt_mat[cbind(35-j, i+1)] and it seems to be working

Related

Combine table and matrix with R

I am performing an analysis in R. I want to fill the first row of an matrix with the content of a table. The problem I have is that the content of the table is variable depending on the data so sometimes certain identifiers that appear in the matrix do not appear in the table.
> random.evaluate
DNA LINE LTR/ERV1 LTR/ERVK LTR/ERVL LTR/ERVL-MaLR other SINE
[1,] NA NA NA NA NA NA NA NA
> y
DNA LINE LTR/ERVK LTR/ERVL LTR/ERVL-MaLR SINE
1 1 1 1 1 4
Due to this, when I try to join the data of the matrix with the data of the table, I get the following error
random.evaluate[1,] <- y
Error in random.evaluate[1, ] <- y :
number of items to replace is not a multiple of replacement length
Could someone help me fix this bug? I have found solutions to this error but in my case they do not work for me.
First check if the column names of the table exist in the matrix
Check this link
If it exists, just set the value as usual.

Why I get NA when I do indexing a vector (or dataframe) that do not match my condition?

When I do indexing a vector or dataframe in R, I sometimes get an empty vector (e.g. numeric(0), integer(0), or factor(0)...), and sometimes get NA.
I guess that I get NA when the vector or dataframe I deal with contains NA.
For example,
iris_test = iris
iris_test$Sepal.Length[1] = NA
iris[iris$Sepal.Length < 0, "Sepal.Length"] # numeric(0)
iris_test[iris_test$Sepal.Length < 0, "Sepal.Length"] # NA
It's intuitive for me to get numeric(0) when I find values that do not match my condition
(no search result --> no element in the resulted vector --> numeric(0)).
However, why I get NA rather than numeric(0)?
Your assumption is kind of correct that is you get NA values when there is NA in the data.
The comparison yields NA values
iris_test$Sepal.Length < 0
#[1] NA FALSE FALSE FALSE.....
When you subset a vector with NA it returns NA. See for example,
iris$Sepal.Length[c(1, NA)]
#[1] 5.1 NA
This is what the second case returns. For first case, all the values are FALSE so you get numeric(0)
iris$Sepal.Length[FALSE]
#numeric(0)
Adding to #Ronak's
The discussion of NA at R for Data Science makes it easy for me to understand NA. NA stands for Not Available which is a representation for an unknown values. According to the book linked above, missing values are "contagious"; almost any operation involving an unknown (NA) value will also be unknown. Here are some examples:
# Is unknown greater than 0? Result is unknown (NA)
NA > 0
#NA
# Is unknown less than 0? Output is unknown (NA).
NA < 0
# NA
# Is unknown equal to unknown? Output is unknown(NA).
NA == NA
# NA
Getting back to your data, when you do:
iris_test$Sepal.Length[1] = NA, you are assigning the value of iris_test$Sepal.Length[1] as "unknown" (NA).
The question is "Is unknown less than 0?".
The answer will be unknown and that is why you'r subsetting returns NA as output. The value is unknown (NA).
There is a function called is.na() which I'm sure you're aware of to handle missing values.
Hope that adds some insight to your question.

Affect value in R dataframe without checking if the index is empty

df = data.frame(A=c(1,1),B=c(2,2))
df$C = NA
df[is.na(df$B),]$C=5
Each time I want to affect a new value and the indexes found out to be empty like here is.na(df$B) , R raised raises an error replacement has 1 row, data has 0.
Is there a way that R just doesnt affect anything in these case instead of raising an error ?
We can do this in a single line instead of assigning 'C' to NA and then subsetting the data.frame. The below code will assign 5 to 'C' where there are NA elements in 'B' or else it will be NA
df$C[is.na(df$B)] <- 5

I am trying make a function that checks how many na are in each column category, and then delete the column if more than 20% of the entries are blank

I am programming in R for a commercial real estate project from this place I started to work at. I have data frames that have 195 categories for each of the properties sold in that area for the last year. The categories are along the top and the properties along the row.
I tried to make a function called cuttingvariables1 to cut out the number of variables first by taking a subset of the categories based on if they have seller, buyer, buyers, listing in the column name.
I was able to have it work when I ran it as commands, but why isn't it working when I try to make function in the source file and run off that.
Cuttingvariables2 is my second function and I do not understand why it stops working at line 7 for that loop. The loop is meant to check every na_count for each category and then see if it is greater than 20% the number of properties listed in that loaded csv. If it is, then the column gets deleted.
Any help would be appreciated.
cuttingvariables1 <- function(dataset)
(
dataset <- (subset(dataset,select=c(!grepl("Seller|Buyer|Buyers|Listing",names(dataset))))
)
)
Cuttingvariables2 function below!
cuttingvariables2 <- function(dataset)
{
z = ncol(dataset)
na_count <- c(lapply(dataset, function(y) sum(length(which(is.na(y))))))
setDT(na_count, keep.rownames = TRUE)[]
j = ncol(na_count)
for (i in 1:j) if((as.integer(na_count[,i])) > (nrow(dataset)/5)) na_count <- na_count[,-i]
for (i in 1:ncol(dataset)) if(colnames(dataset)[i] %in% (colnames(na_count))) dataset <- dataset[,-i]
return (dataset[1:5,1:5])
return (colnames(dataset))
}
#sample data
BROWNSVILLEMF2016TO2017[1:12,1:5]
Actual.Cap.Rate Age Asking.Price Assessed.Improved Assessed.Land
1 NA 31 NA 12039000 1776000
2 NA NA NA 1434000 1452000
3 NA 87 NA 306900 270000
4 NA 11 NA 432900 337950
5 NA 89 NA 281700 107100
6 4.5 87 3300000 NA NA
7 NA 96 NA 427500 66150
8 NA 87 NA 1228000 300000
9 NA 95 NA NA NA
10 NA 95 NA NA NA
11 NA 87 NA 210755 14418
12 NA 87 NA NA NA
I would not use subset directly with grep because you have so many fields. There may very different versions of the words and you want them whether they are capitalized or not.
(be sure to check my R grammar I have been working in python all day)
#Empty List - you will build a list of names
extractList<-list()
#names you are looking for in column names saved as a list (lowercase)
nameList<- c("seller","buyer","buyers","listing")
#Create the outer loop to grab index of columns and pull the column name off each one at a time
for (i in 1:ncol(dataset)){
cName<-names(dataset[i])
lcName<-tolower(cName)
#Created a loop within that loop to compare each keyword on your nameList to the columns to see if the word is in the title (with title case lowered)
for (j in nameList){
#if it is append the column name to the list NOT LOWER CASE, ***ORIGINAL***
if(grepl(j, lcName)==TRUE ){extractList=append(cName,extractList)}
} }
#Now remove duplicates names for the extract list
extractList<-unique(extractlist)
At this point you should have a concatenated list of column names each of which has one (or more) of those four words in ANY FORM capital or lowercase or camel case...which was the point of lowering the case of the column name before comparing them. Now you just need to subset the data frame the easy way!
newSet<- dataset[,which((names(dataset) %in% extractList)==TRUE)
This creates a logical vector with %in% statement so only names in the data frame which appear on the new list of unique column names with ANY version of your keywords will show as TRUE and be included in the columns of the new set.
Now you should have a complete set of data with only the types of column names you are looking to use. DO NOT JUST USE THIS...look at it and try to understand why some of the more esoteric tricks are at play so that you can work through similar problems in the future.
Almost forgot:
install.packages("questionr")
Then:
freq.na(newSet)
will give you a formatted table with the #missing and the percent of na's for each column, you can set this to a variable to use it in you vetting process!

Ordering the data based on 2 columns in a data frame

I have a data frame x with columns a,b,c.
I am trying to order the data based on columns a and c and my ordering criteria is ascending on a and descending on c. Below is the code I wrote.
z <- x[order(x$a,-x$c),]
This code gives me a warning as below.
Warning message:
In Ops.factor(x$c) : - not meaningful for factors
Also, when I check the data frame z using head(z), it gives me wrong data as below:
30708908 0.3918980 NA
22061768 0.4022183 NA
21430343 0.4118651 NA
21429828 0.4134888 NA
21425966 0.4159323 NA
22057521 0.4173094 NA
and initially there wasnt any NA values of the column c in the data frame x. I have gone through a lot of threads but couldn't find any solution. Can anybody please suggest.
try this
install.packages('plyr');
library('plyr');
z<-arrange(x,a,desc(c));
In addition, you can use the
options(stringsAsFactors = FALSE)
before you create your frame, or while creating your 'x' data frame, specify
stringsAsFactors = FALSE
z <- x[order(x$a,-as.character(x$c) ), ]
z
If as Roman suspects you have digits in your facttor levels you may need to do as he suggests and add as.numeric, otherwise 9 will be greater than 10
z <- x[order(x$a,-as.numeric(as.character(x$c)) ), ]
z
But if they are characters, then you will again get all NAs, so it really depends on the nature of the levels of x$c

Resources