I am trying make a function that checks how many na are in each column category, and then delete the column if more than 20% of the entries are blank - r

I am programming in R for a commercial real estate project from this place I started to work at. I have data frames that have 195 categories for each of the properties sold in that area for the last year. The categories are along the top and the properties along the row.
I tried to make a function called cuttingvariables1 to cut out the number of variables first by taking a subset of the categories based on if they have seller, buyer, buyers, listing in the column name.
I was able to have it work when I ran it as commands, but why isn't it working when I try to make function in the source file and run off that.
Cuttingvariables2 is my second function and I do not understand why it stops working at line 7 for that loop. The loop is meant to check every na_count for each category and then see if it is greater than 20% the number of properties listed in that loaded csv. If it is, then the column gets deleted.
Any help would be appreciated.
cuttingvariables1 <- function(dataset)
(
dataset <- (subset(dataset,select=c(!grepl("Seller|Buyer|Buyers|Listing",names(dataset))))
)
)
Cuttingvariables2 function below!
cuttingvariables2 <- function(dataset)
{
z = ncol(dataset)
na_count <- c(lapply(dataset, function(y) sum(length(which(is.na(y))))))
setDT(na_count, keep.rownames = TRUE)[]
j = ncol(na_count)
for (i in 1:j) if((as.integer(na_count[,i])) > (nrow(dataset)/5)) na_count <- na_count[,-i]
for (i in 1:ncol(dataset)) if(colnames(dataset)[i] %in% (colnames(na_count))) dataset <- dataset[,-i]
return (dataset[1:5,1:5])
return (colnames(dataset))
}
#sample data
BROWNSVILLEMF2016TO2017[1:12,1:5]
Actual.Cap.Rate Age Asking.Price Assessed.Improved Assessed.Land
1 NA 31 NA 12039000 1776000
2 NA NA NA 1434000 1452000
3 NA 87 NA 306900 270000
4 NA 11 NA 432900 337950
5 NA 89 NA 281700 107100
6 4.5 87 3300000 NA NA
7 NA 96 NA 427500 66150
8 NA 87 NA 1228000 300000
9 NA 95 NA NA NA
10 NA 95 NA NA NA
11 NA 87 NA 210755 14418
12 NA 87 NA NA NA

I would not use subset directly with grep because you have so many fields. There may very different versions of the words and you want them whether they are capitalized or not.
(be sure to check my R grammar I have been working in python all day)
#Empty List - you will build a list of names
extractList<-list()
#names you are looking for in column names saved as a list (lowercase)
nameList<- c("seller","buyer","buyers","listing")
#Create the outer loop to grab index of columns and pull the column name off each one at a time
for (i in 1:ncol(dataset)){
cName<-names(dataset[i])
lcName<-tolower(cName)
#Created a loop within that loop to compare each keyword on your nameList to the columns to see if the word is in the title (with title case lowered)
for (j in nameList){
#if it is append the column name to the list NOT LOWER CASE, ***ORIGINAL***
if(grepl(j, lcName)==TRUE ){extractList=append(cName,extractList)}
} }
#Now remove duplicates names for the extract list
extractList<-unique(extractlist)
At this point you should have a concatenated list of column names each of which has one (or more) of those four words in ANY FORM capital or lowercase or camel case...which was the point of lowering the case of the column name before comparing them. Now you just need to subset the data frame the easy way!
newSet<- dataset[,which((names(dataset) %in% extractList)==TRUE)
This creates a logical vector with %in% statement so only names in the data frame which appear on the new list of unique column names with ANY version of your keywords will show as TRUE and be included in the columns of the new set.
Now you should have a complete set of data with only the types of column names you are looking to use. DO NOT JUST USE THIS...look at it and try to understand why some of the more esoteric tricks are at play so that you can work through similar problems in the future.
Almost forgot:
install.packages("questionr")
Then:
freq.na(newSet)
will give you a formatted table with the #missing and the percent of na's for each column, you can set this to a variable to use it in you vetting process!

Related

Selecting Specific Data in an Data Frame to replace using the Row and Column names

I am attempting to replace specific NA values with 0 in my data table. I do not want all NAs replaces, only those under certain conditions. For example, "replace NA with Zeros when the row is Cole_1 and the Column includes the designation 'Fall1'". I have a huge data set, so I need as little manual designating as possible, numbering each column is not an option. Basically, I want to be able to target the cells like playing battleship.
I have tried:
whentest <- count_order_site %>%
when(select(contains("Fall1")) &
count_order_site[count_order_site$Point_Name == "Cole_1", ],
count_order_site[is.na(count_order_site)] <- 0 )
but get an error "contains() must be used within a selecting function."
I'm not even sure if this is the right path to get what I want.
The basic layout idea (Sorry it's stacked weird, I can't figure out how to make them next to each other):
Point Name
ACWO_Fall1
Cole_1
NA
Cole_2
3
ACWO_FAll2
HOSP_FAll1
3
NA
NA
5
After the functions the data would look like:
Point Name
ACWO_Fall1
Cole_1
0
Cole_2
3
ACWO_FAll2
HOSP_FAll1
3
0
NA
5
If I understand correctly, you can use mutate across to include columns that contain certain character values, such as "Fall1". Then, with the replace function, replace those values that are missing using is.na and where the point_name has a specific value, such as "Cole_1".
The example below has a couple extra columns to demonstrate if the logic is correct.
library(tidyverse)
df %>%
mutate(across(contains("Fall1"), ~replace(., is.na(.) & point_name == "Cole_1", 0)))
Output
point_name ACWO_Fall1 ACWO_Fall2 HOSP_Fall1 Other1 Other_Fall1
1 Cole_1 0 3 0 NA 6
2 Cole_2 3 NA 5 NA NA

Combine table and matrix with R

I am performing an analysis in R. I want to fill the first row of an matrix with the content of a table. The problem I have is that the content of the table is variable depending on the data so sometimes certain identifiers that appear in the matrix do not appear in the table.
> random.evaluate
DNA LINE LTR/ERV1 LTR/ERVK LTR/ERVL LTR/ERVL-MaLR other SINE
[1,] NA NA NA NA NA NA NA NA
> y
DNA LINE LTR/ERVK LTR/ERVL LTR/ERVL-MaLR SINE
1 1 1 1 1 4
Due to this, when I try to join the data of the matrix with the data of the table, I get the following error
random.evaluate[1,] <- y
Error in random.evaluate[1, ] <- y :
number of items to replace is not a multiple of replacement length
Could someone help me fix this bug? I have found solutions to this error but in my case they do not work for me.
First check if the column names of the table exist in the matrix
Check this link
If it exists, just set the value as usual.

R: if a value is less or is na update another data.frame

I have two data.frames A and B.
A contains negative, absolute and NA values.
B contains only positive and NA values.
The dimensions of the data frames are the same.
data.frame A looks like this:
ENSMUSG00000000001.4/Gnai3 0.1943315 0.3021675 NA NA
ENSMUSG00000000003.9/Pbsn -1.4843914 -1.2608270 -0.2587953 -0.46167430
ENSMUSG00000000028.8/Cdc45 -0.2388901 -0.1106236 0.9046436 0.08968331
ENSMUSG00000000037.9/Scml 0.3242902 0.5385371 0.2311202 0.51110287
ENSMUSG00000000049.5/Apoh -1.7606033 -1.8159545 -0.2087083 -1.09614630
ENSMUSG00000000056.7/Narf NA NA -0.3747798 -0.55547798
I need to check if a value is NA or negative in this table then I need to update data.frame B on the same indices to the value 0.999.
For example:
The first record of A has two NA values, indexes are [1,4] and [1,5] meaning, I will update B[1,4]=0.999 and B[1,5]=0.999.
I could do this in the nested loops for columns and rows but it would take too much time. Is there a faster way?
You can pass a Boolean mask as an index if it's the same size:
b[is.na(a) | a < 0] <- 0.999
I would use ifelse to do this, since the dataframes have the same dimensions.
A<-matrix(data=1:15,nrow=5) # create matrices (works with dataframe as well)
B<-matrix(data=16:30,nrow=5)
B[1,2]<-NA # introduce some NA and negative values
B[5,3]<-(-1)
ifelse(is.na(B) | B<=0,A,B) # new matrix with "updated" values

Check each element of a column against vector

I have two dataframes. I need to check each element of a column in one data against each element of the second dataframe and when there is a match copy something from a different column in the second dataframe back to another column in the first dataframe.
Here is some fake data to play with:
df1 <-data.frame(c("267119002","257051033",NA,"267098003","267099020","267047006"))
names(df1)[1]<-"ID"
df2 <-data.frame(c("257051033","267098003","267119002","267047006","267099020"))
names(df2)[1]<-"ID"
df2$vals <-c(11,22,33,44,55)
Basically what I want to do is for each ID in df1, check for the corresponding matching row in df2, and copy the value of df2$vals back to df1. Merge is not really an option cause in the real data I need to repeat this for many columns and multiple merges would result in df1 getting stupidly big. I need to keep it lean! And df1 may contain NA's in which case I want to place NA in the new column instead of a value.
You can use match:
df2[match(df1$ID,df2$ID),]
ID vals
3 267119002 33
1 257051033 11
NA <NA> NA
2 267098003 22
5 267099020 55
4 267047006 44
ANd if you want to remove NA:
df2[na.omit(match(df1$ID,df2$ID)),]
ID vals
3 267119002 33
1 257051033 11
2 267098003 22
5 267099020 55
4 267047006 44
Ok so thanks to agstudy's answer I was able to figure this out myself.This does exactly what I want!
fetcher <-function(x){
y <- df2$vals[which(match(df2$ID,x)==TRUE)]
return(y)
}
sapply(df1$ID,function(x) fetcher(x))
Thanks for the inspiration agstudy

Making compatible (equal) dimensions for two vectors in R

I have a vector called classes that is the output of an analysis that used listwise deletion. As a result, the cases included in classes is a subset of the entire dataset -- some cases were dropped because of incomplete data.
Selection is a dummy variable that occurs with every case in my dataset. A shortened example of my data is below. There is also a unique case ID for every observation.
classes <- c(1,2,1,1,1,2,3,3,3,1,1,1,3,3,2,2,2)
selection <- c(1,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,1,0)
case <-seq(1,26,1)
I would like to create a new version of selection (say, selection2) so that it only includes cases that are in classes. Basically, I would like both variables to be the same length for comparison purposes, where the cases that are NOT included in classes are also not included in selection2.
I thought this would be an easy fix, but I've spend a lot of time getting nowhere, so I thought I'd ask. Thanks in advance!
If they are to be the same length, then the reduced version must have NA's:
> selection2 <- selection
> is.na(selection2) <- !selection2 %in% classes
> selection2
[1] 1 NA NA NA 1 1 1 1 NA NA NA NA NA 1 1 1 1 NA NA NA 1 1 1 NA 1 NA

Resources