Making compatible (equal) dimensions for two vectors in R - r

I have a vector called classes that is the output of an analysis that used listwise deletion. As a result, the cases included in classes is a subset of the entire dataset -- some cases were dropped because of incomplete data.
Selection is a dummy variable that occurs with every case in my dataset. A shortened example of my data is below. There is also a unique case ID for every observation.
classes <- c(1,2,1,1,1,2,3,3,3,1,1,1,3,3,2,2,2)
selection <- c(1,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,1,0)
case <-seq(1,26,1)
I would like to create a new version of selection (say, selection2) so that it only includes cases that are in classes. Basically, I would like both variables to be the same length for comparison purposes, where the cases that are NOT included in classes are also not included in selection2.
I thought this would be an easy fix, but I've spend a lot of time getting nowhere, so I thought I'd ask. Thanks in advance!

If they are to be the same length, then the reduced version must have NA's:
> selection2 <- selection
> is.na(selection2) <- !selection2 %in% classes
> selection2
[1] 1 NA NA NA 1 1 1 1 NA NA NA NA NA 1 1 1 1 NA NA NA 1 1 1 NA 1 NA

Related

Selecting Specific Data in an Data Frame to replace using the Row and Column names

I am attempting to replace specific NA values with 0 in my data table. I do not want all NAs replaces, only those under certain conditions. For example, "replace NA with Zeros when the row is Cole_1 and the Column includes the designation 'Fall1'". I have a huge data set, so I need as little manual designating as possible, numbering each column is not an option. Basically, I want to be able to target the cells like playing battleship.
I have tried:
whentest <- count_order_site %>%
when(select(contains("Fall1")) &
count_order_site[count_order_site$Point_Name == "Cole_1", ],
count_order_site[is.na(count_order_site)] <- 0 )
but get an error "contains() must be used within a selecting function."
I'm not even sure if this is the right path to get what I want.
The basic layout idea (Sorry it's stacked weird, I can't figure out how to make them next to each other):
Point Name
ACWO_Fall1
Cole_1
NA
Cole_2
3
ACWO_FAll2
HOSP_FAll1
3
NA
NA
5
After the functions the data would look like:
Point Name
ACWO_Fall1
Cole_1
0
Cole_2
3
ACWO_FAll2
HOSP_FAll1
3
0
NA
5
If I understand correctly, you can use mutate across to include columns that contain certain character values, such as "Fall1". Then, with the replace function, replace those values that are missing using is.na and where the point_name has a specific value, such as "Cole_1".
The example below has a couple extra columns to demonstrate if the logic is correct.
library(tidyverse)
df %>%
mutate(across(contains("Fall1"), ~replace(., is.na(.) & point_name == "Cole_1", 0)))
Output
point_name ACWO_Fall1 ACWO_Fall2 HOSP_Fall1 Other1 Other_Fall1
1 Cole_1 0 3 0 NA 6
2 Cole_2 3 NA 5 NA NA

Combine table and matrix with R

I am performing an analysis in R. I want to fill the first row of an matrix with the content of a table. The problem I have is that the content of the table is variable depending on the data so sometimes certain identifiers that appear in the matrix do not appear in the table.
> random.evaluate
DNA LINE LTR/ERV1 LTR/ERVK LTR/ERVL LTR/ERVL-MaLR other SINE
[1,] NA NA NA NA NA NA NA NA
> y
DNA LINE LTR/ERVK LTR/ERVL LTR/ERVL-MaLR SINE
1 1 1 1 1 4
Due to this, when I try to join the data of the matrix with the data of the table, I get the following error
random.evaluate[1,] <- y
Error in random.evaluate[1, ] <- y :
number of items to replace is not a multiple of replacement length
Could someone help me fix this bug? I have found solutions to this error but in my case they do not work for me.
First check if the column names of the table exist in the matrix
Check this link
If it exists, just set the value as usual.

R: Dropping variables using number of observations

I have a large dataset, and I'm trying to drop some of my variables based on how many observations each has. For instance, I would like to drop any variable in my dataframe where n < 3 (total observations for that variable is less than 3). Since R can count observations for each variable using describe, can't I use that number to subset the data instead of having to type in each variable name each time I pull in a new version (each version has different variables that will have low n's and there are over 40 variables). Thanks so much for your help!
For instance, my data looks like this:
ID Runaway Aggressive Emergency Hospitalization Injury
1 3 NA 4 1 NA
2 NA NA 2 1 NA
3 4 NA 6 2 3
4 1 NA 1 1 NA
I want to be able to drop "Aggressive" and "Injury" based on their n's being 0 and 1 respectively. However, instead of telling R to drop them by variable name, it would be much more convenient if it was possible to tell R to drop any variable where n < 3 (or whatever number I choose) as I'll be using this code for multiple versions of this dataset. I have tried using column numbers (which is better than writing them out) but it's still pretty tedious when I have to describe() the data, figure out which variables have low n's, and then drop 28 variables or subset() around them.
This works but it's cumbersome...
UIRCorrelation <- UIRKidUnique61[c(28, 30, 32, 34:38, 42, 54:74)]
For some reason, my example looks different when I'm editing versus when I save so I also included an image of it. Sorry. This is the first time I've ever used stack overflow to ask a question. I actually spent a lot of time googling this but couldn't find an answer relating to n.
This line did not work: DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
DF being your dataframe
DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
This function did the trick:
valid <- function(x) {sum(!is.na(x))}
N <- apply(UIRCorrelation,2,valid)
UIRCorrelation2 <- UIRCorrelation[N > 3]

I am trying make a function that checks how many na are in each column category, and then delete the column if more than 20% of the entries are blank

I am programming in R for a commercial real estate project from this place I started to work at. I have data frames that have 195 categories for each of the properties sold in that area for the last year. The categories are along the top and the properties along the row.
I tried to make a function called cuttingvariables1 to cut out the number of variables first by taking a subset of the categories based on if they have seller, buyer, buyers, listing in the column name.
I was able to have it work when I ran it as commands, but why isn't it working when I try to make function in the source file and run off that.
Cuttingvariables2 is my second function and I do not understand why it stops working at line 7 for that loop. The loop is meant to check every na_count for each category and then see if it is greater than 20% the number of properties listed in that loaded csv. If it is, then the column gets deleted.
Any help would be appreciated.
cuttingvariables1 <- function(dataset)
(
dataset <- (subset(dataset,select=c(!grepl("Seller|Buyer|Buyers|Listing",names(dataset))))
)
)
Cuttingvariables2 function below!
cuttingvariables2 <- function(dataset)
{
z = ncol(dataset)
na_count <- c(lapply(dataset, function(y) sum(length(which(is.na(y))))))
setDT(na_count, keep.rownames = TRUE)[]
j = ncol(na_count)
for (i in 1:j) if((as.integer(na_count[,i])) > (nrow(dataset)/5)) na_count <- na_count[,-i]
for (i in 1:ncol(dataset)) if(colnames(dataset)[i] %in% (colnames(na_count))) dataset <- dataset[,-i]
return (dataset[1:5,1:5])
return (colnames(dataset))
}
#sample data
BROWNSVILLEMF2016TO2017[1:12,1:5]
Actual.Cap.Rate Age Asking.Price Assessed.Improved Assessed.Land
1 NA 31 NA 12039000 1776000
2 NA NA NA 1434000 1452000
3 NA 87 NA 306900 270000
4 NA 11 NA 432900 337950
5 NA 89 NA 281700 107100
6 4.5 87 3300000 NA NA
7 NA 96 NA 427500 66150
8 NA 87 NA 1228000 300000
9 NA 95 NA NA NA
10 NA 95 NA NA NA
11 NA 87 NA 210755 14418
12 NA 87 NA NA NA
I would not use subset directly with grep because you have so many fields. There may very different versions of the words and you want them whether they are capitalized or not.
(be sure to check my R grammar I have been working in python all day)
#Empty List - you will build a list of names
extractList<-list()
#names you are looking for in column names saved as a list (lowercase)
nameList<- c("seller","buyer","buyers","listing")
#Create the outer loop to grab index of columns and pull the column name off each one at a time
for (i in 1:ncol(dataset)){
cName<-names(dataset[i])
lcName<-tolower(cName)
#Created a loop within that loop to compare each keyword on your nameList to the columns to see if the word is in the title (with title case lowered)
for (j in nameList){
#if it is append the column name to the list NOT LOWER CASE, ***ORIGINAL***
if(grepl(j, lcName)==TRUE ){extractList=append(cName,extractList)}
} }
#Now remove duplicates names for the extract list
extractList<-unique(extractlist)
At this point you should have a concatenated list of column names each of which has one (or more) of those four words in ANY FORM capital or lowercase or camel case...which was the point of lowering the case of the column name before comparing them. Now you just need to subset the data frame the easy way!
newSet<- dataset[,which((names(dataset) %in% extractList)==TRUE)
This creates a logical vector with %in% statement so only names in the data frame which appear on the new list of unique column names with ANY version of your keywords will show as TRUE and be included in the columns of the new set.
Now you should have a complete set of data with only the types of column names you are looking to use. DO NOT JUST USE THIS...look at it and try to understand why some of the more esoteric tricks are at play so that you can work through similar problems in the future.
Almost forgot:
install.packages("questionr")
Then:
freq.na(newSet)
will give you a formatted table with the #missing and the percent of na's for each column, you can set this to a variable to use it in you vetting process!

R - remove rows from a data frame with empty lines (not only numbers)

The issue seems to be something already treated but after a check I couldn't find any solution. I load a table from a file and it could be (don't know how) that some entire lines are empty. So when I get the data frame I got
# id c1 c2
# 1 a 1 2
# 2 b 2 4
# 3 NA NA
# 4 d 6 1
# 5 e 7 5
# 6 NA NA
if I do
apply(df, 1, function(x) all(is.na(x))
I got all FALSE as the first column is not a number (the table is much bigger with mixed character and numeric columns) and I can't filter these lines. Also with na.omit or complete.cases I cannot sort it out.
Is there any function or expression to check empty rows?
You may be able to cut this problem off at the source with the parameters you pass to read.csv:
For instance if the blanks are one space or blanks you could use
df <- read.csv(<your other logic here>, na.strings=c("NA","", " ")
This question seems to raise similar issues: read.csv blank fields to NA
If this works, then you can use the apply logic to work with the offending rows.

Resources