Removing NA from a variable in R - r

I want to know if I can remove NAs from a variable without creating a new subset?
The only solutions I find are making me create a new dataset. But I want to delete those rows that have NA in that variable right from the original dataset.
From:
Title Length
1- A NA
2- B 2
3- C 7
Title Length
2- B 2
3- C 7
Is it even possible?
The best solution I found was this one (but as I sad it creates a new dataset):
completerecords <- na.omit(data$emp_length)
Thank you,
Dani

You can reuse the same dataset name to overwrite the original one:
In your example, it would be:
data <- data[!is.na(data$emp_length),]
Note that this way you would remove only the rows that have NA in the column you're interested in, as requested. If some other rows have NA values in different columns, these rows will not be affected.

Related

R - Update value of a column based on condition

I need to update all the values of a column, using as reference another df.
The two dataframes have equal structures:
cod name dom_by
1 A 3
2 B 4
3 C 1
4 D 2
I tried to use the following line, but apparently it did not work:
df2$name[df2$dom_by==df1$cod] <- df1$name[df2$dom_by==df1$cod]
It keeps saying that replacement has 92 rows, data has 2.
(df1 has 92 rows and df2 has 2).
Although it seems like a simple problem, I still can not solve it, even after some searches.

Subtraction of rows from data table in R

I'm new to this site (and new to R) so I hope this is the right way to approach my problem.
I searched at this site but couldn't find the answer I'm looking for.
My problem is the following:
I have imported a table from a database into R (it says it's a data frame) and I want to substract the values from a particular columnn (row by row). Thereafter, I'd like to assign these differences to a new column called 'Difference' in the same data frame.
Could anyone please tell me how to do this?
Many thanks,
Arjan
To add a new column, just do df <- df$newcol, where df is the name of your data frame, and newcol is the name you want, in this case it would be "Difference". If you want to subtract an existing column using an existing column just use arithmetic operations.
df$Difference <- (df$col1 - df$col2)
I'm going to assume you want to subtract the values in one column from another is this correct? This can be done pretty easily see code below.
first I'm just going to make up some data.
df <- data.frame(v1 = rnorm(10,100,4), v2 = rnorm(10,25,4))
You can subtract values in one column from another by doing just that (see below).
Use $ to specify columns. Adding a new name after the $ will create a new column.
(see code below)
df$Differences <- df$v1 - df$v2
df
v1 v2 Differences
1 98.63754 29.54652 69.09102
2 99.49724 24.27766 75.21958
3 102.73056 25.01621 77.71435
4 100.87495 26.92563 73.94933
5 103.01357 17.46149 85.55208
6 97.24901 20.82983 76.41917
7 100.73915 27.95460 72.78454
8 98.14175 24.19351 73.94824
9 102.63738 21.74604 80.89133
10 105.78443 16.79960 88.98483
Hope this helps

Filtering a data frame based on multiple columns sharing a name

I am working with R for a few month now and still considering myself a beginner in R. Thanks to this community, I've learned so much about R already. I can't thank you enough for that.
Now, I have a question that somehow always comes back to me at some point and is so basic in nature that I have the feeling, that I should already have solved it myself at some point.
It is related to this question: filtering data frame based on NA on multiple columns
I have a data.frame that contains are variable number of columns containing a specific string (e.g. "type") in the name.
Here, is a simplified example:
data <- data.frame(name=c("aaa","bbb","ccc","ddd"),
'type_01'=c("match", NA, NA, "match"),
'type_02'=c("part",NA,"match","match"),
'type_03'=c(NA,NA,NA,"part"))
> data
name type_01 type_02 type_03
1 aaa match part <NA>
2 bbb <NA> <NA> <NA>
3 ccc <NA> match <NA>
4 ddd match match part
OK, I know that can filter the columns with...
which(is.na(data$'type_01') & is.na(data$'type_02') & is.na(data$'type_03'))
[1] 2
but since the number of type columns are variable (up to 20 sometimes) in my data and I would rather like to get them with something like ...
grep("type", names(data))
[1] 2 3 4
... and apply the condition to all of the columns, without specifying them individually.
In the example here, I am looking for the NAs, but that might not always be the case.
Is there a simple way, to apply a condition to multiple columns sharing a common names without specifing them one by one?
You don't need to loop or apply anything. Continuing from your grep method,
i1 <- grep("type", names(a))
which(rowSums(is.na(a[i1])) == length(i1))
#[1] 2
NOTE I renamed your data frame to a since data is already defined as a function in R

I am trying make a function that checks how many na are in each column category, and then delete the column if more than 20% of the entries are blank

I am programming in R for a commercial real estate project from this place I started to work at. I have data frames that have 195 categories for each of the properties sold in that area for the last year. The categories are along the top and the properties along the row.
I tried to make a function called cuttingvariables1 to cut out the number of variables first by taking a subset of the categories based on if they have seller, buyer, buyers, listing in the column name.
I was able to have it work when I ran it as commands, but why isn't it working when I try to make function in the source file and run off that.
Cuttingvariables2 is my second function and I do not understand why it stops working at line 7 for that loop. The loop is meant to check every na_count for each category and then see if it is greater than 20% the number of properties listed in that loaded csv. If it is, then the column gets deleted.
Any help would be appreciated.
cuttingvariables1 <- function(dataset)
(
dataset <- (subset(dataset,select=c(!grepl("Seller|Buyer|Buyers|Listing",names(dataset))))
)
)
Cuttingvariables2 function below!
cuttingvariables2 <- function(dataset)
{
z = ncol(dataset)
na_count <- c(lapply(dataset, function(y) sum(length(which(is.na(y))))))
setDT(na_count, keep.rownames = TRUE)[]
j = ncol(na_count)
for (i in 1:j) if((as.integer(na_count[,i])) > (nrow(dataset)/5)) na_count <- na_count[,-i]
for (i in 1:ncol(dataset)) if(colnames(dataset)[i] %in% (colnames(na_count))) dataset <- dataset[,-i]
return (dataset[1:5,1:5])
return (colnames(dataset))
}
#sample data
BROWNSVILLEMF2016TO2017[1:12,1:5]
Actual.Cap.Rate Age Asking.Price Assessed.Improved Assessed.Land
1 NA 31 NA 12039000 1776000
2 NA NA NA 1434000 1452000
3 NA 87 NA 306900 270000
4 NA 11 NA 432900 337950
5 NA 89 NA 281700 107100
6 4.5 87 3300000 NA NA
7 NA 96 NA 427500 66150
8 NA 87 NA 1228000 300000
9 NA 95 NA NA NA
10 NA 95 NA NA NA
11 NA 87 NA 210755 14418
12 NA 87 NA NA NA
I would not use subset directly with grep because you have so many fields. There may very different versions of the words and you want them whether they are capitalized or not.
(be sure to check my R grammar I have been working in python all day)
#Empty List - you will build a list of names
extractList<-list()
#names you are looking for in column names saved as a list (lowercase)
nameList<- c("seller","buyer","buyers","listing")
#Create the outer loop to grab index of columns and pull the column name off each one at a time
for (i in 1:ncol(dataset)){
cName<-names(dataset[i])
lcName<-tolower(cName)
#Created a loop within that loop to compare each keyword on your nameList to the columns to see if the word is in the title (with title case lowered)
for (j in nameList){
#if it is append the column name to the list NOT LOWER CASE, ***ORIGINAL***
if(grepl(j, lcName)==TRUE ){extractList=append(cName,extractList)}
} }
#Now remove duplicates names for the extract list
extractList<-unique(extractlist)
At this point you should have a concatenated list of column names each of which has one (or more) of those four words in ANY FORM capital or lowercase or camel case...which was the point of lowering the case of the column name before comparing them. Now you just need to subset the data frame the easy way!
newSet<- dataset[,which((names(dataset) %in% extractList)==TRUE)
This creates a logical vector with %in% statement so only names in the data frame which appear on the new list of unique column names with ANY version of your keywords will show as TRUE and be included in the columns of the new set.
Now you should have a complete set of data with only the types of column names you are looking to use. DO NOT JUST USE THIS...look at it and try to understand why some of the more esoteric tricks are at play so that you can work through similar problems in the future.
Almost forgot:
install.packages("questionr")
Then:
freq.na(newSet)
will give you a formatted table with the #missing and the percent of na's for each column, you can set this to a variable to use it in you vetting process!

Exclude rows that contain NA in a particular column in subsets

I am trying exclude rows of a subset which contain an NA for a particular column that I choose. I have a CSV spreadsheet of survey data this kind of organization, for instance:
name idnum term type q2 q3
bob 0321 1 2 0 .
. . 3 1 5 3
ron . 2 4 2 1
. 2561 4 3 4 2
When I was creating my R-workspace, I set it such that data <- read.csv(..., na.strings='.'). For purposes of my analysis, I then created subsets by term and type, like set13 <- subset(data, term=1 & type=2), for example. When I trying to conduct t-tests, I noticed that the function threw out any instance of NA, effectively cutting my sample size in half.
For my analysis, I want to exclude responses that are missing survey items, such as Bob from my example, missing question 3. But I still want to include rows that have one or more NAs in the name or idnum columns. So, in essence, I want to pick by columns which NAs are omitted. (Keep in mind, this is just an example - my actual CSV has about 1000 rows, so each subset may contain 100-150 rows.)
I know this can be done using data frames, but I'm not sure how to incorporate that into my given subset format. Is there a way to do this?
Check out complete.cases as shown in the answer to this SO post.
data[complete.cases(data[,3:6]),]
This will return all rows with complete information in columns 3 through 6.
Another approach.
data[rowSums(is.na(data[,3:6]))==0,]
Another option is
data[!Reduce(`|`, lapply(data[3:6], is.na)),]

Resources