Exclude rows that contain NA in a particular column in subsets - r

I am trying exclude rows of a subset which contain an NA for a particular column that I choose. I have a CSV spreadsheet of survey data this kind of organization, for instance:
name idnum term type q2 q3
bob 0321 1 2 0 .
. . 3 1 5 3
ron . 2 4 2 1
. 2561 4 3 4 2
When I was creating my R-workspace, I set it such that data <- read.csv(..., na.strings='.'). For purposes of my analysis, I then created subsets by term and type, like set13 <- subset(data, term=1 & type=2), for example. When I trying to conduct t-tests, I noticed that the function threw out any instance of NA, effectively cutting my sample size in half.
For my analysis, I want to exclude responses that are missing survey items, such as Bob from my example, missing question 3. But I still want to include rows that have one or more NAs in the name or idnum columns. So, in essence, I want to pick by columns which NAs are omitted. (Keep in mind, this is just an example - my actual CSV has about 1000 rows, so each subset may contain 100-150 rows.)
I know this can be done using data frames, but I'm not sure how to incorporate that into my given subset format. Is there a way to do this?

Check out complete.cases as shown in the answer to this SO post.
data[complete.cases(data[,3:6]),]
This will return all rows with complete information in columns 3 through 6.

Another approach.
data[rowSums(is.na(data[,3:6]))==0,]

Another option is
data[!Reduce(`|`, lapply(data[3:6], is.na)),]

Related

Removing NA from a variable in R

I want to know if I can remove NAs from a variable without creating a new subset?
The only solutions I find are making me create a new dataset. But I want to delete those rows that have NA in that variable right from the original dataset.
From:
Title Length
1- A NA
2- B 2
3- C 7
Title Length
2- B 2
3- C 7
Is it even possible?
The best solution I found was this one (but as I sad it creates a new dataset):
completerecords <- na.omit(data$emp_length)
Thank you,
Dani
You can reuse the same dataset name to overwrite the original one:
In your example, it would be:
data <- data[!is.na(data$emp_length),]
Note that this way you would remove only the rows that have NA in the column you're interested in, as requested. If some other rows have NA values in different columns, these rows will not be affected.

Filtering a data frame based on multiple columns sharing a name

I am working with R for a few month now and still considering myself a beginner in R. Thanks to this community, I've learned so much about R already. I can't thank you enough for that.
Now, I have a question that somehow always comes back to me at some point and is so basic in nature that I have the feeling, that I should already have solved it myself at some point.
It is related to this question: filtering data frame based on NA on multiple columns
I have a data.frame that contains are variable number of columns containing a specific string (e.g. "type") in the name.
Here, is a simplified example:
data <- data.frame(name=c("aaa","bbb","ccc","ddd"),
'type_01'=c("match", NA, NA, "match"),
'type_02'=c("part",NA,"match","match"),
'type_03'=c(NA,NA,NA,"part"))
> data
name type_01 type_02 type_03
1 aaa match part <NA>
2 bbb <NA> <NA> <NA>
3 ccc <NA> match <NA>
4 ddd match match part
OK, I know that can filter the columns with...
which(is.na(data$'type_01') & is.na(data$'type_02') & is.na(data$'type_03'))
[1] 2
but since the number of type columns are variable (up to 20 sometimes) in my data and I would rather like to get them with something like ...
grep("type", names(data))
[1] 2 3 4
... and apply the condition to all of the columns, without specifying them individually.
In the example here, I am looking for the NAs, but that might not always be the case.
Is there a simple way, to apply a condition to multiple columns sharing a common names without specifing them one by one?
You don't need to loop or apply anything. Continuing from your grep method,
i1 <- grep("type", names(a))
which(rowSums(is.na(a[i1])) == length(i1))
#[1] 2
NOTE I renamed your data frame to a since data is already defined as a function in R

Selecting rows based on grepl results in multiple columns in R

I have data (df) like this with 50 diagnosis codes (dx.1 through dx.50) per patient:
ID dx.1 dx.2 dx.50
1 150200 140650 250400
2 752802 851812 NA
3 441402 450220 NA
4 853406 853200 150404
5 250604 NA NA
I would like to select the rows that have any of the diagnosis codes starting with "250". So in the example, it would be ID 1 and 5.
After stumbling around for awhile, I finally came up with this:
df$select = rowSums(sapply(df[,2:ncol(df)], function(x) grepl("\\<250", x)))
selected = df[df$select>0,]
It's kind of clanky and takes a while since I'm running it on several thousand rows.
Is there a better/faster way to do this?
Is there an easy way to extend this to multiple search criteria?

Understanding the syntax for Column vs Row indexing in R

I'm a bit confused on the filtering scheme on an R data frame.
For example, let's say we have the following data frame titled dframe:
> str(dframe)
'data.frame': 143 obs. of 3 variables:
$ Year : int 1999 2005 2007 2008 2009 2010 2005 2006 2007 2008 ...
$ Name : Factor w/ 18 levels "AADAM","AADEN",..: 1 1 2 2 2 2 3 3 3 3 ...
$ Frequency: int 5 6 10 34 38 12 10 6 10 5 ...
Now if I want to filter dframe where the values of Name is of "AADAM", the proper filter is:
dframe[dframe$Name=="AADAM",]
The part where I'm confused is why the comma doesn't come first. Why isn't it this: dframe[,dframe$Name=="AARUSH"]
UPDATE: You clarified your question is really "Please give examples of what sort of logical expressions are valid for filtering columns?"
I agree with you the syntax appears weird initially, but it has the following logic.
The bottom line is that column-filter expressions are typically less rich and expressive than row-filtering expressions, and in particular you can't chain logical indexing the way you do with rows.
Best way is to think of indexing expressions as the general form:
dframe[<row-index-expression>,<col-index-expression>]
where either index-expression is optional, so you can just do one and we (crucially!) need the comma to disambiguate whether it's row- or column-indexing:
dframe[<row-index-expression>,] # such as dframe[dframe$Name=="ADAM",]
dframe[,<col-index-expression>]
Before we look at examples of col-index-expression and what's valid (and invalid) to include in one, let's review and discuss how R does indexing - I had the same confusion when I started with it.
In this example, you have three columns. You can refer to them by their string names 'Year','Name','Frequency'. You can also refer to them by column indices 1,2,3 where the numbers 1,2,3 correspond to the entries colnames(dframe). R does indexing using the '[' operator, also the '[[' operator. Here are some valid examples of ways to index column-indexing:
dframe[,2] # column 2 / Name
dframe[,'Name'] # column 2 / Name
dframe[,c('Name','Frequency')] # string vector - very common
dframe[,c(2,3)] # integer vector - also very common
dframe[,c(F,T,T)] # logical vector - very rarely seen, and a pain in the butt to compute
Now, if you choose to use a logical expression for the column-index, it must be a valid expression without using column names - inside a column it doesn't know their own names.
Suppose you wanted to dynamically filter "give me only the factor columns from dframe".
Something like:
unlist(apply(dframe[1,1:3], 2, is.factor), use.names=F) # except I can't seem to remove the colnames
For more help and examples on indexing look at the '[' operator help-page:
Type ?'['
dframe[,dframe$Name=="ADAM"] is invalid attempt at column-indexing because the columns know nothing about Name=="ADAM"
Addendum: code to generate example dataframe (because you didn't dump us a dput output)
set.seed(123)
N = 10
randomName <- function() { cat(sample(letters, size=runif(1)*6+2, replace=T), sep='') }
dframe = data.frame(Year=round(runif(N,1980,2014)),
Name = as.factor(replicate(N, randomName())),
Frequency=round(runif(N, 2,40)))
You have to remember that when you're sub-setting, the part before the comma is specifying which rows you want, and the part after the comma is specifying which columns you want. ie:
dframe[rowsyouwant, columnsyouwant]
You're filtering based on columns, but you want all of the columns in your result, so the space after the comma is blank. You want some sub-set of rows, so your filtering specification goes before the comma, where the rows you want are specified.
As others have indicated, requesting a certain subset of a data frame requires the syntax [rows, columns]. Since dframe[has 143 rows, has 3 columns], any request for some part of dframe should be of the form
dframe[which of the 143 rows do I want?, which of the 3 columns do I want?].
Because dframe$Name is a vector of length 143, the comparison dframe$Name=='AADAM' is a vector of T/F values that also has length 143. So,
dframe[dframe$Name=='AADAM',]
is like saying
dframe[of the 143 rows I want these ones, I want all columns]
whereas
dframe[,dframe$Name=='AADAM']
generates an error because it's like saying
dframe[I want all rows, of the 143 columns I want these ones]
On a side note, you may want to look into the subset() function if you're not already familiar with it. You could get the same result by writing subset(dframe, Name=='AADAM')
As others have said, the structure within brackets is row, then column.
One way I think of the syntax of selecting data from a data.frame using:
dframe[dframe$Name=="AADAM",]
is to think of a noun, then a verb where:
dframe[] is the noun. It is the object on which you want to perform an action
and
[dframe$Name=="AADAM",] is the verb. It is the action you want to perform.
I have a silly way of expressing this to myself, but it keeps things straight in my mind:
Hey, you! dframe! I am going to... ...in this case, select all of your rows in which Name is equal to AADAM!
By keeping the column portion of [dframe$Name=="AADAM",] blank you are saying you want to keep all columns.
Sometimes it can be a little difficult to remember that you have to write dframe both inside and outside the brackets.
As for exactly why row comes first and column comes second, I do not know, but row had to be either first or second.
dframe <- read.table(text = '
Year Name Frequency
1 ADAM 4
3 BOB 10
7 SALLY 5
2 ADAM 12
4 JIM 3
12 ADAM 7
', header = TRUE)
dframe[,dframe$Name=="ADAM"]
# Error in `[.data.frame`(dframe, , dframe$Name == "ADAM") :
# undefined columns selected
dframe[dframe$Name=="ADAM",]
# Year Name Frequency
# 1 1 ADAM 4
# 4 2 ADAM 12
# 6 12 ADAM 7
dframe[,'Name']
# [1] ADAM BOB SALLY ADAM JIM ADAM
# Levels: ADAM BOB JIM SALLY
dframe[dframe$Name=="ADAM",'Name']
# [1] ADAM ADAM ADAM
# Levels: ADAM BOB JIM SALLY

filtering large data sets to exclude an identical element across all columns

I am a relatively new R user, and most of the complex coding (and packages) looks like Greek to me. It has been a long time since I used a programming language (Java/Perl) and I have only used R for very simple manipulations in the past (basic loading data from file, subsetting, ANOVA/T-Test). However, I am working on a project where I had no control over the data layout and the data file is very lengthy.
In my data, I have 172 rows which feature the Participant to a survey and 158 columns, each which represents the question number. The answers for each are 1-5. The raw data includes the number "99" to indicate that a question was not answered. I need to exclude any questions where a Participant did not answer without excluding the entire participant.
Part Q001 Q002 Q003 Q004
1 2 4 99 2
2 3 99 1 3
3 4 4 2 5
4 99 1 3 2
5 1 3 4 2
In the past I have used the subset feature to filter my data
data.filter <- subset(data, Q001 != 99)
Which works fine when I am working with sets where all my answers are contained in one column. Then this would just delete the whole row where the answer was not available.
However, with the answers in this set spread across 158 columns, if I subset out 99 in column 1 (Q001), I also filter out that entire Participant.
I'd like to know if there is a way to filter/subset the data such that my large data set would end up having 'blanks' when the "99" occured so that these 99's would not inflate or otherwise interfere with the statistics I run of the rest of the numbers. I need to be able to calculate means per question and run ANOVAs and T-Tests on various questions.
Resp Q001 Q002 Q003 Q004
1 2 4 2
2 3 1 3
3 4 4 2 5
4 1 3 2
5 1 3 4 2
Is this possible to do in R? I've tried to filter it before submitting to R, but it won't read the data file in when I have blanks, and I'd like to be able to use the whole data set without creating a subset for each question (which I will do if I have to... it's just time consuming if there is a better code or package to use)
Any assistance would be greatly appreciated!
You could replace the "99" by "NA" and the calculate the colMeans omitting NAs:
df <- replicate(20, sample(c(1,2,3,99), 4))
colMeans(df) # nono
dfc <- df
dfc[dfc == 99] <- NA
colMeans(dfc, na.rm = TRUE)
You can also indicate which values are NA's when you read your data base. For your particular case:
mydata <- read.table('dat_base', na.strings = "99")

Resources