Trouble removing missing data from data frame - r

I'm new to R so please excuse my very basic question:
I have a data frame that has a lot of missing data. I've used na.omit to remove missing data as in:
data2 <- na.omit(data1)
Howevever, some of the variables are factors that still seem to have "" as one of the categorise, as in:
> str(data2$smoker)
Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 2 2 3 3 2 ...
When I look at "data2" it does still have missing values. What I am doing wrong?
Help and advice much appreciated.
Greg

NA is not the same as "".
What is the difference?
NA indicates a missing value
"" is an empty string, which is a type of value
na.omit will remove NA values, but it will not remove empty strings.
I suggest turning "" into NA before using na.omit:
data1[data1$smoker == "", "smoker"] <- NA

Related

R help converting non numeric column to numeric

I'm trying to help my friend, Director of Sales, make sense of his logged call data. There is one column in particular in which he is interested, "Disposition". This column has string values and I'm trying to convert them to numeric values (i.e. "Not Answered" converted to 1, "Answered" converted to 2, etc.) and remove any row with no values entered. I've created data frames, used as.numeric, created and deleted columns/rows, etc. to no avail. I'm just trying to run simple R code to give him some insight. Any and all help is much appreciated. Thanks in advance!
P.S. I'm unsure as to whether I should provide some code due to the fact that there is a lot of delicate information (personal phone numbers and emails).
First off: You should always provide representative sample data; if your data is sensitive in nature, provide mock-up data.
That aside, to recode a character vector as numeric you could convert to factor and then use as.numeric. For example:
# Sample data
column <- c("Not Answered", "Answered", "Something else", "Others")
# Convert character vector to factor
column <- factor(column, levels = as.character(unique(column)))
# Convert to numeric
as.numeric(column);
#[1] 1 2 3 4
The numbering can be adjusted by changing the order of the factor levels.
Alternatively, you can create a new column and fill it with the numeric values using an ifelse statement. To illustrate, let's assume this is your dataframe:
df <- data.frame(
Disposition = c(rep(c("answer", "no answer", "whatever", NA),3)),
Anything = c(rnorm(12))
)
df
Disposition Anything
1 answer 2.54721951
2 no answer 1.07409803
3 whatever 0.60482744
4 <NA> 2.08405038
5 answer 0.31799860
6 no answer -1.17558239
7 whatever 0.94206106
8 <NA> 0.45355501
9 answer 0.01787330
10 no answer -0.07629330
11 whatever 0.83109679
12 <NA> -0.06937357
Now you define a new column, say df$Analysis, and assign to it numbers based on the information in df$Disposition:
df$Analysis <- ifelse(df$Disposition=="no answer", 1,
ifelse(df$Disposition=="answer", 2, 3))
df
Disposition Anything Analysis
1 answer 2.54721951 2
2 no answer 1.07409803 1
3 whatever 0.60482744 3
4 <NA> 2.08405038 NA
5 answer 0.31799860 2
6 no answer -1.17558239 1
7 whatever 0.94206106 3
8 <NA> 0.45355501 NA
9 answer 0.01787330 2
10 no answer -0.07629330 1
11 whatever 0.83109679 3
12 <NA> -0.06937357 NA
The advantage of this method is that you keep the original information unchanged. If you now want to remove Na values in the dataframe, use na.omit. NB: this will remove not only the NA values in df$Disposition but any row with NA in any column:
df_clean <- na.omit(df)
df_clean
Disposition Anything Analysis
1 answer 2.5472195 2
2 no answer 1.0740980 1
3 whatever 0.6048274 3
5 answer 0.3179986 2
6 no answer -1.1755824 1
7 whatever 0.9420611 3
9 answer 0.0178733 2
10 no answer -0.0762933 1
11 whatever 0.8310968 3

Counting the number of individual letters in a string in R

I have the following string of characters:
pig<-c("A","B","C","D","AB","ABC","AB","AA","CD","CA",NA)
I am trying to get R to tell me how many of each total letters there are and how many total NAs there are. Thus, in this case I would like to the result to look like this:
print(cow)
A B C D NA
6 3 4 2 1
I have tried table in combination with strsplit but cannot figure out exactly how to do it. Any thoughts? Thanks!
You would need to use NULL (or the empty character "") for the split value in strsplit(), then unlist it. Then, in table() you'll want to use the useNA argument to include any NA values. Here we'll use "ifany", so that if there are any NA values they will be shown in the table and if there are not, NA will not be shown in the result at all.
table(unlist(strsplit(pig, NULL)), useNA = "ifany")
#
# A B C D <NA>
# 7 4 4 2 1

R - remove rows from a data frame with empty lines (not only numbers)

The issue seems to be something already treated but after a check I couldn't find any solution. I load a table from a file and it could be (don't know how) that some entire lines are empty. So when I get the data frame I got
# id c1 c2
# 1 a 1 2
# 2 b 2 4
# 3 NA NA
# 4 d 6 1
# 5 e 7 5
# 6 NA NA
if I do
apply(df, 1, function(x) all(is.na(x))
I got all FALSE as the first column is not a number (the table is much bigger with mixed character and numeric columns) and I can't filter these lines. Also with na.omit or complete.cases I cannot sort it out.
Is there any function or expression to check empty rows?
You may be able to cut this problem off at the source with the parameters you pass to read.csv:
For instance if the blanks are one space or blanks you could use
df <- read.csv(<your other logic here>, na.strings=c("NA","", " ")
This question seems to raise similar issues: read.csv blank fields to NA
If this works, then you can use the apply logic to work with the offending rows.

Making compatible (equal) dimensions for two vectors in R

I have a vector called classes that is the output of an analysis that used listwise deletion. As a result, the cases included in classes is a subset of the entire dataset -- some cases were dropped because of incomplete data.
Selection is a dummy variable that occurs with every case in my dataset. A shortened example of my data is below. There is also a unique case ID for every observation.
classes <- c(1,2,1,1,1,2,3,3,3,1,1,1,3,3,2,2,2)
selection <- c(1,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,1,0)
case <-seq(1,26,1)
I would like to create a new version of selection (say, selection2) so that it only includes cases that are in classes. Basically, I would like both variables to be the same length for comparison purposes, where the cases that are NOT included in classes are also not included in selection2.
I thought this would be an easy fix, but I've spend a lot of time getting nowhere, so I thought I'd ask. Thanks in advance!
If they are to be the same length, then the reduced version must have NA's:
> selection2 <- selection
> is.na(selection2) <- !selection2 %in% classes
> selection2
[1] 1 NA NA NA 1 1 1 1 NA NA NA NA NA 1 1 1 1 NA NA NA 1 1 1 NA 1 NA

Barplot Error in R

I recently created a barplot in R using some sample data with no trouble. Then I tried it again using the real data which was exactly the same as the sample data except there was more of it. The problem is now I get this error:
Error in barplot.default(table(datafr)) :
'height' must be a vector or a matrix
I don't know if this is of help but when I print out the table these are what the last lines look like.
33333 2010-09-13-19:25:50.206 Google Chrome-#135 NA
[ reached getOption("max.print") -- omitted 342611 rows ]]
Is it possible that this is too much data to process? Any suggestion as to how I can fix this?
Thanks :)
EDIT 1
Hey Joris,
Here is the info from str(datafr) :
'data.frame': 375944 obs. of 3 variables:
$ TIME : Factor w/ 375944 levels "2010-09-11-19:28:34.680 ",..: 1 2 3 4 5 6 7 8 9 10 ...
$ FOCUS.APP: Factor w/ 107 levels " Finder-#101 ",..: 3 3 3 3 3 3 3 3 1 1 ...
$ X : logi NA NA NA NA NA NA ...
and from traceback()
3: stop("'height' must be a vector or a matrix")
2: barplot.default(table(datafr))
1: barplot(table(datafr))
I also ran the other command you told me, but the feedback was super verbose; too much to print here. Let me know if you need any other info or if the last information was really important I can figure out a way to post it.
Thanks,
Ah, that solves the problem : you have 3 dimensions in your table, barplot can't deal with that. Take the 2 columns you want to use for the barplot function, eg:
# sample data
Df <- data.frame(
TIME = as.factor(seq.Date(as.Date("2010-09-11"),as.Date("2010-09-20"),by="day")),
FOCUS.APP = as.factor(rep(c("F101","F102"),5)),
X = sample(c(TRUE,FALSE,NA),10,r=T)
)
# make tables
T1 <- table(Df)
T2 <- table(Df[,-3])
# plot tables
barplot(T1)
barplot(T2)
This said, that plot must look interesting to say the least. I don't know what you try to do, but I'd say that you might to reconsider your approach to it.

Resources