Remove NA from a dataset in R - r

I have used this function to remove rows that are not blanks:
data <- data[data$Age != "",]
in this dataset
Initial Age Type
1 S 21 Customer
2 D Enquirer
3 T 35 Customer
4 D 36 Customer
However if I run the above code, I get this:
Initial Age Type
1 S 21 Customer
N/A N/A N/A N/A
3 T 35 Customer
4 D 36 Customer
When all I want is:
Initial Age Type
1 S 21 Customer
3 T 35 Customer
4 D 36 Customer
I just want the dataset without any NAs and I wanted to remove any rows that are not blank, so ideally all NAs and any that are just "".
I have tried the na.omit function but this deletes everything from my dataset.
This is an example dataset I have used, but in my dataset there's over 1000 columns and I would like to remove all rows that are NA for a particular column name.
This is my first post, I apologise if this isn't the right way to write up my code, plus I am very new to R.
Also my row number has converted to NA when I don't want it there, it's messing up my calculation.
Thank you for taking time to read and commenting this post.

As pointed out in the comments, it would be good to know what the exact values in the "empty" Age cells are. When I recreate the above data snippet using:
data <- data.frame(Initial = c("S", "D", "T", "D"),
Age = c(21, "", 35, 36),
Type = c("Customer", "Enquirer", "Customer", "Customer"))
We can see that "Age" is transformed into column of type "character".
Using the following code we can effectively remove those "empty" Age rows:
data <- subset(data, is.finite(as.numeric(Age)))
This takes the subset of the dataframe "data" where a numeric version of the Age variable is a finite number, thus eliminating the rows with missing Age values.
Hope this solves your problem!

Thank you # M.P.Maurits
This formula worked!
data <- subset(data, is.finite(as.numeric(Age)))
The column was actually an integer but when changed to numeric it removed all rows that were imported as blank but shown as NAs. I didn't think that integer or numeric would be a difference.
Thank you to everyone else who also commented, much appreciated :)

A simple solution based on dplyr's function filter:
library(dplyr)
data %>%
filter(!Age == "")
Initial Age Type
1 S 21 Customer
2 T 35 Customer
3 D 36 Customer

Related

R Count the number of occurrence produces strange output

I have a dataset and consist of 10 000 rows of data. I perform the random set of 1000 sample data.
Name Age ...
Alice
Jasmine
Alice
Joel
Jimmy
Alice
Alex
Agar
Agar
When I perform the count of number of occurrence of names in a column
name <- table(example['Name'], useNA = "ifany")
The output showed a strange output. It showed a new name Bruce which has 0 value but for Bruce it is not found in the random set of 1000 data but it is instead found in the original dataset. I only want to to use the random set of 1000 data and the 0 value is it normal? How to get rid of it? Or is it impossible to get rid of it?
Alice 3
Jasmine 1
Joel 1
Agar 2
Jimmy 1
Alex 1
Bruce 0
You may use droplevels to drop unused factor levels.
name <- table(droplevels(example['Name']))
Consider this example -
set.seed(123)
#Sample dataframe
df <- data.frame(a = factor(sample(c('A', 'B', 'C'), 10, replace = TRUE)))
#Select only first 5 rows so we don't have any row with "A" value.
df1 <- df[1:5, , drop = FALSE]
table(df1['a'])
#A B C
#0 1 4
table(droplevels(df1['a']))
#B C
#1 4
Sounds like your name field is a factor variable. You are getting totals based on the factor levels. Note in the help text for table(), "Only when exclude is specified (i.e., not by default) and non-empty, will table potentially drop levels of factor arguments." Sounds like you may want to specify exclude to drop factor levels. Or consider refactoring the name field of the random data set with the unique values found just in that set.

R remove rows, that hasn't got the same value in two columns

Sorry for asking stuff that should be an easy job, I am a geology student, triing to use R for his work in school.
I'd like to remove the rows from my database, where the value at two certain columns do not match.
example:
e F 14 14
t D 14 12
j A 11 11
a R 14 13
So the second row should be removed and the forth as well. The column with the letters should not be relevant, just the two with the numbers.
suppose your data is store in df, to do following:
df <- data.frame(col1= c('e','t','j','a'),
col2 =c('F','D','A','R'),
col3=c(14,14,11,14),
col4=c(14,12,11,13))
df <- df[df$col3==df$col4,]
Simple subset operation:
new_df <- subset(df, columnX == columnY)
So assume the rows that you want to remove is 2,3
The key idea is you form a set of the rows you want to remove, and keep the complement of that set.
In R, the complement of a set is given by the '-' operator.
So, assuming the data.frame is called myData:
myData <- myData[-c(2, 3), ]

Data handling: 2 independent factors, which decide the position of a numeric value in a new data frame

I am new to Stackoverflow and to R, so I hope you can be a bit patient and excuse any formatting mistakes.
I am trying to write an R-script, which allows me to automatically analyze the raw data of a qPCR machine.
I was quite successful in cleaning up the data, but at some point I run into trouble. My goal is to consolidate the data into a comprehensive table.
The initial data frame (DF) looks something like this:
Sample Detector Value
1 A 1
1 B 2
2 A 3
3 A 2
3 B 3
3 C 1
My goal is to have a dataframe with the Sample-names as row names and Detector as column names.
A B C
1 1 2 NA
2 3 NA NA
3 2 3 1
My approach
First I took out the names of samples and detectors and saved them in vectors as factors.
detectors = summary(DF$Detector)
detectors = names(detectors)
samples = summary(DF$Sample)
samples = names(samples)
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
Then I subsetted the detectors into a new dataframe based on the name of the detector in the dataframe.
for (i in 1:length(detectors)){
assign(detectors[i], DF[which(DF$Detector == detectors[i]),])
}
Then I initialize an empty dataframe with the right column and row names:
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
So now the Problem. I have to get the values from the detector subsets into the result dataframe. Here it is important that each values finds the way to the right position in the dataframe. The issue is that there are not equally many values since some samples lack some detectors.
I tried to do the following: Iterate through the detector subsets, compare the rowname (=samplename) with each other and if it's the same write the value into the new dataframe. In case it it is not the same, it should write an NA.
for (i in 1:length(detectors)){
for (j in 1:length(get(detectors[i])$Sample)){
result[j,i] = ifelse(get(detectors[i])$Sample[j] == rownames(result[j,]), get(detectors[i])$Ct.Mean[j], NA)
}
}
The trouble is, that this stops the iteration through the detector$Sample column and it switches to the next detector. My understanding is that the comparing samples get out of sync, yielding the all following ifelse yield a NA.
I tried to circumvent it somehow by editing the ifelse(test, yes, no) NO with j=j+1 to get it back in sync, but this unfortunately didn't work.
I hope I could make my problem understandable to you!
Looking forward to hear any suggestions, or comments (also how to general improve my code ;)
We can use acast from library(reshape2) to convert from 'long' to 'wide' format.
acast(DF, Sample~Detector, value.var='Value') #returns a matrix output
# A B C
#1 1 2 NA
#2 3 NA NA
#3 2 3 1
If we need a data.frame output, use dcast.
Or use spread from library(tidyr), which will also have the 'Sample' as an additional column.
library(tidyr)
spread(DF, Detector, Value)

Get the position of maximum value and the respective row element in a Data frame

I created a data frame named "data" and has 100 rows of names and corresponding ages (colnames "NAMES" and "AGES"). Now I try to find the maximum age using the max() function by using
max(data[,"AGES"])
I get the maximum age, but I want to get the position also and the names of the people having the maximum age. And after getting the names of the people of maximum age I want to arrange them alphabetically.. How do I do this?
I tried searching on the net, but wasnt successful in summing the different things up..
Let's first generate some demo data:
data<-data.frame(NAMES=replicate(100, paste(sample(letters, 8, replace=T), collapse="")), AGES=sample(20:60, 100, replace=T))
head(data)
NAMES AGES
1 oepefudt 21
2 ibmuaemm 49
3 mkockaqu 23
4 whyzomna 59
5 omqqtbsz 35
6 qnbmjmuf 25
We can then find the rows that have the maximum age, extract their names, and finally sort them in alphabetical order in a single line:
sort(as.character(data$NAMES[data$AGES==max(data$AGES)]))
Or maybe more transparently:
# Find the maximum age
max.age<-max(data$AGES)
# Which rows have the maximum age value?
ind<-which(data$AGES==max.age)
# Extract the name using the ind from above
persons<-as.character(data$NAMES[ind])
# Sort the names
persons.sorted<-sort(persons)
persons.sorted
Would this help?

count of entries in data frame in R

I'm looking to get a count for the following data frame:
> Santa
Believe Age Gender Presents Behaviour
1 FALSE 9 male 25 naughty
2 TRUE 5 male 20 nice
3 TRUE 4 female 30 nice
4 TRUE 4 male 34 naughty
of the number of children who believe. What command would I use to get this?
(The actual data frame is much bigger. I've just given you the first four rows...)
Thanks!
You could use table:
R> x <- read.table(textConnection('
Believe Age Gender Presents Behaviour
1 FALSE 9 male 25 naughty
2 TRUE 5 male 20 nice
3 TRUE 4 female 30 nice
4 TRUE 4 male 34 naughty'
), header=TRUE)
R> table(x$Believe)
FALSE TRUE
1 3
I think of this as a two-step process:
subset the original data frame according to the filter supplied
(Believe==FALSE); then
get the row count of this subset
For the first step, the subset function is a good way to do this (just an alternative to ordinary index or bracket notation).
For the second step, i would use dim or nrow
One advantage of using subset: you don't have to parse the result it returns to get the result you need--just call nrow on it directly.
so in your case:
v = nrow(subset(Santa, Believe==FALSE)) # 'subset' returns a data.frame
or wrapped in an anonymous function:
>> fnx = function(fac, lev){nrow(subset(Santa, fac==lev))}
>> fnx(Believe, TRUE)
3
Aside from nrow, dim will also do the job. This function returns the dimensions of a data frame (rows, cols) so you just need to supply the appropriate index to access the number of rows:
v = dim(subset(Santa, Believe==FALSE))[1]
An answer to the OP posted before this one shows the use of a contingency table. I don't like that approach for the general problem as recited in the OP. Here's the reason. Granted, the general problem of how many rows in this data frame have value x in column C? can be answered using a contingency table as well as using a "filtering" scheme (as in my answer here). If you want row counts for all values for a given factor variable (column) then a contingency table (via calling table and passing in the column(s) of interest) is the most sensible solution; however, the OP asks for the count of a particular value in a factor variable, not counts across all values. Aside from the performance hit (might be big, might be trivial, just depends on the size of the data frame and the processing pipeline context in which this function resides). And of course once the result from the call to table is returned, you still have to parse from that result just the count that you want.
So that's why, to me, this is a filtering rather than a cross-tab problem.
sum(Santa$Believe)
You can do summary(santa$Believe) and you will get the count for TRUE and FALSE
DPLYR makes this really easy.
x<-santa%>%
count(Believe)
If you wanted to count by a group; for instance, how many males v females believe, just add a group_by:
x<-santa%>%
group_by(Gender)%>%
count(Believe)
A one-line solution with data.table could be
library(data.table)
setDT(x)[,.N,by=Believe]
Believe N
1: FALSE 1
2: TRUE 3
using sqldf fits here:
library(sqldf)
sqldf("SELECT Believe, Count(1) as N FROM Santa
GROUP BY Believe")

Resources