I have a dataframe, which includes a corrupt row with NAs and "". I cannot remove this from the .csv file I am importing into R since Excel cannot deal with (opening) the size of the .csv document.
I do a check when I first read.csv() like below to remove the row with NA:
if ( any( is.na(unique(data$A)) ) ){
print("WARNING: data has a corrupt row in it!")
data <- data[ !is.na(data$A) , ]
}
However, as if it is a factor, the Acolumn remembers NA as a level:
> summary(data$A)
Mode FALSE TRUE NA's
logical 185692 36978 0
This obviously causes issues when I am trying to fit a linear model. How can I get rid of the NA as a logical level here?
I tried this but doesn't seem to work:
A <- as.logical(droplevels(factor(data_combine$A)))
summary(A)
Mode FALSE TRUE NA's
logical 185692 36978 0
unique(A)
[1] FALSE TRUE
First, your data$A is not a factor, it's a logical. The summary print methods are not the same for factors and logicals. Logicals use summary.default while factors dispatch to summary.factor. Plus it tells you in the result that the variable is a logical.
fac <- factor(c(NA, letters[1:4]))
log <- c(NA, logical(4), !logical(2))
summary(fac)
# a b c d NA's
# 1 1 1 1 1
summary(log)
# Mode FALSE TRUE NA's
# logical 4 2 1
See ?summary for the differences.
Second, your call
A <- as.logical(droplevels(factor(data_combine$A)))
summary(A)
is also calling summary.default because you wrapped droplevels with as.logical (why?). So don't change data_combine$A at all, and just try
summary(data_combine$A)
and see how that goes. For more information, please provide a sample of your data.
As mentioned in my other answer, those actually are not factor levels. Since you asked how to remove the NA printing on summary, I'm undeleting this answer.
The NA printing is hard-coded into a summary for a logical vector. Here's the relevant code from summary.default.
# value <- if (is.logical(object))
# c(Mode = "logical", {
# tb <- table(object, exclude = NULL)
# if (!is.null(n <- dimnames(tb)[[1L]]) && any(iN <- is.na(n)))
# dimnames(tb)[[1L]][iN] <- "NA's"
# tb
# })
The exclude = NULL in table is the problem. If we look at the exclude argument in table with a logical vector log, we can see that when it is NULL the NAs always print out.
log <- c(NA, logical(4), NA, !logical(2), NA)
table(log, exclude = NULL) ## with NA values
# log
# FALSE TRUE <NA>
# 4 2 3
table(log[!is.na(log)], exclude = NULL) ## NA values removed
#
# FALSE TRUE <NA>
# 4 2 0
To make your summary print the way you want it, we can write a summary method based on the original source code.
summary.logvec <- function(object, exclude = NA) {
stopifnot(is.logical(object))
value <- c(Mode = "logical", {
tb <- table(object, exclude = exclude)
if(is.null(exclude)) {
if (!is.null(n <- dimnames(tb)[[1L]]) && any(iN <- is.na(n)))
dimnames(tb)[[1L]][iN] <- "NA's"
}
tb
})
class(value) <- c("summaryDefault", "table")
print.summary.logvec <- function(x) {
UseMethod("print.summaryDefault")
}
value
}
And then here are the results. Since we set exclude = NA in our print method the NAs will not print unless we set it to NULL
summary(log) ## original vector
# Mode FALSE TRUE NA's
# logical 4 2 3
class(log) <- "logvec"
summary(log, exclude = NULL) ## prints NA when exclude = NULL
# Mode FALSE TRUE NA's
# logical 4 2 3
summary(log) ## NA's don't print
# Mode FALSE TRUE
# logical 4 2
Now that I've done all this I'm wondering if you have tried to run your linear model.
Related
I have a list of data.frames. I want to send each data.frame to a function using lapply. Inside the function I want to check whether the name of a data.frame includes a particular string. If the string in question is present I want to perform one series of operations. Otherwise I want to perform a different series of operations. I cannot figure out how to check whether the string in question is present from within the function.
I wish to use base R. This seems to be a possible solution but I cannot get it to work:
In R, how to get an object's name after it is sent to a function?
Here is an example list followed by an example function further below.
matrix.apple1 <- read.table(text = '
X3 X4 X5
1 1 1
1 1 1
', header = TRUE)
matrix.apple2 <- read.table(text = '
X3 X4 X5
1 1 1
2 2 2
', header = TRUE)
matrix.orange1 <- read.table(text = '
X3 X4 X5
10 10 10
20 20 20
', header = TRUE)
my.list <- list(matrix.apple1 = matrix.apple1,
matrix.orange1 = matrix.orange1,
matrix.apple2 = matrix.apple2)
This operation can check whether each object name contains the string apples
but I am not sure how to use this information inside the function further below.
grepl('apple', names(my.list), fixed = TRUE)
#[1] TRUE FALSE TRUE
Here is an example function. Based on hours of searching and trial-and-error I perhaps am supposed to use deparse(substitute(x)) but so far it only returns x or something similar.
table.function <- function(x) {
# The three object names are:
# 'matrix.apple1', 'matrix.orange1' and 'matrix.apple2'
myObjectName <- deparse(substitute(x))
print(myObjectName)
# perform a trivial example operation on a data.frame
my.table <- table(as.matrix(x))
# Test whether an object name contains the string 'apple'
contains.apple <- grep('apple', myObjectName, fixed = TRUE)
# Use the result of the above test to perform a trivial example operation.
# With my code 'my.binomial' is always given the value of 0 even though
# 'apple' appears in the name of two of the data.frames.
my.binomial <- ifelse(contains.apple == 1, 1, 0)
return(list(my.table = my.table, my.binomial = my.binomial))
}
table.function.output <- lapply(my.list, function(x) table.function(x))
These are the results of print(myObjectName):
#[1] "x"
#[1] "x"
#[1] "x"
table.function.output
Here are the rest of the results of table.function showing that my.binomial is always 0.
The first and third value of my.binomial should be 1 because the names of the first and third data.frames contain the string apple.
# $matrix.apple1
# $matrix.apple1$my.table
# 1
# 6
# $matrix.apple1$my.binomial
# logical(0)
#
# $matrix.orange1
# $matrix.orange1$my.table
# 10 20
# 3 3
# $matrix.orange1$my.binomial
# logical(0)
#
# $matrix.apple2
# $matrix.apple2$my.table
# 1 2
# 3 3
# $matrix.apple2$my.binomial
# logical(0)
You could redesign your function to use the list names instead:
table_function <- function(myObjectName) {
# The three object names are:
# 'matrix.apple1', 'matrix.orange1' and 'matrix.apple2'
myObject <- get(myObjectName)
print(myObjectName)
# perform a trivial example operation on a data.frame
my.table <- table(as.matrix(myObject))
# Test whether an object name contains the string 'apple'
contains.apple <- grep('apple', myObjectName, fixed = TRUE)
# Use the result of the above test to perform a trivial example operation.
# With my code 'my.binomial' is always given the value of 0 even though
# 'apple' appears in the name of two of the data.frames.
my.binomial <- +(contains.apple == 1)
return(list(my.table = my.table, my.binomial = my.binomial))
}
lapply(names(my.list), table_function)
This returns
[[1]]
[[1]]$my.table
1
6
[[1]]$my.binomial
[1] 1
[[2]]
[[2]]$my.table
10 20
3 3
[[2]]$my.binomial
integer(0)
[[3]]
[[3]]$my.table
1 2
3 3
[[3]]$my.binomial
[1] 1
If you want to keep the list names, you could use
sapply(names(my.list), table_function, simplify = FALSE, USE.NAMES = TRUE)
instead of lapply.
Use Map and pass both list data and it's name to the function. Change your function to accept two arguments.
table.function <- function(data, name) {
# The three object names are:
# 'matrix.apple1', 'matrix.orange1' and 'matrix.apple2'
print(name)
# perform a trivial example operation on a data.frame
my.table <- table(as.matrix(data))
# Test whether an object name contains the string 'apple'
contains.apple <- grep('apple', name, fixed = TRUE)
# Use the result of the above test to perform a trivial example operation.
# With my code 'my.binomial' is always given the value of 0 even though
# 'apple' appears in the name of two of the data.frames.
my.binomial <- as.integer(contains.apple == 1)
return(list(my.table = my.table, my.binomial = my.binomial))
}
Map(table.function, my.list, names(my.list))
#[1] "matrix.apple1"
#[1] "matrix.orange1"
#[1] "matrix.apple2"
#$matrix.apple1
#$matrix.apple1$my.table
#1
#6
#$matrix.apple1$my.binomial
#[1] 1
#$matrix.orange1
#$matrix.orange1$my.table
#10 20
# 3 3
#$matrix.orange1$my.binomial
#integer(0)
#...
#...
The same functionality is provided by imap in purrr where you don't need to explicitly pass the names.
purrr::imap(my.list, table.function)
I would like to check if at least one element of my data_frame_1 is in data_frame_2 and add it like a new column
my code:
library(data.table)
object_to_check <- data.table(c('aaax', 'bbbx', 'cccy', 'dddk', 'mmmt'))
colnames(object_to_check) <- 'x'
list_of_element <- data.table(c('ax', 'kh', 'dk'))
colnames(list_of_element) <- 'y'
Fun2 <- function(element_to_find, string_to_check) {
element_to_find <- '0'
if (element_to_find == '0') {
for (i in 1:length(list_of_element)) {
m <- lista[i]
element_to_find <- ifelse(grepl(m, string_to_check, ignore.case = T) == T,string_to_check,'')
}
}
}
object_to_check <- object_to_check[, check := Fun2(check, x)]
my code give me this error:
Warning message:
In `[.data.table`(object_to_check, , `:=`(check, Fun2(check, x))) :
Adding new column 'check' then assigning NULL (deleting it).
I'm stuck on this error and i can't find a solution on my problem. Can sameone help me?
Desired output:
x check
aaax ax
bbbx NA
cccx NA
dddk dk
mmmt NA
Thanks
You can do it like this
> df1 <- data.frame(row.names=1:4, var1=c(TRUE, TRUE, FALSE, FALSE), var2=c(1,2,3,4))
> df2 <- data.frame(row.names=5:7, var1=c(FALSE, TRUE, FALSE), var2=c(5,2,3))
> df1
var1 var2
1 TRUE 1
2 TRUE 2
3 FALSE 3
4 FALSE 4
> df2
var1 var2
5 FALSE 5
6 TRUE 2
7 FALSE 3
There are also some other easiest ways are also available. you can use all.equal(target, current, ...) function. It does not sort the dataframes.
Another way is to use identical() funtion
Say I have a data frame df and want to subset it based on the value of column a.
df <- data.frame(a = 1:4, b = 5:8)
df
Is it necessary to include a which function in the brackets or can I just include the logical test?
df[df$a == "2",]
# a b
#2 2 6
df[which(df$a == "2"),]
# a b
#2 2 6
It seems to work the same either way... I was getting some strange results in a large data frame (i.e., getting empty rows returned as well as the correct ones) but once I cleaned the environment and reran my script it worked fine.
df$a == "2" returns a logical vector, while which(df$a=="2") returns indices. If there are missing values in the vector, the first approach will include them in the returned value, but which will exclude them.
For example:
x=c(1,NA,2,10)
x[x==2]
[1] NA 2
x[which(x==2)]
[1] 2
x==2
[1] FALSE NA TRUE FALSE
which(x==2)
[1] 3
For several days already I've been stuck with a problem in R, trying to make duplicate levels in multiple factor columns in data frame unique using a loop. This is part of a larger project.
I have more than 200 SPSS data sets where the number of cases vary between 4,000 and 23,000 and the number of variables vary between 120 and 1,200 (an excerpt of one of the SPSS data sets can be found here). The files contain both numeric and factor variables and many of the factor ones have duplicated levels. I have used read.spss from the foreign package to import them in data frames, keeping the value labels because I need them for further use. During the import R warns me about the duplicated levels in the factor columns:
> adn <- read.spss("/tmp/adn_110.sav", use.value.labels = TRUE,
use.missings = TRUE, to.data.frame = TRUE)
Warning messages:
1: In read.spss("/tmp/adn_110.sav", use.value.labels = TRUE, use.missings = TRUE, :
/tmp/adn_110.sav: Unrecognized record type 7, subtype 18 encountered in system file
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
3: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
The data frame, exported as .RData, can be found here. When I use table (for example) to get the counts for each level of any factor column, all duplicated levels are displayed, but the counts for all duplicated levels are added to the first occurrence of the duplicate levels and for all others 0s are returned:
> table(adn[["adn01"]], useNA = "ifany")
Incorrect Incorrect Partially correct Partially correct
8 0 4 0
Correct <NA>
2 1
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
I know I can easily treat the factor as.numeric when calling table. However, I need the level names displayed in the output. I can use make.unique to make the levels for individual factor columns unique, appending a number at the end of the duplicate levels:
> levels(adn[["adn01"]]) <- make.unique(levels(adn[["adn01"]]), sep = " ")
Works like a charm. Then table shows me the correct counts:
> table(adn[["adn01"]], useNA = "ifany")
Incorrect Incorrect 1 Partially correct
5 3 1
Partially correct 1 Correct <NA>
3 2 1
However, doing this for each factor column in each of the more than 200 files, where the number of variables vary between 120 and 1,200, would be a mission of a lifetime. And if the files change I will have to redo everything. I naively thought looping through the ccolums would be easy. However, make.table requires names. I have tried the following:
> lapply(adn[ , 1:length(adn)], make.unique(as.vector(attr(adn[ , 1:length(adn)],
"levels"))))
Error in make.unique(as.vector(attr(adn[, 1:length(adn)], "levels"))) :
'names' must be a character vector
No luck. I have tried many other things in the last days, including classical for loops. Still the same: 'names' must be a character vector. I guess the problem is in indexing the attribute levels of the columns, which is a list component, but I can't figure out what. Additional issue may be that not all columns are factors. Can someone help?
EDIT:
The solution provided by akrun works perfectly. Thank you once again!
Try
load('adn.RData')
indx <- sapply(adn, is.factor)
adn[indx] <- lapply(adn[indx], function(x) {
levels(x) <- make.unique(levels(x))
x })
table(adn[['adn01']], useNA='ifany')
# Incorrect Incorrect.1 Partially correct Partially correct.1
# 5 3 1 3
# Correct <NA>
# 2 1
table(adn[['adn03']], useNA='ifany')
# Incorrect Partially correct Correct <NA>
# 6 3 5 1
Update
If you have multiple files, you can read the files into a list and then do the processing on the list. For example, considering that the files are in the working directory.
files <- list.files(pattern='^adn\\d+')
lst1 <- lapply(files, function(x) read.spss(x, use.value.labels = TRUE,
use.missings = TRUE, to.data.frame = TRUE) #not tested
For testing purposes, I am creating lst1 with the same dataset adn.
adn1 <- adn
lst1 <- list(adn, adn1)
Now, you are apply the make.unique for each list element
lst2 <- lapply(lst1, function(dat) {
indx <- sapply(dat, is.factor)
dat[indx] <- lapply(dat[indx], function(x){
levels(x) <- make.unique(levels(x))
x})
dat})
lapply(lst2, function(x) table(x[['adn01']], useNA='ifany'))
# [[1]]
# Incorrect Incorrect.1 Partially correct Partially correct.1
# 5 3 1 3
# Correct <NA>
# 2 1
# [[2]]
# Incorrect Incorrect.1 Partially correct Partially correct.1
# 5 3 1 3
# Correct <NA>
# 2 1
I have the piece to display NAs, but I can't figure it out.
try(na.fail(x))
> Error in na.fail.default(x) : missing values in object
# display NAs
myvector[is.na(x)]
# returns
NA NA NA NA
The only thing I get from this the length of the NA vector, which is actually not too helpful when the NAs where caused by a bug in my code that I am trying to track. How can I get the index of NA element(s) ?
I also tried:
subset(x,is.na(x))
which has the same effect.
EDIT:
y <- complete.cases(x)
x[!y]
# just returns another
NA NA NA NA
You want the which function:
which(is.na(arr))
is.na() will return a boolean index of the same shape as the original data frame.
In other words, any cells in that m x n index with the value TRUE correspond to NA values in the original data frame.
You can them use this to change the NAs, if you wish:
DF[is.na(DF)] = 999
To get the total number of data rows with at least one NA:
cc = complete.cases(DF)
num_missing = nrow(DF) - sum(ok)
which(Dataset$variable=="") will return the corresponding row numbers in a particular column
R Code using loop and condition :
# Testing for missing values
is.na(x) # returns TRUE if x is missing
y <- c(1,NA,3,NA)
is.na(y)
# returns a vector (F F F T)
# Print the index of NA values
for(i in 1:length(y)) {
if(is.na(y[i])) {
cat(i, ' ')
}
}
Output is :
Click here
Also :
which(is.na(y))