I have a data frame which has one column and column has some data and some empty cells.
When I am checking the levels of that column it is showing three levels as it is taking empty cells as one level. I want to delete that level.
suppose I have
## editor note: starting from R 4.0.0, `stringsAsFactors` defaults to FALSE
## we now explicitly need `stringsAsFactors = TRUE`
df <- data.frame(fan = c("a","b"," ","a","b"), stringsAsFactors = TRUE)
I have tried this code
droplevels(df)
but it is not working.
'droplevels' does work. No need for complex code:
df <- data.frame(fan = c("a","b"," ","a","b"))
df
# fan
#1 a
#2 b
#3
#4 a
#5 b
df$fan[df$fan==' ']=NA
df$fan = droplevels(df$fan)
str(df)
#'data.frame': 5 obs. of 1 variable:
# $ fan: Factor w/ 2 levels "a","b": 1 2 NA 1 2
When you read your file to R, you may avoid 'empty cell' being included as a factor level in the first place, by using the na.strings argument in read.csv (or in read.xxx). The na.strings argument defines "strings which are to be interpreted as NA values".
Here is an example where I read a text file (foo.csv) which I created from your 'df':
read.csv(file = "foo.csv", na.strings = " ")
# fan
# 1 a
# 2 b
# 3 <NA>
# 4 a
# 5 b
str(as.factor(df2$fan))
# Factor w/ 2 levels "a","b": 1 2 NA 1 2
When the file is read the empty fields are now treated as NA, and 'blank' is thus not included as a factor level.
From ?read.table: "Blank fields are [...] considered to be missing values in logical, integer, numeric and complex fields". However, in your data, the variable "fan" is a character. If you then have stringsAsFactors = TRUE in options or in read.xxx, the character vector is converted to a factor.
Try:
df$fan[grepl("^\\s*$", df$fan)] <- NA #in case you have c(" ", "", "a", "b", " ")
Explanation
^(|\\s+)$- matches if there is an empty quote'' or spaces within quotes(" ", " ", " "). Hence, more general.
str(droplevels(df))
#'data.frame': 5 obs. of 1 variable:
#$ fan: Factor w/ 2 levels "a","b": 1 2 NA 1 2
If you want to create a new dataset with the empty cells deleted
df1 <- droplevels(df[!grepl("^\\s*$", df$fan),,drop=FALSE] )
str(df1)
#'data.frame': 4 obs. of 1 variable:
#$ fan: Factor w/ 2 levels "a","b": 1 2 1 2
If you are using csv, this might help:
data<-read.csv(file = "data.csv", na.strings = "", stringsAsFactors = T)
I modified the prior response and added , stringsAsFactors = T
So, later it will not report NA in any subsequent analysis as in Createtableone
Related
I Have few variables AGE ACT_TYPE GENDER in my data frame. Instead of printing each of these factor variable's level distribution, I have used for loop to print the distribution. However nothing seems to be printing. Please let me know how to resolve the issue ..
> str(combin)
Classes ‘data.table’ and 'data.frame': 500000 obs. of 333 variables:
$ CUSTOMER_ID : int 385793 286891 108751 278651 23637 130723 5694 275523 163723 469852 ...
$ ACT_TYPE : Factor w/ 2 levels "CSA","SA": 1 1 1 1 1 1 2 2 2 1 ...
$ GENDER : Factor w/ 3 levels "","F","M": 3 3 3 3 3 3 3 3 3 3 ...
$ LEGAL_ENTITY : Factor w/ 7 levels "ASSOCIATION",..: 3 3 3 3 3 3 3 3 3 3
combin[, prop.table(table(GENDER))]
GENDER
F M
0.000272 0.232436 0.767292
combin[, prop.table(table(ACT_TYPE))]
ACT_TYPE
CSA SA
0.710686 0.289314
If I replace the above printing to the display with forloop, I don't see any o/p.
Please let me know where I am going wrong...
for(i in names(combin)) {
combin[, prop.table(table(names(combin)[i]))]
}
Also suggest me how can I apply a condition in the for loop to only print the
distribution only if it's a factor variable.
You could use purrr to loop through each column of the data frame and return a list, where each item in the list corresponds to a column and the columns that are factors are the prop.tables
library(purrr)
#generate some random data like yours
mydf <- data_frame(
id = sample(1:100, 10,replace = F)
, ACT_TYPE = factor(sample(c("CSA", "SA"),10, replace = T))
, GENDER = factor(sample(c("", "F", "M"), 10, replace = T))
)
# use map_if to generate prop.tables when the column is a factor
map_if(mydf, ~class(.x) == "factor", ~prop.table(table(.x)) )
After reading in data and cleaning it, I ended up with factor columns that have levels that should no longer be there.
For example, d below has one blank cell in excel. When it’s read in, the factor columns have a level "", which shouldn’t be part of the data.
d <- read.csv(header = TRUE, text='
x,y,value
a,one,1
,,5
b,two,4
c,three,10
')
d
#> x y value
#> 1 a one 1
#> 2 5
#> 3 b two 4
#> 4 c three 10
str(d)
#> 'data.frame': 4 obs. of 3 variables:
#> $ x : Factor w/ 4 levels "","a","b","c": 2 1 3 4
#> $ y : Factor w/ 4 levels "","one","three",..: 2 1 4 3
#> $ value: int 1 5 4 10
How do I remove this level, "" from the factors which are about 20 factors in the data frame, without deleting the entire row that has just one empty cell, cause this will reduce my sample size from 299000 to just 7 observation(which I have tried before).
One way would be to replace the '' with NA and use droplevels to remove the unused levels
d[1:2] <- lapply(d[1:2], function(x) droplevels(replace(x, x=="", NA)))
levels(d$x)
#[1] "a" "b" "c"
levels(d$y)
#[1] "one" "three" "two"
Another option while reading the dataset (as we assume the OP wanted factor columns would be
d <- read.csv("yourfile.csv", na.strings = "")
This should make sure that the '' will be read as NA.
Update
Suppose, there are numeric columns in between and we need to do the replace/droplevels only for the factor columns
d[] <- lapply(d, function(x) if(is.factor(x)) droplevels(replace(x, x== "", NA))
else x)
I'm trying to get the error rates for a Naive Bayes classifier, by adding in each variable incrementally. For example I have 25 variables in my dataset. I want to get the error rates of the model as I add in one variable at a time. So you know it would output the error rate of the model with the first 2 columns, the error rate with the first 3 columns, then with the first 4 columns, and so on up to the last column.
Here is the pseudocode of what I'm trying to achieve
START
IMPORT DATASET WITH ALL VARIABLES
num_variables = num_dataset_cols
i= 1
WHILE (i <= num_variables)
{
CREATE NEW DATASET WITH x COLUMNs
BUILD THE MODEL
GET THE ERROR RATE
ADD IN NEXT COLUMN
i = i + 1
}
Here is a reproducible question. Obviously you can't build a NB classifier with this data, but that's not my problem. My problem is adding in the columns one by one. So far, the only way I can do it is by overwriting each column. For a NB classifier, the first column is the class node, so there must be at least 2 columns starting off in order for it to run.
#REPRODUCIBLE EXAMPLE
col1 <- c("A", "B", "C", "D", "E")
col2 <- c(1,2,3,4,5)
col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
col4 <- c("n","y","y","n","y")
col5 <- c("10", "15", "50", "100", "20")
dataset <- data.frame(col1, col2, col3, col4,col5)
num_variables <- ncol(dataset)
i <- 1
while i <= num_variables
{
data <- dataset[c(1, i+1)]
str(data)
#BUILD MODEL AND GET VALIDATION ERROR
#INCREMENT i TO GET NEXT COLUMN
i <- i + 1
}
You should be able to see from the str(data) that each time the column is overwritten. Does anyone know how I could go about adding each column without overwriting the previous one? Someone suggested an array to me, but I'm not too familiar with arrays in R. Would this work?
I think this is what you want.
col1 <- c("A", "B", "C", "D", "E")
col2 <- c(1,2,3,4,5)
col3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
col4 <- c("n","y","y","n","y")
col5 <- c("10", "15", "50", "100", "20")
dataset <- data.frame(col1, col2, col3, col4,col5)
dataset
num_variables <- ncol(dataset)
num_variables
i <- 1
while (i <= num_variables) {
data <- dataset[, 1:i]
print(str(data))
#BUILD MODEL AND GET VALIDATION ERROR
#INCREMENT i TO GET NEXT COLUMN
i <- i + 1
}
Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
NULL
'data.frame': 5 obs. of 2 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
NULL
'data.frame': 5 obs. of 3 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
NULL
'data.frame': 5 obs. of 4 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
$ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
NULL
'data.frame': 5 obs. of 5 variables:
$ col1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ col2: num 1 2 3 4 5
$ col3: logi TRUE FALSE FALSE TRUE FALSE
$ col4: Factor w/ 2 levels "n","y": 1 2 2 1 2
$ col5: Factor w/ 5 levels "10","100","15",..: 1 3 5 2 4
NULL
You can use append function after defining output variable
data <- dataset[c(1, i+1)]
append(output, data)
str(data)
Using the "assign" function within a while loop can be helpful for issues like this. You don't show the model syntax, but something like this should work:
dataset$errorrate <- [whatever makes this calculation, assuming it is vectorized]
name <- paste0(errorrate, i)
assign(name, dataset$errorrate)
...
This should leave you with i variables containing error estimate for each model run. If you are looking for one parameter estimate per model you can assign the single estimate a unique name within the global environment using the process above and then rbind them together after the loop has finished
I am curious about the behaviour of transform. Two ways I might try creating a new column as character not as factor:
x <- data.frame(Letters = LETTERS[1:3], Numbers = 1:3)
y <- transform(x, Alphanumeric = as.character(paste(Letters, Numbers)))
x$Alphanumeric = with(x, as.character(paste(Letters, Numbers)))
x
y
str(x$Alphanumeric)
str(y$Alphanumeric)
The results "look" the same:
> x
Letters Numbers Alphanumeric
1 A 1 A 1
2 B 2 B 2
3 C 3 C 3
> y
Letters Numbers Alphanumeric
1 A 1 A 1
2 B 2 B 2
3 C 3 C 3
But look inside and only one has worked:
> str(x$Alphanumeric) # did convert to character
chr [1:3] "A 1" "B 2" "C 3"
> str(y$Alphanumeric) # but transform didn't
Factor w/ 3 levels "A 1","B 2","C 3": 1 2 3
I didn't find ?transform very useful to explain this behaviour - presumably Alphanumeric was coerced back to being a factor - or find a way to stop it (something like stringsAsFactors = FALSE for data.frame). What is the safest way to do this? Are there similar pitfalls to beware of, for instance with the apply or plyr functions?
This is not so much an issue with transform as much as it is with data.frames, where stringsAsFactors is set, by default, to TRUE. Add an argument that it should be FALSE and you'll be on your way:
y <- transform(x, Alphanumeric = paste(Letters, Numbers),
stringsAsFactors = FALSE)
str(y)
# 'data.frame': 3 obs. of 3 variables:
# $ Letters : Factor w/ 3 levels "A","B","C": 1 2 3
# $ Numbers : int 1 2 3
# $ Alphanumeric: chr "A 1" "B 2" "C 3"
I generally use within instead of transform, and it seems to not have this problem:
y <- within(x, {
Alphanumeric = paste(Letters, Numbers)
})
str(y)
# 'data.frame': 3 obs. of 3 variables:
# $ Letters : Factor w/ 3 levels "A","B","C": 1 2 3
# $ Numbers : int 1 2 3
# $ Alphanumeric: chr "A 1" "B 2" "C 3"
This is because it takes an approach similar to your with approach: Create a character vector and add it (via [<-) into the existing data.frame.
You can view the source of each of these by typing transform.data.frame and within.data.frame at the prompt.
As for other pitfalls, that's much too broad of a question. One thing that comes to mind right waya is that apply would create a matrix from a data.frame, so all the columns would be coerced to a single type.
I have a table that look like this
A B C
AB ABC CBS
AB ABC
ADS
BBB
A want to use the columns as a character so is used this
A= as.character(table$A)
this results in c(“AB”, “AB”, “”) my goal was c(“AB”, “AB”), so without the empty cell "". To get wit of the empty cell I used this A=A[!A==""] which gives the results I want, but there must be a more elegant way of accomplishing the same goal.
May questions are 1) is there a better way of removing empty characters/cells.
Or more general 2) is there a way to transform the 3 columns (A,B,C) into characters A, B, C without the empty cells.
Thanks
'data.frame': 3 obs. of 3 variables:
$ A: Factor w/ 2 levels "","AB": 2 2 1
$ B: Factor w/ 3 levels "","ABC","ADS": 2 1 3
$ C: Factor w/ 3 levels "ABC","BBB","CBS": 3 1 2
Try specifying the argument na.strings during data import. Also, instead of using read.csv(), you could write read.csv2() which uses sep = ";" by default.
# Import data
data <- read.csv2("/path/to/data.csv", header = TRUE,
na.strings = "", stringsAsFactors = FALSE)
str(data)
'data.frame': 4 obs. of 3 variables:
$ A: chr "AB" "AB" NA NA
$ B: chr "ABC" NA "ADS" NA
$ C: chr "CBS" "ABC" NA "BBB"
# Exclude NAs
as.character(na.exclude(data$A))
[1] "AB" "AB"
If you prefer not to read your data set again, you can use:
# not in ('') or ("")
A <- table$A[!table$A %in% '']