Here are three vectors.
vec1 <- 1:6
vec2 <- c('radio', 'newspaper', 'web-page', 'chat', 'tv', 'web-page')
vec3 <- c(0, 0, 1, 1, 0, 1)
The task is to form a data frame with the following structure using these vectors.
'data.frame': 6 obs. of 3 variables:
$ id : int 1 2 3 4 5 6
$ response: Factor w/ 2 levels "No","Yes": 1 1 2 2 1 2
$ medium : chr "radio" "newspaper" "web-page" "chat" ...
Here is my solution.
dfr <- data.frame(id = vec1, response = vec3, medium = vec2, stringsAsFactors = FALSE)
dfr$response <- factor(x = , levels = , labels = )
My question is: "What values should the arguments (x, levels, labels) have and why?"
Talking about this line:
dfr$response <- factor(x = , levels = , labels = )
We can assign labels to vec3 as levels are by default taken from unique values of vec3.
df <- data.frame(id = vec1, response = factor(vec3, labels = c('No', 'Yes')),
medium = vec2, stringsAsFactors = FALSE)
str(df)
#'data.frame': 6 obs. of 3 variables:
#$ id : int 1 2 3 4 5 6
#$ response: Factor w/ 2 levels "No","Yes": 1 1 2 2 1 2
#$ medium : chr "radio" "newspaper" "web-page" "chat" ...
You can read ?factor for more details.
In this:
x is the vector of data that you want to turn into a factor, in this case the responses x=df$response
Levels is a vector of values that x might have taken. The default is a list of the distinct values of x, in ascending order (numeric or alphabetical), so the default would be c(0, 1). You don't need to include the levels, as it will automatically detect them, however as you're adding labels then it's good practice to add the levels so your labels match up in case you have lots of levels and manage to get the order mixed up.
Labels can either be a single string or a vector of all labels for the levels, you can use labels to map multiple values to the same Label. For your task you would use c("No", "Yes"). the default for Labels is the levels i.e. no label.
So your final code will be
dfr$response <- factor(x=dfr$response, levels=c(0,1), labels=c("No", "Yes"))
As a minor aside, people generally use df to represent a data frame, rather than dfr. It doesn't make any difference, but is just the commonly used notation.
Related
I have this df
df = data.frame(x = 1:3)
converted to a factor
df$x = factor(df$x)
the levels by default are
str(df)
now let's make level 2 as the reference level
df$x = relevel(df$x,ref=2)
everything till now is ok. but when deciding to make the level 1 again as the default level it's not working
df$x = relevel(df$x,ref=2)
str(df)
df$x = relevel(df$x,ref=1)
str(df)
Appreciatethe help.
From ?relevel,
ref: the reference level, typically a string.
I'll key off of "typically". Looking at the code of stats:::relevel.factor, one key part is
if (is.character(ref))
ref <- match(ref, lev)
This means to me that after this expression, ref is now (assumed to be) an integer that corresponds to the index within the levels. In that context, your ref=1 is saying to use the first level by its index (which is already first).
Try using a string.
relevel(df$x,ref=1)
# [1] 1 2 3
# Levels: 2 1 3
relevel(df$x,ref="1")
# [1] 1 2 3
# Levels: 1 2 3
Trying to create a BN using BNlearn, but I keep getting an error;
Error in check.data(data, allowed.types = discrete.data.types) : variable Variable1 must have at least two levels.
It gives me that error for every of my variable, even though they're all factors and has more than 1 levels, As you can see - in this case my variable "model" has 4 levels
As I can't share the variables and dataset, I've created a small set and belonging code to the data set. I get the same problem. I know I've only shared 2 variables, but I get the same error for all the variables.
library(tidyverse)
library (bnlearn)
library(openxlsx)
DataFull <- read.xlsx("(.....)/test.xlsx", sheet = 1, startRow = 1, colNames = TRUE)
set.seed(600)
DataFull <- as_tibble(DataFull)
DataFull$Variable1 <- as.factor(DataFull$Variable1)
DataFull$TargetVar <- as.factor(DataFull$TargetVar)
DataFull <- na.omit(DataFull)
DataFull <- droplevels(DataFull)
DataFull <- DataFull[sample(nrow(DataFull)),]
Data <- DataFull[1:as.integer(nrow(DataFull)*0.70)-1,]
Datatest <- DataFull[as.integer(nrow(DataFull)*0.70):nrow(DataFull),]
nrow(Data)+nrow(Datatest)==nrow(DataFull)
FocusVar <- as.character("TargetVar")
BN.naive <- naive.bayes(Data, FocusVar)
Using str(data), I can see that the variable has 2 or more levels already:
str(Data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 27586 obs. of 2 variables:
$ Variable1: Factor w/ 3 levels "Small","Medium",..: 2 2 3 3 3 3 3 3 3 3 ...
$ TargetVar: Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 2 1 1 1 ...
Link to data set: https://drive.google.com/open?id=1VX2xkPdeHKdyYqEsD0FSm1BLu1UCtOj9eVIVfA_KJ3g
bnlearn expects a data.frame : doesn't work with tibbles, So keep your data as a data.frame by omitting the line DataFull <- as_tibble(DataFull)
Example
library(tibble)
library (bnlearn)
d <- as_tibble(learning.test)
hc(d)
Error in check.data(x) : variable A must have at least two levels.
In particular, it is the line from bnlearn:::check.data
if (nlevels(x[, col]) < 2)
stop("variable ", col, " must have at least two levels.")
In a standard data.frame,learning.test[,"A"] returns a vector and so nlevels(learning.test[,"A"]) works as expected, however, by design, you cannot extract vectors like this from tibbles : d[,"A"]) is still a tbl_df and not a vector hence nlevels(d[,"A"]) doesn't work as expected, and returns zero.
I have data with fifty various categorical values in a column labelled "cat", and a second column with a continuous numerical value "amount". I only want to plot the subset of "cat" with an "amount" greater than 5. Why do I have the ghost-label on my x-axis for those intermediate rows that should be omitted based on my subset?
Example code:
cat<-c("a","b","c","d","e")
amount<-c(4,15,18,2,9)
df<-data.frame(cat=cat,amount=amount)
df1<-subset(df,amount >5)
library(plotly)
p <- plot_ly(df1, x = ~cat, y = ~amount)
p
df1 printed out:
cat amount
2 b 15
3 c 18
5 e 9
And the plot generated:
It is interesting that "a" doesn't appear on my x axis, but "d" does. I take it there is something going on with the row numbers, but why is this and how can I prevent this from happening?
Thank you in advance.
subset does not drop the unused levels of a factor as shown below
str(df1)
'data.frame': 3 obs. of 2 variables:
$ cat : Factor w/ 5 levels "a","b","c","d",..: 2 3 5
$ amount: num 15 18 9
So stringsAsFactors = FALSE will import cat as a character vector which you can modify to factor after subsetting or use it directly.
df <- data.frame(cat=cat,amount=amount, stringsAsFactors = FALSE)
df1 <- subset(df,amount >5)
plot_ly(df1, x = ~cat, y = ~amount)
i have a data frame(called hp) what contains more columns with NA-s.The classes of these columns are factor. First i want to change it to character, fill NA-s with "none" and change it back to factor. I have 14 columns and because of it i'd like to make it with loops. But it doesnt work.
Thx for your help.
The columns:
miss_names<-c("Alley","MasVnrType","FireplaceQu","PoolQC","Fence","MiscFeature","GarageFinish", "GarageQual","GarageCond","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1",
"BsmtFinType2","Electrical")
The loop:
for (i in miss_names){
hp[i]<-as.character(hp[i])
hp[i][is.na(hp[i])]<-"NONE"
hp[i]<-as.factor(hp[i])
print(hp[i])
}
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
Use addNA() to add NA as a factor level and then replace that level with whatever you want. You don't have to turn the factors into a character vector first. You can loop over all the factors in the data frame and replace them one by one.
# Sample data
dd <- data.frame(
x = sample(c(NA, letters[1:3]), 20, replace = TRUE),
y = sample(c(NA, LETTERS[1:3]), 20, replace = TRUE)
)
# Loop over the columns
for (i in seq_along(dd)) {
xx <- addNA(dd[, i])
levels(xx) <- c(levels(dd[, i]), "none")
dd[, i] <- xx
}
This gives us
> str(dd)
'data.frame': 20 obs. of 2 variables:
$ x: Factor w/ 4 levels "a","b","c","none": 1 4 1 4 4 1 4 3 3 3 ...
$ y: Factor w/ 4 levels "A","B","C","none": 1 1 2 2 1 3 3 3 4 1 ...
An alternative solution using the purrr library using the same data as # Johan Larsson:
library(purrr)
set.seed(15)
dd <- data.frame(
x = sample(c(NA, letters[1:3]), 20, replace = TRUE),
y = sample(c(NA, LETTERS[1:3]), 20, replace = TRUE))
# Create a function to convert NA to none
convert.to.none <- function(x){
y <- addNA(x)
levels(y) <- c(levels(x), "none")
x <- y
return(x) }
# use the map function to cycle through dd's columns
map_df(dd, convert.2.none)
Allows for scaling of your work.
I have an R data frame with factor columns.
dataframe <- read.csv("import.csv")
dataframe$col1 = as.factor(dataframe$col1)
dataframe$col2 = as.factor(dataframe$col2)
...
How can I generate a new row from labels?
newRow = dataframe[1,] #template
newRow[1] = ?
newRow[2] = ?
Lets say col1 includes "TestValue". I would like to set newRow[1] value to "TestValue" as if it was selected from my dataframe. How can I do that?
I know I can get factor index like so:
match(c("TestValue"),levels(dataframe$col1))
[1] 3
But whenever I assign anything to newRow[1], I seem to change its type.
Thanks in advance.
You could assign a factor to newRow[1] and maintain the levels too.
In your case:
newRow[1] <- factor('TestValue', levels = levels(df$col1))
As an example:
df <- data.frame(a = letters, b = letters)
new <- df[1, ]
new[1] <- factor('b', levels = levels(df[[1]]))
Output:
> str(new)
'data.frame': 1 obs. of 2 variables:
$ a: Factor w/ 26 levels "a","b","c","d",..: 2
$ b: Factor w/ 26 levels "a","b","c","d",..: 1
column a is still a factor with all the levels