R, generating dataframe row from factor values - r

I have an R data frame with factor columns.
dataframe <- read.csv("import.csv")
dataframe$col1 = as.factor(dataframe$col1)
dataframe$col2 = as.factor(dataframe$col2)
...
How can I generate a new row from labels?
newRow = dataframe[1,] #template
newRow[1] = ?
newRow[2] = ?
Lets say col1 includes "TestValue". I would like to set newRow[1] value to "TestValue" as if it was selected from my dataframe. How can I do that?
I know I can get factor index like so:
match(c("TestValue"),levels(dataframe$col1))
[1] 3
But whenever I assign anything to newRow[1], I seem to change its type.
Thanks in advance.

You could assign a factor to newRow[1] and maintain the levels too.
In your case:
newRow[1] <- factor('TestValue', levels = levels(df$col1))
As an example:
df <- data.frame(a = letters, b = letters)
new <- df[1, ]
new[1] <- factor('b', levels = levels(df[[1]]))
Output:
> str(new)
'data.frame': 1 obs. of 2 variables:
$ a: Factor w/ 26 levels "a","b","c","d",..: 2
$ b: Factor w/ 26 levels "a","b","c","d",..: 1
column a is still a factor with all the levels

Related

How to change factor arguments? (R)

Here are three vectors.
vec1 <- 1:6
vec2 <- c('radio', 'newspaper', 'web-page', 'chat', 'tv', 'web-page')
vec3 <- c(0, 0, 1, 1, 0, 1)
The task is to form a data frame with the following structure using these vectors.
'data.frame': 6 obs. of 3 variables:
$ id : int 1 2 3 4 5 6
$ response: Factor w/ 2 levels "No","Yes": 1 1 2 2 1 2
$ medium : chr "radio" "newspaper" "web-page" "chat" ...
Here is my solution.
dfr <- data.frame(id = vec1, response = vec3, medium = vec2, stringsAsFactors = FALSE)
dfr$response <- factor(x = , levels = , labels = )
My question is: "What values should the arguments (x, levels, labels) have and why?"
Talking about this line:
dfr$response <- factor(x = , levels = , labels = )
We can assign labels to vec3 as levels are by default taken from unique values of vec3.
df <- data.frame(id = vec1, response = factor(vec3, labels = c('No', 'Yes')),
medium = vec2, stringsAsFactors = FALSE)
str(df)
#'data.frame': 6 obs. of 3 variables:
#$ id : int 1 2 3 4 5 6
#$ response: Factor w/ 2 levels "No","Yes": 1 1 2 2 1 2
#$ medium : chr "radio" "newspaper" "web-page" "chat" ...
You can read ?factor for more details.
In this:
x is the vector of data that you want to turn into a factor, in this case the responses x=df$response
Levels is a vector of values that x might have taken. The default is a list of the distinct values of x, in ascending order (numeric or alphabetical), so the default would be c(0, 1). You don't need to include the levels, as it will automatically detect them, however as you're adding labels then it's good practice to add the levels so your labels match up in case you have lots of levels and manage to get the order mixed up.
Labels can either be a single string or a vector of all labels for the levels, you can use labels to map multiple values to the same Label. For your task you would use c("No", "Yes"). the default for Labels is the levels i.e. no label.
So your final code will be
dfr$response <- factor(x=dfr$response, levels=c(0,1), labels=c("No", "Yes"))
As a minor aside, people generally use df to represent a data frame, rather than dfr. It doesn't make any difference, but is just the commonly used notation.

BNlearn R error “variable Variable1 must have at least two levels.”

Trying to create a BN using BNlearn, but I keep getting an error;
Error in check.data(data, allowed.types = discrete.data.types) : variable Variable1 must have at least two levels.
It gives me that error for every of my variable, even though they're all factors and has more than 1 levels, As you can see - in this case my variable "model" has 4 levels
As I can't share the variables and dataset, I've created a small set and belonging code to the data set. I get the same problem. I know I've only shared 2 variables, but I get the same error for all the variables.
library(tidyverse)
library (bnlearn)
library(openxlsx)
DataFull <- read.xlsx("(.....)/test.xlsx", sheet = 1, startRow = 1, colNames = TRUE)
set.seed(600)
DataFull <- as_tibble(DataFull)
DataFull$Variable1 <- as.factor(DataFull$Variable1)
DataFull$TargetVar <- as.factor(DataFull$TargetVar)
DataFull <- na.omit(DataFull)
DataFull <- droplevels(DataFull)
DataFull <- DataFull[sample(nrow(DataFull)),]
Data <- DataFull[1:as.integer(nrow(DataFull)*0.70)-1,]
Datatest <- DataFull[as.integer(nrow(DataFull)*0.70):nrow(DataFull),]
nrow(Data)+nrow(Datatest)==nrow(DataFull)
FocusVar <- as.character("TargetVar")
BN.naive <- naive.bayes(Data, FocusVar)
Using str(data), I can see that the variable has 2 or more levels already:
str(Data)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 27586 obs. of 2 variables:
$ Variable1: Factor w/ 3 levels "Small","Medium",..: 2 2 3 3 3 3 3 3 3 3 ...
$ TargetVar: Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 2 1 1 1 ...
Link to data set: https://drive.google.com/open?id=1VX2xkPdeHKdyYqEsD0FSm1BLu1UCtOj9eVIVfA_KJ3g
bnlearn expects a data.frame : doesn't work with tibbles, So keep your data as a data.frame by omitting the line DataFull <- as_tibble(DataFull)
Example
library(tibble)
library (bnlearn)
d <- as_tibble(learning.test)
hc(d)
Error in check.data(x) : variable A must have at least two levels.
In particular, it is the line from bnlearn:::check.data
if (nlevels(x[, col]) < 2)
stop("variable ", col, " must have at least two levels.")
In a standard data.frame,learning.test[,"A"] returns a vector and so nlevels(learning.test[,"A"]) works as expected, however, by design, you cannot extract vectors like this from tibbles : d[,"A"]) is still a tbl_df and not a vector hence nlevels(d[,"A"]) doesn't work as expected, and returns zero.

Plotly Histogram Retains Ghost Categorical X Values

I have data with fifty various categorical values in a column labelled "cat", and a second column with a continuous numerical value "amount". I only want to plot the subset of "cat" with an "amount" greater than 5. Why do I have the ghost-label on my x-axis for those intermediate rows that should be omitted based on my subset?
Example code:
cat<-c("a","b","c","d","e")
amount<-c(4,15,18,2,9)
df<-data.frame(cat=cat,amount=amount)
df1<-subset(df,amount >5)
library(plotly)
p <- plot_ly(df1, x = ~cat, y = ~amount)
p
df1 printed out:
cat amount
2 b 15
3 c 18
5 e 9
And the plot generated:
It is interesting that "a" doesn't appear on my x axis, but "d" does. I take it there is something going on with the row numbers, but why is this and how can I prevent this from happening?
Thank you in advance.
subset does not drop the unused levels of a factor as shown below
str(df1)
'data.frame': 3 obs. of 2 variables:
$ cat : Factor w/ 5 levels "a","b","c","d",..: 2 3 5
$ amount: num 15 18 9
So stringsAsFactors = FALSE will import cat as a character vector which you can modify to factor after subsetting or use it directly.
df <- data.frame(cat=cat,amount=amount, stringsAsFactors = FALSE)
df1 <- subset(df,amount >5)
plot_ly(df1, x = ~cat, y = ~amount)

R check character values for numeric and change var datatype automatically

I have many dataframes where all the data is character. I can guess that a var containing a number should be changed to a numeric data type. I have 100's of columns though so I don't want to type out each each one to change in order to change it.
Is there another way to automate this process and to scan a column of data check if the character has a numeric value and change it into a numeric type from character type?
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c("21000", "23400", "26800")
gender <- c("M", "M", "F")
rank <- c("5", "109", "2")
df <- data.frame(employee, salary, gender, rank)
I don't want to have to do this for each column/var
df$rank <- as.numeric(df$rank)
I would like to do something like this
i <- sapply(df, is.vector.of.columns.contaning.numeric.values)
df[i] <- lapply(df[i], as.numeric)
We can write a function with the number condition. It works by trying as.numeric and checking if it returns NA, if it does, that means the value cannot be coerced to an unambiguous numeric. When this happens, the function will keep the column as is.
smartConvert <- function(x) {
if(any(is.na(as.numeric(as.character(x))))) x else as.numeric(x)
}
df[] <- lapply(df, smartConvert)
str(df)
# 'data.frame': 3 obs. of 4 variables:
# $ employee: Factor w/ 3 levels "John Doe","Jolie Hope",..: 1 3 2
# $ salary : num 1 2 3
# $ gender : Factor w/ 2 levels "F","M": 2 2 1
# $ rank : num 3 1 2

How can I merge an empty data frame and a data frame in R

I'm trying to merge to data frames like this:
data1 <- data.frame(hola = as.numeric(), toma = as.character())
data2 <- data.frame(hola = as.numeric(1), toma = as.character("cadenita"))
data1
data2
merge(data1, data2)
But it just doesn't work, when I explore each I get:
> str(data1)
'data.frame': 0 obs. of 2 variables:
$ hola: num
$ toma: Factor w/ 0 levels:
> str(data2)
'data.frame': 1 obs. of 2 variables:
$ hola: num 1
$ toma: Factor w/ 1 level "cadenita": 1
I can see, it may be about the character column (toma) but I don't understand what's happening, can anyone give a hand???
Reopening long time later, but I bumped into the same question. Answer may be useful to others.
What you are looking for is appending the table, not merging. If you are using data.table, then rbindlist(list(data1, data2)) will do the job. Data1 may even be completely empty/undefined, i.e. data1 <- data.frame()

Resources