I have a survey dataset with 40 ordered factor variables. The variables are transformed into characters when the data is imported.Please correct me if I am wrong, as I am thinking of using the apply function here.
Below my data manipulation:
### data
v1 <- as.character(c(1,4,2,4,3,1,3,4,5,2,2,3,6,5,4,6,5,4,5,6,6,2,4,3,4,5,6,1,6,3,5,6,3,2,4,5,3,2,4,5,3,2,4))
v2 <- as.character(c(3,4,1,4,5,1,3,1,5,6,4,3,4,5,6,3,3,5,4,3,3,5,6,3,4,3,4,6,3,1,1,3,4,5,6,1,3,6,4,3,1,6,5))
df <- data.frame(v1,v2)
### transform into ordered factor
df$v1.f <- as.factor(df$v1)
df$v1.f <- ordered(df$v1.f, levels = c("1", "2", "3", "4", "5", "6"))
The real levels are unsorted characters, which is why I included the step. I don't mind typing this for all variables, but it seems redundant.
My second issue is with the output. I would like to create a fancy report and know how to generate the numbers for it:
v1.freq <- table(df$v1.f)
v1.perc <- round(prop.table(v1.freq),2)*100
v1.med <- median(df$v1)
How can a table that contains all the information for all the variables at once for multiple variables be printed - especially when there are no answers to a level (see v2, where there is no response for level 2; table() simply skips over the level).
How do I turn the R output in a table that has the levels as headers and frequencies and percentages as rows for multiple variables?
Copy/pasting the numbers into an Excel Sheet seems - again - unnecessary and prone to errors.
First, you might want to check if you have a stringAsFactor option for your data import function.
Then, as I understand, you want to transform your variable into ordered factors, and this for all of them. You can wrap this into a dplyr sentence, and use forcats to handle factors. Let's take your data:
library(tidyverse)
df %>%
mutate(across(1:2, ~factor(.))) %>%
mutate(across(1:2,~ordered(.))) %>%
str()
Output:
'data.frame': 43 obs. of 2 variables:
$ v1: Ord.factor w/ 6 levels "1"<"2"<"3"<"4"<..: 1 4 2 4 3 1 3 4 5 2 ...
$ v2: Ord.factor w/ 5 levels "1"<"3"<"4"<"5"<..: 2 3 1 3 4 1 2 1 4 5 ...
As you can see, the variables are transformed as ordered factors, with levels ordered alphabetically. To explain, mutate is to alterate your variables, across specify which variables you want to change, and how. Here, we want to mutate the variable 1 to 2 and apply to them the functions factor and then ordered. If the alphabetical levelling isn't the one desired, you can still mutate the column by it self and give the levels argument.
For the second question, as far as there is no level "2" for V2, unlike V1, you cannot merge the two variable, unless you add a level for V2 with NA. You can still check janitor::tabyl to give you cross frequencies, and create one table per variable:
library(janitor)
df2 <- df %>%
mutate(across(1:2, ~factor(.))) %>%
mutate(across(1:2,~ordered(.)))
map(df2, tabyl)
Output:
$v1
.x[[i]] n percent
1 3 0.06976744
2 7 0.16279070
3 8 0.18604651
4 10 0.23255814
5 8 0.18604651
6 7 0.16279070
$v2
.x[[i]] n percent
1 7 0.1627907
3 13 0.3023256
4 9 0.2093023
5 7 0.1627907
6 7 0.1627907
df = data.frame(table(train$department , train$outcome))
Here department and outcome both are factors so it gives me a dataframe which looks like in the given image
is_outcome is binary and df looks like this
containing only 2 variables(fields) while I want this department column to be a part of dataframe i.e a dataframe of 3 variables
0 1
Analytics 4840 512
Finance 2330 206
HR 2282 136
Legal 986 53
Operations 10325 1023
Procurement 6450 688
R&D 930 69
Sales & Marketing 15627 1213
Technology 6370 768
One way I learnt was...
df = data.frame(table(train$department , train$is_outcome))
write.csv(df,"df.csv")
rm(df)
df = read.csv("df.csv")
colnames(df) = c("department", "outcome_0","outcome_1")
but I cannot save file in everytime in my program
is there any way to do it directly.
When you are trying to create tables from a matrix in R, you end up with trial.table. The object trial.table looks exactly the same as the matrix trial, but it really isn’t. The difference becomes clear when you transform these objects to a data frame. Take a look at the outcome of this code:
> trial.df <- as.data.frame(trial)
> str(trial.df)
‘data.frame’: 2 obs. of 2 variables:
$ sick : num 34 11
$ healthy: num 9 32
Here you get a data frame with two variables (sick and healthy) with each two observations. On the other hand, if you convert the table to a data frame, you get the following result:
> trial.table.df <- as.data.frame(trial.table)
> str(trial.table.df)
‘data.frame’: 4 obs. of 3 variables:
$ Var1: Factor w/ 2 levels “risk”,”no_risk”: 1 2 1 2
$ Var2: Factor w/ 2 levels “sick”,”healthy”: 1 1 2 2
$ Freq: num 34 11 9 32
The as.data.frame() function converts a table to a data frame in a format that you need for regression analysis on count data. If you need to summarize the counts first, you use table() to create the desired table.
Now you get a data frame with three variables. The first two — Var1 and Var2 — are factor variables for which the levels are the values of the rows and the columns of the table, respectively. The third variable — Freq — contains the frequencies for every combination of the levels in the first two variables.
In fact, you also can create tables in more than two dimensions by adding more variables as arguments, or by transforming a multidimensional array to a table using as.table(). You can access the numbers the same way you do for multidimensional arrays, and the as.data.frame() function creates as many factor variables as there are dimensions.
Preface:
I have seen this post:How to convert a factor to an integer\numeric without a loss of information? , but it does not really apply to the issue I am having. It addresses the issue of converting a vector in the form of factor to a numeric, but the issue I am having is larger than that.
Problem:
I am trying to convert a column in a dataframe from a factor to a numeric, while representing the dataframe using paste0. Here is an example:
aa=1:10
bb=rnorm(10)
dd=data.frame(aa,bb)
get(paste0("d","d"))[,2]=as.factor(get(paste0("d","d"))[,2])
(The actual code I am using requires me to use the paste0 function)
I get the error: target of assignment expands to non-language object
I am not sure how to do this, I think what is messing it up is the paste0 function.
First, this is not really a natural way to think about things or to code things in R. It can be done, but if you rephrase your question to give the bigger picture, someone can probably provide more natural ways of doing this in R. (Like the named lists #joran mentioned in the comment.)
With that said, to do this in R, you need to split apart the three steps you're trying to do in one line: get the data frame with the specified variable, make the desired column a factor, and then assign back to the variable name. Here I've wrapped this in a function, so the assignment needs to be made in pos=1 instead of the default, which would name it only within the function.
tof <- function(dfname, colnum) {
d <- get(dfname)
d[, colnum] <- factor(d[, colnum])
assign(dfname, d, pos=1)
}
dd <- data.frame(aa=1:10, bb=rnorm(10))
str(dd)
## 'data.frame': 10 obs. of 2 variables:
## $ aa: int 1 2 3 4 5 6 7 8 9 10
## $ bb: num -1.4824 0.7904 0.0258 1.2075 0.2455 ...
tof("dd", 2)
str(dd)
## 'data.frame': 10 obs. of 2 variables:
## $ aa: int 1 2 3 4 5 6 7 8 9 10
## $ bb: Factor w/ 10 levels "-1.48237228248052",..: 1 8 4 9 5 10 2 7 3 6
I have a column of type factor in my data whose summary looks as follows
$COL_256
0 1 <NA>
31557 0 0
As you can see, there are only three levels for this column and two of them have zero occurrences, which means it's basically just one factor level.
The trouble with this is that, when I do certain operations like regression, I get an error which says,
contrasts can be applied only to factors with 2 or more levels
How can I remove this column which have all their occurrences in just one of the factor levels?
EDIT : I tried droplevels(df) as suggested but now my column looks as follows and gives the same error.
$COL_256
0
31557
You could test to see what the status of variables are and drop them if they are constants. E.g.:
dat <- data.frame(y=1:3,x=factor("a",levels=c("a","b")),x2=letters[c(1,2,1)])
# y = numeric, x=constant with 2 factor levels, x2=not constant with 2 factor levels
dat[sapply(dat,function(x) length(levels({if(is.factor(x)) droplevels(x) else x}))!=1 )]
# y x2
#1 1 a
#2 2 b
#3 3 a
All of my data comes in character format. When I try transforming a subset of the data in to numeric using apply it doesn't seem to work.
df2 <- as.data.frame(matrix(as.character(1:9),3,3))
df2[,-2] <- apply(df2[,-2], 2, as.numeric)
apply(df2, 2, class)
Could somebody point me out what I am doing wrong in the example above?
Thanks
As commented above.. a matrix in R can only hold values of the same type in all columns. You cannot change some of the values to numeric and leave some others as characters. If you want different data types, you can use a data.frame, but even then, you can only have one data type per column.
For your example case:
df2 <- as.data.frame(matrix(as.character(1:9),3,3))
will create a data.frame with factors in each column. If you want to convert the second column to numeric, you can do:
df2$V2 <- as.numeric(levels(df2$V2))[df2$V2]
Or
df$V2 <- as.numeric(as.character(df2$V2))
So you don't need to use apply in this case.
str(df2)
#'data.frame': 3 obs. of 3 variables:
# $ V1: Factor w/ 3 levels "1","2","3": 1 2 3
# $ V2: num 4 5 6
# $ V3: Factor w/ 3 levels "7","8","9": 1 2 3
If you wanted to convert all columns to numeric, you can do:
# if the columns were factors before:
df2[] <- lapply(df2, function(i) as.numeric(levels(i))[i])
Or
# if the columns were characters before:
df2[] <- lapply(df2, as.numeric)