I have a survey dataset with 40 ordered factor variables. The variables are transformed into characters when the data is imported.Please correct me if I am wrong, as I am thinking of using the apply function here.
Below my data manipulation:
### data
v1 <- as.character(c(1,4,2,4,3,1,3,4,5,2,2,3,6,5,4,6,5,4,5,6,6,2,4,3,4,5,6,1,6,3,5,6,3,2,4,5,3,2,4,5,3,2,4))
v2 <- as.character(c(3,4,1,4,5,1,3,1,5,6,4,3,4,5,6,3,3,5,4,3,3,5,6,3,4,3,4,6,3,1,1,3,4,5,6,1,3,6,4,3,1,6,5))
df <- data.frame(v1,v2)
### transform into ordered factor
df$v1.f <- as.factor(df$v1)
df$v1.f <- ordered(df$v1.f, levels = c("1", "2", "3", "4", "5", "6"))
The real levels are unsorted characters, which is why I included the step. I don't mind typing this for all variables, but it seems redundant.
My second issue is with the output. I would like to create a fancy report and know how to generate the numbers for it:
v1.freq <- table(df$v1.f)
v1.perc <- round(prop.table(v1.freq),2)*100
v1.med <- median(df$v1)
How can a table that contains all the information for all the variables at once for multiple variables be printed - especially when there are no answers to a level (see v2, where there is no response for level 2; table() simply skips over the level).
How do I turn the R output in a table that has the levels as headers and frequencies and percentages as rows for multiple variables?
Copy/pasting the numbers into an Excel Sheet seems - again - unnecessary and prone to errors.
First, you might want to check if you have a stringAsFactor option for your data import function.
Then, as I understand, you want to transform your variable into ordered factors, and this for all of them. You can wrap this into a dplyr sentence, and use forcats to handle factors. Let's take your data:
library(tidyverse)
df %>%
mutate(across(1:2, ~factor(.))) %>%
mutate(across(1:2,~ordered(.))) %>%
str()
Output:
'data.frame': 43 obs. of 2 variables:
$ v1: Ord.factor w/ 6 levels "1"<"2"<"3"<"4"<..: 1 4 2 4 3 1 3 4 5 2 ...
$ v2: Ord.factor w/ 5 levels "1"<"3"<"4"<"5"<..: 2 3 1 3 4 1 2 1 4 5 ...
As you can see, the variables are transformed as ordered factors, with levels ordered alphabetically. To explain, mutate is to alterate your variables, across specify which variables you want to change, and how. Here, we want to mutate the variable 1 to 2 and apply to them the functions factor and then ordered. If the alphabetical levelling isn't the one desired, you can still mutate the column by it self and give the levels argument.
For the second question, as far as there is no level "2" for V2, unlike V1, you cannot merge the two variable, unless you add a level for V2 with NA. You can still check janitor::tabyl to give you cross frequencies, and create one table per variable:
library(janitor)
df2 <- df %>%
mutate(across(1:2, ~factor(.))) %>%
mutate(across(1:2,~ordered(.)))
map(df2, tabyl)
Output:
$v1
.x[[i]] n percent
1 3 0.06976744
2 7 0.16279070
3 8 0.18604651
4 10 0.23255814
5 8 0.18604651
6 7 0.16279070
$v2
.x[[i]] n percent
1 7 0.1627907
3 13 0.3023256
4 9 0.2093023
5 7 0.1627907
6 7 0.1627907
Related
I have the following dataframe
OCC1990 Skilllevel
3 1
8 2
12 2
14 3
15 1
As illustrated above it contains a long list of occupations assigned to a specific skill level.
My actual dataframe is a household survey with millions of rows, including a column which is also named OCC1990.
My goal is to implement my assigned skill levels from the above-listed data frame into the household survey.
I applied in the past already the following code for smaller dataframes, which is a pretty manual way
cps_data[cps_data$OCC1990 %in% 3,"skilllevel"] <- 1
cps_data[cps_data$OCC1990 %in% 4:7,"skilllevel"] <- 1
cps_data[cps_data$OCC1990 %in% 8,"skilllevel"] <- 2
But due to the fact that I don't wanna spend hours copying pasting as well as it increases the probability of making mistakes I'm searching for a different, more direct way.
I've already tried to merge both dataframes, but this result in an error related to the size of the vector.
Is there another way than merging just the two dataframes to assign the skill level also to the occupations in the survey?
Many thanks in advance
Xx freddy
Using data.table for large dataset
create two vectors: levels and labels. The levels contains unique values of OCC1990 and labels contains the new skill levels you want to apply.
Now use levels and labels inside the factor function to modify the skill level. (I used Skilllevel = 3 for OCC1990 = 8 )
library(data.table)
setDT(df)
levels <- c(3:7,8) # unique values of OCC1990
labels <- c(rep(1,5), 3) # new Skill levels corresponding to OCC1990
setkey(df, OCC1990) # sort OCC1990 for speed before filtering
df[ OCC1990 %in% levels, Skilllevel := as.integer(as.character(factor(OCC1990, levels = levels, labels = labels)))]
head(df)
# OCC1990 Skilllevel
#1: 3 1
#2: 8 3
#3: 12 2
#4: 14 3
#5: 15 1
If you are still facing memory size issues, read in chunks of data from IO (use fread) and apply the above operation and then append data to a new file.
Data:
df <- read.table(text='OCC1990 Skilllevel
3 1
8 2
12 2
14 3
15 1 ', header=TRUE)
df = data.frame(table(train$department , train$outcome))
Here department and outcome both are factors so it gives me a dataframe which looks like in the given image
is_outcome is binary and df looks like this
containing only 2 variables(fields) while I want this department column to be a part of dataframe i.e a dataframe of 3 variables
0 1
Analytics 4840 512
Finance 2330 206
HR 2282 136
Legal 986 53
Operations 10325 1023
Procurement 6450 688
R&D 930 69
Sales & Marketing 15627 1213
Technology 6370 768
One way I learnt was...
df = data.frame(table(train$department , train$is_outcome))
write.csv(df,"df.csv")
rm(df)
df = read.csv("df.csv")
colnames(df) = c("department", "outcome_0","outcome_1")
but I cannot save file in everytime in my program
is there any way to do it directly.
When you are trying to create tables from a matrix in R, you end up with trial.table. The object trial.table looks exactly the same as the matrix trial, but it really isn’t. The difference becomes clear when you transform these objects to a data frame. Take a look at the outcome of this code:
> trial.df <- as.data.frame(trial)
> str(trial.df)
‘data.frame’: 2 obs. of 2 variables:
$ sick : num 34 11
$ healthy: num 9 32
Here you get a data frame with two variables (sick and healthy) with each two observations. On the other hand, if you convert the table to a data frame, you get the following result:
> trial.table.df <- as.data.frame(trial.table)
> str(trial.table.df)
‘data.frame’: 4 obs. of 3 variables:
$ Var1: Factor w/ 2 levels “risk”,”no_risk”: 1 2 1 2
$ Var2: Factor w/ 2 levels “sick”,”healthy”: 1 1 2 2
$ Freq: num 34 11 9 32
The as.data.frame() function converts a table to a data frame in a format that you need for regression analysis on count data. If you need to summarize the counts first, you use table() to create the desired table.
Now you get a data frame with three variables. The first two — Var1 and Var2 — are factor variables for which the levels are the values of the rows and the columns of the table, respectively. The third variable — Freq — contains the frequencies for every combination of the levels in the first two variables.
In fact, you also can create tables in more than two dimensions by adding more variables as arguments, or by transforming a multidimensional array to a table using as.table(). You can access the numbers the same way you do for multidimensional arrays, and the as.data.frame() function creates as many factor variables as there are dimensions.
Preface:
I have seen this post:How to convert a factor to an integer\numeric without a loss of information? , but it does not really apply to the issue I am having. It addresses the issue of converting a vector in the form of factor to a numeric, but the issue I am having is larger than that.
Problem:
I am trying to convert a column in a dataframe from a factor to a numeric, while representing the dataframe using paste0. Here is an example:
aa=1:10
bb=rnorm(10)
dd=data.frame(aa,bb)
get(paste0("d","d"))[,2]=as.factor(get(paste0("d","d"))[,2])
(The actual code I am using requires me to use the paste0 function)
I get the error: target of assignment expands to non-language object
I am not sure how to do this, I think what is messing it up is the paste0 function.
First, this is not really a natural way to think about things or to code things in R. It can be done, but if you rephrase your question to give the bigger picture, someone can probably provide more natural ways of doing this in R. (Like the named lists #joran mentioned in the comment.)
With that said, to do this in R, you need to split apart the three steps you're trying to do in one line: get the data frame with the specified variable, make the desired column a factor, and then assign back to the variable name. Here I've wrapped this in a function, so the assignment needs to be made in pos=1 instead of the default, which would name it only within the function.
tof <- function(dfname, colnum) {
d <- get(dfname)
d[, colnum] <- factor(d[, colnum])
assign(dfname, d, pos=1)
}
dd <- data.frame(aa=1:10, bb=rnorm(10))
str(dd)
## 'data.frame': 10 obs. of 2 variables:
## $ aa: int 1 2 3 4 5 6 7 8 9 10
## $ bb: num -1.4824 0.7904 0.0258 1.2075 0.2455 ...
tof("dd", 2)
str(dd)
## 'data.frame': 10 obs. of 2 variables:
## $ aa: int 1 2 3 4 5 6 7 8 9 10
## $ bb: Factor w/ 10 levels "-1.48237228248052",..: 1 8 4 9 5 10 2 7 3 6
I have a data frame where each column is of type factor and has over 3000levels.
Is there a way where I can replace each level with a numeric value.
Consider the inbuilt data frame InsectSprays
> str(InsectSprays)
'data.frame': 72 obs. of 2 variables:
$ count: num 10 7 20 14 14 12 10 23 17 20 ...
$ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
The replacement should be as follows:
A=1,B=2,C=3,D=4,E=5,F=6.
If there are 3000 levels:
"USA"=1,"UK"=2....,France="3000".
The solution should automatically detect the levels(Ex: 3000),then replace each level starting from 1 to 3000.
For the InsectSprays example, you can use:
levels(InsectSprays$spray) <- 1:6
Should generalize to your problem.
Factor variables already have underlying numeric values corresponding to each factor level. You can see this as follows:
as.numeric(InsectSprays$spray)
or
x = factor(c("A","D","B","G"))
as.numeric(x)
If you want to add specific numeric values corresponding to each level, you can, for example, merge in those values from a lookup table:
# Create a lookup table with the numeric values you want to correspond to each level of spray
lookup = data.frame(spray=levels(InsectSprays$spray), sprayNumeric=c(5,4,1,2,3,6))
# Merge lookup values into your data frame
InsectSprays = merge(InsectSprays, lookup, by="spray")
Based on this tutorial (https://statisticsglobe.com/how-to-convert-a-factor-to-numeric-in-r/), I have used the following code to convert factor levels into specific numbers:
levels(InsectSprays$spray) # to check the order levels are stored
levels(InsectSprays$spray) <- c(0, 1, 2, 3, 4, 5) # assign the number I want to each level
InsectSprays$spray <- as.numeric(as.character(InsectSprays$spray)) # to change from factor to numeric
I have a column of type factor in my data whose summary looks as follows
$COL_256
0 1 <NA>
31557 0 0
As you can see, there are only three levels for this column and two of them have zero occurrences, which means it's basically just one factor level.
The trouble with this is that, when I do certain operations like regression, I get an error which says,
contrasts can be applied only to factors with 2 or more levels
How can I remove this column which have all their occurrences in just one of the factor levels?
EDIT : I tried droplevels(df) as suggested but now my column looks as follows and gives the same error.
$COL_256
0
31557
You could test to see what the status of variables are and drop them if they are constants. E.g.:
dat <- data.frame(y=1:3,x=factor("a",levels=c("a","b")),x2=letters[c(1,2,1)])
# y = numeric, x=constant with 2 factor levels, x2=not constant with 2 factor levels
dat[sapply(dat,function(x) length(levels({if(is.factor(x)) droplevels(x) else x}))!=1 )]
# y x2
#1 1 a
#2 2 b
#3 3 a