If I have a file with many column, the data are all numbers, how can I know whether a specific column is categorical or quantitative data?. Is there an area of study for this kind of problem? If not, what are some heuristics that can be used to determine?
Some heuristics that I can think of:
Likely to be categorical data
make a summary of the unique value, if it's < some_threshold, there is higher chance to be categorical data.
if the data is highly concentrate (low std.)
if the unique value are highly sequential, and starts from 1
if all the value in column has fixed length (may be ID/Date)
if it has a very small p-value at Benford's Law
if it has a very small p-value at the Chi-square test against the result column
Likely to be quantitative data
if the column has floating number
if the column has sparse value
if the column has negative value
Other
Maybe quantitative data are more likely to be near/next to quantitative data (vice-versa)
I am using R, but the question doesn't need to be R specific.
This assumes someone coded the data correctly.
Perhaps you are suggesting the data were not coded or labeled correctly, that it was all entered as numeric and some of it really is categorical. In that case, I do not know how one could tell with any certainty. Categorical data can have decimals places and can be negative.
The question I would ask myself in such a situation is what difference does it make how I treat the data?
If you are interested in the second scenario perhaps you should ask your question on Stack Exchange.
my.data <- read.table(text = '
aa bb cc dd
10 100 1000 1
20 200 2000 2
30 300 3000 3
40 400 4000 4
50 500 5000 5
60 600 6000 6
', header = TRUE, colClasses = c('numeric', 'character', 'numeric', 'character'))
my.data
# one way
str(my.data)
'data.frame': 6 obs. of 4 variables:
$ aa: num 10 20 30 40 50 60
$ bb: chr "100" "200" "300" "400" ...
$ cc: num 1000 2000 3000 4000 5000 6000
$ dd: chr "1" "2" "3" "4" ...
Here is a way to record the information:
my.class <- rep('empty', ncol(my.data))
for(i in 1:ncol(my.data)) {
my.class[i] <- class(my.data[,i])
}
> my.class
[1] "numeric" "character" "numeric" "character"
EDIT
Here is a way to record class for each column without using a for-loop:
my.class <- sapply(my.data, class)
Related
If I have a dataset that looks like the following, looking at species richness of spiders in different habitats of a garden.
'data.frame': 6 obs. of 5 variables:
$ ID : int 1 2 3 4 5 6
$ species_count: num 10 13 15 17 22 9
$ habitat_type : Factor w/ 2 levels "wall","tree": 1 2 1 2 1 2
$ wall_height : num 153 NA 160 NA 170 NA
$ tree_diameter: num NA 48 NA 52 NA 71
I want to create a lm with species_count as the dependent variable and habitat_type, wall_height and tree_diameter as the independent variables, however the NA's are tricky.
lm.1 <- lm(species_count ~ habitat_type + wall_height + tree_diameter,
data = DF, na.action = na.exclude)
throws up the following error:
Error in contrasts<-(tmp, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
as na.exclude and na.omit delete the entire rows.
Using:
DF$wall_height <- na.exclude(DF$wall_height)
and
DF$tree_diameter <- na.exclude(DF$tree_diameter)
just repeats the values, giving tree_diameter values to wall and vice versa, like so:
DF[1,]
ID species_count habitat_type wall_height tree_diameter
1 1 10 wall 153 48
Is there a way to omit NA values only whilst retaining the rest of the information within the row, or will I have to use separate linear models?
Thanks in advance for any help and hope that I've been clear enough in explaining the issue.
The fundamental problem is that
wall_height doesn't apply to the tree obs and vice versa.
So there is nothing to be gained by trying to analyze the data from wall and tree habitats together. In principle, you can compare the two habitats, and then evaluate how habitat-specific characteristics are associated with species numbers within a habitat.
In practice, you face a problem of very few observations. Usually you want about 10 cases per predictor that you are using in your model. You might be able to do an adequate comparison of the 2 habitats, but any results within a habitat, with only 3 observations each, will be highly suspect.
A couple of other thoughts. First, count data are often better analyzed with a different type of model, a Poisson generalized linear model. Second, the numbers of species are presumably represented by different numbers of individuals of each. There is probably some information to be gleaned from that, which should be explained in the ecology literature on species diversity.
df = data.frame(table(train$department , train$outcome))
Here department and outcome both are factors so it gives me a dataframe which looks like in the given image
is_outcome is binary and df looks like this
containing only 2 variables(fields) while I want this department column to be a part of dataframe i.e a dataframe of 3 variables
0 1
Analytics 4840 512
Finance 2330 206
HR 2282 136
Legal 986 53
Operations 10325 1023
Procurement 6450 688
R&D 930 69
Sales & Marketing 15627 1213
Technology 6370 768
One way I learnt was...
df = data.frame(table(train$department , train$is_outcome))
write.csv(df,"df.csv")
rm(df)
df = read.csv("df.csv")
colnames(df) = c("department", "outcome_0","outcome_1")
but I cannot save file in everytime in my program
is there any way to do it directly.
When you are trying to create tables from a matrix in R, you end up with trial.table. The object trial.table looks exactly the same as the matrix trial, but it really isn’t. The difference becomes clear when you transform these objects to a data frame. Take a look at the outcome of this code:
> trial.df <- as.data.frame(trial)
> str(trial.df)
‘data.frame’: 2 obs. of 2 variables:
$ sick : num 34 11
$ healthy: num 9 32
Here you get a data frame with two variables (sick and healthy) with each two observations. On the other hand, if you convert the table to a data frame, you get the following result:
> trial.table.df <- as.data.frame(trial.table)
> str(trial.table.df)
‘data.frame’: 4 obs. of 3 variables:
$ Var1: Factor w/ 2 levels “risk”,”no_risk”: 1 2 1 2
$ Var2: Factor w/ 2 levels “sick”,”healthy”: 1 1 2 2
$ Freq: num 34 11 9 32
The as.data.frame() function converts a table to a data frame in a format that you need for regression analysis on count data. If you need to summarize the counts first, you use table() to create the desired table.
Now you get a data frame with three variables. The first two — Var1 and Var2 — are factor variables for which the levels are the values of the rows and the columns of the table, respectively. The third variable — Freq — contains the frequencies for every combination of the levels in the first two variables.
In fact, you also can create tables in more than two dimensions by adding more variables as arguments, or by transforming a multidimensional array to a table using as.table(). You can access the numbers the same way you do for multidimensional arrays, and the as.data.frame() function creates as many factor variables as there are dimensions.
I have a dataframe that looks like this:
Sensor NewValue NewDate
1 iphone/NuhKZFrx/noise 1.00000 2015-10-20 23:26:14
2 iphone/NuhKZFrx/noiseS 58.63411 2015-10-20 23:26:14
3 iphone/wlhAlrPQ/noise 0.00000 2015-10-21 08:03:28
4 iphone/wlhAlrPQ/noiseS 65.26167 2015-10-21 08:03:28
[...]
with the following datatypes:
'data.frame': 405 obs. of 3 variables:
$ Sensor : Factor w/ 28 levels "iphone/5mZU0HWz/noise",..: 11 12 23 24 9 10 23 24 21 22 ...
$ NewValue: num 1 58.6 0 65.3 3 ...
$ NewDate : POSIXct, format: "2015-10-20 23:26:13" "2015-10-20 23:26:14" "2015-10-21 08:03:28" "2015-10-21 08:03:28" .
The Sensor field is set up like this: <model>/<uniqueID>/<type>. And I want to find out if there is a correlation between noise and noiseS for each uniqueID at a given time.
For a single uniqueID it works fine since there are only two factors. I tried to use xtabs(NewValue~NewDate+Sensor, data=dataNoises) but that gives me zeros since there aren't values for every ID at any time ...
What could I do to somehow compose the factors so that I only have on factor for noise and one for noiseS? Or is there an easier way to solve this problem?
What I want to do is the following:
Date noise noiseS
2015-10-20 23:26:14 1 58.63
2015-10-20 23:29:10 4 78.33
And then compute the pearson correlation coefficient between noise and noiseS.
If I understand your question correctly, you just want a 2-level factor that distinguishes between noise and noiseS?
That can be easily achieved by defining a new column in the dataframe and populating it with the output of grepl(). A MWE:
a <- "blahblahblahblahnoise"
aa <- "blahblahblahblahnoiseS"
b <- "noiseS"
type <- vector()
type[1] <- grepl(b, a)
type[2] <- grepl(b, aa)
type <- as.factor(type)
This two-level factor would let you build a simple model of the means for noise (type[i]==FALSE) and noiseS (type[i]==TRUE), but would not let you evaluate the CORRELATION between the types for a given UniqueID and time. One way to do this would be to create separate columns for data with type==FALSE and type==TRUE, where rows correspond to a specific UniqueID+time combination. In this case, you would need to think carefully about what you want to learn and when you assume data to be independent. For example, if you want to learn whether noise and noiseS are correlated across time for a given uniqueID, then you would need to make a separate factor for uniqueID and include it in your model as an effect (possibly a random effect, depending on your purposes and your data).
Here is what I have:
tmp[1,]
percentages percentages.1 percentages.2 percentages.3 percentages.4 percentages.5 percentages.6 percentages.7 percentages.8 percentages.9
0.0329489291598023 0.0391268533772652 0.0292421746293245 0.0354200988467875 0.0284184514003295 0.035831960461285 0.0308896210873147 0.0345963756177924 0.0366556836902801 0.0403624382207578
I try converting this to numeric, since the class is factor, but I get:
as.numeric(as.character(tmp[1,]))
[1] 35 36 35 36 31 32 31 34 36 34
Where did these integers come from?
Your problem is that indexing by rows of a data frame gives surprising results.
Reconstruct your object:
tmp <- read.csv(text=
"0.0329489291598023,0.0391268533772652,0.0292421746293245,0.0354200988467875,0.0284184514003295,0.035831960461285,0.0308896210873147,0.0345963756177924,0.0366556836902801,0.0403624382207578",
header=FALSE,colClasses=rep("factor",10))
Inspect:
str(tmp[1,])
## 'data.frame': 1 obs. of 10 variables:
## $ V1 : Factor w/ 1 level "0.0329489291598023": 1
## $ V2 : Factor w/ 1 level "0.0391268533772652": 1
## ... etc.
Converting via as.character() totally messes things up:
str(as.character(tmp[1,]))
## chr [1:10] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
On the other hand, this (converting to a matrix first) works fine:
as.numeric(as.matrix(tmp)[1,])
## [1] 0.03294893 0.03912685 0.02924217 0.03542010 0.02841845 0.03583196
## [7] 0.03088962 0.03459638 0.03665568 0.04036244
That said, I have to admit that I do not understand the particular magic that makes as.character() applied to a data frame drop the information about factor levels and convert everything first to the underlying numerical codes, and then to character -- I don't know where precisely you would go to read about this. (The bottom line is "don't extract rows of data frames if you can help it; convert them to matrices first if necessary.")
As an alternative to converting to matrix, you can just transpose the dataframe row to a column:
as.numeric(as.character(t(tmp[1,])))
## [1] 0.03294893 0.03912685 0.02924217 0.03542010 0.02841845 0.03583196
## [7] 0.03088962 0.03459638 0.03665568 0.04036244
I think the integers seen by the OP
[1] 35 36 35 36 31 32 31 34 36 34
are factor levels, his data frame had multiple rows - 36 or more - and these are the levels of the first row.
ETA I see that t() converts a data frame to a matrix, so my solution is the same as Ben's.
Perhaps the reason as.character() doesn't work with a dataframe row is that the levels of the different columns may differ, so there isn't a common set of levels(). In these circumstances as.matrix() will convert to character, so it solves the problem.
Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:
0-99 Data
-1 Question not asked
-5 Do not know
-7 Refused to respond
-9 Module not asked
Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.
I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.
I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.
A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.
eg :
NACode <- function(x,code){
Df <- sapply(x,function(i){
i[i %in% code] <- NA
i
})
id <- which(is.na(Df))
rowid <- id %% nrow(x)
colid <- id %/% nrow(x) + 1
NAdf <- data.frame(
id,rowid,colid,
value = as.matrix(x)[id]
)
Df <- as.data.frame(Df)
attr(Df,"NAcode") <- NAdf
Df
}
This allows to do :
> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA NA NA 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :
ChangeNAToCode <- function(x,code){
NAval <- attr(x,"NAcode")
for(i in which(NAval$value %in% code))
x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]
x
}
> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA -2 -3 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.
But in one line : using attributes and indices might be a nice way of doing it.
The most obvious way seems to use two vectors:
Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.
Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.
Update following questions from #gsk3
Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.
I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.
This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:
MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.
IMHO this question is more suitable for CrossValidated.
But here's a link from SO that you may find useful:
Handling missing/incomplete data in R--is there function to mask but not remove NAs?
You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.
-Ralph Winters
I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.
> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1
That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.
Allan.
I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.