How to delete columns in a dataset that have binomial variables - r

The dataset named data has both categorical and continuous variables. I would like to the delete categorical variables.
I tried:
data.1 <- data[,colnames(data)[[3L]]!=0]
No error is printed, but categorical variables stay in data.1. Where are problems ?
The summary of "head(data)" is
id 1,2,3,4,...
age 45,32,54,23,...
status 0,1,0,0,...
...
(more variables like as I wrote above)
All variables are defined as "Factor".

What are you trying to do with that code? First of all, colnames(data) is not a list so using [[]] doesn't make sense. Second, The only thing you test is whether the third column name is not equal to zero. As a column name can never start with a number, that's pretty much always true. So your code translates to :
data1 <- data[,TRUE]
Not what you intend to do.
I suppose you know the meaning of binomial. One way of doing that is defining your own function is.binomial() like this :
is.binomial <- function(x,na.action=c('na.omit','na.fail','na.pass'){
FUN <- match.fun(match.arg(na.action))
length(unique(FUN(x)))==2
}
in case you want to take care of NA's. This you can then apply to your dataframe :
data.1 <- data[!sapply(data,is.binomial)]
This way you drop all binomial columns, i.e. columns with only two distinct values.

#Shimpei Morimoto,
I think you need a different approach.
Are the categorical variables defines in the dataframe as factors?
If so you can use:
data.1 <- data[,!apply(data,2,is.factor)]
The test you perform now is if the colname number 3L is not 0.
I think this is not the case.
Another approach is
data.1 <- data[,-3L]
works only if 3L is a number and the only column with categorical variables

I think you're getting there, with your last comment to #Mischa Vreeburg. It might make sense (as you suggest) to reformat your original data file, but you should also be able to solve the problem within R. I can't quite replicate the undefined columns error you got.
Construct some data that look as much like your data as possible:
X <- read.csv(textConnection(
"id,age,pre.treat,status
1,'27', 0,0
2,'35', 1,0
3,'22', 0,1
4,'24', 1,2
5,'55', 1,3
, ,yes(vs)no,"),
quote="\"'")
Take a look:
str(X)
'data.frame': 6 obs. of 4 variables:
$ id : int 1 2 3 4 5 NA
$ age : int 27 35 22 24 55 NA
$ pre.treat: Factor w/ 3 levels " 0"," 1","yes(vs)no": 1 2 1 2 2 3
$ status : int 0 0 1 2 3 NA
Define #Joris Mey's function:
is.binomial <- function(x,na.action=c('na.omit','na.fail','na.pass')) {
FUN <- match.fun(match.arg(na.action))
length(unique(FUN(x)))==2
}
Try it out: you'll see that it does not detect pre.treat as binomial, and keeps all the variables.
sapply(X,is.binomial)
X1 <- X[!sapply(X,is.binomial)]
names(X1)
## keeps everything
We can drop the last row and try again:
X2 <- X[-nrow(X),]
sapply(X2,is.binomial)
It is true in general that R does not expect "extraneous" information such as level IDs to be in the same column as the data themselves. On the one hand, you can do even better in the R world by simply leaving the data as their original, meaningful values ("no", "yes", or "healthy", "sick" rather than 0, 1); on the other hand the data take up slightly more space if stored as a text file, and, more important, it becomes harder to incorporate other meta-data such as units in the file along with the data ...

Related

R: Dropping variables using number of observations

I have a large dataset, and I'm trying to drop some of my variables based on how many observations each has. For instance, I would like to drop any variable in my dataframe where n < 3 (total observations for that variable is less than 3). Since R can count observations for each variable using describe, can't I use that number to subset the data instead of having to type in each variable name each time I pull in a new version (each version has different variables that will have low n's and there are over 40 variables). Thanks so much for your help!
For instance, my data looks like this:
ID Runaway Aggressive Emergency Hospitalization Injury
1 3 NA 4 1 NA
2 NA NA 2 1 NA
3 4 NA 6 2 3
4 1 NA 1 1 NA
I want to be able to drop "Aggressive" and "Injury" based on their n's being 0 and 1 respectively. However, instead of telling R to drop them by variable name, it would be much more convenient if it was possible to tell R to drop any variable where n < 3 (or whatever number I choose) as I'll be using this code for multiple versions of this dataset. I have tried using column numbers (which is better than writing them out) but it's still pretty tedious when I have to describe() the data, figure out which variables have low n's, and then drop 28 variables or subset() around them.
This works but it's cumbersome...
UIRCorrelation <- UIRKidUnique61[c(28, 30, 32, 34:38, 42, 54:74)]
For some reason, my example looks different when I'm editing versus when I save so I also included an image of it. Sorry. This is the first time I've ever used stack overflow to ask a question. I actually spent a lot of time googling this but couldn't find an answer relating to n.
This line did not work: DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
DF being your dataframe
DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
This function did the trick:
valid <- function(x) {sum(!is.na(x))}
N <- apply(UIRCorrelation,2,valid)
UIRCorrelation2 <- UIRCorrelation[N > 3]

R: NAs show up when writing to a data frame, occurs inside second condition of if/else statement

I'm trying to run some code in R that reads data from one data frame and writes it to a new data frame depending on conditionals about what text occurs in the first "read" data frame.
For some reason when I write to a new data frame I get "NA" values for the characters in the final column, but this only happens with the second if statement when I try to write the text "Byeeeeee", "HiThere" comes up fine.
Any insight into why this is working for the first if statement, but not the second?
input file is "testWayang", output file is "outputWayang"
Thanks!!!
outputWayang<-data.frame(startId=numeric(), studId=numeric(), sessNum=numeric(), startTime=numeric(), problemId=numeric(), action=character())
for(i in 1:50){
if(testWayang[i,4]=="beginProblem"){
outputWayang = rbind(outputWayang, c(testWayang[i,1], testWayang[i,2], testWayang[i,3], testWayang[i,7], testWayang[i+1,9], as.character("HiThere")))
}
else if(testWayang[i,4]=="studentFeedback"){
outputWayang = rbind(outputWayang, c(testWayang[i,1], testWayang[i,2], testWayang[i,3], testWayang[i,7], testWayang[i,9], as.character("Byeeeeeee")))
}
write.csv(outputWayang, file="place I write my file to on my computer", append=TRUE)
}
You are attempting to create a vector with the c function that has both numeric and character values. That will fail you. It's also the case that R newbies often fail to recognize problems related to R's default creation of factor variables. You might be able to rbind a list to an existing dataframe. Let's try with a simple example:
rbind( data.frame(a=1,b="y"), c(2,"n"))
a b
1 1 y
2 2 <NA>
Warning message:
In `[<-.factor`(`*tmp*`, ri, value = "n") :
invalid factor level, NA generated
So that replicates your failure; It's also the case taht you inadvertently coerced the first colum of what you might expect to be numeri class to character class:
> str( rbind( data.frame(a=1,b="y"), c(2,"n")) )
'data.frame': 2 obs. of 2 variables:
$ a: chr "1" "2"
$ b: Factor w/ 1 level "y": 1 NA
Just using a list also fails because of the factor issue mentioned above, but if you create the dataframe with a factor that has all possible levels then you get success:
rbind( data.frame(a=1,b=factor("y",levels=c("y","n"))), list(2,"n"))
a b
1 1 y
2 2 n
I'n not sure what else might be going on since your didn't include enough data to do further investigation.

Loading and expanding contingency table

I have a data file that represents a contingency table that I need to work with. The problem is I can't figure out how to load it properly.
Data structure:
Rows: individual churches
1st Column: Name of the church
2nd - 12th column: Mean age of followers
Every cell: Number of people who follows corresponding church and are correspondingly old.
//In the original data set only the age range was available (e.g. between 60-69) so to enable computation with it I decided to replace it with mean age (e. g. 64.5 instead of 60-69)
Data sample:
name;7;15;25
catholic;25000;30000;15000
hinduism;5000;2000;3000
...
I tried to simply load the data and make them a 'table' so I could expand it but it didn't work (only produced something really weird).
dataset <- read.table("C:/.../dataset.csv", sep=";", quote="\"")
dataset_table <- as.table(as.matrix(dataset))
When I tried use the data as they were to produce a simple graph it didn't work either.
barplot(dataset[2,2:4])
Error in barplot.default(dataset[2,2:4]) : 'height' must be a vector or a matrix
Classing dataset[2,2:4] showed me that it is a 'list' which I don't understand (I guess it is because dataset is data.frame and not table).
If someone could point me into the right direction how to properly load the data as a table and then work with them, I'd be forever grateful :).
If your file is already a contingency table, don't use as.table().
df <- read.table(header=T,sep=";",text="name;7;15;25
catholic;25000;30000;15000
hinduism;5000;2000;3000")
colnames(df)[-1] <- substring(colnames(df)[-1],2)
barplot(as.matrix(df[2,2:4]), col="lightblue")
The transformation of colnames(...) is because R doesn't like column names that start with a number, so it prepends X. This codes just gets rid of that.
EDIT (Response to OP's comment)
If you want to convert the df defined above to a table suitable for use with expand.table(...) you have to set dimnames(...) and names(dimnames(...)) as described in the documentation for expand.table(...).
tab <- as.matrix(df[-1])
dimnames(tab) <- list(df$name,colnames(df)[-1])
names(dimnames(tab)) <- c("name","age")
library(epitools)
x.tab <- expand.table(tab)
str(x.tab)
# 'data.frame': 80000 obs. of 2 variables:
# $ name: Factor w/ 2 levels "catholic","hinduism": 1 1 1 1 1 1 1 1 1 1 ...
# $ age : Factor w/ 3 levels "7","15","25": 1 1 1 1 1 1 1 1 1 1 ...

R biglm with categorical variables

I have a large data set I working with in R using some of the big.___() packages. It's ~ 10 gigs (100mmR x 15C) and looks like this:
Price Var1 Var2
12.45 1 1
33.67 1 2
25.99 3 3
14.89 2 2
23.99 1 1
... ... ...
I am trying to predict price based on Var1 and Var2.
The problem I've come up with is that Var1 and Var2 are categorical / factor variables.
Var1 and Var2 each have 3 levels (1,2 and 3) but there are only 6 combinations in the data set
(1,1; 1,2; 1,3; 2,2; 2,3; 3,3)
To use factor variables in biglm() they must be present in each chunk of data that biglm uses (my understanding is that biglm breaks the data set into 'x' number of chunks and updates the regression parameters after analyzing each chunk in order to get around dealing with data sets that are larger than RAM).
I've tried to subset the data but my computer can't handle it or my code is wrong:
bm11 <- big.matrix(150000000, 3)
bm11 <- subset(x, x[,2] == 1 & x[,3] == 1)
The above gives me a bunch of these:
Error: cannot allocate vector of size 1.1 Gb
Does anyone have any suggestions for working around this issue?
I'm using R 64-bit on a windows 7 machine w/ 4 gigs of RAM.
You do not need all the data or all values present in each chunk, you just need all the levels accounted for. This means that you can have a chunk like this:
curchunk <- data.frame( Price=c(12.45, 33.67), Var1=factor( c(1,1), levels=1:3),
Var2 = factor( 1:2, levels=1:3 ) )
and it will work. Even though there is only 1 value in Var1 and 2 values in Var2, all three levels are present in both so it will do the correct thing.
Also biglm does not break the data into chunks for you, but expects you to give it manageble chunks to work with. Work through the examples to see this better. A common methodology with biglm is to read from a file or database, read in the first 'n' rows (where 'n' is a reasonble subset) and pass them to biglm (possibly after making sure all the factors have all the levels specified), then remove that chunk of data from memory and read in the next 'n' rows and pass that to update, continues with this until the end of the file removing the used chunks each time (so you have enough memory room for the next one).

How do I handle multiple kinds of missingness in R?

Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:
0-99 Data
-1 Question not asked
-5 Do not know
-7 Refused to respond
-9 Module not asked
Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.
I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.
I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.
A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.
eg :
NACode <- function(x,code){
Df <- sapply(x,function(i){
i[i %in% code] <- NA
i
})
id <- which(is.na(Df))
rowid <- id %% nrow(x)
colid <- id %/% nrow(x) + 1
NAdf <- data.frame(
id,rowid,colid,
value = as.matrix(x)[id]
)
Df <- as.data.frame(Df)
attr(Df,"NAcode") <- NAdf
Df
}
This allows to do :
> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA NA NA 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :
ChangeNAToCode <- function(x,code){
NAval <- attr(x,"NAcode")
for(i in which(NAval$value %in% code))
x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]
x
}
> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA -2 -3 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.
But in one line : using attributes and indices might be a nice way of doing it.
The most obvious way seems to use two vectors:
Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.
Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.
Update following questions from #gsk3
Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.
I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.
This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:
MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.
IMHO this question is more suitable for CrossValidated.
But here's a link from SO that you may find useful:
Handling missing/incomplete data in R--is there function to mask but not remove NAs?
You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.
-Ralph Winters
I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.
> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1
That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.
Allan.
I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.

Resources