I recently created a barplot in R using some sample data with no trouble. Then I tried it again using the real data which was exactly the same as the sample data except there was more of it. The problem is now I get this error:
Error in barplot.default(table(datafr)) :
'height' must be a vector or a matrix
I don't know if this is of help but when I print out the table these are what the last lines look like.
33333 2010-09-13-19:25:50.206 Google Chrome-#135 NA
[ reached getOption("max.print") -- omitted 342611 rows ]]
Is it possible that this is too much data to process? Any suggestion as to how I can fix this?
Thanks :)
EDIT 1
Hey Joris,
Here is the info from str(datafr) :
'data.frame': 375944 obs. of 3 variables:
$ TIME : Factor w/ 375944 levels "2010-09-11-19:28:34.680 ",..: 1 2 3 4 5 6 7 8 9 10 ...
$ FOCUS.APP: Factor w/ 107 levels " Finder-#101 ",..: 3 3 3 3 3 3 3 3 1 1 ...
$ X : logi NA NA NA NA NA NA ...
and from traceback()
3: stop("'height' must be a vector or a matrix")
2: barplot.default(table(datafr))
1: barplot(table(datafr))
I also ran the other command you told me, but the feedback was super verbose; too much to print here. Let me know if you need any other info or if the last information was really important I can figure out a way to post it.
Thanks,
Ah, that solves the problem : you have 3 dimensions in your table, barplot can't deal with that. Take the 2 columns you want to use for the barplot function, eg:
# sample data
Df <- data.frame(
TIME = as.factor(seq.Date(as.Date("2010-09-11"),as.Date("2010-09-20"),by="day")),
FOCUS.APP = as.factor(rep(c("F101","F102"),5)),
X = sample(c(TRUE,FALSE,NA),10,r=T)
)
# make tables
T1 <- table(Df)
T2 <- table(Df[,-3])
# plot tables
barplot(T1)
barplot(T2)
This said, that plot must look interesting to say the least. I don't know what you try to do, but I'd say that you might to reconsider your approach to it.
Related
Could anyone please help me with this?
I am trying to select/split some features from the dataset, before I was able to do it with this:
My data had 50 features and here I reduce to 24.
trainQ1 <- df_2015_Q1[,1:24]
#trainQ1 12312312 obs. of 24 variables
But now I use the same code
age <- trainQ1[,1:4]
#and returns
#age int [1:4] 1 2 3 4
What going on here??
Preface:
I have seen this post:How to convert a factor to an integer\numeric without a loss of information? , but it does not really apply to the issue I am having. It addresses the issue of converting a vector in the form of factor to a numeric, but the issue I am having is larger than that.
Problem:
I am trying to convert a column in a dataframe from a factor to a numeric, while representing the dataframe using paste0. Here is an example:
aa=1:10
bb=rnorm(10)
dd=data.frame(aa,bb)
get(paste0("d","d"))[,2]=as.factor(get(paste0("d","d"))[,2])
(The actual code I am using requires me to use the paste0 function)
I get the error: target of assignment expands to non-language object
I am not sure how to do this, I think what is messing it up is the paste0 function.
First, this is not really a natural way to think about things or to code things in R. It can be done, but if you rephrase your question to give the bigger picture, someone can probably provide more natural ways of doing this in R. (Like the named lists #joran mentioned in the comment.)
With that said, to do this in R, you need to split apart the three steps you're trying to do in one line: get the data frame with the specified variable, make the desired column a factor, and then assign back to the variable name. Here I've wrapped this in a function, so the assignment needs to be made in pos=1 instead of the default, which would name it only within the function.
tof <- function(dfname, colnum) {
d <- get(dfname)
d[, colnum] <- factor(d[, colnum])
assign(dfname, d, pos=1)
}
dd <- data.frame(aa=1:10, bb=rnorm(10))
str(dd)
## 'data.frame': 10 obs. of 2 variables:
## $ aa: int 1 2 3 4 5 6 7 8 9 10
## $ bb: num -1.4824 0.7904 0.0258 1.2075 0.2455 ...
tof("dd", 2)
str(dd)
## 'data.frame': 10 obs. of 2 variables:
## $ aa: int 1 2 3 4 5 6 7 8 9 10
## $ bb: Factor w/ 10 levels "-1.48237228248052",..: 1 8 4 9 5 10 2 7 3 6
I'm new to R so please excuse my very basic question:
I have a data frame that has a lot of missing data. I've used na.omit to remove missing data as in:
data2 <- na.omit(data1)
Howevever, some of the variables are factors that still seem to have "" as one of the categorise, as in:
> str(data2$smoker)
Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 2 2 3 3 2 ...
When I look at "data2" it does still have missing values. What I am doing wrong?
Help and advice much appreciated.
Greg
NA is not the same as "".
What is the difference?
NA indicates a missing value
"" is an empty string, which is a type of value
na.omit will remove NA values, but it will not remove empty strings.
I suggest turning "" into NA before using na.omit:
data1[data1$smoker == "", "smoker"] <- NA
I am having rather lengthy problems concerning my data set and I believe that my trouble trace back to importing the data. I have looked at many other questions and answers as well as as many help sites as I can find, but I can't seem to make anything work. I am attemping to run some TTests on my data and have thus far been unable to do so. I believe the root cause is the data is imported as class NULL. I've tried to include as much information here as I can to show what I am working with and the types of issues I am having (in case the issue is in some other area)
An overview of my data and what i've been doing so far is this:
Example File data (as displayed in R after reading data from .csv file):
Part Q001 Q002 LA003 Q004 SA005 D106
1 5 3 text 99 text 3
2 3 text 2 text 2
3 2 4 3 text 5
4 99 5 text 2 2
5 4 2 1 text 3
So in my data, the "answers" are 1 through 5. 99 represents a question that was answered N/A. blanks represent unanswered questions. the 'text' questions are long and short answer/comments from a survey. All of them are stored in a large data set over over 150 Participants (Part) and over 300 questions (labled either Q, LA, SA, or D based on question with a 1-5 answer, long answer, short answer, or demographic (also numeric answers 0 thought 6 or so)).
When I import the data, I need to have it disregard any blank or 99 answers so they do not interfere with statistics. I also don't care about the comments, so I filter all of them out.
EDIT: data file looks like:
Part,Q001,Q002,LA003,Q004,SA005,D006
1,5,3,text,99,text,3
2,3,,text,2,text,2
etc...
I am using the following lines to read the data:
data.all <- read.table("data.csv", header=TRUE, sep=",", na.strings = c("","99"))
data <- data.all[, !(colnames(data.all) %in% c("LA003", "SA005")
now, when I type
class(data$Q001)
I get NULL
I need these to be Numeric. I can use summary(data) to get the means and such, but when I try to run ttests, I get errors including NULL.
I tried to turn this column into numerics by using
data<-sapply(data,as.numeric)
and I tried
data[,1]<-as.numeric(as.character(data[,1]))
(and with 2 instead of 1, but I don't really understand the sapply syntax, I saw it in several other answers and was trying to make it work)
when I then type
class(data$Q001)
I get "Error: $ operator is invalid for atomic vectors
If I do not try to use sapply, and I try to run a ttest, I've created subsets such as
data.2<-subset(data, D106 == "2")
data.3<-subset(data, D106 == "3")
and I use
t.test(data.2$Q001~data.3$Q001, na.rm=TRUE)
and I get "invalid type (NULL) for variable 'data.2$Q001'
I tried using the different syntax, trying to see if I can get anything to work, and
t.test(data.2$Q001, data.3$Q001, na.rm=TRUE)
gives "In is.na(d) : is.na() applied to non-(list or vector) of type 'NULL'" and "In mean.default(x) : argument is not numeric or logical: returning NA"
So, now that I think I've been clear about what I'm trying to do and some of the things I've tried...
How can I import my data so that numbers (specifically any number in a column with a header starting with Q) are accurately read as numbers and do not get a NULL class applied to them? What do I need to do in order to get my data properly imported to run TTests on it? I've used TTests on plenty of data in the past, but it has always been data I recorded manually in excel (and thus had only one column of numbers with no blanks or NAs) and I've never had an issue, and I just do not understand what it is about this data set that I can't get it to work. Any assistance in the right direction is much appreciated!
This works for me:
> z <- read.table(textConnection("Part,Q001,Q002,LA003,Q004,SA005,D006
+ 1,5,3,text,99,text,3
+ 2,3,,text,2,text,2
+ "),header=TRUE,sep=",",na.strings=c("","99"))
> str(z)
'data.frame': 2 obs. of 7 variables:
$ Part : int 1 2
$ Q001 : int 5 3
$ Q002 : int 3 NA
$ LA003: Factor w/ 1 level "text": 1 1
$ Q004 : int NA 2
$ SA005: Factor w/ 1 level "text": 1 1
$ D006 : int 3 2
> z2 <- z[,!(colnames(z) %in% c("LA003","SA005"))]
> str(z2)
'data.frame': 2 obs. of 5 variables:
$ Part: int 1 2
$ Q001: int 5 3
$ Q002: int 3 NA
$ Q004: int NA 2
$ D006: int 3 2
> z2$Q001
[1] 5 3
> class(z2$Q001)
[1] "integer"
The only I can think of is that your second command (which was missing some terminating parentheses and brackets) didn't work at all, you missed seeing the error message, and you are referring to some previously defined data object that doesn't have the same columns defined. For example, class(z$QQQ) is NULL following the above example.
edit: it appears that the original problem was some weird/garbage characters in the header that messed up the name of the first column. Manually renaming the column (names(data)[1] <- "Q001") seems to have fixed the problem.
Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:
0-99 Data
-1 Question not asked
-5 Do not know
-7 Refused to respond
-9 Module not asked
Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.
I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.
I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.
A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.
eg :
NACode <- function(x,code){
Df <- sapply(x,function(i){
i[i %in% code] <- NA
i
})
id <- which(is.na(Df))
rowid <- id %% nrow(x)
colid <- id %/% nrow(x) + 1
NAdf <- data.frame(
id,rowid,colid,
value = as.matrix(x)[id]
)
Df <- as.data.frame(Df)
attr(Df,"NAcode") <- NAdf
Df
}
This allows to do :
> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA NA NA 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :
ChangeNAToCode <- function(x,code){
NAval <- attr(x,"NAcode")
for(i in which(NAval$value %in% code))
x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]
x
}
> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA -2 -3 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.
But in one line : using attributes and indices might be a nice way of doing it.
The most obvious way seems to use two vectors:
Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.
Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.
Update following questions from #gsk3
Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.
I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.
This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:
MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.
IMHO this question is more suitable for CrossValidated.
But here's a link from SO that you may find useful:
Handling missing/incomplete data in R--is there function to mask but not remove NAs?
You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.
-Ralph Winters
I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.
> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1
That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.
Allan.
I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.