I'm in the middle of doing some research and using R Studio (version 3.6.1) platform to analyze the data gathered. The data comprises of continuous item responses ranging from 0 to 100. Because the nature of the data is continuous responses, I have to use CRM in R to perform the Item Response Theory (IRT) analysis. So I decided to use the EstCRM package (version 1.4) created by Cengiz Zpopluoglu.
Below is an example of the data I use. In this example the data set consists of 3 variables or items that I used, it is named PA1 through PA3. Also, I've set that there's 5 participants involved. The actual data set consists of 100+ items and 100+ participants.
> str(Data_ManDown) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5 obs. of 3 variables:
$ PA1: num 100 75 20 49 90
$ PA2: num 100 75 80 100 80
$ PA3: num 0 30 40 100 80
The data which I named "Data_ManDown" is to be run in this EstCRM package, particularly EstCRMitem.
CRM <- EstCRMitem(Data_ManDown[,1:3],
max.item=c(100,100,100),
min.item=c(0,0,0),
max.EMCycle=500,
converge=.01,
type="Wang&Zeng",
BFGS=TRUE)
CRM$param
The problem I encountered starts when running the EstCRMitem command. The console shows this notification.
The column vectors are not numeric. Please check your data
When I checked the data, it seems that the data that I uploaded from excel is considered as a list.
> class(Data_ManDown)
[1] "tbl_df" "tbl" "data.frame"
> typeof(Data_ManDown)
[1] "list"
> is.numeric(Data_ManDown)
[1] FALSE
Then I decided to coerce my data into numeric by using as.numeric and put it back into the data frame so it can be run in the EstCRMitem command. I found that not also the data have to be numeric, but also data frame.
> iPA1 <- as.numeric(Data_ManDown[[1]])
> iPA2 <- as.numeric(Data_ManDown[[2]])
> iPA3 <- as.numeric(Data_ManDown[[3]])
>
> is.numeric(iPA1)
[1] TRUE
> is.numeric(iPA2)
[1] TRUE
> is.numeric(iPA3)
[1] TRUE
> ManDown_Num <- as.data.frame(cbind(PA1 = iPA1, PA2 = iPA2, PA3 = iPA3))
>
> is.numeric(ManDown_Num)
[1] FALSE
> class(ManDown_Num)
[1] "data.frame"
> typeof(ManDown_Num)
[1] "list"
Unfortunately, the data came back as a list when I combined it back as one data frame that consists of 3 said variables. So the EstCRMitem failed to run. Alternatively, I have also tried a few other ways such as using unlist or put iPA1 through iPA3 as a separate data.frame. It worked but considerably not effective because I have to analyze it one by one per variable. Considering the large number of data taken, this method is preferably the last resort.
All in all, the main question would be, is there (or possibly) an alternate method that can put these kinds of data to be a data.frame consisting many elements (many rows of the participant and many columns of items) and also valued as numeric at the same time? In order to be analyzed in the EstCRMitem command.
On a side note, I also have looked for references in other questions that may be similar and might help such as:
How to convert a data frame column to numeric type?
Unlist all list elements in a dataframe
Although I find that in this case may be a bit different, so there's no apparent solution yet. Also thank you for taking the time to look into this question and for the help.
Related
I have a data frame with several variables that represent ID numbers (the data frames in the workspace are all originally tables from a normalized database). I was surprised to see that I am sometimes able to reference an ID's description before I use the merge to map the description in, but only if I use the $ notation. For example: I set up data frame q to include the variable "LocationID". Then I do the following...
Example for 1 & 2:
> colnames(q)
[1] "LocationID" "PlanID" "Rate"
> sort(unique(q[,'Location')) #This fails. duh
Error in `[.data.frame`(q, , "Location") : undefined columns selected
> sort(unique(q$Location)) #This works. what?
[1] 1 2 3
Questions
Why does the second one work? Maybe that's looking a gift horse in the mouth.
Why doesn't the first one work if the first one does?
For the above example, q is constructed from another data frame with more
variables. This fails for the larger data frame. Why does it fail?
Example for 3:
> dim(y)
[1] 207171 86
q<-y[,cbind('LocationID','PlanID','Rate')]
> dim(q)
[1] 207171 3
> unique(y$Location)
NULL
> unique(q$Location)
[1] 1 2 3
Factors can help preventing some kinds of programming errors in R: You cannot perform equality check for factors that use different levels, and you are warned when performing greater/less than checks for unordered factors.
a <- factor(letters[1:3])
b <- factor(letters[1:3], levels=letters[4:1])
a == b
## Error in Ops.factor(a, b) : level sets of factors are different
a < a
## [1] NA NA NA
## Warning message:
## In Ops.factor(a, a) : < not meaningful for factors
However, contrary to my expectation, this check is not performed when merging data frames:
ad <- data.frame(x=a, a=as.numeric(a))
bd <- data.frame(x=b, b=as.numeric(b))
merge(ad, bd)
## x a b
## 1 a 1 4
## 2 b 2 3
## 3 c 3 2
Those factors simply seem to be coerced to characters.
Is a "safe merge" available somewhere that would do the check? Do you see specific reasons for not doing this check by default?
Example (real-life use case): Assume two spatial data sets with very similar but not identical subdivision in, say, communes. The data sets refer to slightly different points in time, and some of the communes have merged during that time span. Each data set has a "commune ID" column, perhaps even named identically. While the semantics of this column are very similar, I wouldn't want to (accidentally) merge the data sets over this commune ID column. Instead, I construct a matching table between "old" and "new" commune IDs. If the commune IDs are encoded as factors, a "safe merge" would give a correctness check for the merge operation at no extra (implementation) cost and very little computational cost.
The "safe guard" with merge is the by= parameter. You can set exactly which columns you think should match. If you match up two factor columns, R will use the the labels for those values to match them up. So "a" will match with "a" regardless of how the hidden inner working of factor have coded those values. That's what a user sees, so that's how it will be merged. It's just like with numeric values, you can choose to merge on columns that have complete different ranges (the first column has 1:10, the second has 100:1000). When the by value is set, R will do what it's asked. And if you don't explicitly set the by parameter, then R will find all shared column names in the two data.frames and use that.
And many times when merging, you don't always expect matches. Sometimes you're using all.x or all.y to specifically get unmatched records. In this case, depending on how the different data.frames were created, one may not know about the levels it doesn't have. So it's not at all unreasonable to to try to merge them.
So basically R is treating factors like characters during merging, be cause it assumes that you already know that two columns belong together.
Well, with much credit (and apologies to) MrFlick:
> attributes(ad$x)
$levels
[1] "a" "b" "c"
$class
[1] "factor"
> attributes(ad$a)
NULL
> attributes(ad$b)
NULL
> adfoo<-merge(ad,bd)
> attributes(adfoo$x)
$levels
[1] "a" "b" "c"
$class
[1] "factor"
So in fact the merged column $x is a factor, although only levels common to both ad and bd are merged. The other columns were coerced via as.numeric long ago.
I have a matrix with individual column names (the row names are not important), like this
TestMat<-matrix(1:25,ncol=5,nrow=5)
colnames(TestMat)<-c("A","B","C","D","E")
TestMat
For various reasons, but mostly because a package will later need it, I can't alter the values in the matrix and they all have to be integers.
Now I want to categorize my colum names (e.g. A, B and D into "Group 1" and C and E into "Group 2"). The idea is, that the matrix will get smaller later on, as values in the matrix are randomly diminished. As soon as a column-sum reaches zero, that column will be dropped. Along this process I want to see how the fraction/size of one group changes, compared to the other groups.
I thought the easiest way would be to just name all the corresponding columns identical:
TestMat2<-matrix(1:25,ncol=5,nrow=5)
colnames(TestMat2)<-c("Group1","Group1","Group2","Group1","Group2")
TestMat2
But this gives me error-messages later on in the analysis, as R starts numbering the identical column-names in a way of "Group1" "Group1.1" "Group2" "Group1.2" "Group2.1".
I have tried my luck with "class", "attr" and "factor" commands to my column names, but don't get anywhere.
Is there a trick or command, I've maybe never heard of?
as per the comments why not put the grouping in another variable then something like:
> TestMat<-matrix(1:25,ncol=5,nrow=5)
> colnames(TestMat)<-c("A","B","C","D","E")
> F=factor(c("Group1","Group1","Group2","Group1","Group2"))
... do something to your matrix...
> summary(F[colSums(TestMat) >= 40])
Group1 Group2
1 2
Is that it (subs. 40 for 0)?
The Bioconductor package Bioboase defines a class ExpressionSet that allows annotations on rows and columns of a matrix
library(Biobase)
exprs = matrix(1:25,ncol=5,nrow=5, dimnames=list(NULL, LETTERS[1:5]))
df = data.frame(grp=c("Group1","Group1","Group2","Group1","Group2"),
row.names=colnames(exprs))
eset = ExpressionSet(exprs, AnnotatedDataFrame(df))
You can access columns in the data frame with $, subset with [, and extract with exprs(), e.g.,
> exprs(eset[, eset$grp == "Group1"])
A B D
1 1 6 16
2 2 7 17
3 3 8 18
4 4 9 19
5 5 10 20
or
> eset[,colSums(exprs(eset)) > 40]$grp
[1] Group2 Group1 Group2
Levels: Group1 Group2
The GenomicRanges package defines a similar class SummarizedExperiment when the rows are annotated with genomic ranges.
This coordinated integration of data and annotation on data is a really good thing, reducing the chance for 'clerical' errors when matrix and annotation are independent; I'm surprised so many comments suggest that you separately maintain two structures.
Thanks for all the helpful comments. I haven't posted here since my original post, because I first wanted to try all promising approaches and find a final solution to my problem.
I tried the Biobase package with its option for annotations, as well as Stephen's idea of grouping everything via a second variable.
As it turned out, as soon as the matrix diminished in size (as a part of the analysis) the external grouping failed, as column-numbers and grouping didn't match anymore and I couldn't find a way to combine the Bioconductor approach and my code.
I found a (somewhat roundabout) solution, though, if anybody cares:
I already stated, that, if I group my column-names identical for grouping, R later numbers my groups and they are thus not idential any longer.
But I then just searched for the first such-and-such neccessary letters to identify the proper group:
length(colnames(TestMat2)[substr(colnames(TestMat2),1,6) == "Group1"])
This way I can always check the fraction of one group of columns versus the others.
Thanks for your answers and help. I learned a lot and I think Bioconductor will come in handy in the future.
Cheers!
I am a relative newcomer to R. I have searched for the last two workdays trying to figure this out and failed. I have a list of factors generated by a function. I have 9 items in the list of different lengths.
>summary(list_dataframes)
Length Class Mode
[1,] 1757 factor numeric
[2,] 1776 factor numeric
[3,] 1737 factor numeric
[4,] 1766 factor numeric
[5,] 1783 factor numeric
[6,] 1751 factor numeric
[7,] 1744 factor numeric
[8,] 1749 factor numeric
[9,] 1757 factor numeric
Part of a sample of the data as it comes out:
list_dataframes
[[1]]
[1] 1776234_at 1779003_at 1776344_at 1777664_at 1772541_at 1774525_at
[[2]]
[1] 1771703_at 1776299_at 1772744_at 1780116_at 1775451_at 1778821_at
[7] 1774342_at
[[3]]
[1] 1780116_at 1776262_at 1775451_at 1780200_at 1775704_at
I am not sure why it says the Mode is "numeric". The individual entries are a mix of numbers and letter like "S35_at".
I would like to make this into a table of nine columns and 1783 rows without making duplicate values. (Hence I tried using do.call and it didn't work. I ended up with a mess full of duplicates.) The shorter ones can have NAs in the empty spaces or be blank.
I need to be able to eventually end up with something I can put into a spread sheet.
There has to be a way to do this. Thank you!
I guess I should add it initially was coming out as data frames when I had four columns of data but I only need one column of the data and when I subsetted the function that creates this list to create only the one column I actually needed it seems to no longer be a dataframe.
dput(head(list_dataframes))
list(structure(c(3605L, 5065L, 3663L, 4349L, 1655L, 2700L, 5692L, plus many more
.Label = c("1769308_at",
"1769311_at", "1769312_at", "1769313_at", "1769314_at", "1769317_at", plus many more
this pattern is repeated nine more times
What I am trying to do is produce a table that would look like this:
a= xyz,tuv,efg,hij,def
b= xyz,tuv,efg
c= tuv,efg,hij,def
What I want to make is a table that is
a b c
xyz xyz tuv
tuv tuv efg
efg efg hij
hij NA NA
NA NA NA
NA could be blank as well.
After much reading the manual section on lists I determined that I had generated a buried list of lists. It had nine items with the data I wanted buried two layers down i.e to see it I had to use [[1]]. Also because of something in R that results in a single column data frame becoming a factor instead of staying a data frame it was further complicated. To fix it (sort of) I added one step in my equation so that I changed that factor into a data frame.
After that, when I used lapply to generate my result, at least the factor issue was resolved. I could then use the following steps to pull the data frames out.
first <- list_dataframes[[1]]
second <- list_dataframes[[2]]
third <- list_dataframes[[3]]
fourth <- list_dataframes[[4]]
fifth <- list_dataframes[[5]]
sixth <- list_dataframes[[6]]
seventh <- list_dataframes[[7]]
eighth <- list_dataframes[[8]]
nineth <- list_dataframes[[9]]
all_results <- cbindX(first,second,third,fourth,fifth,sixth,seventh, eighth,nineth)
I could then write the csv file using write.csv and get the correct result I was after. SO I guess I have my answer. I mean it does work now.
However I still think I am missing something in making this work optimally even though it is now giving me the correct result I was after.
The factor class variables are vectors of integer mode with an attached attribute that is a character vector specifying the labels to be used in displaying the integer values. I would think the safest way to bind these together would be to convert the factor columns to character class and then to merge with all=TRUE. Why not post a simple example with three dataframes or factors... I cannot actually discern the structure for sure from summary-output ... of length 10, 9 and 8 that has whatever level of complexity is in your data?
If you want to make them all factors with a common set of levels, then use this:
shared_levels <- unique( c( unlist( lapply(list_dataframes) ) ) )
length(shared_levels)
new_list <- lapply(list_dataframes, factor, levels=shared_levels)
As stated in the comment, I still do not understand what sort of table you imagine being produced. Need a concrete example.
Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:
0-99 Data
-1 Question not asked
-5 Do not know
-7 Refused to respond
-9 Module not asked
Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.
I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.
I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.
A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.
eg :
NACode <- function(x,code){
Df <- sapply(x,function(i){
i[i %in% code] <- NA
i
})
id <- which(is.na(Df))
rowid <- id %% nrow(x)
colid <- id %/% nrow(x) + 1
NAdf <- data.frame(
id,rowid,colid,
value = as.matrix(x)[id]
)
Df <- as.data.frame(Df)
attr(Df,"NAcode") <- NAdf
Df
}
This allows to do :
> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA NA NA 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :
ChangeNAToCode <- function(x,code){
NAval <- attr(x,"NAcode")
for(i in which(NAval$value %in% code))
x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]
x
}
> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA -2 -3 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.
But in one line : using attributes and indices might be a nice way of doing it.
The most obvious way seems to use two vectors:
Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.
Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.
Update following questions from #gsk3
Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.
I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.
This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:
MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.
IMHO this question is more suitable for CrossValidated.
But here's a link from SO that you may find useful:
Handling missing/incomplete data in R--is there function to mask but not remove NAs?
You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.
-Ralph Winters
I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.
> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1
That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.
Allan.
I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.