After merging missing values have a spurious value - r

After merging the resulting dataset doesn't have the same number of nonmissings (the variable on which I merge has no duplicates in either dataset), instead it has the same number of missings - meaning that I get 72229 spurious values, all exactly like one value in the second dataset.
There is just one row in the second dataset with that spurious value, and it looks perfectly normal. If I set its value to missing, I get the desired result (except for the one missing value). If I set its value to 1000, I get 72229 times 1000 in the result. So I thought it's something about that row, but if try to make a reproducible example using that row and some others, the error doesn't occur.
I was unsuccessful at making a reproducible example in a sufficiently small subset of the data, that I could comfortably share it, so I'm mostly soliciting sage advice. Advice on how to reproduce the problem would also be nice.
> table(is.na(cogdj$int.youth))
FALSE TRUE
1731 178
> sum(duplicated(soep$PERSNR))
[1] 0
> sum(duplicated(cogdj$PERSNR))
[1] 0
> soep = merge(soep,cogdj[,c('PERSNR','ana','ded','mat','ari','int.youth')],by="PERSNR",all.x=T,incomparables=NA)
> table(is.na(soep$int.youth))
FALSE TRUE
73959 178
> nrow(soep[which(round(soep$int.youth,20)==1.6737269266506955567),])
[1] 72229
> nrow(cogdj[which(round(cogdj$int.youth,20)==1.6737269266506955567),])
[1] 1
> cogdj[which(round(cogdj$int.youth,20)==1.6737269266506955567),c('PERSNR','int.youth')]
PERSNR int.youth
1 609104 1.673727

Related

Combining Numeric Values as Data Frame in R to Run EstCRM

I'm in the middle of doing some research and using R Studio (version 3.6.1) platform to analyze the data gathered. The data comprises of continuous item responses ranging from 0 to 100. Because the nature of the data is continuous responses, I have to use CRM in R to perform the Item Response Theory (IRT) analysis. So I decided to use the EstCRM package (version 1.4) created by Cengiz Zpopluoglu.
Below is an example of the data I use. In this example the data set consists of 3 variables or items that I used, it is named PA1 through PA3. Also, I've set that there's 5 participants involved. The actual data set consists of 100+ items and 100+ participants.
> str(Data_ManDown) Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 5 obs. of 3 variables:
$ PA1: num 100 75 20 49 90
$ PA2: num 100 75 80 100 80
$ PA3: num 0 30 40 100 80
The data which I named "Data_ManDown" is to be run in this EstCRM package, particularly EstCRMitem.
CRM <- EstCRMitem(Data_ManDown[,1:3],
max.item=c(100,100,100),
min.item=c(0,0,0),
max.EMCycle=500,
converge=.01,
type="Wang&Zeng",
BFGS=TRUE)
CRM$param
The problem I encountered starts when running the EstCRMitem command. The console shows this notification.
The column vectors are not numeric. Please check your data
When I checked the data, it seems that the data that I uploaded from excel is considered as a list.
> class(Data_ManDown)
[1] "tbl_df" "tbl" "data.frame"
> typeof(Data_ManDown)
[1] "list"
> is.numeric(Data_ManDown)
[1] FALSE
Then I decided to coerce my data into numeric by using as.numeric and put it back into the data frame so it can be run in the EstCRMitem command. I found that not also the data have to be numeric, but also data frame.
> iPA1 <- as.numeric(Data_ManDown[[1]])
> iPA2 <- as.numeric(Data_ManDown[[2]])
> iPA3 <- as.numeric(Data_ManDown[[3]])
>
> is.numeric(iPA1)
[1] TRUE
> is.numeric(iPA2)
[1] TRUE
> is.numeric(iPA3)
[1] TRUE
> ManDown_Num <- as.data.frame(cbind(PA1 = iPA1, PA2 = iPA2, PA3 = iPA3))
>
> is.numeric(ManDown_Num)
[1] FALSE
> class(ManDown_Num)
[1] "data.frame"
> typeof(ManDown_Num)
[1] "list"
Unfortunately, the data came back as a list when I combined it back as one data frame that consists of 3 said variables. So the EstCRMitem failed to run. Alternatively, I have also tried a few other ways such as using unlist or put iPA1 through iPA3 as a separate data.frame. It worked but considerably not effective because I have to analyze it one by one per variable. Considering the large number of data taken, this method is preferably the last resort.
All in all, the main question would be, is there (or possibly) an alternate method that can put these kinds of data to be a data.frame consisting many elements (many rows of the participant and many columns of items) and also valued as numeric at the same time? In order to be analyzed in the EstCRMitem command.
On a side note, I also have looked for references in other questions that may be similar and might help such as:
How to convert a data frame column to numeric type?
Unlist all list elements in a dataframe
Although I find that in this case may be a bit different, so there's no apparent solution yet. Also thank you for taking the time to look into this question and for the help.

Missing rows after subsetting datatable on a single column

I have a datatable, DT, with columns A, B and C. I want only one A per unique B, and I want to choose that A based on the value of C (choose the largest C).
Based on this (incredibly helpful) SO page, Use data.table to get first of subgroup based on a variable, I tried something like this:
test <- data.table(A=c(1:3,1:2),B=c(1:5),C=c(11:15))
setkey(test,A,C)
test[,.SD[.N],by="A"]
In my test case, this gives me an answer that seems right:
# A B C
# 1: 1 6 16
# 2: 2 7 17
# 3: 3 8 18
# 4: 4 4 14
# 5: 5 5 15
And, as expected, the number of rows matches the number of unique entries for "A" in my DT:
length(unique(test$A))
# 5
However, when I apply this to my actual dataset, I am missing approximately 20% of my initially ~2 million rows.
I cannot seem to put together a test dataset that will recreate this type of a loss. There are no null values in the actual dataset. What else could be a factor in a dataset that would cause a discrepancy between the number of results from something like test[,.SD[.N],by="A"] and length(unique(test$A))?
Thanks to #Eddi's debugging coaching, here's the answer, at least for my dataset: differential handling of numbers in scientific notation.
In particular: In my actual dataset, columns A and B were very long numbers that, upon import from SQL to R, had been imported in scientific notation. It turns out the test[,.SD[.N],by="A"] and length(unique(test$A)) commands were handling this differently: length(unique(test$A)) was preserving the difference between two values that differed only in a small digit that is not visible in the collapsed scientific notation format printed as visual output, but test[,.SD[.N],by="A"] was, in essence, rounding the values and thus collapsing some of them together.
(I feel foolish that I didn't catch this myself before posting, but much appreciate the help - I hope somehow this spares someone else the same confusion, perhaps!)

R is returning an ID that is not in the data frame?

I have a data frame with several variables that represent ID numbers (the data frames in the workspace are all originally tables from a normalized database). I was surprised to see that I am sometimes able to reference an ID's description before I use the merge to map the description in, but only if I use the $ notation. For example: I set up data frame q to include the variable "LocationID". Then I do the following...
Example for 1 & 2:
> colnames(q)
[1] "LocationID" "PlanID" "Rate"
> sort(unique(q[,'Location')) #This fails. duh
Error in `[.data.frame`(q, , "Location") : undefined columns selected
> sort(unique(q$Location)) #This works. what?
[1] 1 2 3
Questions
Why does the second one work? Maybe that's looking a gift horse in the mouth.
Why doesn't the first one work if the first one does?
For the above example, q is constructed from another data frame with more
variables. This fails for the larger data frame. Why does it fail?
Example for 3:
> dim(y)
[1] 207171 86
q<-y[,cbind('LocationID','PlanID','Rate')]
> dim(q)
[1] 207171 3
> unique(y$Location)
NULL
> unique(q$Location)
[1] 1 2 3

Easy or default way to exclue rows with NA values from individual operations and comparisons

I work with survey data, where missing values are the rule rather than the exception. My datasets always have lots of NAs, and for simple statistics I usually want to work with cases that are complete on the subset of variables required for that specific operation, and ignore the other cases.
Most of R's base functions return NA if there are any NAs in the input. Additionally, subsets using comparison operators will return a row of NAs for any row with an NA on one of the variables. I literally never want either of these behaviors.
I would like for R to default to excluding rows with NAs for the variables it's operating on, and returning results for the remaining rows (see example below).
Here are the workarounds I currently know about:
Specify na.rm=T: Not too bad, but not all functions support it.
Add !is.na() to all comparison operations: Works, but it's annoying and error-prone to do this by hand, especially when there are multiple variables involved.
Use complete.cases(): Not helpful because I don't want to exclude cases that are missing any variable, just the variables being used in the current operation.
Create a new data frame with the desired cases: Often each row is missing a few scattered variables. That means that every time I wanted to switch from working with one variable to another, I'd have to explicitly create a new subset.
Use imputation: Not always appropriate, especially when computing descriptives or just examining the data.
I know how to get the desired results for any given case, but dealing with NAs explicitly for every piece of code I write takes up a lot of time. Hopefully there's some simple solution that I'm missing. But complex or partial solutions would also be welcome.
Example:
> z<-data.frame(x=c(413,612,96,8,NA), y=c(314,69,400,NA,8888))
# current behavior:
> z[z$x < z$y ,]
x y
3 96 400
NA NA NA
NA.1 NA NA
# Desired behavior:
> z[z$x < z$y ,]
x y
3 96 400
# What I currently have to do in order to get the desired output:
> z[(z$x < z$y) & !is.na(z$x) & !is.na(z$y) ,]
x y
3 96 400
One trick for dealing with NAs in inequalities when subsetting is to do
z[which(z$x < z$y),]
# x y
# 3 96 400
The which() silently drops NA values.

global condition on vector

.Hi
I would like to make a comparison operation on my vector: I got one with numerical values that I want to transform in 2^. However if one value is greater than 65000 after it has be transformed I would like there's no transformation for the entire vector.
Currently I'm trying this:
final<-ifelse((2^vec>65000)vec,2^vec)
It works great but for each value. So if one is greater than 65000 after transformation it this code returns me the initial value but if does'nt exceed 65000 it returns me the transformed value and I have a mixed vector with transformed and non transformed values.
here an example:
> vec
32.82 576.47 36.45 78.93 8.77 63.28 176.86 1.88 291.97 35.59
And the result after my code
> final
32.820000 576.470000 36.450000 78.930000 436.549065 63.280000 176.860000 3.680751 291.970000 35.590000
here, you can see that some values have been transformed en some not. In this kind of situation finally I would like fina=vec. I tried with a "break" instead of vec for the "yes" condition in the ifelse but it does'nt work. Probably something like that could work but I don't what.
If someone has an idea ^^
Thanks
How's this?
log_if_bigger = function(vec, thresh){
if(any(vec>thresh)){
return(log2(vec))
}else{
return(vec)
}
}
Usage:
# if any values, bigger than 0 then log - here there are:
> log_if_bigger(c(1,2,3,4),0)
[1] 0.000000 1.000000 1.584963 2.000000
# if any values bigger than 9 then log - here there arent:
> log_if_bigger(c(1,2,3,4),9)
[1] 1 2 3 4
Then you just want something like:
final = log_if_bigger(vec, 65000)
or possibly:
final = log_if_bigger(vec, log2(65000))
based on your condition where you test 2^vec>65000

Resources