Missing rows after subsetting datatable on a single column - r

I have a datatable, DT, with columns A, B and C. I want only one A per unique B, and I want to choose that A based on the value of C (choose the largest C).
Based on this (incredibly helpful) SO page, Use data.table to get first of subgroup based on a variable, I tried something like this:
test <- data.table(A=c(1:3,1:2),B=c(1:5),C=c(11:15))
setkey(test,A,C)
test[,.SD[.N],by="A"]
In my test case, this gives me an answer that seems right:
# A B C
# 1: 1 6 16
# 2: 2 7 17
# 3: 3 8 18
# 4: 4 4 14
# 5: 5 5 15
And, as expected, the number of rows matches the number of unique entries for "A" in my DT:
length(unique(test$A))
# 5
However, when I apply this to my actual dataset, I am missing approximately 20% of my initially ~2 million rows.
I cannot seem to put together a test dataset that will recreate this type of a loss. There are no null values in the actual dataset. What else could be a factor in a dataset that would cause a discrepancy between the number of results from something like test[,.SD[.N],by="A"] and length(unique(test$A))?

Thanks to #Eddi's debugging coaching, here's the answer, at least for my dataset: differential handling of numbers in scientific notation.
In particular: In my actual dataset, columns A and B were very long numbers that, upon import from SQL to R, had been imported in scientific notation. It turns out the test[,.SD[.N],by="A"] and length(unique(test$A)) commands were handling this differently: length(unique(test$A)) was preserving the difference between two values that differed only in a small digit that is not visible in the collapsed scientific notation format printed as visual output, but test[,.SD[.N],by="A"] was, in essence, rounding the values and thus collapsing some of them together.
(I feel foolish that I didn't catch this myself before posting, but much appreciate the help - I hope somehow this spares someone else the same confusion, perhaps!)

Related

R: Dropping variables using number of observations

I have a large dataset, and I'm trying to drop some of my variables based on how many observations each has. For instance, I would like to drop any variable in my dataframe where n < 3 (total observations for that variable is less than 3). Since R can count observations for each variable using describe, can't I use that number to subset the data instead of having to type in each variable name each time I pull in a new version (each version has different variables that will have low n's and there are over 40 variables). Thanks so much for your help!
For instance, my data looks like this:
ID Runaway Aggressive Emergency Hospitalization Injury
1 3 NA 4 1 NA
2 NA NA 2 1 NA
3 4 NA 6 2 3
4 1 NA 1 1 NA
I want to be able to drop "Aggressive" and "Injury" based on their n's being 0 and 1 respectively. However, instead of telling R to drop them by variable name, it would be much more convenient if it was possible to tell R to drop any variable where n < 3 (or whatever number I choose) as I'll be using this code for multiple versions of this dataset. I have tried using column numbers (which is better than writing them out) but it's still pretty tedious when I have to describe() the data, figure out which variables have low n's, and then drop 28 variables or subset() around them.
This works but it's cumbersome...
UIRCorrelation <- UIRKidUnique61[c(28, 30, 32, 34:38, 42, 54:74)]
For some reason, my example looks different when I'm editing versus when I save so I also included an image of it. Sorry. This is the first time I've ever used stack overflow to ask a question. I actually spent a lot of time googling this but couldn't find an answer relating to n.
This line did not work: DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
DF being your dataframe
DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
This function did the trick:
valid <- function(x) {sum(!is.na(x))}
N <- apply(UIRCorrelation,2,valid)
UIRCorrelation2 <- UIRCorrelation[N > 3]

Is there an easy way to rank values in R using two criteria (the second one being for the ties)?

I have been wondering whether there is a simple way of ranking values in R with two criteria: one for the main ranking and the other to rank the ties.
For instance, suppose we have the following sets of number:
a <- c(9,13,6,3,7,1,13)
b <- c(1,4,3,6,5,7,2)
Now, suppose we want to rank a using the information in b to handle the ties in rank(a), so we end up having the following:
> 5 7 3 2 4 1 6
Is there an easy way to get that in R? The options in rank to deal with ties are of no help for that.
PS: there is a similar question on rank and ties already, but it's not a duplicate since it's not really asking the same thing despite its title suggesting so: Is there a simple way to rank on multiple criteria that preserves ties in R?
Assuming that all ties are actually broken:
order(order(a, b))
#[1] 5 7 3 2 4 1 6
There are probably more efficient alternatives.

Changing vector of 1-10 to vector of 1-3 using R

I am using R to analyze a survey. Several of the columns include numbers 1-10, depending on how survey respondents answered the respective questions. I'd like to change the 1-10 scale to a 1-3 scale. Is there a simple way to do this? I was writing a complicated set of for loops and if statements, but I feel like there must be a better way in R.
I'd like to change numbers 1-3 to 1; numbers 4 and 8 to 2; numbers 5-7 to 3, and numbers 9 and 10 to NA.
So in the snippet below, OriginalColumn would become NewColumn.
OriginalColumn=c(4,9,1,10,8,3,2,7,5,6)
NewColumn=c(2,NA,1,NA,2,1,1,3,3,3)
Is there an easy way to do this without a bunch of crazy for loops? Thanks!
You can do this using positional indexing:
> c(1,1,1,2,3,3,3,2,NA,NA)[OriginalColumn]
[1] 2 NA 1 NA 2 1 1 3 3 3
It is better than repeated/nested ifelse because it is vectorized (thus easier to read, write, and understand; and probably faster). In essence, you're creating a new vector that contains that new values for every value you want to replace. So, for values 1:3 you want 1, thus the first three elements of the vector are 1, and so forth. You then use your original vector to extract the new values based on the positions of the original values.
You could also try
library(car)
recode(OriginalColumn, '1:3=1; c(4,8)=2; 5:7=3; else=NA')
#[1] 2 NA 1 NA 2 1 1 3 3 3

r - How can I "add" additional information to column names without altering the names themselves?

I have a matrix with individual column names (the row names are not important), like this
TestMat<-matrix(1:25,ncol=5,nrow=5)
colnames(TestMat)<-c("A","B","C","D","E")
TestMat
For various reasons, but mostly because a package will later need it, I can't alter the values in the matrix and they all have to be integers.
Now I want to categorize my colum names (e.g. A, B and D into "Group 1" and C and E into "Group 2"). The idea is, that the matrix will get smaller later on, as values in the matrix are randomly diminished. As soon as a column-sum reaches zero, that column will be dropped. Along this process I want to see how the fraction/size of one group changes, compared to the other groups.
I thought the easiest way would be to just name all the corresponding columns identical:
TestMat2<-matrix(1:25,ncol=5,nrow=5)
colnames(TestMat2)<-c("Group1","Group1","Group2","Group1","Group2")
TestMat2
But this gives me error-messages later on in the analysis, as R starts numbering the identical column-names in a way of "Group1" "Group1.1" "Group2" "Group1.2" "Group2.1".
I have tried my luck with "class", "attr" and "factor" commands to my column names, but don't get anywhere.
Is there a trick or command, I've maybe never heard of?
as per the comments why not put the grouping in another variable then something like:
> TestMat<-matrix(1:25,ncol=5,nrow=5)
> colnames(TestMat)<-c("A","B","C","D","E")
> F=factor(c("Group1","Group1","Group2","Group1","Group2"))
... do something to your matrix...
> summary(F[colSums(TestMat) >= 40])
Group1 Group2
1 2
Is that it (subs. 40 for 0)?
The Bioconductor package Bioboase defines a class ExpressionSet that allows annotations on rows and columns of a matrix
library(Biobase)
exprs = matrix(1:25,ncol=5,nrow=5, dimnames=list(NULL, LETTERS[1:5]))
df = data.frame(grp=c("Group1","Group1","Group2","Group1","Group2"),
row.names=colnames(exprs))
eset = ExpressionSet(exprs, AnnotatedDataFrame(df))
You can access columns in the data frame with $, subset with [, and extract with exprs(), e.g.,
> exprs(eset[, eset$grp == "Group1"])
A B D
1 1 6 16
2 2 7 17
3 3 8 18
4 4 9 19
5 5 10 20
or
> eset[,colSums(exprs(eset)) > 40]$grp
[1] Group2 Group1 Group2
Levels: Group1 Group2
The GenomicRanges package defines a similar class SummarizedExperiment when the rows are annotated with genomic ranges.
This coordinated integration of data and annotation on data is a really good thing, reducing the chance for 'clerical' errors when matrix and annotation are independent; I'm surprised so many comments suggest that you separately maintain two structures.
Thanks for all the helpful comments. I haven't posted here since my original post, because I first wanted to try all promising approaches and find a final solution to my problem.
I tried the Biobase package with its option for annotations, as well as Stephen's idea of grouping everything via a second variable.
As it turned out, as soon as the matrix diminished in size (as a part of the analysis) the external grouping failed, as column-numbers and grouping didn't match anymore and I couldn't find a way to combine the Bioconductor approach and my code.
I found a (somewhat roundabout) solution, though, if anybody cares:
I already stated, that, if I group my column-names identical for grouping, R later numbers my groups and they are thus not idential any longer.
But I then just searched for the first such-and-such neccessary letters to identify the proper group:
length(colnames(TestMat2)[substr(colnames(TestMat2),1,6) == "Group1"])
This way I can always check the fraction of one group of columns versus the others.
Thanks for your answers and help. I learned a lot and I think Bioconductor will come in handy in the future.
Cheers!

count of entries in data frame in R

I'm looking to get a count for the following data frame:
> Santa
Believe Age Gender Presents Behaviour
1 FALSE 9 male 25 naughty
2 TRUE 5 male 20 nice
3 TRUE 4 female 30 nice
4 TRUE 4 male 34 naughty
of the number of children who believe. What command would I use to get this?
(The actual data frame is much bigger. I've just given you the first four rows...)
Thanks!
You could use table:
R> x <- read.table(textConnection('
Believe Age Gender Presents Behaviour
1 FALSE 9 male 25 naughty
2 TRUE 5 male 20 nice
3 TRUE 4 female 30 nice
4 TRUE 4 male 34 naughty'
), header=TRUE)
R> table(x$Believe)
FALSE TRUE
1 3
I think of this as a two-step process:
subset the original data frame according to the filter supplied
(Believe==FALSE); then
get the row count of this subset
For the first step, the subset function is a good way to do this (just an alternative to ordinary index or bracket notation).
For the second step, i would use dim or nrow
One advantage of using subset: you don't have to parse the result it returns to get the result you need--just call nrow on it directly.
so in your case:
v = nrow(subset(Santa, Believe==FALSE)) # 'subset' returns a data.frame
or wrapped in an anonymous function:
>> fnx = function(fac, lev){nrow(subset(Santa, fac==lev))}
>> fnx(Believe, TRUE)
3
Aside from nrow, dim will also do the job. This function returns the dimensions of a data frame (rows, cols) so you just need to supply the appropriate index to access the number of rows:
v = dim(subset(Santa, Believe==FALSE))[1]
An answer to the OP posted before this one shows the use of a contingency table. I don't like that approach for the general problem as recited in the OP. Here's the reason. Granted, the general problem of how many rows in this data frame have value x in column C? can be answered using a contingency table as well as using a "filtering" scheme (as in my answer here). If you want row counts for all values for a given factor variable (column) then a contingency table (via calling table and passing in the column(s) of interest) is the most sensible solution; however, the OP asks for the count of a particular value in a factor variable, not counts across all values. Aside from the performance hit (might be big, might be trivial, just depends on the size of the data frame and the processing pipeline context in which this function resides). And of course once the result from the call to table is returned, you still have to parse from that result just the count that you want.
So that's why, to me, this is a filtering rather than a cross-tab problem.
sum(Santa$Believe)
You can do summary(santa$Believe) and you will get the count for TRUE and FALSE
DPLYR makes this really easy.
x<-santa%>%
count(Believe)
If you wanted to count by a group; for instance, how many males v females believe, just add a group_by:
x<-santa%>%
group_by(Gender)%>%
count(Believe)
A one-line solution with data.table could be
library(data.table)
setDT(x)[,.N,by=Believe]
Believe N
1: FALSE 1
2: TRUE 3
using sqldf fits here:
library(sqldf)
sqldf("SELECT Believe, Count(1) as N FROM Santa
GROUP BY Believe")

Resources