Safely merge data frames by factor columns

Safely merge data frames by factor columns - r

Factors can help preventing some kinds of programming errors in R: You cannot perform equality check for factors that use different levels, and you are warned when performing greater/less than checks for unordered factors.
a <- factor(letters[1:3])
b <- factor(letters[1:3], levels=letters[4:1])
a == b
## Error in Ops.factor(a, b) : level sets of factors are different
a < a
## [1] NA NA NA
## Warning message:
## In Ops.factor(a, a) : < not meaningful for factors
However, contrary to my expectation, this check is not performed when merging data frames:
ad <- data.frame(x=a, a=as.numeric(a))
bd <- data.frame(x=b, b=as.numeric(b))
merge(ad, bd)
## x a b
## 1 a 1 4
## 2 b 2 3
## 3 c 3 2
Those factors simply seem to be coerced to characters.
Is a "safe merge" available somewhere that would do the check? Do you see specific reasons for not doing this check by default?
Example (real-life use case): Assume two spatial data sets with very similar but not identical subdivision in, say, communes. The data sets refer to slightly different points in time, and some of the communes have merged during that time span. Each data set has a "commune ID" column, perhaps even named identically. While the semantics of this column are very similar, I wouldn't want to (accidentally) merge the data sets over this commune ID column. Instead, I construct a matching table between "old" and "new" commune IDs. If the commune IDs are encoded as factors, a "safe merge" would give a correctness check for the merge operation at no extra (implementation) cost and very little computational cost.

The "safe guard" with merge is the by= parameter. You can set exactly which columns you think should match. If you match up two factor columns, R will use the the labels for those values to match them up. So "a" will match with "a" regardless of how the hidden inner working of factor have coded those values. That's what a user sees, so that's how it will be merged. It's just like with numeric values, you can choose to merge on columns that have complete different ranges (the first column has 1:10, the second has 100:1000). When the by value is set, R will do what it's asked. And if you don't explicitly set the by parameter, then R will find all shared column names in the two data.frames and use that.
And many times when merging, you don't always expect matches. Sometimes you're using all.x or all.y to specifically get unmatched records. In this case, depending on how the different data.frames were created, one may not know about the levels it doesn't have. So it's not at all unreasonable to to try to merge them.
So basically R is treating factors like characters during merging, be cause it assumes that you already know that two columns belong together.

Well, with much credit (and apologies to) MrFlick:
> attributes(ad$x)
$levels
[1] "a" "b" "c"
$class
[1] "factor"
> attributes(ad$a)
NULL
> attributes(ad$b)
NULL
> adfoo<-merge(ad,bd)
> attributes(adfoo$x)
$levels
[1] "a" "b" "c"
$class
[1] "factor"
So in fact the merged column $x is a factor, although only levels common to both ad and bd are merged. The other columns were coerced via as.numeric long ago.

Related

understanding levels: is levels not same as unique()

I read a csv file into a data frame named rr. The character column was treated as factors which was nice.
Do I understand correctly that the levels are just the unique values of the columns? i.e.
levels(rr$col) == unique(rr$col)
Then I wanted to strip leading and trailing whitespaces.(I didn't knew about strip.WHITESPACE option in read)
So I did
rr$col = str_trim(rr$col).
Now the rr$col is no longer a factor. So I did
rr$col = as.factor(rr$col)
But I see now that levels(rr$col) is missing some unique values !! Why?

"Level" is a special property of a variable (column). They are handy because they are retained even if a subset does not contain any values from a specific level. Take for example
x <- as.factor(rep(letters[1:3], each = 3))
If we subset only elements under levels a and b, c is left out. It will be detected with levels(), but not unique(). The latter will see which values appear in the subset only.
> x[c(1,2, 4)]
[1] a a b
Levels: a b c
> levels(x[c(1,2, 4)])
[1] "a" "b" "c"
> unique(x[c(1,2, 4)])
[1] a b
Levels: a b c

Easy or default way to exclue rows with NA values from individual operations and comparisons

I work with survey data, where missing values are the rule rather than the exception. My datasets always have lots of NAs, and for simple statistics I usually want to work with cases that are complete on the subset of variables required for that specific operation, and ignore the other cases.
Most of R's base functions return NA if there are any NAs in the input. Additionally, subsets using comparison operators will return a row of NAs for any row with an NA on one of the variables. I literally never want either of these behaviors.
I would like for R to default to excluding rows with NAs for the variables it's operating on, and returning results for the remaining rows (see example below).
Here are the workarounds I currently know about:
Specify na.rm=T: Not too bad, but not all functions support it.
Add !is.na() to all comparison operations: Works, but it's annoying and error-prone to do this by hand, especially when there are multiple variables involved.
Use complete.cases(): Not helpful because I don't want to exclude cases that are missing any variable, just the variables being used in the current operation.
Create a new data frame with the desired cases: Often each row is missing a few scattered variables. That means that every time I wanted to switch from working with one variable to another, I'd have to explicitly create a new subset.
Use imputation: Not always appropriate, especially when computing descriptives or just examining the data.
I know how to get the desired results for any given case, but dealing with NAs explicitly for every piece of code I write takes up a lot of time. Hopefully there's some simple solution that I'm missing. But complex or partial solutions would also be welcome.
Example:
> z<-data.frame(x=c(413,612,96,8,NA), y=c(314,69,400,NA,8888))
# current behavior:
> z[z$x < z$y ,]
x y
3 96 400
NA NA NA
NA.1 NA NA
# Desired behavior:
> z[z$x < z$y ,]
x y
3 96 400
# What I currently have to do in order to get the desired output:
> z[(z$x < z$y) & !is.na(z$x) & !is.na(z$y) ,]
x y
3 96 400

One trick for dealing with NAs in inequalities when subsetting is to do
z[which(z$x < z$y),]
# x y
# 3 96 400
The which() silently drops NA values.

r - How can I "add" additional information to column names without altering the names themselves?

I have a matrix with individual column names (the row names are not important), like this
TestMat<-matrix(1:25,ncol=5,nrow=5)
colnames(TestMat)<-c("A","B","C","D","E")
TestMat
For various reasons, but mostly because a package will later need it, I can't alter the values in the matrix and they all have to be integers.
Now I want to categorize my colum names (e.g. A, B and D into "Group 1" and C and E into "Group 2"). The idea is, that the matrix will get smaller later on, as values in the matrix are randomly diminished. As soon as a column-sum reaches zero, that column will be dropped. Along this process I want to see how the fraction/size of one group changes, compared to the other groups.
I thought the easiest way would be to just name all the corresponding columns identical:
TestMat2<-matrix(1:25,ncol=5,nrow=5)
colnames(TestMat2)<-c("Group1","Group1","Group2","Group1","Group2")
TestMat2
But this gives me error-messages later on in the analysis, as R starts numbering the identical column-names in a way of "Group1" "Group1.1" "Group2" "Group1.2" "Group2.1".
I have tried my luck with "class", "attr" and "factor" commands to my column names, but don't get anywhere.
Is there a trick or command, I've maybe never heard of?

as per the comments why not put the grouping in another variable then something like:
> TestMat<-matrix(1:25,ncol=5,nrow=5)
> colnames(TestMat)<-c("A","B","C","D","E")
> F=factor(c("Group1","Group1","Group2","Group1","Group2"))
... do something to your matrix...
> summary(F[colSums(TestMat) >= 40])
Group1 Group2
1 2
Is that it (subs. 40 for 0)?

The Bioconductor package Bioboase defines a class ExpressionSet that allows annotations on rows and columns of a matrix
library(Biobase)
exprs = matrix(1:25,ncol=5,nrow=5, dimnames=list(NULL, LETTERS[1:5]))
df = data.frame(grp=c("Group1","Group1","Group2","Group1","Group2"),
row.names=colnames(exprs))
eset = ExpressionSet(exprs, AnnotatedDataFrame(df))
You can access columns in the data frame with $, subset with [, and extract with exprs(), e.g.,
> exprs(eset[, eset$grp == "Group1"])
A B D
1 1 6 16
2 2 7 17
3 3 8 18
4 4 9 19
5 5 10 20
or
> eset[,colSums(exprs(eset)) > 40]$grp
[1] Group2 Group1 Group2
Levels: Group1 Group2
The GenomicRanges package defines a similar class SummarizedExperiment when the rows are annotated with genomic ranges.
This coordinated integration of data and annotation on data is a really good thing, reducing the chance for 'clerical' errors when matrix and annotation are independent; I'm surprised so many comments suggest that you separately maintain two structures.

Thanks for all the helpful comments. I haven't posted here since my original post, because I first wanted to try all promising approaches and find a final solution to my problem.
I tried the Biobase package with its option for annotations, as well as Stephen's idea of grouping everything via a second variable.
As it turned out, as soon as the matrix diminished in size (as a part of the analysis) the external grouping failed, as column-numbers and grouping didn't match anymore and I couldn't find a way to combine the Bioconductor approach and my code.
I found a (somewhat roundabout) solution, though, if anybody cares:
I already stated, that, if I group my column-names identical for grouping, R later numbers my groups and they are thus not idential any longer.
But I then just searched for the first such-and-such neccessary letters to identify the proper group:
length(colnames(TestMat2)[substr(colnames(TestMat2),1,6) == "Group1"])
This way I can always check the fraction of one group of columns versus the others.
Thanks for your answers and help. I learned a lot and I think Bioconductor will come in handy in the future.
Cheers!

printing a list of dataframes

I am a relative newcomer to R. I have searched for the last two workdays trying to figure this out and failed. I have a list of factors generated by a function. I have 9 items in the list of different lengths.
>summary(list_dataframes)
Length Class Mode
[1,] 1757 factor numeric
[2,] 1776 factor numeric
[3,] 1737 factor numeric
[4,] 1766 factor numeric
[5,] 1783 factor numeric
[6,] 1751 factor numeric
[7,] 1744 factor numeric
[8,] 1749 factor numeric
[9,] 1757 factor numeric
Part of a sample of the data as it comes out:
list_dataframes
[[1]]
[1] 1776234_at 1779003_at 1776344_at 1777664_at 1772541_at 1774525_at
[[2]]
[1] 1771703_at 1776299_at 1772744_at 1780116_at 1775451_at 1778821_at
[7] 1774342_at
[[3]]
[1] 1780116_at 1776262_at 1775451_at 1780200_at 1775704_at
I am not sure why it says the Mode is "numeric". The individual entries are a mix of numbers and letter like "S35_at".
I would like to make this into a table of nine columns and 1783 rows without making duplicate values. (Hence I tried using do.call and it didn't work. I ended up with a mess full of duplicates.) The shorter ones can have NAs in the empty spaces or be blank.
I need to be able to eventually end up with something I can put into a spread sheet.
There has to be a way to do this. Thank you!
I guess I should add it initially was coming out as data frames when I had four columns of data but I only need one column of the data and when I subsetted the function that creates this list to create only the one column I actually needed it seems to no longer be a dataframe.
dput(head(list_dataframes))
list(structure(c(3605L, 5065L, 3663L, 4349L, 1655L, 2700L, 5692L, plus many more
.Label = c("1769308_at",
"1769311_at", "1769312_at", "1769313_at", "1769314_at", "1769317_at", plus many more
this pattern is repeated nine more times
What I am trying to do is produce a table that would look like this:
a= xyz,tuv,efg,hij,def
b= xyz,tuv,efg
c= tuv,efg,hij,def
What I want to make is a table that is
a b c
xyz xyz tuv
tuv tuv efg
efg efg hij
hij NA NA
NA NA NA
NA could be blank as well.
After much reading the manual section on lists I determined that I had generated a buried list of lists. It had nine items with the data I wanted buried two layers down i.e to see it I had to use [[1]]. Also because of something in R that results in a single column data frame becoming a factor instead of staying a data frame it was further complicated. To fix it (sort of) I added one step in my equation so that I changed that factor into a data frame.
After that, when I used lapply to generate my result, at least the factor issue was resolved. I could then use the following steps to pull the data frames out.
first <- list_dataframes[[1]]
second <- list_dataframes[[2]]
third <- list_dataframes[[3]]
fourth <- list_dataframes[[4]]
fifth <- list_dataframes[[5]]
sixth <- list_dataframes[[6]]
seventh <- list_dataframes[[7]]
eighth <- list_dataframes[[8]]
nineth <- list_dataframes[[9]]
all_results <- cbindX(first,second,third,fourth,fifth,sixth,seventh, eighth,nineth)
I could then write the csv file using write.csv and get the correct result I was after. SO I guess I have my answer. I mean it does work now.
However I still think I am missing something in making this work optimally even though it is now giving me the correct result I was after.

The factor class variables are vectors of integer mode with an attached attribute that is a character vector specifying the labels to be used in displaying the integer values. I would think the safest way to bind these together would be to convert the factor columns to character class and then to merge with all=TRUE. Why not post a simple example with three dataframes or factors... I cannot actually discern the structure for sure from summary-output ... of length 10, 9 and 8 that has whatever level of complexity is in your data?
If you want to make them all factors with a common set of levels, then use this:
shared_levels <- unique( c( unlist( lapply(list_dataframes) ) ) )
length(shared_levels)
new_list <- lapply(list_dataframes, factor, levels=shared_levels)
As stated in the comment, I still do not understand what sort of table you imagine being produced. Need a concrete example.

Having Numeric data type and character data type in the same column of a data frame?

I have a large data frame (570 rows by 200000 columns) in R. For those of you that are familiar with PLINK, I am trying to create a PED file for a GWAS analysis. Plink requires that each missing character be coded with a zero. The non-missing values are "A", "T", "C", or "G".
So, for example, the data structure looks like this in the data frame.
COL1 COL2
PT1 A T
PT2 T T
PT3 A A
PT4 A T
PT5 0 0
PT6 A A
PT7 T A
PTn T T
When I run my file in Plink, I get an error. I went back to check my file in R and found that the zeros were "character" types. Is it possible to have two different data types (numeric and character) in a given column in R? I've tried making the 0's a numeric type and keep the letters as character type, but it won't work.

I think Justin's advice will probably fix the problem you have with Plink, but wanting to answer your question in bold...
Is it possible to have two different data types (numeric and character) in a given column in R?
Not really, but in this particular scenario, when it is a discrete variable, kind of yes. In R you have the factor basic type, an enumerate in some other languages.
For example try this:
x = factor(c("0","A","C","G","T"),levels=c(0,"A","T","G","C"))
print(x)
[1] 0 A C G T
Levels: 0 A T G C
You can transform them back in integers (first level is 1 by default) and characters:
> as.integer(x)
[1] 1 2 5 4 3
> as.character(x)
[1] "0" "A" "C" "G" "T"
Now when you read a table with read.table you can indicate that all character types should be read as factor even those with quotes around them.
mydata = read.table("yourData.tsv",stringAsFactors=T);

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Safely merge data frames by factor columns - r

Related

understanding levels: is levels not same as unique()

Easy or default way to exclue rows with NA values from individual operations and comparisons

r - How can I "add" additional information to column names without altering the names themselves?

printing a list of dataframes

Having Numeric data type and character data type in the same column of a data frame?

Categories

Resources