model.matrix dropping a column - r

I have a data frame, that I am wanting to use to generate a design matrix.
>ct<-read.delim(filename, skip=0, as.is=TRUE, sep="\t", row.names = 1)
> ct
s2 s6 S10 S14 S3 S7 S11 S15 S4 S8 S12 S16
group 1 1 1 1 2 2 2 2 3 3 3 3
donor 1 2 3 4 1 2 3 4 1 2 3 4
>factotum<-apply(ct,1,as.factor) # to turn rows into factors.
>design <- model.matrix(~0 + factotum[,1] + factotum[,2])
Eventually, I'll generate a string and use as.formula() instead of hard coding the formula. Anyway, this works and produces a design matrix. It leaves a column out though.
>design
factotum[, 1]1 factotum[, 1]2 factotum[, 1]3 factotum[, 2]2 factotum[, 2]3 factotum[, 2]4
1 1 0 0 0 0 0
2 1 0 0 1 0 0
3 1 0 0 0 1 0
4 1 0 0 0 0 1
5 0 1 0 0 0 0
6 0 1 0 1 0 0
7 0 1 0 0 1 0
8 0 1 0 0 0 1
9 0 0 1 0 0 0
10 0 0 1 1 0 0
11 0 0 1 0 1 0
12 0 0 1 0 0 1
By my reasoning, the column names should be:
factotum[, 1]1 factotum[, 1]2 factotum[, 1]3, factotum[,2]1, factotum[, 2]2 factotum[, 2]3 factotum[, 2]4. These would be renamed as group1,group2,group3,donor1,donor2,donor3,donor4.
Which means that factotum[,2]1, or donor1, is missing. What am I doing that this would be missing? Any help would be be appreciated.
Cheers
Ben.

There are several things here.
(1) apply(ct,1,as.factor) doesn't necessarily turn the rows into factors. Try str(factotum) and you'll see that it failed. I'm not sure what the fastest way is, but this should work:
factotum <- data.frame(lapply(data.frame(t(ct)), as.factor))
(2) Since you are working with factors, model.matrix creates dummy coding. In this case, donor has four values. If you are 2, then you get a 1 in the column factotum[,2]2. If you are 3 or 4, you get a 1 in their respective columns. So what if you are a 1? Well, that simply means that you are 0 in all three columns. In this way, you only need three columns to create four groups. The value 1 for donor is called the reference group here, which is the group with which the other groups are compared.
(3) So now the question is... Why doesn't group (or factotum[,1]) have only TWO columns? We could easily code three levels with two columns, right? Well... yes, this is exactly what happens when you use:
design <- model.matrix(~ factotum[,1] + factotum[,2])
However, since you specify that there is no intercept, you'll get an extra column for group.
(4) Usually you don't have to create the design matrix yourself. I'm not sure what function you want to use next, but in most cases the functions take care of it for you.

Related

How to add multiple values to data.frame without loop?

Suppose I have matrix D which consists of death counts per year by specific ages.
I want to fill this matrix with appropriate death counts that is stored in
vector Age, but the following code gives me wrong answer. How should I write the code without making a loop?
# Year and age grid for tables
Years=c(2007:2017)
Ages=c(60:70)
#Data.frame of deaths
D=data.frame(matrix(ncol=length(Years),nrow=length(Ages))); D[is.na(D)]=0
colnames(D)=Years
rownames(D)=Ages
Age=c(60,61,62,65,65,65,68,69,60)
year=2010
D[as.character(Age),as.character(year)]<-
D[as.character(Age),as.character(year)]+1
D[,'2010'] # 1 1 1 0 0 1 0 0 1 1 0
# Should be 2 1 1 0 0 3 0 0 1 1 0
You need to use table
AgeTable = table(Age)
D[names(AgeTable), as.character(year)] = AgeTable
D[,'2010']
[1] 2 1 1 0 0 3 0 0 1 1 0

merge multiple columns with condition

I have a data frame like this:
Q17a_17 Q17a_18 Q17a_19 Q17a_20 Q17a_21 Q17a_22 Q17a_23
1 NA NA NA NA NA NA NA
2 0 0 0 0 0 0 1
3 0 0 0 0 0 1 1
4 0 0 0 0 0 0 1
5 1 0 0 0 1 1 0
6 0 0 0 0 0 1 1
7 1 1 0 0 1 0 1
And I would like to merge Q17a_17, Q17a_19 and Q17a_23 in a new column with a new name. The "old" columns Q17a_17, Q17a_19 and Q17a_23 should be deleted.
In the new column should be just one value with the following conditions: "NA" if there was "NA" before, "1" if there was somewhere "1" as value before (like in row 3 or 4 or 7) and "0" if there were only zeros before.
Maybe this is really simple, but I struggle already for hours...
The approach I use here is to first compute a vector which is NA when an NA value occurs in at least one of the three columns, and zero otherwise. Also, we compute a vector containing the numerical result you want. What you want can be obtained by logically ORing together the three columns. Then, adding these two computed vectors together produces the desired result.
na.vector <- df$Q17a_17 * df$Q17a_19 * df$Q17a_23
na.vector[!is.na(na.vector)] <- 0
num.vector <- as.numeric(df$Q17a_17 | df$Q17a_19 | df$Q17a_23)
df$new_column <- na.vector + num.vector
df <- df[ , -which(names(df) %in% c("Q17a_17", "Q17a_19", "Q17a_23"))]

Sample random column in dataframe

I have the following code: model$data
model$data
[[1]]
Category1 Category2 Category3 Category4
3555 1 0 0 0
6447 1 0 0 0
5523 1 0 1 0
7550 1 0 1 0
6330 1 0 1 0
2451 1 0 0 0
4308 1 0 1 0
8917 0 0 0 0
4780 1 0 1 0
6802 1 0 1 0
2021 1 0 0 0
5792 1 0 1 0
5475 1 0 1 0
4198 1 0 0 0
223 1 0 1 0
4811 1 0 1 0
678 1 0 1 0
I am trying to use this formula to get an index of the column names:
sample(colnames(model$data), 1)
But I receive the following error message:
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
Is there a way to avoid that error?
Notice this?
model$data
[[1]]
The [[1]] means that model$data is a list, whose first component is a data frame. To do anything with it, you need to pass model$data[[1]] to your code, not model$data.
sample(colnames(model$data[[1]]), 1)
This seems to be a near-duplicate of Random rows in dataframes in R and should probably be closed as duplicate. But for completeness, adapting that answer to sampling column-indices is trivial:
you don't need to generate a vector of column-names, only their indices. Keep it simple.
sample your col-indices from 1:ncol(df) instead of 1:nrow(df)
then put those column-indices on the RHS of the comma in df[, ...]
df[, sample(ncol(df), 1)]
the 1 is because you apparently want to take a sample of size 1.
one minor complication is that your dataframe is model$data[[1]], since your model$data looks like a list with one element which is a dataframe, rather than a plain dataframe. So first, assign df <- model$data[[1]]
finally, if you really really want the sampled column-name(s) as well as their indices:
samp_col_idxs <- sample(ncol(df), 1)
samp_col_names <- colnames(df) [samp_col_idxs]

Finding "similar" rows performing a conditional join with sqldf

Say I got a data.table (can also be data.frame, doesn't matter to me) which has numeric columns a, b, c, d and e.
Each row of the table represents an article and a-e are numeric characteristics of the articles.
What I want to find out is which articles are similar to each other, based on columns a, b and c.
I define "similar" by allowing a, b and c to vary +/- 1 at most.
That is, article x is similar to article y if neither a, b nor c differs by more than 1. Their values for d and e don't matter and may differ significantly.
I've already tried a couple of approaches but didn't get the desired result. What I want to achieve is to get a result table which contains only those rows that are similar to at least one other row. Plus, duplicates must be excluded.
Particularly, I'm wondering if this is possible using the sqldf library. My idea is to somehow join the table with itself under the given conditions, but I don't get it together properly. Any ideas (not necessarily using sqldf)?
Suppose our input data frame is the built-in 11x8 anscombe data frame. Its first three column names are x1, x2 and x3. Then here are some solutions.
1) sqldf This returns the pairs of row numbers of similar rows:
library(sqldf)
ans <- anscombe
ans$id <- 1:nrow(ans)
sqldf("select a.id, b.id
from ans a
join ans b on abs(a.x1 - b.x1) <= 1 and
abs(a.x2 - b.x2) <= 1 and
abs(a.x3 - b.x3) <= 1")
Add another condition and a.id < b.id if each row should not be paired with itself and if we want to exclude the reverse of each pair or add and not a.id = b.id to just exclude self pairs.
2) dist This returns a matrix m whose i,j-th element is 1 if rows i and j are similar and 0 if not based on columns 1, 2 and 3.
# matrix of pairs (1 = similar, 0 = not)
m <- (as.matrix(dist(anscombe[1:3], method = "maximum")) <= 1) + 0
giving:
1 2 3 4 5 6 7 8 9 10 11
1 1 0 0 1 1 0 0 0 0 0 0
2 0 1 0 1 0 0 0 0 0 1 0
3 0 0 1 0 0 1 0 0 1 0 0
4 1 1 0 1 0 0 0 0 0 0 0
5 1 0 0 0 1 0 0 0 1 0 0
6 0 0 1 0 0 1 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 1 1
8 0 0 0 0 0 0 0 1 0 0 1
9 0 0 1 0 1 0 0 0 1 0 0
10 0 1 0 0 0 0 1 0 0 1 0
11 0 0 0 0 0 0 1 1 0 0 1
We could add m[lower.tri(m, diag = TRUE)] <- 0 to exclude self pairs and the reverse of each pair if desired or diag(m) <- 0 to just exclude self pairs.
We can create a data frame of similar row number pairs like this. To keep the output short we have excluded self pairs and the reverse of each pair.
# two-column data.frame of pairs excluding self pairs and reverses
subset(as.data.frame.table(m), c(Var1) < c(Var2) & Freq == 1)[1:2]
giving:
Var1 Var2
34 1 4
35 2 4
45 1 5
58 3 6
91 3 9
93 5 9
101 2 10
106 7 10
117 7 11
118 8 11
Here is a network graph of the above. Note that answer continues after the graph:
# network graph
library(igraph)
g <- graph.adjacency(m)
plot(g)
# raster plot
library(ggplot2)
ggplot(as.data.frame.table(m), aes(Var1, Var2, fill = factor(Freq))) +
geom_raster()
I am quite new to R so don't expect to much.
What if you create from your values (which are basically vectors) a matrix with the distance from the two values. So you can find those combinations which have a difference of less than 1 from each other. Via this way you can find the matching (a)-pairs. Repeat this with (b) and (c) and find those which are included in all pairs.
Alternatively this can probably be done as a cube as well.
Just as a thought hint.

Data Transformations in R

I have a need to look at the data in a data frame in a different way. Here is the problem..
I have a data frame as follows
Person Item BuyOrSell
1 a B
1 b S
1 a S
2 d B
3 a S
3 e S
I need it be transformed into this way. Show the sum of all transactions made by the Person on individual items.
Person a b d e
1 2 1 0 0
2 0 0 1 0
3 1 0 0 1
I was able to achieve the above by using the
table(Person,Item) in R
The new requirement I have is to see the data as follows. Show the sum of all transactions made by the Person on individual items broken by the transaction type (B or S)
Person aB aS bB bS dB dS eB eS
1 1 1 0 1 0 0 0 0
2 0 0 0 0 1 0 0 0
3 1 0 0 0 0 0 0 1
So i created a new column and appended the values of both the Item and BuyOrSell.
df$newcol<-paste(Item,"-",BuyOrSell,sep="")
table(Person,newcol)
and was able to achieve the above results.
Is there a better way in R to do this type of transformation ?
Your way (creating a new column via paste) is probably the easiest. You could also do this:
require(reshape2)
dcast(Person~Item+BuyOrSell,data=df,fun.aggregate=length,drop=FALSE)

Resources