Create virtual categorical column based on existing values - r

Reference:
Transpose and create categorical values in R
Follow-up to this question. While both model.matrix and data.table work very well with values already in it, how can we use them to simulate a column?
Meaning, from data in the same data frame,
data <- read.table(header=T, text='
subject weight sex test
1 2 M control
2 3 F cond1
3 2 F cond2
4 4 M control
5 3 F control
6 2 F control
')
If I were to simulate the case statement with OR condition from SQL in R, how do I go about it? In SQL I would do:
case when ( sex = 'F' OR sex = 'M') AND CONTROL IS NOT NULL THEN 1 ELSE 0 AS F_M_CONTROL
case when (sex = 'F' OR sex = 'M') AND COND1 IS NOT NULL THEN 1 ELSE 0 AS F_M_COND1
bringing the output to:
subject weight control_F_M control_M condtrol_F cond1_F_M cond1_F cond1_M
1 2 0 1 0 0 0 0
2 3 0 0 1 0 0 0
3 2 0 0 0 0 1 0
4 4 0 1 0 0 0 0
5 3 1 0 0 0 0 0
6 2 1 0 0 0 0 0
Any idea how I can generate the "Control_F_M" and Cond1_F_M columns in R?
Thanks in advance,
Bee
Edit:
To generate the afore mentioned output, i'm using the data table & dcast as suggested before.
I can use If-Else if I knew all the values in the column: test. I apologize for not clarifying this earlier. The challenge ofcourse is that the column is dynamic and so I'm hoping to generate that many columns dynamically as an extension to the below or using a similar approach.
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))

Related

How to add multiple values to data.frame without loop?

Suppose I have matrix D which consists of death counts per year by specific ages.
I want to fill this matrix with appropriate death counts that is stored in
vector Age, but the following code gives me wrong answer. How should I write the code without making a loop?
# Year and age grid for tables
Years=c(2007:2017)
Ages=c(60:70)
#Data.frame of deaths
D=data.frame(matrix(ncol=length(Years),nrow=length(Ages))); D[is.na(D)]=0
colnames(D)=Years
rownames(D)=Ages
Age=c(60,61,62,65,65,65,68,69,60)
year=2010
D[as.character(Age),as.character(year)]<-
D[as.character(Age),as.character(year)]+1
D[,'2010'] # 1 1 1 0 0 1 0 0 1 1 0
# Should be 2 1 1 0 0 3 0 0 1 1 0
You need to use table
AgeTable = table(Age)
D[names(AgeTable), as.character(year)] = AgeTable
D[,'2010']
[1] 2 1 1 0 0 3 0 0 1 1 0

Extract columns from df by subset of column id characters

I am working on a gene expression dataset with hundreds of samples. Each sample in the data frame has a unique column ID (example: OHC_112 of IHC_123). I want to make a new dataframe containing only the columns containing the "OHC". How can i do this?
I am struggling to make workable example dataframe... but this is the best i was able to do.
Data frame "DF"
OHC_1 OHC_2 OHC_3 IHC_4 IHC_5 OHC_6
Gene1 1 1 0 1 1 0
Gene2 0 0 0 1 1 0
Gene3 1 1 1 0 0 1
Gene4 1 1 1 0 0 0
I got close by using the following subset command
newDF <- subset(DF, ,select = OHC_1:OHC_3)
This allows me to subset the dataframe by a range of the columns but does not allow me to choose all the columns containing "OHC" in the header.
Thanks for your help!
Just subset the columns with names that match using grepl?
> DF[, grepl("OHC",names(DF))]
OHC_1 OHC_2 OHC_3 OHC_6
1 1 1 0 0
2 0 0 0 0
3 1 1 1 1
4 1 1 1 0
You can make a shorter call that is also more generalizable with negative-grep:
df.2 <- df[, -grep("^OHC_[1:3]$", names(df) )]
Since grep returns numerics you can use the negative vector indexing to remove columns. You could add further number or more complex patterns.
We can use select with matches from tidyverse
library(tidyverse)
DF %>%
select(matches("^OHC"))
# OHC_1 OHC_2 OHC_3 OHC_6
#Gene1 1 1 0 0
#Gene2 0 0 0 0
#Gene3 1 1 1 1
#Gene4 1 1 1 0

Finding "similar" rows performing a conditional join with sqldf

Say I got a data.table (can also be data.frame, doesn't matter to me) which has numeric columns a, b, c, d and e.
Each row of the table represents an article and a-e are numeric characteristics of the articles.
What I want to find out is which articles are similar to each other, based on columns a, b and c.
I define "similar" by allowing a, b and c to vary +/- 1 at most.
That is, article x is similar to article y if neither a, b nor c differs by more than 1. Their values for d and e don't matter and may differ significantly.
I've already tried a couple of approaches but didn't get the desired result. What I want to achieve is to get a result table which contains only those rows that are similar to at least one other row. Plus, duplicates must be excluded.
Particularly, I'm wondering if this is possible using the sqldf library. My idea is to somehow join the table with itself under the given conditions, but I don't get it together properly. Any ideas (not necessarily using sqldf)?
Suppose our input data frame is the built-in 11x8 anscombe data frame. Its first three column names are x1, x2 and x3. Then here are some solutions.
1) sqldf This returns the pairs of row numbers of similar rows:
library(sqldf)
ans <- anscombe
ans$id <- 1:nrow(ans)
sqldf("select a.id, b.id
from ans a
join ans b on abs(a.x1 - b.x1) <= 1 and
abs(a.x2 - b.x2) <= 1 and
abs(a.x3 - b.x3) <= 1")
Add another condition and a.id < b.id if each row should not be paired with itself and if we want to exclude the reverse of each pair or add and not a.id = b.id to just exclude self pairs.
2) dist This returns a matrix m whose i,j-th element is 1 if rows i and j are similar and 0 if not based on columns 1, 2 and 3.
# matrix of pairs (1 = similar, 0 = not)
m <- (as.matrix(dist(anscombe[1:3], method = "maximum")) <= 1) + 0
giving:
1 2 3 4 5 6 7 8 9 10 11
1 1 0 0 1 1 0 0 0 0 0 0
2 0 1 0 1 0 0 0 0 0 1 0
3 0 0 1 0 0 1 0 0 1 0 0
4 1 1 0 1 0 0 0 0 0 0 0
5 1 0 0 0 1 0 0 0 1 0 0
6 0 0 1 0 0 1 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 1 1
8 0 0 0 0 0 0 0 1 0 0 1
9 0 0 1 0 1 0 0 0 1 0 0
10 0 1 0 0 0 0 1 0 0 1 0
11 0 0 0 0 0 0 1 1 0 0 1
We could add m[lower.tri(m, diag = TRUE)] <- 0 to exclude self pairs and the reverse of each pair if desired or diag(m) <- 0 to just exclude self pairs.
We can create a data frame of similar row number pairs like this. To keep the output short we have excluded self pairs and the reverse of each pair.
# two-column data.frame of pairs excluding self pairs and reverses
subset(as.data.frame.table(m), c(Var1) < c(Var2) & Freq == 1)[1:2]
giving:
Var1 Var2
34 1 4
35 2 4
45 1 5
58 3 6
91 3 9
93 5 9
101 2 10
106 7 10
117 7 11
118 8 11
Here is a network graph of the above. Note that answer continues after the graph:
# network graph
library(igraph)
g <- graph.adjacency(m)
plot(g)
# raster plot
library(ggplot2)
ggplot(as.data.frame.table(m), aes(Var1, Var2, fill = factor(Freq))) +
geom_raster()
I am quite new to R so don't expect to much.
What if you create from your values (which are basically vectors) a matrix with the distance from the two values. So you can find those combinations which have a difference of less than 1 from each other. Via this way you can find the matching (a)-pairs. Repeat this with (b) and (c) and find those which are included in all pairs.
Alternatively this can probably be done as a cube as well.
Just as a thought hint.

Create mutually exclusive dummy variables from categorical variable in R [duplicate]

This question already has answers here:
Generate a dummy-variable
(17 answers)
Closed 7 years ago.
A while ago, I asked a question about creating a categorical variable from mutually exclusive dummy variables. Now, it turns out I want to do the opposite.
How would one go about creating dummy variables in a long-form dataset from a single categorical variable (time)? e.g. the dataframe below...
id time
1 1
1 2
1 3
1 4
would become...
id time time_dummy_1 time_dummy_2 time_dummy_3 time_dummy_4
1 1 1 0 0 0
1 2 0 1 0 0
1 3 0 0 1 0
1 4 0 0 0 1
I'm sure this is trivial (and please let me know if this question is a duplicate -- I'm not sure it is, but will happily remove if so). Thanks!
You can try the dummies library.
R Code:
# Creating the data frame
# id <- c(1,1,1,1)
# time <- c(1,2,3,4)
# data <- data.frame(id, time)
install.packages("dummies")
library(dummies)
data <- cbind(data, dummy(data$time))
Output:
id time data1 data2 data3 data4
1 1 1 0 0 0
1 2 0 1 0 0
1 3 0 0 1 0
1 4 0 0 0 1
Further you can rename the newly added dummy variable headers to suit your needs
R Code:
# Rename column headers
colnames(data)[colnames(data)=="data1"] <- "time_dummy_1"
colnames(data)[colnames(data)=="data2"] <- "time_dummy_2"
colnames(data)[colnames(data)=="data3"] <- "time_dummy_3"
colnames(data)[colnames(data)=="data4"] <- "time_dummy_4"
Output:
id time time_dummy_1 time_dummy_2 time_dummy_3 time_dummy_4
1 1 1 0 0 0
1 2 0 1 0 0
1 3 0 0 1 0
1 4 0 0 0 1
Hope this helps.
If your data is
id <- c(1,1,1,1)
time <- c(1,2,3,4)
df <- data.frame(id,time)
you can try
time <- as.character(time)
unique.time <- as.character(unique(df$time))
# Create a dichotomous dummy-variable for each time
x <- sapply(unique.time, function(x)as.numeric(df$time == x))
or
time.f = factor(time)
dummies = model.matrix(~time.f)

Restructure Data in R

I am just starting to get beyond the basics in R and have come to a point where I need some help. I want to restructure some data. Here is what a sample dataframe may look like:
ID Sex Res Contact
1 M MA ABR
1 M MA CON
1 M MA WWF
2 F FL WIT
2 F FL CON
3 X GA XYZ
I want the data to look like:
ID SEX Res ABR CON WWF WIT XYZ
1 M MA 1 1 1 0 0
2 F FL 0 1 0 1 0
3 X GA 0 0 0 0 1
What are my options? How would I do this in R?
In short, I am looking to keep the values of the CONT column and use them as column names in the restructred data frame. I want to hold a variable set of columns constant (in th example above, I held ID, Sex, and Res constant).
Also, is it possible to control the values in the restructured data? I may want to keep the data as binary. I may want some data to have the value be the count of times each contact value exists for each ID.
The reshape package is what you want. Documentation here: http://had.co.nz/reshape/. Not to toot my own horn, but I've also written up some notes on reshape's use here: http://www.ling.upenn.edu/~joseff/rstudy/summer2010_reshape.html
For your purpose, this code should work
library(reshape)
data$value <- 1
cast(data, ID + Sex + Res ~ Contact, fun = "length")
model.matrix works great (this was asked recently, and gappy had this good answer):
> model.matrix(~ factor(d$Contact) -1)
factor(d$Contact)ABR factor(d$Contact)CON factor(d$Contact)WIT factor(d$Contact)WWF factor(d$Contact)XYZ
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 0 0 1 0 0
5 0 1 0 0 0
6 0 0 0 0 1
attr(,"assign")
[1] 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`factor(d$Contact)`
[1] "contr.treatment"

Resources