Data Transformations in R - r

I have a need to look at the data in a data frame in a different way. Here is the problem..
I have a data frame as follows
Person Item BuyOrSell
1 a B
1 b S
1 a S
2 d B
3 a S
3 e S
I need it be transformed into this way. Show the sum of all transactions made by the Person on individual items.
Person a b d e
1 2 1 0 0
2 0 0 1 0
3 1 0 0 1
I was able to achieve the above by using the
table(Person,Item) in R
The new requirement I have is to see the data as follows. Show the sum of all transactions made by the Person on individual items broken by the transaction type (B or S)
Person aB aS bB bS dB dS eB eS
1 1 1 0 1 0 0 0 0
2 0 0 0 0 1 0 0 0
3 1 0 0 0 0 0 0 1
So i created a new column and appended the values of both the Item and BuyOrSell.
df$newcol<-paste(Item,"-",BuyOrSell,sep="")
table(Person,newcol)
and was able to achieve the above results.
Is there a better way in R to do this type of transformation ?

Your way (creating a new column via paste) is probably the easiest. You could also do this:
require(reshape2)
dcast(Person~Item+BuyOrSell,data=df,fun.aggregate=length,drop=FALSE)

Related

Create a new conditional column based on binary values in other columns? (R)

My data looks like this:
A B C
1 0 0
0 1 0
0 1 0
0 0 1
This is what I’m going for:
A B C New_Column
1 0 0 A
0 1 0 B
0 1 0 B
0 0 1 C
So I'm creating a new column that tells me which of the three variables (A, B, or C) is present. Only one of the three columns will contain a 1 per row. What is the best way to go about this?
We can use max.col
df1$New_Column <- names(df1)[max.col(df1, "first")]

How to add multiple values to data.frame without loop?

Suppose I have matrix D which consists of death counts per year by specific ages.
I want to fill this matrix with appropriate death counts that is stored in
vector Age, but the following code gives me wrong answer. How should I write the code without making a loop?
# Year and age grid for tables
Years=c(2007:2017)
Ages=c(60:70)
#Data.frame of deaths
D=data.frame(matrix(ncol=length(Years),nrow=length(Ages))); D[is.na(D)]=0
colnames(D)=Years
rownames(D)=Ages
Age=c(60,61,62,65,65,65,68,69,60)
year=2010
D[as.character(Age),as.character(year)]<-
D[as.character(Age),as.character(year)]+1
D[,'2010'] # 1 1 1 0 0 1 0 0 1 1 0
# Should be 2 1 1 0 0 3 0 0 1 1 0
You need to use table
AgeTable = table(Age)
D[names(AgeTable), as.character(year)] = AgeTable
D[,'2010']
[1] 2 1 1 0 0 3 0 0 1 1 0

Finding "similar" rows performing a conditional join with sqldf

Say I got a data.table (can also be data.frame, doesn't matter to me) which has numeric columns a, b, c, d and e.
Each row of the table represents an article and a-e are numeric characteristics of the articles.
What I want to find out is which articles are similar to each other, based on columns a, b and c.
I define "similar" by allowing a, b and c to vary +/- 1 at most.
That is, article x is similar to article y if neither a, b nor c differs by more than 1. Their values for d and e don't matter and may differ significantly.
I've already tried a couple of approaches but didn't get the desired result. What I want to achieve is to get a result table which contains only those rows that are similar to at least one other row. Plus, duplicates must be excluded.
Particularly, I'm wondering if this is possible using the sqldf library. My idea is to somehow join the table with itself under the given conditions, but I don't get it together properly. Any ideas (not necessarily using sqldf)?
Suppose our input data frame is the built-in 11x8 anscombe data frame. Its first three column names are x1, x2 and x3. Then here are some solutions.
1) sqldf This returns the pairs of row numbers of similar rows:
library(sqldf)
ans <- anscombe
ans$id <- 1:nrow(ans)
sqldf("select a.id, b.id
from ans a
join ans b on abs(a.x1 - b.x1) <= 1 and
abs(a.x2 - b.x2) <= 1 and
abs(a.x3 - b.x3) <= 1")
Add another condition and a.id < b.id if each row should not be paired with itself and if we want to exclude the reverse of each pair or add and not a.id = b.id to just exclude self pairs.
2) dist This returns a matrix m whose i,j-th element is 1 if rows i and j are similar and 0 if not based on columns 1, 2 and 3.
# matrix of pairs (1 = similar, 0 = not)
m <- (as.matrix(dist(anscombe[1:3], method = "maximum")) <= 1) + 0
giving:
1 2 3 4 5 6 7 8 9 10 11
1 1 0 0 1 1 0 0 0 0 0 0
2 0 1 0 1 0 0 0 0 0 1 0
3 0 0 1 0 0 1 0 0 1 0 0
4 1 1 0 1 0 0 0 0 0 0 0
5 1 0 0 0 1 0 0 0 1 0 0
6 0 0 1 0 0 1 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 1 1
8 0 0 0 0 0 0 0 1 0 0 1
9 0 0 1 0 1 0 0 0 1 0 0
10 0 1 0 0 0 0 1 0 0 1 0
11 0 0 0 0 0 0 1 1 0 0 1
We could add m[lower.tri(m, diag = TRUE)] <- 0 to exclude self pairs and the reverse of each pair if desired or diag(m) <- 0 to just exclude self pairs.
We can create a data frame of similar row number pairs like this. To keep the output short we have excluded self pairs and the reverse of each pair.
# two-column data.frame of pairs excluding self pairs and reverses
subset(as.data.frame.table(m), c(Var1) < c(Var2) & Freq == 1)[1:2]
giving:
Var1 Var2
34 1 4
35 2 4
45 1 5
58 3 6
91 3 9
93 5 9
101 2 10
106 7 10
117 7 11
118 8 11
Here is a network graph of the above. Note that answer continues after the graph:
# network graph
library(igraph)
g <- graph.adjacency(m)
plot(g)
# raster plot
library(ggplot2)
ggplot(as.data.frame.table(m), aes(Var1, Var2, fill = factor(Freq))) +
geom_raster()
I am quite new to R so don't expect to much.
What if you create from your values (which are basically vectors) a matrix with the distance from the two values. So you can find those combinations which have a difference of less than 1 from each other. Via this way you can find the matching (a)-pairs. Repeat this with (b) and (c) and find those which are included in all pairs.
Alternatively this can probably be done as a cube as well.
Just as a thought hint.

Create virtual categorical column based on existing values

Reference:
Transpose and create categorical values in R
Follow-up to this question. While both model.matrix and data.table work very well with values already in it, how can we use them to simulate a column?
Meaning, from data in the same data frame,
data <- read.table(header=T, text='
subject weight sex test
1 2 M control
2 3 F cond1
3 2 F cond2
4 4 M control
5 3 F control
6 2 F control
')
If I were to simulate the case statement with OR condition from SQL in R, how do I go about it? In SQL I would do:
case when ( sex = 'F' OR sex = 'M') AND CONTROL IS NOT NULL THEN 1 ELSE 0 AS F_M_CONTROL
case when (sex = 'F' OR sex = 'M') AND COND1 IS NOT NULL THEN 1 ELSE 0 AS F_M_COND1
bringing the output to:
subject weight control_F_M control_M condtrol_F cond1_F_M cond1_F cond1_M
1 2 0 1 0 0 0 0
2 3 0 0 1 0 0 0
3 2 0 0 0 0 1 0
4 4 0 1 0 0 0 0
5 3 1 0 0 0 0 0
6 2 1 0 0 0 0 0
Any idea how I can generate the "Control_F_M" and Cond1_F_M columns in R?
Thanks in advance,
Bee
Edit:
To generate the afore mentioned output, i'm using the data table & dcast as suggested before.
I can use If-Else if I knew all the values in the column: test. I apologize for not clarifying this earlier. The challenge ofcourse is that the column is dynamic and so I'm hoping to generate that many columns dynamically as an extension to the below or using a similar approach.
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))

Restructure Data in R

I am just starting to get beyond the basics in R and have come to a point where I need some help. I want to restructure some data. Here is what a sample dataframe may look like:
ID Sex Res Contact
1 M MA ABR
1 M MA CON
1 M MA WWF
2 F FL WIT
2 F FL CON
3 X GA XYZ
I want the data to look like:
ID SEX Res ABR CON WWF WIT XYZ
1 M MA 1 1 1 0 0
2 F FL 0 1 0 1 0
3 X GA 0 0 0 0 1
What are my options? How would I do this in R?
In short, I am looking to keep the values of the CONT column and use them as column names in the restructred data frame. I want to hold a variable set of columns constant (in th example above, I held ID, Sex, and Res constant).
Also, is it possible to control the values in the restructured data? I may want to keep the data as binary. I may want some data to have the value be the count of times each contact value exists for each ID.
The reshape package is what you want. Documentation here: http://had.co.nz/reshape/. Not to toot my own horn, but I've also written up some notes on reshape's use here: http://www.ling.upenn.edu/~joseff/rstudy/summer2010_reshape.html
For your purpose, this code should work
library(reshape)
data$value <- 1
cast(data, ID + Sex + Res ~ Contact, fun = "length")
model.matrix works great (this was asked recently, and gappy had this good answer):
> model.matrix(~ factor(d$Contact) -1)
factor(d$Contact)ABR factor(d$Contact)CON factor(d$Contact)WIT factor(d$Contact)WWF factor(d$Contact)XYZ
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 0 0 1 0 0
5 0 1 0 0 0
6 0 0 0 0 1
attr(,"assign")
[1] 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$`factor(d$Contact)`
[1] "contr.treatment"

Resources