How do I convert table formats in R - r

Specifically,
I used the following set up:
newdata <- tapply(mydata(#), list(mydata(X), mydata(Y)), sum)
I currently have a table that currently is listed as follows:
X= State, Y= County within State, #= a numerical total of something
__ Y1 Y2 Y3 Yn
X1 ## ## ## ##
X2 ## ## ## ##
X3 ## ## ## ##
Xn ## ## ## ##
What I need is a table listed as follows:
X1 Y1 ##
X1 Y2 ##
X1 Y3 ##
X1 Yn ##
X2 Y1 ##
X2 Y2 ##
X2 Y3 ##
X2 Yn ##
Xn Y1 ##
Xn Y2 ##
Xn Y3 ##
Xn Yn ##

library(reshape2)
new_data <- melt(old_data, id.vars=1)
Look into ?melt for more details on syntax.
example:
> df <- data.frame(x=1:5, y1=rnorm(5), y2=rnorm(5))
> df
x y1 y2
1 1 -1.3417817 -1.1777317
2 2 -0.4014688 1.4653270
3 3 0.4050132 1.5547598
4 4 0.1622901 -1.2976084
5 5 -0.7207541 -0.1203277
> melt(df, id.vars=1)
x variable value
1 1 y1 -1.3417817
2 2 y1 -0.4014688
3 3 y1 0.4050132
4 4 y1 0.1622901
5 5 y1 -0.7207541
6 1 y2 -1.1777317
7 2 y2 1.4653270
8 3 y2 1.5547598
9 4 y2 -1.2976084
10 5 y2 -0.1203277

Some example data
mydata <- data.frame(num=rnorm(40),
gp1=rep(LETTERS[1:2],2),
gp2=rep(letters[1:2],each=2))
And applying tapply to it:
tmp <- tapply(mydata$num, list(mydata$gp1, mydata$gp2), sum)
The result of tapply is a matrix, but you can treat it like a table and use as.data.frame.table to convert it. This does not rely on any additional packages.
as.data.frame.table(tmp)
The two different data structures look like:
> tmp
a b
A 8.381483 6.373657
B 2.379303 -1.189488
> as.data.frame.table(tmp)
Var1 Var2 Freq
1 A a 8.381483
2 B a 2.379303
3 A b 6.373657
4 B b -1.189488

Related

Selection of argument within a function based on the comparison of two vectors

Given is a dataframe with the vectors x1 and y1:
x1 <- c(1,1,2,2,3,4)
y1 <- c(0,0,1,1,2,2)
df1 <- data.frame(x1,y1)
Also, I have a dataframe with the different values from the vector y1 and a corresponding probability:
y <- c(0,1,2)
p <- c(0.1,0.6,0.9)
df2 <- data.frame(y,p)
The following function compares a given probability (p) with a random number (runif(1)). Based on the result of the comparison, the value of df$x1 changes and is stored in df$x2 (for each value of x1 a new random number has to be drawn):
example_function <- function(x,p){
if(runif(1) <= p) return(x + 1)
return(x)
}
set.seed(123)
df1$x2 <- unlist(lapply(df1$x1,example_function,0.5))
> df1$x2
[1] 2 1 3 2 3 5
Here is my problem: In the example above I chose 0.5 for the argument "p" (manually). Instead, I would like to select a probability p from df2 based on the values for y1 associated with x1 in df1. Accordingly, I want p in
df1$x2 <- unlist(lapply(df1$x1,example_function,p))
to be derived from df2.
For example, df$x1[3], which is a 2, belongs to df$y1[3], which is a 1. df2 shows, that a 1 for y is associated with p = 0.6. In that case, the argument p for df1$x1[3] in "example_function" should be 0.6. How can this kind of a query for the value p be integrated into the described function?
df1$x2 <- unlist(lapply(df1$x1,
function(z) {
example_function(z, df2$p[df2$y == df1$y1[df1$x1 == z][1])
}))
df1
# x1 y1 x2
# 1 1 0 1
# 2 2 0 2
# 3 3 1 4
# 4 4 1 4
# 5 5 2 6
# 6 6 2 7
There is no need to do anything complicated here. You can get what you want using vector-expressions.
To pick your probabilities given p and y1, simply subscript:
> p[y1]
[1] 0.1 0.1 0.6 0.6
and then pick your x2 from x1 and the sample like this:
> ifelse(runif(1) <= p[y1], x1, x1 + 2)
[1] 3 4 3 4
One way to solve the problem is working with "merge" and "mapply" instead of "lapply":
df_new <- merge(df1, df2, by.x = 'y1', by.y = 'y')
set.seed(123)
df1$x2 <- mapply(example_function,df1$x1,df_new$p)
> df1
x1 y1 x2
1 1 0 1
2 1 0 1
3 2 1 3
4 2 1 2
5 3 2 3
6 4 2 5

R - Manhattan / Euclidean distance calculations into a matrix

I would like to ask help on distance measures for continuous variables
There is an example:
x1 = (0,0)
x2 = (1,0)
x3 = (5,5)
The example is to find the distance matrix for L1-norm and L2-norm(Euclidean).
I don't know how to compute in R to get the following answer:
I have tried to do it like this but it didn't work as expected.
y2 <- c(0,0)
y3 <- c(1,0)
y4 <- c(5,5)
y5 <- rbind(y2,y3,y4)
dist(y5)
y2 <- c(0,0)
y3 <- c(1,0)
y4 <- c(5,5)
mat <- rbind(y2, y3, y4)
d1 <- dist(mat, upper=TRUE, diag=TRUE, method="manhattan")
d1
# y2 y3 y4
# y2 0 1 10
# y3 1 0 9
# y4 10 9 0
d2 <- dist(mat, upper=TRUE, diag=TRUE)^2
d2
# y2 y3 y4
# y2 0 1 50
# y3 1 0 41
# y4 50 41 0

How to "unmelt" data with reshape r

I have a data frame that I melted using the reshape package that I would like to "un melt".
here is a toy example of the melted data (real data frame is 500x100 or larger) :
variable<-c(rep("X1",3),rep("X2",3),rep("X3",3))
value<-c(rep(rnorm(1,.5,.2),3),rep(rnorm(1,.5,.2),3),rep(rnorm(1,.5,.2),3))
dat <-data.frame(variable,value)
dat
variable value
1 X1 0.5285376
2 X1 0.5285376
3 X1 0.5285376
4 X2 0.1694908
5 X2 0.1694908
6 X2 0.1694908
7 X3 0.7446906
8 X3 0.7446906
9 X3 0.7446906
Each variable (X1, X2,X3) has values estimated at 3 different times (which in this toy example happen to be the same, but this is never the case).
I would like to get it (back) in the form of :
X1 X2 X3
1 0.5285376 0.1694908 0.7446906
2 0.5285376 0.1694908 0.7446906
3 0.5285376 0.1694908 0.7446906
Basically, I would like the variable column to be sorted on ID (X1, X2 etc) and become column headings. I have tried various permutations of cast, dcast, recast, etc.. and cant seem to get the data in the format that I want. It was easy enough to 'melt' data from the wide form to the longer form (e.g. the dat datset), but getting it back is proving difficult. Any ideas? I know this is relatively simple, but I am having a hard time conceptualizing how to do this in reshape or reshape2.
Thanks,
LP
I typically do this by creating an id column and then using dcast:
> dat
variable value
1 X1 0.4299397
2 X1 0.4299397
3 X1 0.4299397
4 X2 0.2531551
5 X2 0.2531551
6 X2 0.2531551
7 X3 0.3972119
8 X3 0.3972119
9 X3 0.3972119
> dat$id <- rep(1:3,times = 3)
> dcast(data = dat,formula = id~variable,fun.aggregate = sum,value.var = "value")
id X1 X2 X3
1 1 0.4299397 0.2531551 0.3972119
2 2 0.4299397 0.2531551 0.3972119
3 3 0.4299397 0.2531551 0.3972119
Depending on how robust you need this to be , the following will correctly cast for varying number of occurrences of variables (and in any order).
> variable<-c(rep("X1",5),rep("X2",4),rep("X3",3))
> value<-c(rep(rnorm(1,.5,.2),5),rep(rnorm(1,.5,.2),4),rep(rnorm(1,.5,.2),3))
> dat <-data.frame(variable,value)
> dat <- dat[order(rnorm(nrow(dat))),]
> dat
variable value
11 X3 1.0294454
8 X2 0.6147509
2 X1 0.3537012
7 X2 0.6147509
9 X2 0.6147509
5 X1 0.3537012
4 X1 0.3537012
12 X3 1.0294454
3 X1 0.3537012
1 X1 0.3537012
10 X3 1.0294454
6 X2 0.6147509
> dat$id = numeric(nrow(dat))
> for (i in 1:nrow(dat)){
+ dat_temp <- dat[1:i,]
+ dat[i,]$id <- nrow(dat_temp[dat_temp$variable == dat[i,]$variable,])
+ }
> cast(dat, id~variable, value = 'value')
id X1 X2 X3
1 1 0.3537012 0.6147509 1.029445
2 2 0.3537012 0.6147509 1.029445
3 3 0.3537012 0.6147509 1.029445
4 4 0.3537012 0.6147509 NA
5 5 0.3537012 NA NA

R: Data Frame Manipulations

I have two data frames:
>df1
type id1 id2 id3 count1 count2 count3
a x1 y1 z1 10 20 0
b x2 y2 z2 20 0 30
c x3 y3 z3 10 10 10
>df2
id prop
x1 10
x2 5
x3 100
y1 0
y2 50
y3 80
z1 10
z2 20
z3 30
count* are like weights. So, finally I want to join the table such that TotalProp is weighted sum of prop and counts
For e.g. for the first row in df1 TotalProp = 10(prop for x1) * 10(count1) + 0(Prop for y1) * 20(count2) + 10(Prop for z1) * 0(count3) = 100
Hence my final table looks like this:
>result
type id1 id2 id3 TotalProp
a x1 y1 z1 100
b x2 y2 z2 700
c x3 y3 z3 2100
Any idea how can I do this?
Thanks.
One line solution first and then explanation using multiple steps
df1
## type id1 id2 id3 count1 count2 count3
## 1 a x1 y1 z1 10 20 0
## 2 b x2 y2 z2 20 0 30
## 3 c x3 y3 z3 10 10 10
df2
## id prop
## x1 x1 10
## x2 x2 5
## x3 x3 100
## y1 y1 0
## y2 y2 50
## y3 y3 80
## z1 z1 10
## z2 z2 20
## z3 z3 30
rownames(df2) <- df2$id
result <- data.frame(type = df1$type, TotalProp = rowSums(matrix(df2[unlist(df1[, c("id1", "id2", "id3")]), "prop"], nrow = nrow(df1)) * as.matrix(df1[,
c("count1", "count2", "count3")])))
result
## type TotalProp
## 1 a 100
## 2 b 700
## 3 c 2100
Stepwise explanation
First we get all the id values in a vector for which we want to fetch corresponding prop values from df2
Step 1
unlist(df1[, c("id1", "id2", "id3")])
## id11 id12 id13 id21 id22 id23 id31 id32 id33
## "x1" "x2" "x3" "y1" "y2" "y3" "z1" "z2" "z3"
Step 2
We name the rows of df2 with df2$id.
rownames(df2) <- df2$id
Step 3
Then using result from step 1, we get corresponding prop values
df2[unlist(df1[, c("id1", "id2", "id3")]), "prop"]
## [1] 10 5 100 0 50 80 10 20 30
Step 4
Convert the vector from step 3 back to 2 dimensional form
matrix(df2[unlist(df1[, c("id1", "id2", "id3")]), "prop"], nrow = nrow(df1))
## [,1] [,2] [,3]
## [1,] 10 0 10
## [2,] 5 50 20
## [3,] 100 80 30
Step 5
Multiply result of Step 4 with counts from df1
as.matrix(df1[, c("count1", "count2", "count3")])
## count1 count2 count3
## [1,] 10 20 0
## [2,] 20 0 30
## [3,] 10 10 10
matrix(df2[unlist(df1[, c("id1", "id2", "id3")]), "prop"], nrow = nrow(df1)) *
as.matrix(df1[, c("count1", "count2", "count3")])
## count1 count2 count3
## [1,] 100 0 0
## [2,] 100 0 600
## [3,] 1000 800 300
Step 6
Apply rowSums to result from step 5 to get desired TotalProp values
rowSums(matrix(df2[unlist(df1[,c('id1','id2','id3')]),'prop'], nrow=nrow(df1)) * as.matrix(df1[,c('count1', 'count2', 'count3')]))
## [1] 100 700 2100
My solution relies on the data structure, so it is not universal, but short.
m1 <- matrix(df[, tail(names(df1), 3)])
m2 <- matrix(df2$prop, 3)
rowSums(m1 * m2)
[1] 100 700 2100
It does not use ids whatsoever, so be careful!
And another way...
TotalProp <- apply(df1,1,function(x) {
sapply(x[2:4],function(x)df2[df2$id==x,]$prop) %*% as.numeric(x[5:7])
})
result <- cbind(df1[1:4],TotalProp)
%*% is the inner product operator, which is like rowsum, so this is somewhat like #ChinmayPatil's answer. So the steps are:
For each row in df1, extract the elements of df2 which have id = cols 2:4 of df1
Form the inner product of the vector from 1 with the vector formed from cols 5:7 of df1
Repeat for each row of df1 [apply(df1,1, ...)]

Selecting values from a 3-column dataframe in R

I have a 3-dimensional array, the variables being x, y and z. x is a list of places, y is a list of time, and z is a list of names. The list of names do not start at the same initial time across the places:
x y z
x1 1 NA
x1 2 z2
x1 3 z3
x1 4 z1
x2 1 NA
x2 2 NA
x2 3 z5
x2 4 z3
x3 1 z3
x3 2 z1
x3 3 z2
x3 4 z2
How do I find the first z for every x? I want the output matrix or dataframe to be:
x z
x1 z2
x2 z5
x3 z3
EDITED, after example data was supplied
You can use function ddply() in package plyr
dat <- "x y z
x1 1 NA
x1 2 z2
x1 3 z3
x1 4 z1
x2 1 NA
x2 2 NA
x2 3 z5
x2 4 z3
x3 1 z3
x3 2 z1
x3 3 z2
x3 4 z2"
df <- read.table(textConnection(dat), header=TRUE, stringsAsFactors=FALSE)
library(plyr)
ddply(df, .(x), function(x)x[!is.na(x$z), ][1, "z"])
x V1
1 x1 z2
2 x2 z5
3 x3 z3
If you don't want to use plyr
t(data.frame(lapply(split(df, as.factor(df$x)), function(k) head(k$z[!is.na(k$z)], 1))))
[,1]
x1 "z2"
x2 "z5"
x3 "z3"

Resources