Aloha,
I am trying to cast() a dataset in which every unique combination of W-X-Y returns the max number of Z AND the associated week. For example:
W X Y week Z
w1 x1 y1 1 0
w1 x1 y1 2 0.1
w1 x1 y1 3 0.2
w2 x2 y1 1 0.5
w2 x2 y1 2 0.7
w2 x2 y1 3 0.3
w3 x1 y1 1 0.1
w3 x1 y1 2 0.2
w3 x1 y1 3 0.5
w4 x2 y2 1 0.7
w4 x2 y2 2 0.3
w4 x2 y2 3 0.1
w5 x1 y2 1 0.3
w5 x1 y2 2 0.1
w5 x1 y2 3 0.2
Can I do this w/cast()? I am able to return just the max number of Z per unique W-X-Y combination, but not the week with the following:
cast(foo, W + X + Y ~ ., max, value="Z")
For the above dataset, I would like the output to look as such:
W X Y week Z
w1 x1 y1 3 0.2
w2 x2 y1 2 0.7
w3 x1 y1 3 0.5
w4 x2 y2 1 0.7
w5 x1 y2 1 0.3
Mahalo for your suggestions!
cast is not the right tool for this. Consider instead the functions in the plyr package:
library("plyr")
ddply(foo, .(W, X, Y), summarise, week=week[which.max(Z)], Z=max(Z))
Related
I am still getting the hang of R and coding in general, so bear with me on this.
my problem This is a dimension reduction idea I have consisting of three steps. I need help with the first two.
bin rows
transpose the binned rows into new columns so the columns will increase by number of bin, rows decrease by number of bins
Perform PCA to then reduce columns
So the data would go from this:
A B C D
1 W1 X1 Y1 Z1
2 W2 X2 Y2 Z2
3 W3 X3 Y3 Z3
4 W4 X4 Y4 Z4
5 W5 X5 Y5 Z5
6 W6 X6 Y6 Z6
so, if I bin by 2 and transpose it would look something like this:
A A B B C C D D
1 W1 W2 X1 X2 Y1 Y2 Z1 Z2
2 W3 W4 X3 X4 Y3 Y4 Z3 Z4
3 W5 W6 X5 X6 Y5 Y6 Z5 Z6
I'm pretty sure I need to nest bin and transpose in some sort of function, but I'm not sure which comes first, or really at all how to approach this, so any suggestions will help!
I really hope this makes some sense, let me know how I can rephrase if needed!
EDIT
I am working with integer datatypes, here is a snippet of my actual data I'd like to bin and expand.
> head(dataset[1:4])
EMG1 EMG2 EMG3 EMG4
1 32744 32571 32935 32279
2 32788 32934 32767 32624
3 32828 33202 32587 32377
4 32870 33269 32423 32954
5 32838 33319 32126 32721
6 32903 33502 32652 32151
Assuming these letter digit entries as not supposed to be stand ins for numerics, I would first run this:
dat[] <- lapply(dat, as.character) # ensures we get rid of factors
This uses recycling of logical indices inside a function that gets serially applied across your dataframe to create two lists from each column. That is then coerced to a dataframe. The initial result res has rather odd names which get shortened with some simple regex work.
res <- data.frame( lapply(dat,
function(cl){list( list(cl[c(TRUE,FALSE)],
list(cl[!c(TRUE,FALSE)]) )) }))
names(res) <- sub("\\..+$", "", names(res))
> res
A A B B C C D D
1 W1 W2 X1 X2 Y1 Y2 Z1 Z2
2 W3 W4 X3 X4 Y3 Y4 Z3 Z4
3 W5 W6 X5 X6 Y5 Y6 Z5 Z6
I would like to ask help on distance measures for continuous variables
There is an example:
x1 = (0,0)
x2 = (1,0)
x3 = (5,5)
The example is to find the distance matrix for L1-norm and L2-norm(Euclidean).
I don't know how to compute in R to get the following answer:
I have tried to do it like this but it didn't work as expected.
y2 <- c(0,0)
y3 <- c(1,0)
y4 <- c(5,5)
y5 <- rbind(y2,y3,y4)
dist(y5)
y2 <- c(0,0)
y3 <- c(1,0)
y4 <- c(5,5)
mat <- rbind(y2, y3, y4)
d1 <- dist(mat, upper=TRUE, diag=TRUE, method="manhattan")
d1
# y2 y3 y4
# y2 0 1 10
# y3 1 0 9
# y4 10 9 0
d2 <- dist(mat, upper=TRUE, diag=TRUE)^2
d2
# y2 y3 y4
# y2 0 1 50
# y3 1 0 41
# y4 50 41 0
I have a data frame that I melted using the reshape package that I would like to "un melt".
here is a toy example of the melted data (real data frame is 500x100 or larger) :
variable<-c(rep("X1",3),rep("X2",3),rep("X3",3))
value<-c(rep(rnorm(1,.5,.2),3),rep(rnorm(1,.5,.2),3),rep(rnorm(1,.5,.2),3))
dat <-data.frame(variable,value)
dat
variable value
1 X1 0.5285376
2 X1 0.5285376
3 X1 0.5285376
4 X2 0.1694908
5 X2 0.1694908
6 X2 0.1694908
7 X3 0.7446906
8 X3 0.7446906
9 X3 0.7446906
Each variable (X1, X2,X3) has values estimated at 3 different times (which in this toy example happen to be the same, but this is never the case).
I would like to get it (back) in the form of :
X1 X2 X3
1 0.5285376 0.1694908 0.7446906
2 0.5285376 0.1694908 0.7446906
3 0.5285376 0.1694908 0.7446906
Basically, I would like the variable column to be sorted on ID (X1, X2 etc) and become column headings. I have tried various permutations of cast, dcast, recast, etc.. and cant seem to get the data in the format that I want. It was easy enough to 'melt' data from the wide form to the longer form (e.g. the dat datset), but getting it back is proving difficult. Any ideas? I know this is relatively simple, but I am having a hard time conceptualizing how to do this in reshape or reshape2.
Thanks,
LP
I typically do this by creating an id column and then using dcast:
> dat
variable value
1 X1 0.4299397
2 X1 0.4299397
3 X1 0.4299397
4 X2 0.2531551
5 X2 0.2531551
6 X2 0.2531551
7 X3 0.3972119
8 X3 0.3972119
9 X3 0.3972119
> dat$id <- rep(1:3,times = 3)
> dcast(data = dat,formula = id~variable,fun.aggregate = sum,value.var = "value")
id X1 X2 X3
1 1 0.4299397 0.2531551 0.3972119
2 2 0.4299397 0.2531551 0.3972119
3 3 0.4299397 0.2531551 0.3972119
Depending on how robust you need this to be , the following will correctly cast for varying number of occurrences of variables (and in any order).
> variable<-c(rep("X1",5),rep("X2",4),rep("X3",3))
> value<-c(rep(rnorm(1,.5,.2),5),rep(rnorm(1,.5,.2),4),rep(rnorm(1,.5,.2),3))
> dat <-data.frame(variable,value)
> dat <- dat[order(rnorm(nrow(dat))),]
> dat
variable value
11 X3 1.0294454
8 X2 0.6147509
2 X1 0.3537012
7 X2 0.6147509
9 X2 0.6147509
5 X1 0.3537012
4 X1 0.3537012
12 X3 1.0294454
3 X1 0.3537012
1 X1 0.3537012
10 X3 1.0294454
6 X2 0.6147509
> dat$id = numeric(nrow(dat))
> for (i in 1:nrow(dat)){
+ dat_temp <- dat[1:i,]
+ dat[i,]$id <- nrow(dat_temp[dat_temp$variable == dat[i,]$variable,])
+ }
> cast(dat, id~variable, value = 'value')
id X1 X2 X3
1 1 0.3537012 0.6147509 1.029445
2 2 0.3537012 0.6147509 1.029445
3 3 0.3537012 0.6147509 1.029445
4 4 0.3537012 0.6147509 NA
5 5 0.3537012 NA NA
Specifically,
I used the following set up:
newdata <- tapply(mydata(#), list(mydata(X), mydata(Y)), sum)
I currently have a table that currently is listed as follows:
X= State, Y= County within State, #= a numerical total of something
__ Y1 Y2 Y3 Yn
X1 ## ## ## ##
X2 ## ## ## ##
X3 ## ## ## ##
Xn ## ## ## ##
What I need is a table listed as follows:
X1 Y1 ##
X1 Y2 ##
X1 Y3 ##
X1 Yn ##
X2 Y1 ##
X2 Y2 ##
X2 Y3 ##
X2 Yn ##
Xn Y1 ##
Xn Y2 ##
Xn Y3 ##
Xn Yn ##
library(reshape2)
new_data <- melt(old_data, id.vars=1)
Look into ?melt for more details on syntax.
example:
> df <- data.frame(x=1:5, y1=rnorm(5), y2=rnorm(5))
> df
x y1 y2
1 1 -1.3417817 -1.1777317
2 2 -0.4014688 1.4653270
3 3 0.4050132 1.5547598
4 4 0.1622901 -1.2976084
5 5 -0.7207541 -0.1203277
> melt(df, id.vars=1)
x variable value
1 1 y1 -1.3417817
2 2 y1 -0.4014688
3 3 y1 0.4050132
4 4 y1 0.1622901
5 5 y1 -0.7207541
6 1 y2 -1.1777317
7 2 y2 1.4653270
8 3 y2 1.5547598
9 4 y2 -1.2976084
10 5 y2 -0.1203277
Some example data
mydata <- data.frame(num=rnorm(40),
gp1=rep(LETTERS[1:2],2),
gp2=rep(letters[1:2],each=2))
And applying tapply to it:
tmp <- tapply(mydata$num, list(mydata$gp1, mydata$gp2), sum)
The result of tapply is a matrix, but you can treat it like a table and use as.data.frame.table to convert it. This does not rely on any additional packages.
as.data.frame.table(tmp)
The two different data structures look like:
> tmp
a b
A 8.381483 6.373657
B 2.379303 -1.189488
> as.data.frame.table(tmp)
Var1 Var2 Freq
1 A a 8.381483
2 B a 2.379303
3 A b 6.373657
4 B b -1.189488
I have a 3-dimensional array, the variables being x, y and z. x is a list of places, y is a list of time, and z is a list of names. The list of names do not start at the same initial time across the places:
x y z
x1 1 NA
x1 2 z2
x1 3 z3
x1 4 z1
x2 1 NA
x2 2 NA
x2 3 z5
x2 4 z3
x3 1 z3
x3 2 z1
x3 3 z2
x3 4 z2
How do I find the first z for every x? I want the output matrix or dataframe to be:
x z
x1 z2
x2 z5
x3 z3
EDITED, after example data was supplied
You can use function ddply() in package plyr
dat <- "x y z
x1 1 NA
x1 2 z2
x1 3 z3
x1 4 z1
x2 1 NA
x2 2 NA
x2 3 z5
x2 4 z3
x3 1 z3
x3 2 z1
x3 3 z2
x3 4 z2"
df <- read.table(textConnection(dat), header=TRUE, stringsAsFactors=FALSE)
library(plyr)
ddply(df, .(x), function(x)x[!is.na(x$z), ][1, "z"])
x V1
1 x1 z2
2 x2 z5
3 x3 z3
If you don't want to use plyr
t(data.frame(lapply(split(df, as.factor(df$x)), function(k) head(k$z[!is.na(k$z)], 1))))
[,1]
x1 "z2"
x2 "z5"
x3 "z3"