I have a data frame that I melted using the reshape package that I would like to "un melt".
here is a toy example of the melted data (real data frame is 500x100 or larger) :
variable<-c(rep("X1",3),rep("X2",3),rep("X3",3))
value<-c(rep(rnorm(1,.5,.2),3),rep(rnorm(1,.5,.2),3),rep(rnorm(1,.5,.2),3))
dat <-data.frame(variable,value)
dat
variable value
1 X1 0.5285376
2 X1 0.5285376
3 X1 0.5285376
4 X2 0.1694908
5 X2 0.1694908
6 X2 0.1694908
7 X3 0.7446906
8 X3 0.7446906
9 X3 0.7446906
Each variable (X1, X2,X3) has values estimated at 3 different times (which in this toy example happen to be the same, but this is never the case).
I would like to get it (back) in the form of :
X1 X2 X3
1 0.5285376 0.1694908 0.7446906
2 0.5285376 0.1694908 0.7446906
3 0.5285376 0.1694908 0.7446906
Basically, I would like the variable column to be sorted on ID (X1, X2 etc) and become column headings. I have tried various permutations of cast, dcast, recast, etc.. and cant seem to get the data in the format that I want. It was easy enough to 'melt' data from the wide form to the longer form (e.g. the dat datset), but getting it back is proving difficult. Any ideas? I know this is relatively simple, but I am having a hard time conceptualizing how to do this in reshape or reshape2.
Thanks,
LP
I typically do this by creating an id column and then using dcast:
> dat
variable value
1 X1 0.4299397
2 X1 0.4299397
3 X1 0.4299397
4 X2 0.2531551
5 X2 0.2531551
6 X2 0.2531551
7 X3 0.3972119
8 X3 0.3972119
9 X3 0.3972119
> dat$id <- rep(1:3,times = 3)
> dcast(data = dat,formula = id~variable,fun.aggregate = sum,value.var = "value")
id X1 X2 X3
1 1 0.4299397 0.2531551 0.3972119
2 2 0.4299397 0.2531551 0.3972119
3 3 0.4299397 0.2531551 0.3972119
Depending on how robust you need this to be , the following will correctly cast for varying number of occurrences of variables (and in any order).
> variable<-c(rep("X1",5),rep("X2",4),rep("X3",3))
> value<-c(rep(rnorm(1,.5,.2),5),rep(rnorm(1,.5,.2),4),rep(rnorm(1,.5,.2),3))
> dat <-data.frame(variable,value)
> dat <- dat[order(rnorm(nrow(dat))),]
> dat
variable value
11 X3 1.0294454
8 X2 0.6147509
2 X1 0.3537012
7 X2 0.6147509
9 X2 0.6147509
5 X1 0.3537012
4 X1 0.3537012
12 X3 1.0294454
3 X1 0.3537012
1 X1 0.3537012
10 X3 1.0294454
6 X2 0.6147509
> dat$id = numeric(nrow(dat))
> for (i in 1:nrow(dat)){
+ dat_temp <- dat[1:i,]
+ dat[i,]$id <- nrow(dat_temp[dat_temp$variable == dat[i,]$variable,])
+ }
> cast(dat, id~variable, value = 'value')
id X1 X2 X3
1 1 0.3537012 0.6147509 1.029445
2 2 0.3537012 0.6147509 1.029445
3 3 0.3537012 0.6147509 1.029445
4 4 0.3537012 0.6147509 NA
5 5 0.3537012 NA NA
Related
I have a matrix that looks like below:
x1<-c(1,2,3,4,5,6,NA)
x2<-c(1,2,NA,4,5,NA,NA)
x3<-c(1,2,3,4,NA,NA,NA)
x4<-c(1,2,3,NA,NA,NA,NA)
x5<-c(1,2,NA,NA,NA,NA,NA)
x<-cbind(x1,x2,x3,x4,x5)
If I want to calculate the last 3 non NA values of each column, and if a column has less than 3 non NA values (like column 5), then I'll sum all the non NA values in that column. I want an output that looks like
15 11 10 6 3
Thank you!
You can use apply with tail to sum up the last non NA like:
apply(x, 2, function(x) sum(tail(x[!is.na(x)], 3)))
#x1 x2 x3 x4 x5
#15 11 9 6 3
It also works with a customized function (#GKi answer is pretty cool):
#Build function
myfun <- function(y)
{
#Count na
i <- length(which(!is.na(y)))
if(i<3)
{
r1 <- sum(y,na.rm=T)
} else
{
y1 <- y[!is.na(y)]
y2 <- y1[(length(y1)-2):length(y1)]
r1 <- sum(y2)
}
return(r1)
}
#Apply
apply(x,2,myfun)
Output:
x1 x2 x3 x4 x5
15 11 9 6 3
One dplyr option using the logic from #GKi could be:
x %>%
data.frame() %>%
summarise(across(everything(), ~ sum(tail(na.omit(.), 3))))
x1 x2 x3 x4 x5
1 15 11 9 6 3
Or:
x %>%
data.frame() %>%
summarise(across(everything(), ~ sum(rev(na.omit(.))[1:3], na.rm = TRUE)))
Using sapply from base R
sapply(as.data.frame(x), function(x) sum(tail(na.omit(x), 3)))
# x1 x2 x3 x4 x5
#15 11 9 6 3
I have df including NA.
df <- data.frame( X1= c(NA, 1, 4, NA),
X2 = c(34, 75, 1, 4),
X3= c(2,9,3,5))
My ideal out come looks like,
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
I have tried
df$Min <- colnames(df)[apply(df,1,which.min, na.rm=TRUE)]
but this one didn't work
You don't need the na.rm=TRUE when using which.min() – try this instead:
df$Min <- colnames(df)[apply(df,1,which.min)]
Output:
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
Code:
foo <- names(df)
df$Min <- apply(df, 1, function(x) foo[which.min(x)])
df
Output:
X1 X2 X3 Min
1 NA 34 2 X3
2 1 75 9 X1
3 4 1 3 X2
4 NA 4 5 X2
Here's an idea that will likely be faster and does not require any looping. You could replace NA with Inf, take the negative of the data, then find the maximum per column via max.col().
names(df)[max.col(-replace(df, is.na(df), Inf))]
# [1] "X3" "X1" "X2" "X2"
Also, not to forget, a data.table solution, given that dt <- as.data.table(df)
dt[ , Min:=names(dt)[match(min(.SD, na.rm=T), .SD)], by=1:nrow(dt)][]
# X1 X2 X3 Min
#1: NA 34 2 X3
#2: 1 75 9 X1
#3: 4 1 3 X2
#4: NA 4 5 X2
Not much simpler than the solutions above, just extending the choices here.
The data contains four fields: id, x1, x2, and x3.
id <- c(1,2,3,4,5,6,7,8,9,10)
x1 <- c(2,4,5,3,6,4,3,6,7,7)
x2 <- c(0,1,2,6,7,6,0,8,2,2)
x3 <- c(5,3,4,5,8,3,4,2,5,6)
DF <- data.frame(id, x1,x2,x3)
Before I ask the question, let me create a new field (minX) which is the min of (x1,x2,x3)
DF$minX <- pmin(DF$x1, DF$x2, DF$x3)
I need to create a new field, y, that is defined as follows
if min(x1,x2,x3) = x1, then y = "x1"
if min(x1,x2,x3) = x2, then y = "x2"
if min(x1,x2,x3) = x3, then y = "x3"
Note: we assume no ties.
As a simply solution, do:
VARS <- colnames(DF)[-1]
y <- VARS[apply(DF[, -1], MARGIN = 1, FUN = which.min)]
DF$y <- y
The function which.min returns the index of the minimum. If the minimum is not unique it returns the first one. Since you guarantee that there is no tie, this is not an issue here.
Finally, you should be familiar with apply, right? MARGIN = 1 means applying function FUN row-wise, while MARGIN = 2 means applying FUN column-wise. This is an useful function to avoid the need for a for loop when dealing with matrix. Since your data frame only contains numerical/integer values, it is like a matrix hence we can use apply.
Here is another option using pmin and max.col
library(data.table)
setDT(DF)[, c("minx", "y") := list(do.call(pmin, .SD),
names(.SD)[max.col(-1*.SD)]), .SDcols= x1:x3]
DF
# id x1 x2 x3 minx y
# 1: 1 2 0 5 0 x2
# 2: 2 4 1 3 1 x2
# 3: 3 5 2 4 2 x2
# 4: 4 3 6 5 3 x1
3 5: 5 6 7 8 6 x1
# 6: 6 4 6 3 3 x3
# 7: 7 3 0 4 0 x2
# 8: 8 6 8 2 2 x3
# 9: 9 7 2 5 2 x2
#10: 10 7 2 6 2 x2
a data.table solution:
# create variables
id <- c(1,2,3,4,5,6,7,8,9,10)
x1 <- c(2,4,5,3,6,4,3,6,7,7)
x2 <- c(0,1,2,6,7,6,0,8,2,2)
x3 <- c(5,3,4,5,8,3,4,2,5,6)
DF <- data.frame(id, x1,x2,x3)
# load package and set data table, calculating min
library(data.table)
setDT(DF)[, minx := apply(.SD, 1, min), .SDcols=c("x1", "x2", "x3")]
# Create variable with name of minimum
DF[, y := apply(.SD, 1, function(x) names(x)[which.min(x)]), .SDcols = c("x1", "x2", "x3")]
# call result
DF
## id x1 x2 x3 minx y
1: 1 2 0 5 0 x2
2: 2 4 1 3 1 x2
3: 3 5 2 4 2 x2
4: 4 3 6 5 3 x1
5: 5 6 7 8 6 x1
6: 6 4 6 3 3 x3
7: 7 3 0 4 0 x2
8: 8 6 8 2 2 x3
9: 9 7 2 5 2 x2
10: 10 7 2 6 2 x2
The last step can be called directly, without the need to calculate minx.
Please notice that data.table is particularily fast in large data sets.
######## EDIT TO ADD: DPLYR METHOD #########
For completeness, this would be a dplyr method to produce the same (final) result. This solution is credited to #eipi10 in a question I started out of this problem (see here):
DF %>% mutate(y = apply(.[,2:4], 1, function(x) names(x)[which.min(x)]))
This solution takes about the same time as the data.table one provided in the original answer, when applyed to a 1e6 rows data frame (about 17 secs in my sony laptop).
Specifically,
I used the following set up:
newdata <- tapply(mydata(#), list(mydata(X), mydata(Y)), sum)
I currently have a table that currently is listed as follows:
X= State, Y= County within State, #= a numerical total of something
__ Y1 Y2 Y3 Yn
X1 ## ## ## ##
X2 ## ## ## ##
X3 ## ## ## ##
Xn ## ## ## ##
What I need is a table listed as follows:
X1 Y1 ##
X1 Y2 ##
X1 Y3 ##
X1 Yn ##
X2 Y1 ##
X2 Y2 ##
X2 Y3 ##
X2 Yn ##
Xn Y1 ##
Xn Y2 ##
Xn Y3 ##
Xn Yn ##
library(reshape2)
new_data <- melt(old_data, id.vars=1)
Look into ?melt for more details on syntax.
example:
> df <- data.frame(x=1:5, y1=rnorm(5), y2=rnorm(5))
> df
x y1 y2
1 1 -1.3417817 -1.1777317
2 2 -0.4014688 1.4653270
3 3 0.4050132 1.5547598
4 4 0.1622901 -1.2976084
5 5 -0.7207541 -0.1203277
> melt(df, id.vars=1)
x variable value
1 1 y1 -1.3417817
2 2 y1 -0.4014688
3 3 y1 0.4050132
4 4 y1 0.1622901
5 5 y1 -0.7207541
6 1 y2 -1.1777317
7 2 y2 1.4653270
8 3 y2 1.5547598
9 4 y2 -1.2976084
10 5 y2 -0.1203277
Some example data
mydata <- data.frame(num=rnorm(40),
gp1=rep(LETTERS[1:2],2),
gp2=rep(letters[1:2],each=2))
And applying tapply to it:
tmp <- tapply(mydata$num, list(mydata$gp1, mydata$gp2), sum)
The result of tapply is a matrix, but you can treat it like a table and use as.data.frame.table to convert it. This does not rely on any additional packages.
as.data.frame.table(tmp)
The two different data structures look like:
> tmp
a b
A 8.381483 6.373657
B 2.379303 -1.189488
> as.data.frame.table(tmp)
Var1 Var2 Freq
1 A a 8.381483
2 B a 2.379303
3 A b 6.373657
4 B b -1.189488
I have a 3-dimensional array, the variables being x, y and z. x is a list of places, y is a list of time, and z is a list of names. The list of names do not start at the same initial time across the places:
x y z
x1 1 NA
x1 2 z2
x1 3 z3
x1 4 z1
x2 1 NA
x2 2 NA
x2 3 z5
x2 4 z3
x3 1 z3
x3 2 z1
x3 3 z2
x3 4 z2
How do I find the first z for every x? I want the output matrix or dataframe to be:
x z
x1 z2
x2 z5
x3 z3
EDITED, after example data was supplied
You can use function ddply() in package plyr
dat <- "x y z
x1 1 NA
x1 2 z2
x1 3 z3
x1 4 z1
x2 1 NA
x2 2 NA
x2 3 z5
x2 4 z3
x3 1 z3
x3 2 z1
x3 3 z2
x3 4 z2"
df <- read.table(textConnection(dat), header=TRUE, stringsAsFactors=FALSE)
library(plyr)
ddply(df, .(x), function(x)x[!is.na(x$z), ][1, "z"])
x V1
1 x1 z2
2 x2 z5
3 x3 z3
If you don't want to use plyr
t(data.frame(lapply(split(df, as.factor(df$x)), function(k) head(k$z[!is.na(k$z)], 1))))
[,1]
x1 "z2"
x2 "z5"
x3 "z3"