In the "An Introduction to reshape2" package Sean C. Anderson presents the following example.
He uses the airquality data and renames the column names
names(airquality) <- tolower(names(airquality))
The data look like
# ozone solar.r wind temp month day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6
Then he melts them by
aql <- melt(airquality, id.vars = c("month", "day"))
to get
# month day variable value
# 1 5 1 ozone 41
# 2 5 2 ozone 36
# 3 5 3 ozone 12
# 4 5 4 ozone 18
# 5 5 5 ozone NA
# 6 5 6 ozone 28
Finally he gets the original one (different column order) by
aqw <- dcast(aql, month + day ~ variable)
My Quesiton
Assume now that we do not have ID variables (i.e. month and day) and have melted the data as follows
aql <- melt(airquality)
which look like
# variable value
# 1 ozone 41
# 2 ozone 36
# 3 ozone 12
# 4 ozone 18
# 5 ozone NA
# 6 ozone 28
My question is how can I get the original ones? The original ones would correspond to
# ozone solar.r wind temp
# 1 41 190 7.4 67
# 2 36 118 8.0 72
# 3 12 149 12.6 74
# 4 18 313 11.5 62
# 5 NA NA 14.3 56
# 6 28 NA 14.9 66
Another option is unstack
out <- unstack(aql,value~variable)
head(out)
# ozone solar.r wind temp month day
#1 41 190 7.4 67 5 1
#2 36 118 8.0 72 5 2
#3 12 149 12.6 74 5 3
#4 18 313 11.5 62 5 4
#5 NA NA 14.3 56 5 5
#6 28 NA 14.9 66 5 6
As the question is about dcast, we can create a sequence column and then use dcast
aql$indx <- with(aql, ave(seq_along(variable), variable, FUN=seq_along))
out1 <- dcast(aql, indx~variable, value.var='value')[,-1]
head(out1)
# ozone solar.r wind temp month day
#1 41 190 7.4 67 5 1
#2 36 118 8.0 72 5 2
#3 12 149 12.6 74 5 3
#4 18 313 11.5 62 5 4
#5 NA NA 14.3 56 5 5
#6 28 NA 14.9 66 5 6
If you are using data.table, the devel version of data.table ie. v1.9.5 also has dcast function. Instructions to install the devel version are here
library(data.table)#v1.9.5+
setDT(aql)[, indx:=1:.N, variable]
dcast(aql, indx~variable, value.var='value')[,-1]
One option using split,
out <- data.frame(sapply(split(aql, aql$variable), `[[`, 2))
Here, the data is split by the variable column, then the second column of each group is combined back into a data frame (the [[ function with the argument 2 is passed to sapply)
head(out)
# Ozone Solar.R Wind Temp Month Day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 NA NA 14.3 56 5 5
# 6 28 NA 14.9 66 5 6
Related
I have two versions of datasets sharing the same columns (more or less). Let's take as an example
db = airquality
db1 = airquality[,-c(6)]
db1$Ozone[db1$Ozone < 30] <- 24
db1$Month[db1$Month == 5] <- 24
db
db1
If I would like to transfer two columns 'Ozone' and 'Wind' from the dataset 'db1' to the 'db' dataset by writing a code using the pipe operator %>% or another iterative method to achieve this result, which code you may possibly suggest?
Thanks
You csn do:
library(dplyr)
db1 %>%
select(Ozone, Wind) %>%
bind_cols(db)
Note that in this example, since some column names will be duplicated in the final result, dplyr will automatically rename the duplicates by appending numbers to the end of the column names.
Base R:
cbind(db, db1[,c(1,3)])
Ozone Solar.R Wind Temp Month Day Ozone Wind
1 41 190 7.4 67 5 1 41 7.4
2 36 118 8.0 72 5 2 36 8.0
3 12 149 12.6 74 5 3 24 12.6
4 18 313 11.5 62 5 4 24 11.5
5 NA NA 14.3 56 5 5 NA 14.3
6 28 NA 14.9 66 5 6 24 14.9
7 23 299 8.6 65 5 7 24 8.6
8 19 99 13.8 59 5 8 24 13.8
9 8 19 20.1 61 5 9 24 20.1
10 NA 194 8.6 69 5 10 NA 8.6
11 7 NA 6.9 74 5 11 24 6.9
12 16 256 9.7 69 5 12 24 9.7
.
.
.
airquality
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
Hi there,
How do I replace values in Ozone to be binary? If NA then 0 and if a value then 1.
Thanks
H
Assuming your dataframe is called airquality
airquality$Ozone <- ifelse(is.na(airquality$Ozone), 0, 1)
airquality$Ozone <- as.integer(!is.na(airquality$Ozone))
Alternatively
airquality$Ozone[!is.na(airquality$Ozone)] <- 1L
airquality$Ozone[is.na(airquality$Ozone)] <- 0L
I have two data frame. One with all the data A, and a smaller one B that contain an unique identifier of A and column names of A. I am trying to add a column on A base on what the B is pointed to. In another word, I need to get data from A pointed by B.
For example
A<-airquality
B<-data.frame(Month=unique(A$Month),col=c("Ozone","Solar.R", "Wind", "wind","Solar.R"))
This would give me the following
> head(A)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> B
Month col
1 5 Ozone
2 6 Solar.R
3 7 Wind
4 8 wind
5 9 Solar.R
The result should be something like
> head(A)
Ozone Solar.R Wind Temp Month Day ADDED
1 41 190 7.4 67 5 1 41
2 36 118 8.0 72 5 2 36
3 12 149 12.6 74 5 3 12
4 18 313 11.5 62 5 4 18
5 NA NA 14.3 56 5 5 NA
6 28 NA 14.9 66 5 6 28
> tail(A)
Ozone Solar.R Wind Temp Month Day ADDED
148 14 20 16.6 63 9 25 20
149 30 193 6.9 70 9 26 193
150 NA 145 13.2 77 9 27 145
151 14 191 14.3 75 9 28 191
152 18 131 8.0 76 9 29 131
153 20 223 11.5 68 9 30 223
The only way I can do it is
for(i in 1:nrow(B))
{
j<-A$Month==B$Month[i]
k<-subset(A, select=B$col[i])[j,]
A$ADDED[j]<-k
}
while this does work, it become extremely slow as I have a big dataset. I feel like I am doing it the dumb way. What is a good way of doing this?
Thank
You could do this with the sapply or lapply-functions.
ADDED <- sapply(1:nrow(B), function(i){
A[A$Month==B$Month[i], (B$col[i])]
})
A$ADDED <- unlist(ADDED)
For partial matching, you would have to filter the data, to get only rows where B has values and then assign the values. But befor we have to assign a value for all rows of the ADDED column; in this case NA.
A$ADDED = NA
A[A$Month %in% B$Month,]$ADDED <- unlist(ADDED)
That is already talking only a 1/3 of the time, compared to a for-loop.
appl <- function(){
ADDED <- sapply(1:nrow(B), function(i){
A[A$Month==B$Month[i], (B$col[i])]
})
A$ADDED1 <- unlist(ADDED)
}
lappl <- function(){
ADDED <- lapply(1:nrow(B), function(i){
A[A$Month==B$Month[i], (B$col[i])]
})
A$ADDED1 <- unlist(ADDED)
}
forlo <- function(){
for(i in 1:nrow(B)) {
j<-A$Month==B$Month[i]
k<-subset(A, select=B$col[i])[j,]
A$ADDED[j]<-k
}
}
library(microbenchmark)
mc <- microbenchmark(times = 1000,
sapply = appl(),
lapply = lappl(),
forloop = forlo()
)
mc
Unit: microseconds
expr min lq mean median uq max neval cld
sapply 337.478 359.2125 378.6964 369.7775 385.474 2324.913 1000 a
lapply 319.367 340.7990 366.8448 349.2510 362.532 9051.828 1000 a
forloop 964.136 1013.6415 1074.5584 1032.5070 1059.825 5116.802 1000 b
I have a nested list which contains set of data.frame objects in it, now I want them flatten out. I used most common approach like unlist method, it is not properly fatten out my list, the output was not well represented. How can I make this happen more efficiently? Does anyone knows any trick of doing this operation? Thanks.
example:
mylist <- list(pass=list(Alpha.df1_yes=airquality[2:4,], Alpha.df2_yes=airquality[3:6,],Alpha.df3_yes=airquality[2:5,],Alpha.df4_yes=airquality[7:9,]),
fail=list(Alpha.df1_no=airquality[5:7,], Alpha.df2_no=airquality[8:10,], Alpha.df3_no=airquality[13:16,],Alpha.df4_no=airquality[11:13,]))
I tried like this, it works but output was not properly arranged.
res <- lapply(mylist, unlist)
after flatten out, I would like to do merge them without duplication:
out <- lapply(res, rbind.data.frame)
my desired output:
mylist[[1]]$pass:
Ozone Solar.R Wind Temp Month Day
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
How can make this sort of flatten output more compatibly represented? Can anyone propose possible idea of doing this in R? Thanks a lot.
Using lapply and duplicated:
res <- lapply(mylist, function(i){
x <- do.call(rbind, i)
x[ !duplicated(x), ]
rownames(x) <- NULL
x
})
res$pass
# Ozone Solar.R Wind Temp Month Day
# 1 36 118 8.0 72 5 2
# 2 12 149 12.6 74 5 3
# 3 18 313 11.5 62 5 4
# 4 12 149 12.6 74 5 3
# 5 18 313 11.5 62 5 4
# 6 NA NA 14.3 56 5 5
# 7 28 NA 14.9 66 5 6
# 8 36 118 8.0 72 5 2
# 9 12 149 12.6 74 5 3
# 10 18 313 11.5 62 5 4
# 11 NA NA 14.3 56 5 5
# 12 23 299 8.6 65 5 7
# 13 19 99 13.8 59 5 8
# 14 8 19 20.1 61 5 9
Above still returns a list, if we want to keep all in one dataframe with no lists, then:
res <- do.call(rbind, unlist(mylist, recursive = FALSE))
res <- res[!duplicated(res), ]
res
# Ozone Solar.R Wind Temp Month Day
# pass.Alpha.df1_yes.2 36 118 8.0 72 5 2
# pass.Alpha.df1_yes.3 12 149 12.6 74 5 3
# pass.Alpha.df1_yes.4 18 313 11.5 62 5 4
# pass.Alpha.df2_yes.5 NA NA 14.3 56 5 5
# pass.Alpha.df2_yes.6 28 NA 14.9 66 5 6
# pass.Alpha.df4_yes.7 23 299 8.6 65 5 7
# pass.Alpha.df4_yes.8 19 99 13.8 59 5 8
# pass.Alpha.df4_yes.9 8 19 20.1 61 5 9
# fail.Alpha.df2_no.10 NA 194 8.6 69 5 10
# fail.Alpha.df3_no.13 11 290 9.2 66 5 13
# fail.Alpha.df3_no.14 14 274 10.9 68 5 14
# fail.Alpha.df3_no.15 18 65 13.2 58 5 15
# fail.Alpha.df3_no.16 14 334 11.5 64 5 16
# fail.Alpha.df4_no.11 7 NA 6.9 74 5 11
# fail.Alpha.df4_no.12 16 256 9.7 69 5 12
I have the following data.table:
Month Day Lat Long Temperature
1: 10 01 80.0 180 -6.383330333333309
2: 10 01 77.5 180 -6.193327999999976
3: 10 01 75.0 180 -6.263328333333312
4: 10 01 72.5 180 -5.759997333333306
5: 10 01 70.0 180 -4.838330999999976
---
117020: 12 31 32.5 310 11.840003833333355
117021: 12 31 30.0 310 13.065001833333357
117022: 12 31 27.5 310 14.685003333333356
117023: 12 31 25.0 310 15.946669666666690
117024: 12 31 22.5 310 16.578336333333358
For every location (given by Lat and Long), I have a temperature for each day from 1 October to 31 December.
There are 1,272 locations consisting of each pairwise combination of Lat:
Lat
1 80.0
2 77.5
3 75.0
4 72.5
5 70.0
--------
21 30.0
22 27.5
23 25.0
24 22.5
and Long:
Long
1 180.0
2 182.5
3 185.0
4 187.5
5 190.0
---------
49 300.0
50 302.5
51 305.0
52 307.5
53 310.0
I'm trying to create a data.table that consists of 1,272 rows (one per location) and 92 columns (one per day). Each element of that data.table will then contain the temperature at that location on that day.
Any advice about how to accomplish that goal without using a for loop?
Here we use ChickWeights as the data, where we use "Chick-Diet" as the equivalent of your "lat-lon", and "Time" as your "Date":
dcast.data.table(data.table(ChickWeight), Chick + Diet ~ Time)
Produces:
Chick Diet 0 2 4 6 8 10 12 14 16 18 20 21
1: 18 1 1 1 NA NA NA NA NA NA NA NA NA NA
2: 16 1 1 1 1 1 1 1 1 NA NA NA NA NA
3: 15 1 1 1 1 1 1 1 1 1 NA NA NA NA
4: 13 1 1 1 1 1 1 1 1 1 1 1 1 1
5: ... 46 rows omitted
You will likely need to lat + lon ~ Month + Day or some such for your formula.
In the future, please make your question reproducible as I did here by using a built-in data set.
First create a date value using the lubridate package (I assumed year = 2014, adjust as necessary):
library(lubridate)
df$datetext <- paste(df$Month,df$Day,"2014",sep="-")
df$date <- mdy(df$datetext)
Then one option is to use the tidyr package to spread the columns:
library(tidyr)
spread(df[,-c(1:2,6)],date,Temperature)
Lat Long 2014-10-01 2014-12-31
1 22.5 310 NA 16.57834
2 25.0 310 NA 15.94667
3 27.5 310 NA 14.68500
4 30.0 310 NA 13.06500
5 32.5 310 NA 11.84000
6 70.0 180 -4.838331 NA
7 72.5 180 -5.759997 NA
8 75.0 180 -6.263328 NA
9 77.5 180 -6.193328 NA
10 80.0 180 -6.383330 NA