which.min within reshape2's dcast()? - r

I would like to extract the value of var2 that corresponds to the minimum value of var1 in each building-month combination. Here's my (fake) data set:
head(mydata)
# building month var1 var2
#1 A 1 -26.96333 376.9633
#2 A 1 165.38759 317.3993
#3 A 1 47.46345 271.0137
#4 A 2 73.47784 294.8171
#5 A 2 107.80130 371.7668
#6 A 2 10.16384 308.7975
Reproducible code:
## create fake data set:
set.seed(142)
mydata1 = data.frame(building = rep(LETTERS[1:5],6),month = sort(rep(1:6,5)),var1=rnorm(30,50,35),var2 = runif(30,200,400))
mydata2 = data.frame(building = rep(LETTERS[1:5],6),month = sort(rep(1:6,5)),var1=rnorm(30,60,35),var2 = runif(30,150,400))
mydata3 = data.frame(building = rep(LETTERS[1:5],6),month = sort(rep(1:6,5)),var1=rnorm(30,40,35),var2 = runif(30,250,400))
mydata = rbind(mydata1,mydata2,mydata3)
mydata = mydata[ order(mydata[,"building"], mydata[,"month"]), ]
row.names(mydata) = 1:nrow(mydata)
## here is how I pull the minimum value of v1 for each building-month combination:
require(reshape2)
m1 = melt(mydata, id.var=1:2)
d1 = dcast(m1, building ~ month, function(x) min(max(x,0), na.rm=T),
subset = .(variable == "var1"))
This pulls out the minimum value of var1 for each building-month combo...
head(d1)
# building 1 2 3 4 5 6
#1 A 165.38759 107.80130 93.32816 73.23279 98.55546 107.58780
#2 B 92.08704 98.94959 57.79610 94.10530 80.86883 99.75983
#3 C 93.38284 100.13564 52.26178 62.37837 91.98839 97.44797
#4 D 82.43440 72.43868 66.83636 105.46263 133.02281 94.56457
#5 E 70.09756 61.44406 30.78444 68.24334 94.35605 61.60610
However, what I want is a data frame set up exactly as d1 that instead shows the value of var2 that corresponds to the minimum value pulled for var1 (shown in d1 above). My gut tells me it should be a variation on which.min(), but haven't gotten this to work with dcast() or ddply(). Any help is appreciated!

It may be possible in one step, but I'm more familiar with plyr than reshape2,
dcast(ddply(mydata, .(building, month), summarize, value = var2[which.min(var1)]),
building ~ month)

Related

R Data.Table Random Sample Groups

DATA = data.table(STUDENT = c(1,1,2,2,2,2,2,3,3,3,3,3,4,
SCORE = c(5,6,8,3,14,5,6,9,0,12,13,14,19))
WANT = data.table(STUDENT = c(1,1,4),
SCORE = c(5,6,19))
I have DATA and wish to create WANT which takes a random sample of 2 STUDENT and includes all of their data. I present WANT as an example.
I try this with no success
WANT = WANT[ , .SD[sample(x = .N, size = 2)], by = STUDENT]
sample the unique values of STUDENT and filter all the rows for those STUDENT,
library(data.table)
set.seed(1357)
DATA[STUDENT %in% sample(unique(STUDENT), 2)]
# STUDENT SCORE
#1: 1 5
#2: 1 6
#3: 4 19

Merge a dataframe by creating subsets based on time period and a unique ID number

I am looking to create a dataframe that lists a unique ID with the movement of n different amounts across a period of m timesteps. I currently generate subsets of each timestep and then merge all these subsets with a separate dataframe that contains just the unique IDs. See below:
set.seed(129)
df1 <- data.frame(
id= c(rep(seq(1:7),3)),
time= c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3),
amount1= runif(21,0,50),
amount2= runif(21,-20,600),
amount3= runif(21,-15,200),
amount4= runif(21,-3,300)
)
df2 <- data.frame(
id = unique(df1$id)
)
sub_1 <- subset(df1, time == 1)
sub_2 <- subset(df1, time == 2)
sub_3 <- subset(df1, time == 3)
df2<-merge(df2,sub_1,by.x = "id",by.y = "id", all=TRUE)
df2<-merge(df2,sub_2,by.x = "id",by.y = "id", all=TRUE)
df2<-merge(df2,sub_3,by.x = "id",by.y = "id", all=TRUE)
#df2
id time.x amount1.x amount2.x amount3.x amount4.x time.y amount1.y amount2.y amount3.y amount4.y time amount1 amount2 amount3 amount4
1 1 1 6.558261 -17.713007 46.477430 195.061597 2 18.5453843 269.7406808 132.588713 80.40133 3 24.943217 488.1025 103.473479 198.51302
2 2 1 15.736044 230.018563 72.604346 -2.513162 2 48.8537058 356.5593748 161.239261 246.25985 3 35.559262 406.4749 66.278064 30.11592
3 3 1 8.057720 386.814867 101.997370 152.269564 2 0.7334493 0.7842648 66.603965 156.12478 3 42.170220 450.0306 195.872986 109.73098
4 4 1 15.575282 527.033563 37.403278 197.529341 2 37.8372445 370.0410836 6.074847 273.46715 3 20.302206 290.0026 -2.101649 112.88488
5 5 1 4.230635 427.294382 112.771237 199.401096 2 15.3735066 376.8945806 104.382371 224.09730 3 8.050933 291.6123 53.660734 270.37200
6 6 1 29.087870 9.330858 129.400932 70.801129 2 38.9966662 421.9258798 -3.891286 290.59259 3 17.919554 581.1735 137.100314 129.78561
7 7 1 4.380303 463.658580 4.120219 56.527016 2 6.0582455 484.4981686 67.820164 72.05615 3 43.556746 170.0745 41.134708 247.99512
I have a major issue with this, as the values of m and n increase this method becomes ugly and long. Is there a cleaner way to do this? Maybe as a one liner so I don't have to make say 15 subsets if m = 15.
Thanks
You just need your original df1 dataset and do this:
library(tidyverse)
df1 %>%
group_split(time) %>% # create your subsets and store them as a list of data frames
reduce(left_join, by = "id") # sequentially join those subsets

Creating new columns with mutate

i can figure out the solution of my problem but in a very not optimal way and thus the solution i have is not adapted for a large df. Let me explain.
I have a big dataframe and i need to create new columns by subtracting two others ones. Let me show you using a simple df.
A<-rnorm(10)
B<-rnorm(10)
C<-rnorm(10)
D<-rnorm(10)
E<-rnorm(10)
F<-rnorm(10)
df1<-data_frame(A,B,C,D,E,F)
# A tibble: 10 x 6
A B C D E F
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -2.8750025 0.4685855 2.4435767 1.6999761 -1.3848386 -0.58992249
2 0.2551404 1.8555876 0.8365116 -1.6151186 -1.7754623 0.04423463
3 0.7740396 -1.0756147 0.6830024 -2.3879337 -1.3165875 -1.36646493
4 0.2059932 0.9322016 1.2483196 -0.1787840 0.3546773 -0.12874831
5 -0.4561725 -0.1464692 -0.7112905 0.2791592 0.5835127 0.16493237
6 1.2401795 -1.1422917 -0.6189480 -1.4975416 0.5653565 -1.32575021
7 -1.6173618 0.2283430 0.6154920 0.6082847 0.0273447 0.16771783
8 0.3340799 -0.5096500 -0.5270123 -0.2814217 -2.3732234 0.27972188
9 -0.4841361 0.1651265 0.0296500 0.4324903 -0.3895971 -2.90426195
10 -2.7106357 0.5496335 0.3081533 -0.3083264 -0.1341055 -0.17927807
I need (i) to subtract two columns at a similar distance : D-A, E-B, F-C while (ii) giving the new column a name based on the name of the initial variables' names.
I did in that way and it works:
df2<-df1 %>%
transmute (!!paste0("diff","D","A") := D-A,
!!paste0("diff","E","B") := E-B,
!!paste0("diff","F","C") := F-C)
# A tibble: 10 x 3
diffDA diffEB diffFC
<dbl> <dbl> <dbl>
1 4.5749785 -1.8534241 -3.0334991
2 -1.8702591 -3.6310500 -0.7922769
3 -3.1619734 -0.2409728 -2.0494674
4 -0.3847772 -0.5775242 -1.3770679
5 0.7353317 0.7299819 0.8762229
6 -2.7377211 1.7076482 -0.7068022
7 2.2256465 -0.2009983 -0.4477741
8 -0.6155016 -1.8635734 0.8067342
9 0.9166264 -0.5547236 -2.9339120
10 2.4023093 -0.6837390 -0.4874314
However, i have many columns and i would like to find a way to make the code simpler. I tried many things (like with mutate_all, mutate_at or add_columns) but nothing works...
OK, here's a method that will work for the full width of your data set.
df1 <- tibble(A = rnorm(10),
B = rnorm(10),
C = rnorm(10),
D = rnorm(10),
E = rnorm(10),
F = rnorm(10),
G = rnorm(10),
H = rnorm(10),
I = rnorm(10))
ct <- 1:ncol(df1)
diff_tbl <- tibble(testcol = rnorm(10))
for (i in ct) {
new_tbl <- tibble(col = df1[[i+3]] - df1[[i]])
names(new_tbl)[1] <- paste('diff',colnames(df1[i+3]),colnames(df1[i]),sep='')
diff_tbl <- bind_cols(diff_tbl,new_tbl)
}
diff_tbl <- diff_tbl %>%
select(-testcol)
df1 <- bind_cols(df1,diff_tbl)
Basically, what you are doing is creating a second dummy tibble to compute the differences, iterating over the possible differences (i.e. gaps of three columns) then assembling them into a single tibble, then binding those columns to the original tibble. As you can see, I extended df1 by three extra columns and the whole thing worked like a charm.
It's probable that there's a more elegant way to do this, but this method definitely works. There's one slightly awkward thing in that I had to create the diff_tbl with a dummy column and then remove it before the final bind_cols() call, but it's not a major thing, I think.
You could divide the data frame in two parts and do
inds <- ncol(df1)/2
df1[paste0("diff", names(df1[(inds + 1):ncol(df1)]), names(df1[1:inds]))] <-
df1[(inds + 1):ncol(df1)] - df1[1:inds]
Note that column names with dashes in them are improper and not recommended.
result = df1[4:6] - df1[1:3]
names(result) = paste(names(df1)[4:6], names(df1)[1:3], sep = "-")
result
# D-A E-B F-C
# 1 0.12459065 0.05855622 0.6134559
# 2 -2.65583389 0.26425762 0.8344115
# 3 -1.48761765 -3.13999402 1.3008065
# 4 -4.37469763 1.37551178 1.3405191
# 5 1.01657135 -0.90690359 1.5848562
# 6 -0.34050959 -0.57687686 -0.3794937
# 7 0.85233808 0.57911293 -0.8896393
# 8 0.01931559 0.91385740 3.2685647
# 9 -0.62012982 -2.34166712 -0.4001903
# 10 -2.21764146 0.05927664 0.3965072

dplyr mutate on column subset (one function on all these columns combined)

I have a dataframe with some info and some measurement. For the measurement, I want to calculate the mahalanobis distance, but I don't get to a clean dplyr-approach. I would like to have something like:
library(anomalyDetection)
test<-data.frame(id=LETTERS[1:10],
A = rnorm(10,0,2),
B = rnorm(10,5,3))
test<-test%>%
mutate(MD = mahalanobis_distance(.%>%dplyr::select(one_of(c("A","B")))))
I know that the following works:
test<-test%>%
mutate(MD = mahalanobis_distance(test%>%dplyr::select(one_of(c("A","B")))))
but that breaks down if there are some other step preceding the mutate-call:
test<-test%>%
mutate(group = id %in% c(LETTERS[1:5]))%>%
group_by(group)%>%
mutate(MD = mahalanobis_distance(test%>%dplyr::select(one_of(c("A","B")))))
We can do a split based on the logical vector, then with map_df create the 'MD' column by applying the mahalanobis_distance on the split dataset
library(purrr)
library(dplyr)
library(anomalyDetection)
test %>%
split(.$id %in% LETTERS[1:5]) %>%
map_df(~mutate(., MD = mahalanobis_distance(.[-1])))
# id A B MD
#1 F -0.7829759 4.22808758 2.9007659
#2 G 2.4246532 5.96043439 1.3520245
#3 H -4.8649537 4.95510794 3.0842137
#4 I 1.2221836 5.36154775 0.2921482
#5 J 0.6995204 5.63616864 0.3708477
#6 A 1.2374543 5.17288708 1.4382259
#7 B -2.7815555 0.06437452 2.1244313
#8 C -2.2160242 2.74747556 0.5088291
#9 D 0.8561507 2.70631852 1.5174367
#10 E -1.6427978 6.23758354 2.4110771
NOTE: There was no seed set while creating the dataset in the OP's post

Calling a data.frame from global.env and adding a column with the data.frame name

I have a dataset consisting of pairs of data.frames (which are almost exact pairs, but not enough to merge directly) which I need to munge together. Luckily, each df has an identifier for the date it was created which can be used to reference the pair. E.g.
df_0101 <- data.frame(a = rnorm(1:10),
b = runif(1:10))
df_0102 <- data.frame(a = rnorm(5:20),
b = runif(5:20))
df2_0101 <- data.frame(a2 = rnorm(1:10),
b2 = runif(1:10))
df2_0102 <- data.frame(a2 = rnorm(5:20),
b2 = runif(5:20))
Therefore, the first thing I need to do is mutate a new column on each data.frame consisting of this date (01_01/ 01_02 / etc.) i.e.
df_0101 <- df_0101 %>%
mutate(df_name = "df_0101")
but obviously in a programmatic manner.
I can call every data.frame in the global environment using
l_df <- Filter(function(x) is(x, "data.frame"), mget(ls()))
head(l_df)
$df_0101
a b
1 0.7588803 0.17837296
2 -0.2592187 0.45445752
3 1.2221744 0.01553190
4 1.1534353 0.72097071
5 0.7279514 0.96770448
$df_0102
a b
1 -0.33415584 0.53597308
2 0.31730849 0.32995013
3 -0.18936533 0.41024220
4 0.49441962 0.22123885
5 -0.28985964 0.62388478
$df2_0101
a2 b2
1 -0.5600229 0.6283224
2 0.5944657 0.7384586
3 1.1284180 0.4656239
4 -0.4737340 0.1555984
5 -0.3838161 0.3373913
$df2_0102
a2 b2
1 -0.67987149 0.65352466
2 1.46878953 0.47135011
3 0.10902751 0.04460594
4 -1.82677732 0.38636357
5 1.06021443 0.92935144
but no idea how to then pull the names of each df down into a new column on each. Any ideas?
Thanks for reading,
We can use Map in base R
Map(cbind, names = names(l_df), l_df)
If we are going by the tidyverse way, then
library(tidyverse)
map2(names(l_df), l_df, ~(cbind(names = .x, .y)))
Also, this can be created a single dataset with bind_rows
bind_rows(l_df, .id = "names")

Resources