dplyr mutate on column subset (one function on all these columns combined)

dplyr mutate on column subset (one function on all these columns combined) - r

I have a dataframe with some info and some measurement. For the measurement, I want to calculate the mahalanobis distance, but I don't get to a clean dplyr-approach. I would like to have something like:
library(anomalyDetection)
test<-data.frame(id=LETTERS[1:10],
A = rnorm(10,0,2),
B = rnorm(10,5,3))
test<-test%>%
mutate(MD = mahalanobis_distance(.%>%dplyr::select(one_of(c("A","B")))))
I know that the following works:
test<-test%>%
mutate(MD = mahalanobis_distance(test%>%dplyr::select(one_of(c("A","B")))))
but that breaks down if there are some other step preceding the mutate-call:
test<-test%>%
mutate(group = id %in% c(LETTERS[1:5]))%>%
group_by(group)%>%
mutate(MD = mahalanobis_distance(test%>%dplyr::select(one_of(c("A","B")))))

We can do a split based on the logical vector, then with map_df create the 'MD' column by applying the mahalanobis_distance on the split dataset
library(purrr)
library(dplyr)
library(anomalyDetection)
test %>%
split(.$id %in% LETTERS[1:5]) %>%
map_df(~mutate(., MD = mahalanobis_distance(.[-1])))
# id A B MD
#1 F -0.7829759 4.22808758 2.9007659
#2 G 2.4246532 5.96043439 1.3520245
#3 H -4.8649537 4.95510794 3.0842137
#4 I 1.2221836 5.36154775 0.2921482
#5 J 0.6995204 5.63616864 0.3708477
#6 A 1.2374543 5.17288708 1.4382259
#7 B -2.7815555 0.06437452 2.1244313
#8 C -2.2160242 2.74747556 0.5088291
#9 D 0.8561507 2.70631852 1.5174367
#10 E -1.6427978 6.23758354 2.4110771
NOTE: There was no seed set while creating the dataset in the OP's post

Related

Randomly select 1 value from rows with multiple values; create a new column with cleaned data

Seems like a simple function but cannot seem to find a good way to do it on R. I have a column, P, that has many rows with multiple inputs:
P:
[340000, 410000]
[450000, 450000]
530000
110000
[330000, 440000]
510000
440000
620000
320000
Desired P1 (the * values should be randomly selected): (apologies for the spacing, the spacing is just so each value is a different line)
340000*
450000*
530000
110000
440000*
510000
440000
620000
320000
I want to build a new column that randomly selects 1 value from every row vector starting with "[" in column P and then spits out a new column, P1, with the corrected values+the other independent row values. This is part of a larger effort to clean the column so it is usable for regression.
Right now, I've come up with this tidyverse code as the best option for mutating :
foo <- data.frame(P=="[")
foo %>%
rowwise %>%
mutate(P1 = sample(P, 1))
But this isn't returning the output I need. Asside from sample(), I'm not sure what else can be used for random selection from a [] vector. I'm wondering what the best way to go about this would be?? Appreciate the help.

You can remove [] from the column values, split data on comma and get each value in different row. For each row you can then select 1 random value.
library(dplyr)
df %>%
mutate(P1 = gsub('\\[|\\]', '', P),
row = row_number()) %>%
tidyr::separate_rows(P1, sep = ',\\s*') %>%
group_by(row) %>%
slice_sample(n = 1) %>%
#In older version of dplyr use sample_n
#sample_n(1)
ungroup %>%
select(-row)
# P P1
# <chr> <chr>
#1 [340000, 410000] 340000
#2 [450000, 450000] 450000
#3 530000 530000
#4 110000 110000
#5 [330000, 440000] 440000
#6 510000 510000
#7 440000 440000
#8 620000 620000
#9 320000 320000
In base R you can implement the same logic with
df$P1 <- sapply(strsplit(gsub('\\[|\\]', '', df$P), ',\\s*'), sample, 1)
data
df <- structure(list(P = c("[340000, 410000]", "[450000, 450000]",
"530000", "110000", "[330000, 440000]", "510000", "440000", "620000",
"320000")), class = "data.frame", row.names = c(NA, -9L))

Join tables via multiple partial matching

I have a table describing a train track with each line being a segment of the track with a from and to station as well as a trackID and segment-ID. The stations names are completely random, not as structured as they appear here.
tracks <- data.frame(
trackID = c(rep("A",4),rep("B",4)),
segment = letters[1:8],
from = paste0("station_1",1:8),
to = paste0("station_2",1:8)
)
tracks
trackID segment from to
1 A a station_11 station_21
2 A b station_12 station_22
3 A c station_13 station_23
4 A d station_14 station_24
5 B e station_15 station_25
6 B f station_16 station_26
7 B g station_17 station_27
8 B h station_18 station_28
I have another table with sightings made on this train, and I would like to know what the correspoding trackID is per sighting. The table looks like this:
sightings <- data.frame(from = c("station_24","station_28","station_14"),
to = c("station_14","station_16","station_25"))
sightings
from to
1 station_24 station_14
2 station_28 station_16
3 station_14 station_25
I could gather the information on the trackID from the to and from information provided in the sightings table. BUT, from and to in the sightings-table do not correspond with the from and to in the track-table: from and to can be in different segments and can be interchanged (to-from). In some problematic cases, from and to are in different trackID, which would then return no match. The desired output from this example would be:
from to trackID
1 station_24 station_14 A
2 station_28 station_16 B
3 station_14 station_25 <NA> # no match since station_14 and 25 are from two different trackIDs
In my mind, the solution involves collapsing the tracks table by trackID and then doing a double partial matching of strings (using grepl()?). This next lines would take care of collapsing, but I have no clue where to go from here. Can someone point me in the right direction?
Solutions with R / dplyr very much preferred, but I would take anything!
library(dplyr)
tracks %>%
group_by(trackID) %>%
summarise(
from_to = paste(paste(from,collapse = ","),paste(to,collapse = ","),sep = ",")
)
tracks
trackID from_to
<fct> <chr>
1 A station_11,station_12,station_13,station_14,station_21,station_22,station_23,station_24
2 B station_15,station_16,station_17,station_18,station_25,station_26,station_27,station_28
EDIT: It seems that I've oversimplified my problem in my minimal example. The main issue is that stations (from and to) are not unique in the table, and not even unique to a trackID. Only a combination of to and from is unique to a trackID. I've accepted the answer as it solves the problem as stated, but I will also provide my own solution that I've come up with in the meantime.

A double-join can work.
Notes: You don't appear to be using segment, so I'm discarding it here, but this might be adapted if needed. Also, I added stringsAsFactors=FALSE to your data, since otherwise combining vectors of factors can be problematic.)
library(dplyr)
tracksmod <- bind_rows(
select(tracks, trackID, sta=from),
select(tracks, trackID, sta=to)
)
head(tracksmod)
# trackID sta
# 1 A station_11
# 2 A station_12
# 3 A station_13
# 4 A station_14
# 5 B station_15
# 6 B station_16
sightings %>%
left_join(select(tracksmod, trackID, from=sta), by="from") %>%
left_join(select(tracksmod, trackID2=trackID, to=sta), by="to") %>%
mutate(trackID = if_else(trackID == trackID2, trackID, NA_character_)) %>%
select(-trackID2)
# from to trackID
# 1 station_24 station_14 A
# 2 station_28 station_16 B
# 3 station_14 station_25 <NA>
I did not assume that directionality was important. That is, I'm not assuming that a station listed in from must always be in the from column. This is why I converted tracks to tracksmod, in order to identify a station with an id regardless of direction.

As I've stated in the EDIT of my question, I've oversimplified my problem in the minimal Example.
Here is an updated version of the data that resembles my data more accurately. I've also added stringsAsFactor = F as commented by #r2evans.
tracks <- data.frame(
trackID = c(rep("A",4),rep("B",4)),
segment = letters[1:8],
from = paste0("station_1",c(1:4,1,2,5,6)),
to = paste0("station_2",1:8),
stringsAsFactors = F
)
sightings <- data.frame(
from = c("station_24","station_28","station_14"),
to = c("station_14","station_11","station_25"),
trackID = c("A","B",NA),
stringsAsFactors = F
)
I've solved the problem by collapsing the tracks table on the basis of trackID and then using the purrr package to use the loop functions in a nested manner.
library(dplyr)
# Collapsing the tracks-dataframe
tracks_collapse <- tracks %>%
group_by(trackID) %>%
summarise(
from_to = paste(paste(from,collapse = ","),paste(to,collapse = ","),sep = ",")
# from = list(from),
# to = list(to),
# stas = list(c(from,to))
)
# a helper function to remove NAs when looking for matches
remove_na <- function(x){x[!is.na(x)]}
library(purrr)
pmap_dfr(sightings, function(from,to,trackID){ # pmap_dfr runs over a data.frame and returns a data.frame
data.frame(
from = from, # recreates the sightings data.frame
to = to, # dito
trackID = paste( # collapses the resulting vector
remove_na( # removes the NA values
pmap_chr( # matches every row from the sightings-data.frame with the tracks-data.frame
tracks_collapse,
function(trackID,from_to){
ifelse(grepl(from,from_to) & grepl(to,from_to),trackID,NA) # does partial string matching and returns the trackID if both strings match
}
)
),collapse = ","
)
)
})
Output:
from to trackID
1 station_24 station_14 A
2 station_28 station_11 B
3 station_14 station_25 <NA>

Calling a data.frame from global.env and adding a column with the data.frame name

I have a dataset consisting of pairs of data.frames (which are almost exact pairs, but not enough to merge directly) which I need to munge together. Luckily, each df has an identifier for the date it was created which can be used to reference the pair. E.g.
df_0101 <- data.frame(a = rnorm(1:10),
b = runif(1:10))
df_0102 <- data.frame(a = rnorm(5:20),
b = runif(5:20))
df2_0101 <- data.frame(a2 = rnorm(1:10),
b2 = runif(1:10))
df2_0102 <- data.frame(a2 = rnorm(5:20),
b2 = runif(5:20))
Therefore, the first thing I need to do is mutate a new column on each data.frame consisting of this date (01_01/ 01_02 / etc.) i.e.
df_0101 <- df_0101 %>%
mutate(df_name = "df_0101")
but obviously in a programmatic manner.
I can call every data.frame in the global environment using
l_df <- Filter(function(x) is(x, "data.frame"), mget(ls()))
head(l_df)
$df_0101
a b
1 0.7588803 0.17837296
2 -0.2592187 0.45445752
3 1.2221744 0.01553190
4 1.1534353 0.72097071
5 0.7279514 0.96770448
$df_0102
a b
1 -0.33415584 0.53597308
2 0.31730849 0.32995013
3 -0.18936533 0.41024220
4 0.49441962 0.22123885
5 -0.28985964 0.62388478
$df2_0101
a2 b2
1 -0.5600229 0.6283224
2 0.5944657 0.7384586
3 1.1284180 0.4656239
4 -0.4737340 0.1555984
5 -0.3838161 0.3373913
$df2_0102
a2 b2
1 -0.67987149 0.65352466
2 1.46878953 0.47135011
3 0.10902751 0.04460594
4 -1.82677732 0.38636357
5 1.06021443 0.92935144
but no idea how to then pull the names of each df down into a new column on each. Any ideas?
Thanks for reading,

We can use Map in base R
Map(cbind, names = names(l_df), l_df)
If we are going by the tidyverse way, then
library(tidyverse)
map2(names(l_df), l_df, ~(cbind(names = .x, .y)))
Also, this can be created a single dataset with bind_rows
bind_rows(l_df, .id = "names")

dplyr summarize: create variables from named vector

Here's my problem:
I am using a function that returns a named vector. Here's a toy example:
toy_fn <- function(x) {
y <- c(mean(x), sum(x), median(x), sd(x))
names(y) <- c("Right", "Wrong", "Unanswered", "Invalid")
y
}
I am using group_by in dplyr to apply this function for each group (typical split-apply-combine). So, here's my toy data.frame:
set.seed(1234567)
toy_df <- data.frame(id = 1:1000,
group = sample(letters, 1000, replace = TRUE),
value = runif(1000))
And here's the result I am aiming for:
toy_summary <-
toy_df %>%
group_by(group) %>%
summarize(Right = toy_fn(value)["Right"],
Wrong = toy_fn(value)["Wrong"],
Unanswered = toy_fn(value)["Unanswered"],
Invalid = toy_fn(value)["Invalid"])
> toy_summary
Source: local data frame [26 x 5]
group Right Wrong Unanswered Invalid
1 a 0.5038394 20.15358 0.5905526 0.2846468
2 b 0.5048040 15.64892 0.5163702 0.2994544
3 c 0.5029442 21.62660 0.5072733 0.2465612
4 d 0.5124601 14.86134 0.5382463 0.2681955
5 e 0.4649483 17.66804 0.4426197 0.3075080
6 f 0.5622644 12.36982 0.6330269 0.2850609
7 g 0.4675324 14.96104 0.4692404 0.2746589
It works! But it is just not cool to call four times the same function. I would rather like dplyr to get the named vector and create a new variable for each element in the vector. Something like this:
toy_summary <-
toy_df %>%
group_by(group) %>%
summarize(toy_fn(value))
This, unfortunately, does not work because "Error: expecting a single value".
I thought, ok, let's just convert the vector to a data.frame using data.frame(as.list(x)). But this does not work either. I tried many things but I couldn't trick dplyr into think it's actually receiving one single value (observation) for 4 different variables. Is there any way to help dplyr realize that?.

One possible solution is to use dplyr SE capabilities. For example, set you function as follows
dots <- setNames(list( ~ mean(value),
~ sum(value),
~ median(value),
~ sd(value)),
c("Right", "Wrong", "Unanswered", "Invalid"))
Then, you can use summarize_ (with a _) as follows
toy_df %>%
group_by(group) %>%
summarize_(.dots = dots)
# Source: local data table [26 x 5]
#
# group Right Wrong Unanswered Invalid
# 1 o 0.4490776 17.51403 0.4012057 0.2749956
# 2 s 0.5079569 15.23871 0.4663852 0.2555774
# 3 x 0.4620649 14.78608 0.4475117 0.2894502
# 4 a 0.5038394 20.15358 0.5905526 0.2846468
# 5 t 0.5041168 24.19761 0.5330790 0.3171022
# 6 m 0.4806628 21.14917 0.4805273 0.2825026
# 7 c 0.5029442 21.62660 0.5072733 0.2465612
# 8 w 0.4932484 17.75694 0.4891746 0.3309680
# 9 q 0.5350707 22.47297 0.5608505 0.2749941
# 10 g 0.4675324 14.96104 0.4692404 0.2746589
# .. ... ... ... ... ...
Though it looks nice, there is a big catch here. You have to know the column you are going to operate on a priori (value) when setting up the function, so it won't work on some other column name, if you won't set up dots properly.
As a bonus here's a simple solution using data.table using your original function
library(data.table)
setDT(toy_df)[, as.list(toy_fn(value)), by = group]
# group Right Wrong Unanswered Invalid
# 1: o 0.4490776 17.51403 0.4012057 0.2749956
# 2: s 0.5079569 15.23871 0.4663852 0.2555774
# 3: x 0.4620649 14.78608 0.4475117 0.2894502
# 4: a 0.5038394 20.15358 0.5905526 0.2846468
# 5: t 0.5041168 24.19761 0.5330790 0.3171022
# 6: m 0.4806628 21.14917 0.4805273 0.2825026
# 7: c 0.5029442 21.62660 0.5072733 0.2465612
# 8: w 0.4932484 17.75694 0.4891746 0.3309680
# 9: q 0.5350707 22.47297 0.5608505 0.2749941
# 10: g 0.4675324 14.96104 0.4692404 0.2746589
#...

You can also try this with do():
toy_df %>%
group_by(group) %>%
do(res = toy_fn(.$value))

This is not a dplyr solution, but if you like pipes:
library(magrittr)
toy_summary <-
toy_df %>%
split(.$group) %>%
lapply( function(x) toy_fn(x$value) ) %>%
do.call(rbind, .)
# > head(toy_summary)
# Right Wrong Unanswered Invalid
# a 0.5038394 20.15358 0.5905526 0.2846468
# b 0.5048040 15.64892 0.5163702 0.2994544
# c 0.5029442 21.62660 0.5072733 0.2465612
# d 0.5124601 14.86134 0.5382463 0.2681955
# e 0.4649483 17.66804 0.4426197 0.3075080
# f 0.5622644 12.36982 0.6330269 0.2850609

Apparently there's a problem when using median (not sure what's going on there) but apart from that you can normally use an approach like the following with summarise_each to apply multiple functions. Note that you can specify the names of resulting columns by using a named vector as input to funs_():
x <- c(Right = "mean", Wrong = "sd", Unanswered = "sum")
toy_df %>%
group_by(group) %>%
summarise_each(funs_(x), value)
#Source: local data frame [26 x 4]
#
# group Right Wrong Unanswered
#1 a 0.5038394 0.2846468 20.15358
#2 b 0.5048040 0.2994544 15.64892
#3 c 0.5029442 0.2465612 21.62660
#4 d 0.5124601 0.2681955 14.86134
#5 e 0.4649483 0.3075080 17.66804
#6 f 0.5622644 0.2850609 12.36982
#7 g 0.4675324 0.2746589 14.96104
#8 h 0.4921506 0.2879830 21.16248
#9 i 0.5443600 0.2945428 22.31876
#10 j 0.5276048 0.3236814 20.57659
#.. ... ... ... ...

using the sequence of list(as_tibble(as.list(...)) followed by an unnest from tidyr does the trick
toy_summary2 <- toy_df %>% group_by(group) %>%
summarize(Col = list(as_tibble(as.list(toy_fn(value))))) %>% unnest()

which.min within reshape2's dcast()?

I would like to extract the value of var2 that corresponds to the minimum value of var1 in each building-month combination. Here's my (fake) data set:
head(mydata)
# building month var1 var2
#1 A 1 -26.96333 376.9633
#2 A 1 165.38759 317.3993
#3 A 1 47.46345 271.0137
#4 A 2 73.47784 294.8171
#5 A 2 107.80130 371.7668
#6 A 2 10.16384 308.7975
Reproducible code:
## create fake data set:
set.seed(142)
mydata1 = data.frame(building = rep(LETTERS[1:5],6),month = sort(rep(1:6,5)),var1=rnorm(30,50,35),var2 = runif(30,200,400))
mydata2 = data.frame(building = rep(LETTERS[1:5],6),month = sort(rep(1:6,5)),var1=rnorm(30,60,35),var2 = runif(30,150,400))
mydata3 = data.frame(building = rep(LETTERS[1:5],6),month = sort(rep(1:6,5)),var1=rnorm(30,40,35),var2 = runif(30,250,400))
mydata = rbind(mydata1,mydata2,mydata3)
mydata = mydata[ order(mydata[,"building"], mydata[,"month"]), ]
row.names(mydata) = 1:nrow(mydata)
## here is how I pull the minimum value of v1 for each building-month combination:
require(reshape2)
m1 = melt(mydata, id.var=1:2)
d1 = dcast(m1, building ~ month, function(x) min(max(x,0), na.rm=T),
subset = .(variable == "var1"))
This pulls out the minimum value of var1 for each building-month combo...
head(d1)
# building 1 2 3 4 5 6
#1 A 165.38759 107.80130 93.32816 73.23279 98.55546 107.58780
#2 B 92.08704 98.94959 57.79610 94.10530 80.86883 99.75983
#3 C 93.38284 100.13564 52.26178 62.37837 91.98839 97.44797
#4 D 82.43440 72.43868 66.83636 105.46263 133.02281 94.56457
#5 E 70.09756 61.44406 30.78444 68.24334 94.35605 61.60610
However, what I want is a data frame set up exactly as d1 that instead shows the value of var2 that corresponds to the minimum value pulled for var1 (shown in d1 above). My gut tells me it should be a variation on which.min(), but haven't gotten this to work with dcast() or ddply(). Any help is appreciated!

It may be possible in one step, but I'm more familiar with plyr than reshape2,
dcast(ddply(mydata, .(building, month), summarize, value = var2[which.min(var1)]),
building ~ month)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr mutate on column subset (one function on all these columns combined) - r

Related

Randomly select 1 value from rows with multiple values; create a new column with cleaned data

Join tables via multiple partial matching

Calling a data.frame from global.env and adding a column with the data.frame name

dplyr summarize: create variables from named vector

which.min within reshape2's dcast()?

Categories

Resources