Minimum value matching across values in multiple columns - r

I would like to return a dataframe with the minimum value of column one based on the values of columns 2-4:
df <- data.frame(one = rnorm(1000),
two = sample(letters, 1000, replace = T),
three = sample(letters, 1000, replace = T),
four = sample(letters, 1000, replace = T))
I can do:
df_group <- df %>%
group_by(two) %>%
filter(one = min(one))
This gets me the lowest value of all the "m's" in column two, but what if column three or four had a lower "m" value in column one?
The output should look like this:
one two
1 -0.311609752 r
2 0.053166742 n
3 1.546485810 a
4 -0.430308725 d
5 -0.145428664 c
6 0.419181639 u
7 0.008881661 i
8 1.223517580 t
9 0.797273157 b
10 0.790565358 v
11 -0.560031797 e
12 -1.546234090 q
13 -1.847945540 l
14 -1.489130228 z
15 -1.203255034 g
16 0.146969892 m
17 -0.552363433 f
18 -0.006234646 w
19 0.982932856 s
20 0.751936728 o
21 0.220751258 h
22 -1.557436228 y
23 -2.034885868 k
24 -0.463354387 j
25 -0.351448850 p
26 1.331365941 x
I don't care which column has the lowest value for a given letter, I just need the lowest value and the letter column.
I'm trying to wrap my head around writing this simplistically. This might be a duplicate, but I didn't know how to word the title and couldn't find any material or previous questions on how to do it.

Another solution based in data.table :
library(data.table)
setDT(df)
melt(df,
measure=grep("one",names(df),invert = TRUE,value=TRUE))[
,min(one),value]

You can do something like this:
library(dplyr); library(tidyr)
df %>% gather(cols, letts, -one) %>% # gather all letters into one column
group_by(letts) %>%
summarise(one = min(one)) # do a group by summary for each letter
# A tibble: 26 × 2
# letts one
# <chr> <dbl>
#1 a -2.092327
#2 b -2.461102
#3 c -3.055858
#4 d -2.092327
#5 e -2.461102
#6 f -2.249439
#7 g -1.941632
#8 h -2.543310
#9 i -3.055858
#10 j -1.896974
# ... with 16 more rows

Related

Wide table to long table

I have a dataframe called "results" that look like this
my dataframe:
result <- structure(list(D = c(2986.286, 2842.54, 2921), E = c(3020.458,
2943.926, 2860.763), F = c(3008.644, 3142.134, 3002.515), G = c(2782.983,
3135.148, 2873.025), H = c(2874.082, 3066.655, 2778.107), I = c(2592.377,
3017.99, 2859.603), J = c(3051.184, 3011.467, 3007.769)), class = "data.frame", row.names = c("Above average",
"Below average", "very Good"))
I have tried the following:
result_long <- pivot_longer(result, 1:7, names_to="combination", values_to ="price.m" )
But like this I loose my "index": "Above average",...,"Very good"
As my result is:
# A tibble: 21 x 2
letter value
<chr> <dbl>
1 D 2986.
2 E 3020.
3 F 3009.
4 G 2783.
5 H 2874.
6 I 2592.
7 J 3051.
8 D 2843.
9 E 2944.
10 F 3142.
# ... with 11 more rows
Anyone know how I can achieve the same result but keep the column/index "Above average", "Very good", "Below average"?
While a tibble can have row names (e.g., when converting from a regular data frame), they are removed when subsetting with the [ operator. A warning will be raised when attempting to assign non-NULL row names to a tibble.
Generally, it is best to avoid row names, because they are basically a character column with different semantics than every other column.
https://tibble.tidyverse.org/reference/rownames.html
In your case pivot_longer is removing the rownames, but you could save the rownames as column with rownames_to_column from tibble package before transforming with pivot_longer like this:
library(tibble)
library(tidyr)
library(dplyr)
result_long <- result %>%
rownames_to_column("id") %>%
pivot_longer(
-id,
names_to="combination",
values_to ="price.m"
)
A tibble: 21 x 3
id combination price.m
<chr> <chr> <dbl>
1 Above average D 2986.
2 Above average E 3020.
3 Above average F 3009.
4 Above average G 2783.
5 Above average H 2874.
6 Above average I 2592.
7 Above average J 3051.
8 Below average D 2843.
9 Below average E 2944.
10 Below average F 3142.
# ... with 11 more rows
In base R, we may convert to table and wrap with as.data.frame
as.data.frame.table(as.matrix(result))
-output
Var1 Var2 Freq
1 Above average D 2986.286
2 Below average D 2842.540
3 very Good D 2921.000
4 Above average E 3020.458
5 Below average E 2943.926
6 very Good E 2860.763
7 Above average F 3008.644
8 Below average F 3142.134
...

use assign / create new object with value (dplyr)

I want to create a new variable with the value of column number in my DF by its name.
I have managed to do this:
firstCol <- which(colnames(Mydf) == "Cars")
It takes the column number of the column with the name "Cars" and set its number to the object firstCol. It works well and good on base.
latly, I've been using dplyr and pipes and I'm trying to create a variable and do the same thing by using pipes but I'm unable to do this - use this line but in pipes %>%
Can you help me?
thanks,
Ido
The dplyr way to do this is select.
Here is an example using some made up data:
df <- data.frame(cars = sample(LETTERS, 100, replace = TRUE),
mpg = runif(100, 15, 45),
color = sample(c("green", "red","blue", "silver"),
100, replace = TRUE)) %>% tibbble()
df %>% select(cars)
# A tibble: 100 x 1
cars
<chr>
1 R
2 V
3 I
4 Q
5 P
6 D
7 J
8 Q
9 R
10 A
# ... with 90 more rows
You can also remove columns with select(-col_name)
df %>% select(-mpg)
# A tibble: 100 x 2
cars color
<chr> <chr>
1 R blue
2 V silver
3 I red
4 Q green
5 P silver
6 D silver
7 J green
8 Q blue
9 R red
10 A silver
# ... with 90 more rows

A clean way for adding variable-length values to data frame by group

I am creating random data. It should contain variables id and val where values cannot overlap within a single id but can overlap across id-s. Different id-s have different number of values n. I can create the desired result manually as:
n <- c(3,2,4)
data.frame(id=rep(letters[1:3], n),
val=c(sample(10, n[1]),
sample(10, n[2]),
sample(10, n[3])))
id val
1 a 5
2 a 10
3 a 4
4 b 9
5 b 10
6 c 10
7 c 5
8 c 2
9 c 9
I can also imagine different solutions involving looping over groups and using rbind, or using rep-ing the id-s by corresponding number of times. But all such approaches feel dirty, and may not scale to many variables and large data.
Are there any cleaner ways to achieve it? Something like (in dplyrish):
data.frame(id=letters[1:3]) %>%
mutate(i = row_number()) %>%
group_by(id) %>%
summarize_into_df(id=id, val=sample(10, n[i]))
You can loop through n with lapply, create a list column using sample, then unnest it:
library(dplyr)
library(tidyr)
n <- c(3,2,4)
data.frame(id = letters[1:length(n)]) %>%
mutate(val = lapply(n, sample, x=10)) %>%
unnest
# id val
#1 a 9
#2 a 4
#3 a 10
#4 b 4
#5 b 8
#6 c 5
#7 c 10
#8 c 8
#9 c 2
Or without using any package, which is very close to what you have, just replace manual construct with unlist(lapply(...)):
data.frame(id = rep(letters[1:length(n)], n),
val = unlist(lapply(n, sample, x=10)))

Evaluation order inconsistency with dplyr mutate

I have 2 functions that I use inside a mutate call. One produces per row results as expected while the other repeats the same value for all rows:
library(dplyr)
df <- data.frame(X = rpois(5, 10), Y = rpois(5,10))
pv <- function(a, b) {
fisher.test(matrix(c(a, b, 10, 10), 2, 2),
alternative='greater')$p.value
}
div <- function(a, b) a/b
mutate(df, d = div(X,Y), p = pv(X, Y))
which produces something like:
X Y d p
1 9 15 0.6000000 0.4398077
2 8 7 1.1428571 0.4398077
3 9 14 0.6428571 0.4398077
4 11 15 0.7333333 0.4398077
5 11 7 1.5714286 0.4398077
ie the d column varies, but v is constant and its value does not actually correspond to the X and Y values in any of the rows.
I suspect this relates to NSE, but I do not undertand how from what litlle I have been able to find out about it.
What accounts for the different behaviours of div and pv? How do I fix pv?
We need rowwise
df %>%
rowwise() %>%
mutate(d = div(X,Y), p = pv(X,Y))
# X Y d p
# <int> <int> <dbl> <dbl>
#1 10 9 1.111111 0.5619072
#2 12 8 1.500000 0.3755932
#3 9 8 1.125000 0.5601923
#4 11 16 0.687500 0.8232217
#5 16 10 1.600000 0.3145350
In the OP's code, the pv is taking the 'X' and 'Y' columns as input and it gives a single output.
Or as #Frank mentioned, mapply can be used
df %>%
mutate(d = div(X,Y), p = mapply(pv, X, Y))

Getting a summary data frame for all the combinations of categories represented in two columns

I am working with a data frame corresponding to the example below:
set.seed(1)
dta <- data.frame("CatA" = rep(c("A","B","C"), 4), "CatNum" = rep(1:2,6),
"SomeVal" = runif(12))
I would like to quickly build a data frame that would have sum values for all the combinations of the categories derived from the CatA and CatNum as well as for the categories derived from each column separately. On the primitive example above, for the first couple of combinations, this can be achieved with use of simple code:
df_sums <- data.frame(
"Category" = c("Total for A",
"Total for A and 1",
"Total for A and 2"),
"Sum" = c(sum(dta$SomeVal[dta$CatA == 'A']),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 1]),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 2]))
)
This produces and informative data frame of sums:
Category Sum
1 Total for A 2.1801780
2 Total for A and 1 1.2101839
3 Total for A and 2 0.9699941
This solution would be grossly inefficient when applied to a data frame with multiple categories. I would like to achieve the following:
Cycle through all the categories, including categories derived from each column separately as well as from both columns in the same time
Achieve some flexibility with respect to how the function is applied, for instance I may want to apply mean instead of the sum
Save the Total for string a separate object that I could easily edit when applying other function than sum.
I was initially thinking of using dplyr, on the lines:
require(dplyr)
df_sums_experiment <- dta %>%
group_by(CatA, CatNum) %>%
summarise(TotVal = sum(SomeVal))
But it's not clear to me how I could apply multiple groupings simultaneously. As stated, I'm interested in grouping by each column separately and by the combination of both columns. I would also like to create a string column that would indicate what is combined and in what order.
You could use tidyr to unite the columns and gather the data. Then use dplyr to summarise:
library(dplyr)
library(tidyr)
dta %>% unite(measurevar, CatA, CatNum, remove=FALSE) %>%
gather(key, val, -SomeVal) %>%
group_by(val) %>%
summarise(sum(SomeVal))
val sum(SomeVal)
(chr) (dbl)
1 1 2.8198078
2 2 3.0778622
3 A 2.1801780
4 A_1 1.2101839
5 A_2 0.9699941
6 B 1.4405782
7 B_1 0.4076565
8 B_2 1.0329217
9 C 2.2769138
10 C_1 1.2019674
11 C_2 1.0749464
Just loop over the column combinations, compute the quantities you want and then rbind them together:
library(data.table)
dt = as.data.table(dta) # or setDT to convert in place
cols = c('CatA', 'CatNum')
rbindlist(apply(combn(c(cols, ""), length(cols)), 2,
function(i) dt[, sum(SomeVal), by = c(i[i != ""])]), fill = T)
# CatA CatNum V1
# 1: A 1 1.2101839
# 2: B 2 1.0329217
# 3: C 1 1.2019674
# 4: A 2 0.9699941
# 5: B 1 0.4076565
# 6: C 2 1.0749464
# 7: A NA 2.1801780
# 8: B NA 1.4405782
# 9: C NA 2.2769138
#10: NA 1 2.8198078
#11: NA 2 3.0778622
Split then use apply
#result
res <- do.call(rbind,
lapply(
c(split(dta,dta$CatA),
split(dta,dta$CatNum),
split(dta,dta[,1:2])),
function(i)sum(i[,"SomeVal"])))
#prettify the result
res1 <- data.frame(Category=paste0("Total for ",rownames(res)),
Sum=res[,1])
res1$Category <- sub("."," and ",res1$Category,fixed=TRUE)
row.names(res1) <- seq_along(row.names(res1))
res1
# Category Sum
# 1 Total for A 2.1801780
# 2 Total for B 1.4405782
# 3 Total for C 2.2769138
# 4 Total for 1 2.8198078
# 5 Total for 2 3.0778622
# 6 Total for A and 1 1.2101839
# 7 Total for B and 1 0.4076565
# 8 Total for C and 1 1.2019674
# 9 Total for A and 2 0.9699941
# 10 Total for B and 2 1.0329217
# 11 Total for C and 2 1.0749464

Resources