My data frame is set as follows:
Black White Red Blue
0.8 0.1 0.07 0.03
0.3 0.6 0 0.1
0.1 0.6 0.25 0.05
I wanted my data frame to look like this:
Black White Red Blue Color1 Color2 Color3 Color4
0.8 0.1 0.07 0.03 0.8 0.1 0.07 0.03
0.3 0.6 0 0.1 0.6 0.3 0.1 0
0.1 0.6 0.25 0.05 0.6 0.25 0.1 0.05
In which Color1 represents the largest value for each row, Color2 represents the second largest value, Color3 represents the third largest, and Color4 represents the smallest value for each row.
So far, I've used this function to obtain what I wanted, which is the result above:
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
df$Color1 <- apply(df, 1, max)
df$Color2 <- apply(df, 1, function(x)x[maxn(3)(x)])
df$Color3 <- apply(df, 1, function(x)x[maxn(4)(x)])
df$Color4 <- apply(df, 1, function(x)x[maxn(5)(x)])
Is there a more concise way for me to arrange my dataset?
Additionally, a bit off-topic: I'm not sure if it's because this is a CSV file that I'm working with that whenever I use the function
df$Color2 <- apply(df, 1, function(x)x[maxn(2)(x)])
It will return the same result as the function
apply(df, 1, max)
AND
apply(df, 1, function(x)x[maxn(1)(x)])
One option is to use sort with apply, transpose and then cbind with data frame as:
cbind(df, t(apply(df, 1, sort, decreasing = TRUE)))
# Black White Red Blue 1 2 3 4
# 1 0.8 0.1 0.07 0.03 0.8 0.10 0.07 0.03
# 2 0.3 0.6 0.00 0.10 0.6 0.30 0.10 0.00
# 3 0.1 0.6 0.25 0.05 0.6 0.25 0.10 0.05
Updated: Based on suggestion from #dww column names can be assigned as:
df[paste0('color',1:4)] = t(apply(df, 1, sort, decreasing = TRUE))
# Black White Red Blue color1 color2 color3 color4
# 1 0.8 0.1 0.07 0.03 0.8 0.10 0.07 0.03
# 2 0.3 0.6 0.00 0.10 0.6 0.30 0.10 0.00
# 3 0.1 0.6 0.25 0.05 0.6 0.25 0.10 0.05
It's quite a bit more complex but a speedier solution if you're dealing with a large number of rows is to only do the sorting/ordering once and re-insert it into a matrix shape:
matrix(x[order(-row(x), x, decreasing=TRUE)], nrow=nrow(x), ncol=ncol(x), byrow=TRUE)
Some timings:
x <- matrix(rnorm(300000*5), nrow=300000, ncol=5)
system.time(t(apply(x, 1, sort, decreasing=TRUE)))
# user system elapsed
# 14.13 0.00 14.13
system.time(
matrix(x[order(-row(x),x, decreasing=TRUE)], nrow=nrow(x), ncol=ncol(x), byrow=TRUE)
)
# user system elapsed
# 0.10 0.00 0.09
Related
I am trying to implement a for loop in R to fill a df with some combinations of learning rates and decays used in machine learning. The ideia is to try several learning rates and decays, calculate error metrics of these combinations and save in a dataset. So I could point out which combination is better.
Below is the code and my result. I don't understand why I get this result.
learning_rate = c(0.01, 0.02)
decay = c(0, 1e-1)
combinations = length(learning_rate) * length(decay)
df <- data.frame(Combination=character(combinations),
lr=double(combinations),
decay=double(combinations),
result=character(combinations),
stringsAsFactors=FALSE)
for (i in 1:combinations) {
for (lr in learning_rate) {
for (dc in decay) {
df[i, 1] = i
df[i, 2] = lr
df[i, 3] = dc
df[i, 4] = 10*lr + dc*4 # Here I'd do some machine learning. Just put this is easy equation as example
}
}
}
The result I get. It seems that only the combination loop worked well. What I did wrong?
Combination lr decay result
1 0.02 0.1 0.6
2 0.02 0.1 0.6
3 0.02 0.1 0.6
4 0.02 0.1 0.6
I expected this result
Combination lr decay result
1 0.01 0 0.1
2 0.01 1e-1 0.5
3 0.02 0 0.2
4 0.02 1e-1 0.6
Tuning with for-loop:
df <- data.frame()
for (lr in learning_rate) {
for (dc in decay) {
df <- rbind(df, data.frame(
lr = lr,
decay = dc,
result = 10*lr + dc*4
))
}
}
df
# lr decay result
# 1 0.01 0.0 0.1
# 2 0.01 0.1 0.5
# 3 0.02 0.0 0.2
# 4 0.02 0.1 0.6
Tuning with mapply():
df <- expand.grid(lr = learning_rate, decay = decay)
ML.fun <- function(lr, dc) 10*lr + dc*4
df$result <- mapply(ML.fun, lr = df$lr, dc = df$decay)
df
# lr decay result
# 1 0.01 0.0 0.1
# 2 0.02 0.0 0.2
# 3 0.01 0.1 0.5
# 4 0.02 0.1 0.6
I have a data.frame like this:
value condition
1 0.46 value > 0.5
2 0.96 value == 0.79
3 0.45 value <= 0.65
4 0.68 value == 0.88
5 0.57 value < 0.9
6 0.10 value > 0.01
7 0.90 value >= 0.6
8 0.25 value < 0.91
9 0.04 value > 0.2
structure(list(value = c(0.46, 0.96, 0.45, 0.68, 0.57, 0.1, 0.9,
0.25, 0.04), condition = c("value > 0.5", "value == 0.79", "value <= 0.65",
"value == 0.88", "value < 0.9", "value > 0.01", "value >= 0.6",
"value < 0.91", "value > 0.2")), class = "data.frame", row.names = c(NA,
-9L))
I would like to evaluate the strings in the condition column for every row.
So the result would look like this.
value condition goal
1 0.46 value > 0.5 FALSE
2 0.96 value == 0.79 FALSE
3 0.45 value <= 0.65 TRUE
4 0.68 value == 0.88 FALSE
5 0.57 value < 0.9 TRUE
6 0.10 value > 0.01 TRUE
7 0.90 value >= 0.6 TRUE
8 0.25 value < 0.91 TRUE
9 0.04 value > 0.2 FALSE
I suppose there is a handy NSE solution within the dplyr framework. I have experimented with !! and expr() and others. I got some promising results when trying to subset by condition using
result <- df[0,]
for(i in 1:nrow(df)) {
result <- rbind(result, filter_(df[i,], bquote(.(df$condition[i]))))
}
But I don't like the solution and it's not exactly what I'm after.
I hope someone can help.
UPDATE: I'm trying to avoid eval(parse(..)).
Not entirely sure whether you are looking for something like this, however, you can also use lazy_eval() from lazyeval:
df %>%
rowwise() %>%
mutate(res = lazy_eval(sub("value", value, condition)))
value condition res
<dbl> <chr> <lgl>
1 0.46 value > 0.5 FALSE
2 0.96 value == 0.79 FALSE
3 0.45 value <= 0.65 TRUE
4 0.68 value == 0.88 FALSE
5 0.570 value < 0.9 TRUE
6 0.1 value > 0.01 TRUE
7 0.9 value >= 0.6 TRUE
8 0.25 value < 0.91 TRUE
9 0.04 value > 0.2 FALSE
And even though it is very close to eval(parse(...)), a possibility is also using parse_expr() from rlang:
df %>%
rowwise() %>%
mutate(res = eval(rlang::parse_expr(condition)))
One straightforward and easy solution would be using eval(parse...
library(dplyr)
df %>%
rowwise() %>%
mutate(goal = eval(parse(text = condition)))
# A tibble: 9 x 3
# value condition goal
# <dbl> <chr> <lgl>
#1 0.46 value > 0.5 FALSE
#2 0.96 value == 0.79 FALSE
#3 0.45 value <= 0.65 TRUE
#4 0.68 value == 0.88 FALSE
#5 0.570 value < 0.9 TRUE
#6 0.1 value > 0.01 TRUE
#7 0.9 value >= 0.6 TRUE
#8 0.25 value < 0.91 TRUE
#9 0.04 value > 0.2 FALSE
However, I would recommend reading some posts before using it.
Using match.fun:
# get function, and the value
myFun <- lapply(strsplit(df1$condition, " "), function(i){
list(f = match.fun(i[ 2 ]),
v = as.numeric(i[ 3 ]))
})
df1$goal <- mapply(function(x, y){
x[[ "f" ]](y, x[ "v" ])
}, x = myFun, y = df1$value)
# value condition goal
# 1 0.46 value > 0.5 FALSE
# 2 0.96 value == 0.79 FALSE
# 3 0.45 value <= 0.65 TRUE
# 4 0.68 value == 0.88 FALSE
# 5 0.57 value < 0.9 TRUE
# 6 0.10 value > 0.01 TRUE
# 7 0.90 value >= 0.6 TRUE
# 8 0.25 value < 0.91 TRUE
# 9 0.04 value > 0.2 FALSE
If you want to avoid eval(parse... you can try this:
library(tidyverse)
df %>% mutate(bound = as.numeric(str_extract(condition, "[0-9 \\.]*$")),
goal = case_when(grepl("==", condition) ~ value == bound,
grepl(">=", condition) ~ value >= bound,
grepl("<=", condition) ~ value <= bound,
grepl(">", condition) ~ value > bound,
grepl("<", condition) ~ value < bound,
T ~ NA))
value condition bound goal
1 0.46 value > 0.5 0.50 FALSE
2 0.96 value == 0.79 0.79 FALSE
3 0.45 value <= 0.65 0.65 TRUE
4 0.68 value == 0.88 0.88 FALSE
5 0.57 value < 0.9 0.90 TRUE
6 0.10 value > 0.01 0.01 TRUE
7 0.90 value >= 0.6 0.60 TRUE
8 0.25 value < 0.91 0.91 TRUE
9 0.04 value > 0.2 0.20 FALSE
I am trying to change a data frame such that I only include those columns where the first value of the row is the nth largest.
For example, here let's assume I want to only include the columns where the top value in row 1 is the 2nd largest (top 2 largest).
dat1 = data.frame(a = c(0.1,0.2,0.3,0.4,0.5), b = c(0.6,0.7,0.8,0.9,0.10), c = c(0.12,0.13,0.14,0.15,0.16), d = c(NA, NA, NA, NA, 0.5))
a b c d
1 0.1 0.6 0.12 NA
2 0.2 0.7 0.13 NA
3 0.3 0.8 0.14 NA
4 0.4 0.9 0.15 NA
5 0.5 0.1 0.16 0.5
such that a and d are removed, because 0.1 and NA are not the 2nd largest values in
row 1. Here 0.6 and 0.12 are larger than 0.1 and NA in column a and d respectively.
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
Is there a simple way to subset this? I do not want to order it, because that will create problems with other data frames I have that are related.
Complementing pieca's answer, you can encapsulate that into a function.
Also, this way, the returning data.frame won't be sorted.
get_nth <- function(df, n) {
df[] <- lapply(df, as.numeric) # edit
cols <- names(sort(df[1, ], na.last = NA, decreasing = TRUE))
cols <- cols[seq(n)]
df <- df[names(df) %in% cols]
return(df)
}
Hope this works for you.
Sort the first row of your data.frame, and then subset by names:
cols <- names(sort(dat1[1,], na.last = NA, decreasing = TRUE))
> dat1[,cols[1:2]]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
You can get an inverted rank of the first row and take the top nth columns:
> r <- rank(-dat1[1,], na.last=T)
> r <- r <= 2
> dat1[,r]
b c
1 0.6 0.12
2 0.7 0.13
3 0.8 0.14
4 0.9 0.15
5 0.1 0.16
I need to do some calculation as per the below formula:
B1 = A1 + (1-A1) * B1
example:
B1 = 0.2 + (1 - 0.2) * 0.4
= 0.52
C1 = 0.4 + (1 - 0.4) * 0.8
= 0.904
D1 = 0.8 + (1 - 0.8) * 0.5
= 0.952
Same logic applied for other rows and other columns, there are total 11.
dataframe:
df
A B C D
0.2 0.4 0.8 0.5
0.4 0.5 0.6 0.2
0.8 0.1 0.5 0.4
0.3 0.4 0.1 0.8
Expected output:
A B C D
0.2 0.52 0.904 0.952
0.4 0.7 0.88 0.904
0.8 0.82 0.91 0.946
0.3 0.58 0.622 0.9244
I tried it for 1 with the below code:
Df <- df[-ncol(df)] + ( 1 – df[-ncol(df)]) * df[-1]
I was able to get the column B as per the output, but not working for rest of the column.
Please help, thanks. BM.
You can do this recursively as follows:
do.call(cbind, Reduce(f = function(A1, B1) A1+(1-A1)*B1,
x = df,
accumulate = TRUE))
Explanation:
Since df is a data.frame which is a list of vectors, Reduce will take each vector and apply your function. Then do.call(cbind,...) combine the results into a data.frame.
Example dataframe:
col_1 col_2 col_3 col_4
f1 0.1 0.2 0.3 0.4
f2 0.01 0.02 0.03 0.04
f3 0.001 0.002 0.003 0.004
I want to rename columns by splitting its names with sep="_" to get this:
1 2 3 4
f1 0.1 0.2 0.3 0.4
f2 0.01 0.02 0.03 0.04
f3 0.001 0.002 0.003 0.004
Then I'd like to plot density for each column (f.name vs f.value) on the same plot (for instance: http://ggplot2.tidyverse.org/reference/geom_freqpoly-11.png) so I guess I need to melt it into something like this:
col f.name f.value
1 f1 0.1
2 f1 0.2
3 f1 0.3
4 f1 0.4
1 f2 0.01
2 f2 0.02
3 f2 0.03
4 f2 0.04
1 f3 0.001
2 f3 0.002
3 f3 0.003
4 f3 0.004
Any suggestions how to do that?
Without testing the code, use packages 'dplyr' and 'tidyr'. Where df is your input data frame, the following should work:
df %>% gather(col, val, starts_with('col')) %>%
separate(col, into=c('nah','col'), sep='_') %>%
ggplot(aes(x=val, colour=col)) + geom_freqpoly()