I have this dataframe:
df <- structure(list(x = c(1, 5, 6, 7, 8), y = c("a", "e", "f", "g",
"h")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-5L))
x y
<dbl> <chr>
1 1 a
2 5 e
3 6 f
4 7 g
5 8 h
With complete from tidyr package:
I can do:
df %>%
complete(x = full_seq(min(x):max(x), 1))
x y
<dbl> <chr>
1 1 a
2 2 NA
3 3 NA
4 4 NA
5 5 e
6 6 f
7 7 g
8 8 h
Now I would like to do the same with the y column:
df %>%
complete(y = full_seq(min(y):max(y), 1))
This obviously will not work.
How can I use complete from tidyr package for alphabetical order?
I don't think that's possible, especially because except in the case of 1 letter, it would not be possible to complete strings with more than one letter. You can still use the letters data set:
df %>%
complete(y = letters[full_seq(min(x):max(x), 1)])
or, to be entirely relying on y:
df %>%
complete(y = letters[which(letters == min(y)):which(letters == max(y))])
y x
1 a 1
2 b NA
3 c NA
4 d NA
5 e 5
6 f 6
7 g 7
8 h 8
Related
I have the following dataframe called df (dput below):
> df
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 B 8
6 B 2
7 B 2
8 B 3
9 C 10
10 C 1
11 C 1
12 C 8
I would like to filter groups based on the difference between their highest value (max) and second highest value. The difference should be smaller equal than 2 (<=2), this means that group B should be removed because the highest value is 8 and the second highest value is 3 which is a difference of 5. The desired output should look like this:
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
So I was wondering if anyone knows how to filter groups based on the difference between their highest and second-highest value?
dput of df:
df<-structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C"), value = c(5, 1, 1, 5, 8, 2, 2, 3, 10, 1,
1, 8)), class = "data.frame", row.names = c(NA, -12L))
Using dplyr
library(dplyr)
df %>%
group_by(group) %>%
filter(abs(diff(sort(value, decreasing=T)[1:2])) <= 2) %>%
ungroup()
# A tibble: 8 × 2
group value
<chr> <int>
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
A base R alternative
grp <- na.omit(aggregate(. ~ group, df, function(x)
abs(diff(sort(x, decreasing=T)[1:2])) <= 2))
do.call(rbind, c(mapply(function(g, v)
list(df[df$group == g & v,]), grp$group, grp$value), make.row.names=F))
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
I possibility would be to first create a vector with the groups that achieve your condition and then filter in the original data.frame. Here how I thought:
library(dplyr)
group_to_keep <-
df %>%
group_by(group) %>%
slice_max(n = 2,value) %>%
filter(abs(diff(value)) <= 2) %>%
pull(group) %>%
unique()
df %>%
filter(group %in% group_to_keep)
You can use ave.
df[ave(df$value, df$group, FUN=\(x) diff(sort(c(-x, Inf)))[1]) <= 2,]
# group value
#1 A 5
#2 A 1
#3 A 1
#4 A 5
#9 C 10
#10 C 1
#11 C 1
#12 C 8
In case you can sure that you have all the time at least two values you can use.
df[ave(df$value, df$group, FUN=\(x) diff(tail(sort(x), 2))) <= 2,]
df[ave(df$value, df$group, FUN=\(x) diff(sort(-x)[1:2])) <= 2,]
I am new to R. I would like to calculate the mean for each row of a dataframe, but using different subset of columns for each row. I have two extra-columns providing me the names of the column that represent the "start" and the "end" that I should use to calculate each mean, respectively.
Let's take this example
dframe <- data.frame(a=c("2","3","4", "2"), b=c("1","3","6", "2"), c=c("4","5","6", "3"), d=c("4","2","8", "5"), e=c("a", "c", "a", "b"), f=c("c", "d", "d", "c"))
dframe
Which provides the following dataframe:
a b c d e f
1 2 1 4 4 a c
2 3 3 5 2 c d
3 4 6 6 8 a d
4 2 2 3 5 b c
The columns e and f represent the first and last column I use to calculate the mean for each row.
For example, on line 1, the mean would be calculated including column a, b, c ((2+1+4)/3 -> 2.3)
So I would like to obtain the following output:
a b c d e f mean
1 2 1 4 4 a c 2.3
2 3 3 5 2 c d 3.5
3 4 6 6 8 a d 6
4 2 2 3 5 b c 2.5
I learnt how to create the indices, and I want then to use RowMeans, but I cannot find the correct arguments.
dframe %>%
mutate(e_indice = match(e, colnames(dframe)))%>%
mutate(f_indice = match(f, colnames(dframe)))%>%
mutate(mean = RowMeans(????, na.rm = TRUE))
Thanks a lot for your help
One dplyr option could be:
dframe %>%
rowwise() %>%
mutate(mean = rowMeans(cur_data()[match(e, names(.)):match(f, names(.))]))
a b c d e f mean
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
1 2 1 4 4 a c 2.33
2 3 3 5 2 c d 3.5
3 4 6 6 8 a d 6
4 2 2 3 5 b c 2.5
I would define a helper function that lets you slice the indices you want
from a matrix.
rowSlice <- function(x, start, stop) {
replace(x, col(x) < start | col(x) > stop, NA)
}
rowSlice(matrix(1, 4, 4), c(1, 3, 1, 2), c(3, 4, 4, 3))
#> [,1] [,2] [,3] [,4]
#> [1,] 1 1 1 NA
#> [2,] NA NA 1 1
#> [3,] 1 1 1 1
#> [4,] NA 1 1 NA
Then use across() to select the relvant columns, slice them,
and take the rowMeans().
library(dplyr)
dframe <- data.frame(
a = c(2, 3, 4, 2),
b = c(1, 3, 6, 2),
c = c(4, 5, 6, 3),
d = c(4, 2, 8, 5),
e = c("a", "c", "a", "b"),
f = c("c", "d", "d", "c")
)
dframe %>%
mutate(ei = match(e, colnames(dframe))) %>%
mutate(fi = match(f, colnames(dframe))) %>%
mutate(
mean = across(a:d) %>%
rowSlice(ei, fi) %>%
rowMeans(na.rm = TRUE)
)
#> a b c d e f ei fi mean
#> 1 2 1 4 4 a c 1 3 2.333333
#> 2 3 3 5 2 c d 3 4 3.500000
#> 3 4 6 6 8 a d 1 4 6.000000
#> 4 2 2 3 5 b c 2 3 2.500000
A base R solution. First, set columns to numeric. Then create a list of the columns on which to apply the mean. Then apply mean on selected columns.
s <- mapply(seq, match(dframe$e, colnames(dframe)), match(dframe$f, colnames(dframe)))
dframe$mean <- lapply(seq(nrow(dframe)), function(x) rowMeans(dframe[x, s[[x]]]))
a b c d e f mean
1 2 1 4 4 a c 2.333333
2 3 3 5 2 c d 3.5
3 4 6 6 8 a d 6
4 2 2 3 5 b c 2.5
A base R approach using apply
dframe$mean <- apply(dframe, 1, function(x)
mean(as.numeric(x[which(names(x) == x["e"]) : which(names(x) == x["f"])])))
dframe
a b c d e f mean
1 2 1 4 4 a c 2.333333
2 3 3 5 2 c d 3.500000
3 4 6 6 8 a d 6.000000
4 2 2 3 5 b c 2.500000
I would like to create a new variable X that returns a certain value (e.g., -4) based on the value of name. For each multiple of 9 starting from 1, I would like X to be 4, etc. This is the original data:
# A tibble: 5 x 2
ID name
<chr> <dbl>
1 A 1
2 B 5
3 C 10
4 D 19
5 E 25
And this is the expected output:
# A tibble: 5 x 3
ID name X
<chr> <dbl> <dbl>
1 A 1 -4
2 B 5 NA
3 C 10 -4
4 D 19 -4
5 E 25 NA
While the following works, I was wondering if there is a more efficient piece of code I could use, since I have to do this up to values of 81.
df%>%
mutate(X = case_when(
name == 1 | name == 10 | name == 19 ~ -4
))
dput code
structure(list(ID = c("A", "B", "C", "D", "E"),
name = c(1, 5, 10, 19, 25)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
Yes, you can do this with R's modulo operator %%. This allows you to calculate the remainder after division. So "each multiple of 9 starting from 1" is equivalent to "zero remainder when a number minus 1 is divided by 9."
We can test this with numbers 1 to 27:
((1:27)-1) %% 9
> 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
To add this to a dplyr pipeline, we can write:
df %>%
mutate(X = case_when((name - 1) %% 9 == 0 ~ -4))
Or alternatively:
df %>%
mutate(X = ifelse((name - 1) %% 9 == 0, -4, NA))
I prefer the second ifelse makes it more obvious to readers of the code that there are only two possible outcomes.
I have the below data frame df1. (Edited to have different numbers of repeated value in the data frame.)
> dput(df1)
structure(list(...1 = c("a", "b", "c", "d", "e"), x = c(5, 10,
20, 20, 25), y = c(2, 6, 6, 6, 10), z = c(6, 2, 1, 8, 1)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
>df1
x y z
a 5 2 6
b 10 6 2
c 20 6 1
d 20 6 8
e 25 10 1
I would like to get a df2 which only has the unique values from each column 'x','y' and 'z'.
I tried:
df2<-apply(df1,2, unique)
df2 <- do.call(cbind, df2)
df2 <- as.data.frame(df2)
Desired output:
>df2
x y z
5 2 6
10 6 2
20 10 1
25 8
Tibbles can't have rownames so it creates a new column with it in your data. You can delete the first column and then use unique on all columns.
library(dplyr)
df1$...1 <- NULL
df1 %>% summarise(across(.fns = unique))
# x y z
# <dbl> <dbl> <dbl>
#1 5 2 6
#2 10 6 2
#3 20 8 1
#4 25 10 8
Or in base R :
df2 <- data.frame(sapply(df1, unique))
For unequal unique values in the column you could use :
tmp <- lapply(df1, unique)
data.frame(sapply(tmp, `[`, 1:max(lengths(tmp))))
# x y z
#1 5 2 6
#2 10 6 2
#3 20 10 1
#4 25 NA 8
I have the following frame:
df <- structure(list(returns = list(c(1,2,3,4,5,6), c(7,8,9,10,11,12)), indexId = c("a", "b")), class = "data.frame", row.names = 1:2)
Is there an easy way to convert this into a normal data.frame so it appears as:
Choice ppl
1 a
2 a
3 a
4 a
5 a
6 a
7 b
8 b
9 b
10 b
11 b
12 b
I have a solution using For but I am looking for something simpler.
All help is much appreciated!
df <- structure(list(returns = list(c(1,2,3,4,5,6), c(7,8,9,10,11,12)),
indexId = c("a", "b")), class = "data.frame", row.names = 1:2)
library(tidyverse)
df %>% separate_rows()
# returns indexId
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 a
# 5 5 a
# 6 6 a
# 7 7 b
# 8 8 b
# 9 9 b
# 10 10 b
# 11 11 b
# 12 12 b
Or :
data.frame(choice = unlist(df$returns), ppl = rep(df$indexId, lapply(df$returns, length)))