Reorder, exclude a column and keep others in R? - r

Here is my toy dataframe:
structure(list(a = c(1, 2), b = c(3, 4), c = c(5, 6), d = c(7,
8)), .Names = c("a", "b", "c", "d"), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
Now I want to reorder and exclude one the columns and keep the others:
df %>% select(-a, d, everything())
I want my df to be :
d b c
7 3 5
8 4 6
I get the following:
b c d a
<dbl> <dbl> <dbl> <dbl>
1 3 5 7 1
2 4 6 8 2

Keep the -a at the last in the select. Even though, we removed a in the beginning the everythig() at the end is still checking the column names of the whole dataset
df%>%
select(d, everything(), -a)
# A tibble: 2 x 3
# d b c
# <dbl> <dbl> <dbl>
#1 7 3 5
#2 8 4 6

Related

R Join without duplicates

Currently when joining two datasets (of different years) I get duplicates of the second one when there are less observations in the second one than the first.
Below, ID 1 only has 1 observation in year y, but it gets repeated because the first dataset of year x has three observations in total. I don't want the duplicates, but simply NAs.
So what I currently get is this:
ID Value.x N.x Value.y N.y
<dbl> <chr> <dbl> <chr> <dbl>
1 1 A 6 A 2
2 1 B 7 A 2
3 1 C 1 A 2
What I want is:
ID Value.x N.x Value.y N.y
<dbl> <chr> <dbl> <chr> <dbl>
1 1 A 6 A 2
2 1 B 7 NA NA
3 1 C 1 NA NA
The end result is that my manager can tell in year x customer 1 ordered A, B, C in n.x quantities. In year y they only ordered A in n.y quantities.
Data:
structure(list(ID = c(1, 1, 1), Value = c("A", "B", "C"), N = c(6,
7, 1)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L))
structure(list(ID = 1, Value = "A", N = 2), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -1L))
I would do it like this:
merge(tbl_df1, tbl_df2, by = c("ID", "Value"), all = TRUE)

Transform multiple columns of the same name and different suffixes into a panel structure

I need to put multiple variables of the same name but with different suffixes in a panel structure. For example, transforming:
In this structure:
I tried to combine the pivot_longer and pivot_wider functions from the tidyverse package, but I was not successful, as the variables are distributed between numerics, integers, characters, etc.
I appreciate any help.
Here's the reproducible example:
structure(list(class.x = c(4, 4, 4, 4, 4), class.y = c("a", "a",
"a", "a", "a"), class.x.x = structure(c(9.88131291682493e-324,
9.88131291682493e-324, 9.88131291682493e-324, 9.88131291682493e-324,
9.88131291682493e-324), class = "integer64"), var1.x = c(1, 1,
1, 1, 1), var1.y = c(0, 0, 0, 0, 0), var1.x.x = c("b", "b", "b",
"b", "b"), var2.x = c(9, 9, 9, 9, 9), var2.y = c(5, 5, 5, 5,
5), var2.x.x = c("c", "c", "c", "c", "c")), class = "data.frame", row.names = c(NA,
-5L))
df %>%
pivot_longer(everything(),
names_to = c('.value','Variable'),
names_pattern = '([^.]+)[.](.*)',
values_transform = as.character)
# A tibble: 15 x 4
Variable class var1 var2
<chr> <chr> <chr> <chr>
1 x 4 1 9
2 y a 0 5
3 x.x 0 b c
4 x 4 1 9
5 y a 0 5
6 x.x 0 b c
7 x 4 1 9
8 y a 0 5
9 x.x 0 b c
10 x 4 1 9
11 y a 0 5
12 x.x 0 b c
13 x 4 1 9
14 y a 0 5
15 x.x 0 b c
Note the provided dput varies from the picture you posted:
First we could create names that are all separated by one .
Then we have to transform all to character: I do it with mutate(across... KU99 did it more elegantly with values_transform!
Now we can apply pivot_longer with names_sep argument.
Finally bring data in shape.
library(tidyverse)
df %>%
rename_with(~str_replace_all(., ".x.x", ".z")) %>%
mutate(across(everything(), as.character)) %>%
pivot_longer(
everything(),
names_to = c(".value", "var1_2"),
names_sep ="\\."
) %>%
arrange(var1_2) %>%
mutate(Variable=ifelse(var1_2 == "z", "x.x", var1_2), .keep="unused")
class var1 var2 Variable
<chr> <chr> <chr> <chr>
1 4 1 9 x
2 4 1 9 x
3 4 1 9 x
4 4 1 9 x
5 4 1 9 x
6 a 0 5 y
7 a 0 5 y
8 a 0 5 y
9 a 0 5 y
10 a 0 5 y
11 9.88131291682493e-324 b c x.x
12 9.88131291682493e-324 b c x.x
13 9.88131291682493e-324 b c x.x
14 9.88131291682493e-324 b c x.x
15 9.88131291682493e-324 b c x.x

How to run a function over all the values of a column/variable for multiple columns/variables

I'm new to R so grateful if someone could help me here, because I've now tried a lot of things myself that have been unsuccessful and I'm so frustrated!
I have a big dataset that I have manipulated into two types of dataframe layouts where the variables of interest (A, B, C...) are either unique rows or unique columns. (A, B, C...) are categorical, and their values are integers.
LAYOUT 1<br>
A, 1, 6, 11...<br>
B, 2, 7, 12...<br>
C, 3, 8, 13...<br>
D, 4, 9, 14...<br>
E, 5, 10, 15...<br>
LAYOUT 2<br>
A, B, C, D, E...<br>
1, 2, 3, 4, 5...<br>
6, 7, 8, 9, 10...<br>
11, 12, 13, 14, 15...<br>
I want to run a number of math functions like mean() over each variable (A, B, C..) and record the outcomes in new dataframe that shows the outcomes of this function against each variable.
i.e.
X, mean_X, mode_X, sd_X...<br>
A, mean(A), mode(A), sd(A)...<br>
B, mean(B), mode(B), sd(B)...<br>
C, mean(C), mode(C), sd(C)...<br>
D, mean(D), mode(D), sd(D)...<br>
E, mean(E), mode(E), sd(E)...<br>
However, because the dataset is big, I can't do this manually by selecting each variable. I can't figure out how to do this on either of the layouts.
Happy which ever layout you choose, but is there a way to do this simply, preferably just using base, dplyr, tidyr?
Thank you in advance!
it seems to me you are looking for apply()
A=c(1, 6, 11)
B=c(2, 7, 12)
C=c(3, 8, 13)
D=c(4, 9, 14)
df<-cbind.data.frame(A,B,C,D)
df$mean<-apply(df, 1, mean) # 1 is applying the function along rows, 2 along columns
df$sum<-apply(df, 1, sum)
df
A B C D mean sum
1 1 2 3 4 2.5 12.5
2 6 7 8 9 7.5 37.5
3 11 12 13 14 12.5 62.5
You can get the data in long format so that it is easier to apply multiple functions.
If you have Layout 1 like this :
layout1 <- structure(list(V1 = c("A", "B", "C", "D", "E"), V2 = 1:5, V3 = 6:10,
V4 = 11:15), class = "data.frame", row.names = c(NA, -5L))
layout1
# V1 V2 V3 V4
#1 A 1 6 11
#2 B 2 7 12
#3 C 3 8 13
#4 D 4 9 14
#5 E 5 10 15
You can do :
library(dplyr)
library(tidyr)
layout1 %>%
pivot_longer(cols = where(is.numeric)) %>%
group_by(V1) %>%
summarise(mean = mean(value),
sd = sd(value),
sum = sum(value))
# V1 mean sd sum
# <chr> <dbl> <dbl> <int>
#1 A 6 5 18
#2 B 7 5 21
#3 C 8 5 24
#4 D 9 5 27
#5 E 10 5 30
If you have data in the form of layout 2
layout2 <- structure(list(A = c(1L, 6L, 11L), B = c(2L, 7L, 12L), C = c(3L,
8L, 13L), D = c(4L, 9L, 14L), E = c(5L, 10L, 15L)),
class = "data.frame", row.names = c(NA, -3L))
layout2
# A B C D E
#1 1 2 3 4 5
#2 6 7 8 9 10
#3 11 12 13 14 15
You can apply the functions using across :
layout2 %>%
summarise(across(everything(),
list(mean = mean, sd = sd, sum = sum), .names = '{col}_{fn}')) %>%
pivot_longer(cols = everything(),
names_to = c('X', '.value'),
names_sep = '_')
# A tibble: 5 x 4
# X mean sd sum
# <chr> <dbl> <dbl> <int>
#1 A 6 5 18
#2 B 7 5 21
#3 C 8 5 24
#4 D 9 5 27
#5 E 10 5 30

Left_join fill NA entries with data values from the second dataframe

I have two fairly complicated data.frames and managed to simplify the first step of my problem here. I have a reference table and another that contains my data as follows:
REFERENCE
ref <- structure(list(group = c("A", "B", "C"), position = c("a", "a",
"b")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
DATA
df <- structure(list(position = c("a", "a"), value = c(1, 1, 2), name = c("foo",
"bar")), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"))
I used left_join(ref,df,by="position") %>% arrange(name) to obtain:
1 A a 1 foo
2 A a 1 bar
3 B a 1 foo
4 B a 1 bar
5 C b NA NA
The ideal output however is:
group position value name
<chr> <chr> <dbl> <chr>
1 A a 1 bar
2 B a 1 bar
3 C b 0 bar
4 A a 1 foo
5 B a 1 foo
6 C b 0 foo
I would like the name column to replace NA with the input from df and the value column's NA with 0. In the real df, I have more than foo in the name column
We could use crossing to get the combinations, then replace the 'value' column values to 0 where the 'position' columns are not equal
library(dplyr)
library(tidyr)
crossing(ref, df %>%
rename(position2 = position)) %>%
arrange(name) %>%
mutate(value = replace(value, position != position2 , 0)) %>%
select(-position2)
# A tibble: 6 x 4
# group position value name
# <chr> <chr> <dbl> <chr>
#1 A a 1 bar
#2 B a 1 bar
#3 C b 0 bar
#4 A a 1 foo
#5 B a 1 foo
#6 C b 0 foo

How do I select column based on value in another column with dplyr?

My data frame looks like this:
id A T C G ref var
1 1 10 15 7 0 A C
2 2 11 9 2 3 A G
3 3 2 31 1 12 T C
I'd like to create two new columns: ref_count and var_count which will have following values:
Value from A column and value from C column, since ref is A and var is C
Value from A column and value from G column, since ref is A and var is G
etc.
So I'd like to select a column based on the value in another column for each row.
Thanks!
We can use pivot_longer to reshape into 'long' format, filter the rows and then reshape it to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = A:G) %>%
group_by(id) %>%
filter(name == ref|name == var) %>%
mutate(nm1 = c('ref_count', 'var_count')) %>%
ungroup %>%
select(id, value, nm1) %>%
pivot_wider(names_from = nm1, values_from = value) %>%
left_join(df1, .)
# A tibble: 3 x 9
# id A T C G ref var ref_count var_count
#* <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
#1 1 10 15 7 0 A C 10 7
#2 2 11 9 2 3 A G 11 3
#3 3 2 31 1 12 T C 31 1
Or in base R, we can also make use of the vectorized row/column indexing
df1$refcount <- as.matrix(df1[2:5])[cbind(seq_len(nrow(df1)), match(df1$ref, names(df1)[2:5]))]
df1$var_count <- as.matrix(df1[2:5])[cbind(seq_len(nrow(df1)), match(df1$var, names(df1)[2:5]))]
data
df1 <- structure(list(id = 1:3, A = c(10, 11, 2), T = c(15, 9, 31),
C = c(7, 2, 1), G = c(0, 3, 12), ref = c("A", "A", "T"),
var = c("C", "G", "C")), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
The following is a tidyverse alternative without creating a long dataframe that needs filtering. It essentially uses tidyr::nest() to nest the dataframe by rows, after which the correct column can be selected for each row.
df1 %>%
nest(data = -id) %>%
mutate(
data = map(
data,
~mutate(., refcount = .[[ref]], var_count = .[[var]])
)
) %>%
unnest(data)
#> # A tibble: 3 × 9
#> id A T C G ref var refcount var_count
#> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 10 15 7 0 A C 10 7
#> 2 2 11 9 2 3 A G 11 3
#> 3 3 2 31 1 12 T C 31 1
A variant of this does not need the (assumed row-specific) id column but defines the nested groups from the unique values of ref and var directly:
df1 %>%
nest(data = -c(ref, var)) %>%
mutate(
data = pmap(
list(data, ref, var),
function(df, ref, var) {
mutate(df, refcount = df[[ref]], var_count = df[[var]])
}
)
) %>%
unnest(data)
The data were specified by akrun:
df1 <- structure(list(id = 1:3, A = c(10, 11, 2), T = c(15, 9, 31),
C = c(7, 2, 1), G = c(0, 3, 12), ref = c("A", "A", "T"),
var = c("C", "G", "C")), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))

Resources