use assign / create new object with value (dplyr) - r

I want to create a new variable with the value of column number in my DF by its name.
I have managed to do this:
firstCol <- which(colnames(Mydf) == "Cars")
It takes the column number of the column with the name "Cars" and set its number to the object firstCol. It works well and good on base.
latly, I've been using dplyr and pipes and I'm trying to create a variable and do the same thing by using pipes but I'm unable to do this - use this line but in pipes %>%
Can you help me?
thanks,
Ido

The dplyr way to do this is select.
Here is an example using some made up data:
df <- data.frame(cars = sample(LETTERS, 100, replace = TRUE),
mpg = runif(100, 15, 45),
color = sample(c("green", "red","blue", "silver"),
100, replace = TRUE)) %>% tibbble()
df %>% select(cars)
# A tibble: 100 x 1
cars
<chr>
1 R
2 V
3 I
4 Q
5 P
6 D
7 J
8 Q
9 R
10 A
# ... with 90 more rows
You can also remove columns with select(-col_name)
df %>% select(-mpg)
# A tibble: 100 x 2
cars color
<chr> <chr>
1 R blue
2 V silver
3 I red
4 Q green
5 P silver
6 D silver
7 J green
8 Q blue
9 R red
10 A silver
# ... with 90 more rows

Related

Finding the mode using Dpylr

I have some code written using the dplyr package. I want to calculate the mode. Currently I get results back with a column which says "Character" all the way down. The mode will be the most reoccurring value, which in my case could be a letter, number of a symbol.
eth.data<-data.comb %>%
group_by(Ethnicity, `Qualification Title`, `Qualification Number`, `OutGrade`)%>%
summarise(`Number of Learners`=n(), `Mode` = mode(`OutGrade`)) %>%
group_by(`Qualification Number`)%>%
mutate(`Total Number of Learners`= sum(`Number of Learners`)) %>%
arrange(`Total Number of Learners`)
Take a look at ?mode. mode tells you the storage mode of an object (e.g. "character" for character vectors). If you want the statistical mode, write your own function, see this question.
Also, if you group_by OutGrade, then you will have precisely 1 unique OutGrade in the summarise function, so don't do that.
Let us set up an example (which you should do when you are asking a question!).
df <- data.frame(group=rep(LETTERS[1:5], each=20),
grade=sample(letters[1:15], 100, replace=T))
mymode <- function(x) {
t <- table(x)
names(t)[ which.max(t) ]
}
df %>% group_by(group) %>% summarise(mode=mymode(grade))
The result is what you want:
# A tibble: 5 x 2
group mode
<chr> <chr>
1 A l
2 B f
3 C g
4 D g
5 E c
Note that if you did group_by(group, grade), the summarise function would be called for each combination of group and grade, so the results would have been very different:
# A tibble: 55 x 3
# Groups: group [5]
group grade mode
<chr> <chr> <chr>
1 A a a
2 A b b
3 A f f
4 A h h
5 A i i
6 A k k
7 A l l
8 A m m
9 A n n
10 B a a
# … with 45 more rows

How to calculate overlap between different categories in R

I have read around the forum but I have not found my desired answer.
I have the following dataset:
Dataset
The important columns are TGEClass and peptide:
I would like to calculate the overlap between the different TGEclasses
I used calculate.overlap(TGE) from VennDiagram but that does not give me the desired result;
The R code with a dummy dataset:
# A simple single-set diagram
C1 <- as.data.frame(letters[1:10])
C2 <- as.data.frame(letters[1:10])
data =cbind(C1,C2)
overlap <- calculate.overlap(data)
overlap = as.data.frame(overlap)
The R result:
The result:
a1 a2 a3
1 a a a
2 b b b
3 c c c
4 d d d
5 e e e
6 f f f
The desired result will look like this:
TGEClass
Desired Result
10 genes are expressed in both TGE classes
50 genes in only alternative
60 genes in only short
It is basically a ven diagram but in a table format.
Please note that each gene have a different number of TGE class categories.
I am very new to R so any help will be greatly appreciated.
Thanks very much,
Ishack
The output of VennDiagram::calculate.overlap() is not very convenient for later use (here using as.data.frame you just got lucky as both vectors are of same size).
You can actually use tidyverse to compute it yourself, and return the summary:
library(tidyverse)
list(
"Cardiome" = letters[1:10],
"SuperSet" = letters[8:24]
) %>%
map2_dfr(., names(.), ~tibble::enframe(.x) %>% mutate(group=.y)) %>%
add_count(value) %>%
group_by(value) %>%
summarise(group2 = ifelse(n()==2, "both", group)) %>%
count(group2)
#> # A tibble: 3 x 2
#> group2 n
#> <chr> <int>
#> 1 both 3
#> 2 Cardiome 7
#> 3 SuperSet 14
If you want to stick with the output of VennDiagram::calculate.overlap(), you can use something like:
library(tidyverse)
overlap <- VennDiagram::calculate.overlap(
x = list(
"Cardiome" = letters[1:10],
"SuperSet" = letters[8:24]
)
);
map2_dfr(overlap, names(overlap), ~tibble::enframe(.x) %>% mutate(group=.y)) %>%
spread(group, group) %>%
mutate(a1_only = !is.na(a1) & is.na(a2),
a2_only = !is.na(a2) & is.na(a1),
both = !is.na(a2) & !is.na(a1)) %>%
summarise_at(c("a1_only", "a2_only", "both"), sum) %>%
gather(group, number, everything())
#> # A tibble: 3 x 2
#> group number
#> <chr> <int>
#> 1 a1_only 10
#> 2 a2_only 17
#> 3 both 0

Make multiple random number of copies of rows in a dataframe

I have a dataframe in r with 100 rows of unique first and last name and address. I also have columns for weather 1 and weather 2. I want to make a random number of copies between 50 and 100 for each row. How would I do that?
df$fname df$lname df$street df$town df%state df$weather1 df$weather2
Using iris and baseR:
#example data
iris2 <- iris[1:100, ]
#replicate rows at random
iris2[rep(1:100, times = sample(50:100, 100, replace = TRUE)), ]
Each row of iris2 will be replicated between 50-100 times at random
This is probably not the easiest way to do this, but...
What I've done here is for each for of the data set select just that row and make 1-3 (sub 50-100) copies of that row, and finally stack all the results together.
library(dplyr)
library(purrr)
df <- tibble(foo = 1:3, bar = letters[1:3])
map_dfr(seq_len(nrow(df)), ~{
df %>%
slice(.x) %>%
sample_n(size = sample(1:3, 1), replace = TRUE)
})
#> # A tibble: 7 x 2
#> foo bar
#> <int> <chr>
#> 1 1 a
#> 2 1 a
#> 3 1 a
#> 4 2 b
#> 5 2 b
#> 6 3 c
#> 7 3 c

Minimum value matching across values in multiple columns

I would like to return a dataframe with the minimum value of column one based on the values of columns 2-4:
df <- data.frame(one = rnorm(1000),
two = sample(letters, 1000, replace = T),
three = sample(letters, 1000, replace = T),
four = sample(letters, 1000, replace = T))
I can do:
df_group <- df %>%
group_by(two) %>%
filter(one = min(one))
This gets me the lowest value of all the "m's" in column two, but what if column three or four had a lower "m" value in column one?
The output should look like this:
one two
1 -0.311609752 r
2 0.053166742 n
3 1.546485810 a
4 -0.430308725 d
5 -0.145428664 c
6 0.419181639 u
7 0.008881661 i
8 1.223517580 t
9 0.797273157 b
10 0.790565358 v
11 -0.560031797 e
12 -1.546234090 q
13 -1.847945540 l
14 -1.489130228 z
15 -1.203255034 g
16 0.146969892 m
17 -0.552363433 f
18 -0.006234646 w
19 0.982932856 s
20 0.751936728 o
21 0.220751258 h
22 -1.557436228 y
23 -2.034885868 k
24 -0.463354387 j
25 -0.351448850 p
26 1.331365941 x
I don't care which column has the lowest value for a given letter, I just need the lowest value and the letter column.
I'm trying to wrap my head around writing this simplistically. This might be a duplicate, but I didn't know how to word the title and couldn't find any material or previous questions on how to do it.
Another solution based in data.table :
library(data.table)
setDT(df)
melt(df,
measure=grep("one",names(df),invert = TRUE,value=TRUE))[
,min(one),value]
You can do something like this:
library(dplyr); library(tidyr)
df %>% gather(cols, letts, -one) %>% # gather all letters into one column
group_by(letts) %>%
summarise(one = min(one)) # do a group by summary for each letter
# A tibble: 26 × 2
# letts one
# <chr> <dbl>
#1 a -2.092327
#2 b -2.461102
#3 c -3.055858
#4 d -2.092327
#5 e -2.461102
#6 f -2.249439
#7 g -1.941632
#8 h -2.543310
#9 i -3.055858
#10 j -1.896974
# ... with 16 more rows

Getting a summary data frame for all the combinations of categories represented in two columns

I am working with a data frame corresponding to the example below:
set.seed(1)
dta <- data.frame("CatA" = rep(c("A","B","C"), 4), "CatNum" = rep(1:2,6),
"SomeVal" = runif(12))
I would like to quickly build a data frame that would have sum values for all the combinations of the categories derived from the CatA and CatNum as well as for the categories derived from each column separately. On the primitive example above, for the first couple of combinations, this can be achieved with use of simple code:
df_sums <- data.frame(
"Category" = c("Total for A",
"Total for A and 1",
"Total for A and 2"),
"Sum" = c(sum(dta$SomeVal[dta$CatA == 'A']),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 1]),
sum(dta$SomeVal[dta$CatA == 'A' & dta$CatNum == 2]))
)
This produces and informative data frame of sums:
Category Sum
1 Total for A 2.1801780
2 Total for A and 1 1.2101839
3 Total for A and 2 0.9699941
This solution would be grossly inefficient when applied to a data frame with multiple categories. I would like to achieve the following:
Cycle through all the categories, including categories derived from each column separately as well as from both columns in the same time
Achieve some flexibility with respect to how the function is applied, for instance I may want to apply mean instead of the sum
Save the Total for string a separate object that I could easily edit when applying other function than sum.
I was initially thinking of using dplyr, on the lines:
require(dplyr)
df_sums_experiment <- dta %>%
group_by(CatA, CatNum) %>%
summarise(TotVal = sum(SomeVal))
But it's not clear to me how I could apply multiple groupings simultaneously. As stated, I'm interested in grouping by each column separately and by the combination of both columns. I would also like to create a string column that would indicate what is combined and in what order.
You could use tidyr to unite the columns and gather the data. Then use dplyr to summarise:
library(dplyr)
library(tidyr)
dta %>% unite(measurevar, CatA, CatNum, remove=FALSE) %>%
gather(key, val, -SomeVal) %>%
group_by(val) %>%
summarise(sum(SomeVal))
val sum(SomeVal)
(chr) (dbl)
1 1 2.8198078
2 2 3.0778622
3 A 2.1801780
4 A_1 1.2101839
5 A_2 0.9699941
6 B 1.4405782
7 B_1 0.4076565
8 B_2 1.0329217
9 C 2.2769138
10 C_1 1.2019674
11 C_2 1.0749464
Just loop over the column combinations, compute the quantities you want and then rbind them together:
library(data.table)
dt = as.data.table(dta) # or setDT to convert in place
cols = c('CatA', 'CatNum')
rbindlist(apply(combn(c(cols, ""), length(cols)), 2,
function(i) dt[, sum(SomeVal), by = c(i[i != ""])]), fill = T)
# CatA CatNum V1
# 1: A 1 1.2101839
# 2: B 2 1.0329217
# 3: C 1 1.2019674
# 4: A 2 0.9699941
# 5: B 1 0.4076565
# 6: C 2 1.0749464
# 7: A NA 2.1801780
# 8: B NA 1.4405782
# 9: C NA 2.2769138
#10: NA 1 2.8198078
#11: NA 2 3.0778622
Split then use apply
#result
res <- do.call(rbind,
lapply(
c(split(dta,dta$CatA),
split(dta,dta$CatNum),
split(dta,dta[,1:2])),
function(i)sum(i[,"SomeVal"])))
#prettify the result
res1 <- data.frame(Category=paste0("Total for ",rownames(res)),
Sum=res[,1])
res1$Category <- sub("."," and ",res1$Category,fixed=TRUE)
row.names(res1) <- seq_along(row.names(res1))
res1
# Category Sum
# 1 Total for A 2.1801780
# 2 Total for B 1.4405782
# 3 Total for C 2.2769138
# 4 Total for 1 2.8198078
# 5 Total for 2 3.0778622
# 6 Total for A and 1 1.2101839
# 7 Total for B and 1 0.4076565
# 8 Total for C and 1 1.2019674
# 9 Total for A and 2 0.9699941
# 10 Total for B and 2 1.0329217
# 11 Total for C and 2 1.0749464

Resources