How to plot Insertions and Deletions - r

I'm trying to plot the indels length from a file created by vcftools with the feature: "--hist-indel-len". With that file, I wanted to make a plot with the insertion and deletions, if length is negative, is a deletion and if length is positive, is a insertion. The Count column will be at y-axis from 0 to the max value, and the x-axis will be the min length (-15 in that case) to the max length (15 in that case).
The data looks like:
LENGTH COUNT
1 -15 117
2 -14 178
3 -13 198
4 -12 414
5 -11 314
6 -10 451
7 -9 547
8 -8 1114
9 -7 1214
10 -6 2371
11 -5 3822
12 -4 9229
13 -3 17333
14 -2 20373
15 -1 19774
16 0 202129
17 1 22259
18 2 10101
19 3 4940
20 4 2458
21 5 1343
22 6 987
23 7 535
24 8 427
25 9 317
26 10 307
27 11 161
28 12 270
29 13 116
30 14 121
31 15 95
With this data.frame I'm trying to get a plot like:
My attempt was using:
z <- read.csv("/home/userx/out.indel.hist", sep = "\t")
zz <- table(z)
barplot(zz, main="Insertion and Deletions",
xlab="Length", ylab="Count", col=c("darkblue","red"),
legend = rownames(zz), beside=TRUE)
Result:
Any help would be appreciated.

A relatively easy solution using ggplot and the provided data would be to create a grouping variable to color by and plot using geom_col:
library(tidyverse)
create grouping variable:
dat2 %>%
mutate(fill = ifelse(LENGTH <0, "minus", "plus")) -> dat2
ggplot(dat2)+
geom_col(aes(x = LENGTH, y = COUNT, fill = fill))
the data:
structure(list(LENGTH = -15:15, COUNT = c(117L, 178L, 198L, 414L,
314L, 451L, 547L, 1114L, 1214L, 2371L, 3822L, 9229L, 17333L,
20373L, 19774L, 202129L, 22259L, 10101L, 4940L, 2458L, 1343L,
987L, 535L, 427L, 317L, 307L, 161L, 270L, 116L, 121L, 95L)), .Names = c("LENGTH",
"COUNT"), class = "data.frame", row.names = c(NA, -31L))

Related

Creating differences in a new column for certain dates in R

i have a data frame that looks like this;
Date Value1 Value 2 Value 3
1997Q1 100 130 120
1997Q1 100 130 124
1997Q1 120 136 154
1997Q2 180 145 154
1997Q2 186 134 126
1997Q2 186 124 176
1997Q3 190 143 176
1997Q3 192 143 123
I would like to calculate differences for each values within the same date, for example the differences in value 1 column for 1997q1, then 1997q2 and so on.
I would like these differences to be shown in a new column, so that the results would look something like this;
Date Value1 Value 2 Value 3 Diff Val 1 Diff Val 2 Diff Val 3
1997Q1 100 130 120 0 0 4
1997Q1 100 130 124 20 6 30
1997Q1 120 136 154 N/A N/A N/A
1997Q2 180 145 154 6 -11 -28
1997Q2 186 134 126 0 10 50
1997Q2 186 124 176 N/A N/A N/A
1997Q3 190 143 176 2 0 -53
1997Q3 192 143 123
You can use dplyr functions for this. The ~ .x - lead(.x) is the function applied to every value column, selected with starts_with. we take the current value minus the next value. If you need lag, switch it around, ~ lag(.x) - .x
library(dplyr)
df1 %>%
group_by(Date) %>%
mutate(across(starts_with("Value"), ~.x - lead(.x), .names = "diff_{.col}"))
if the values are numeric and the column names are not easily found, you can use mutate(across(where(is.numeric), ~.x - lead(.x), .names = "diff_{.col}")).
# A tibble: 8 × 7
# Groups: Date [3]
Date Value1 Value2 Value3 diff_Value1 diff_Value2 diff_Value3
<chr> <int> <int> <int> <int> <int> <int>
1 1997Q1 100 130 120 0 0 -4
2 1997Q1 100 130 124 -20 -6 -30
3 1997Q1 120 136 154 NA NA NA
4 1997Q2 180 145 154 -6 11 28
5 1997Q2 186 134 126 0 10 -50
6 1997Q2 186 124 176 NA NA NA
7 1997Q3 190 143 176 -2 0 53
8 1997Q3 192 143 123 NA NA NA
data:
df1 <- structure(list(Date = c("1997Q1", "1997Q1", "1997Q1", "1997Q2",
"1997Q2", "1997Q2", "1997Q3", "1997Q3"), Value1 = c(100L, 100L,
120L, 180L, 186L, 186L, 190L, 192L), Value2 = c(130L, 130L, 136L,
145L, 134L, 124L, 143L, 143L), Value3 = c(120L, 124L, 154L, 154L,
126L, 176L, 176L, 123L)), class = "data.frame", row.names = c(NA,
-8L))

Adding groups to rows in a dataframe in R

I want to transform my data from this:
current data.frame
to this: desired data.frame
I have no clue how to start, any help is welcome!
Thanks in advance,
Mitch
One solution with reshape() melt()
library(readr)
library(reshape)
Data:
df<-structure(list(age_group = c("<20", ">70", "20-29", "30-39",
"40-49", "50-59", "60-69"), no = c(19L, 1L, 447L, 196L, 92L,
55L, 24L), yes = c(21L, 1L, 664L, 371L, 204L, 137L, 63L), total = c(2L,
0L, 217L, 175L, 112L, 82L, 39L)), class = "data.frame", row.names = c(NA,
-7L))
Code:
df<-melt(C0001)
df<-as.data.frame(df)
df[order(df$age_group),]
age_group variable value
1 <20 no 19
8 <20 yes 21
15 <20 total 2
2 >70 no 1
9 >70 yes 1
16 >70 total 0
3 20-29 no 447
10 20-29 yes 664
17 20-29 total 217
4 30-39 no 196
11 30-39 yes 371
18 30-39 total 175
5 40-49 no 92
12 40-49 yes 204
19 40-49 total 112
6 50-59 no 55
13 50-59 yes 137
20 50-59 total 82
7 60-69 no 24
14 60-69 yes 63
21 60-69 total 39

Comparing between dataframes to find numbers +/- 1 of the observation in the same position in both dataframes

I have two dataframes at time point 1 and time point 2. Each Dataframe has a column "midpoint" and i want to compare the midpoint in dataframe 2 (time point 2) to the midpoint dataframe 1 (time point 1) such that if it is within +/- 1, a unique "id" is assigned in a column called "id" for each comparison that returns TRUE under the above parameters. If its false, then the id should be blank or 0. Ive been playing around with the ifelse() function with little success so far. Ive been trying to create a function so that the dataframe time point n, compares to the previous time point (n-1).
I will eventually use the purrr package to loop it for every timepoint (total of 130ish), for some context behind why im doing this.
Ignore the maximum and minimum column, these are relevant for something different, appreciate any help possible!
Dataframe 1 (time point 1)
structure(list(Object = c(2666L, 2668L, 2671L, 2674L, 2676L,
2677L, 2678L, 2679L, 2680L, 2682L, 2683L, 2684L, 2685L, 2686L,
2687L, 2689L, 2692L, 2693L, 2694L, 2695L, 2696L), minimum = c(4L,
39L, 147L, 224L, 419L, 531L, 595L, 641L, 669L, 723L, 810L, 836L,
907L, 978L, 1061L, 1129L, 1290L, 1519L, 1749L, 1843L, 1897L),
maximum = c(22L, 85L, 173L, 242L, 449L, 587L, 627L, 655L,
702L, 740L, 828L, 890L, 923L, 1024L, 1086L, 1144L, 1302L,
1544L, 1780L, 1870L, 1925L), midpoint = c(13, 62, 160, 233,
434, 559, 611, 648, 685.5, 731.5, 819, 863, 915, 1001, 1073.5,
1136.5, 1296, 1531.5, 1764.5, 1856.5, 1911)), row.names = c(NA,
-21L), class = c("tbl_df", "tbl", "data.frame"))
Dataframe 2 (time point 2)
structure(list(Object = c(2645L, 2646L, 2650L, 2652L, 2655L,
2656L, 2657L, 2658L, 2659L, 2661L, 2662L, 2663L, 2664L, 2665L,
2667L, 2670L, 2675L, 2681L, 2688L, 2690L, 2691L), minimum = c(4L,
40L, 147L, 224L, 415L, 532L, 595L, 641L, 670L, 722L, 811L, 835L,
907L, 978L, 1061L, 1128L, 1289L, 1520L, 1748L, 1843L, 1897L),
maximum = c(22L, 85L, 173L, 242L, 445L, 588L, 627L, 655L,
702L, 739L, 828L, 891L, 923L, 1022L, 1085L, 1143L, 1302L,
1544L, 1779L, 1870L, 1925L), midpoint = c(13, 62.5, 160,
233, 430, 560, 611, 648, 686, 730.5, 819.5, 863, 915, 1000,
1073, 1135.5, 1295.5, 1532, 1763.5, 1856.5, 1911)), row.names = c(NA,
-21L), class = c("tbl_df", "tbl", "data.frame"))
Expected output:
object minimum maximum midpoint id
2645 4 22 13 1
2646 40 85 62.5 2
2650 147 173 260 3
So the output is an additional column to dataframe 2, with a unique ID for each instance where midpoint in observation 1 (in df2) is within +/- 1 to observation 1 (in df1). As i want to compare to the n-1th dataframe because it represents the previous timepoint.
You can subset df2on those rows in which df2$midpoint is within the desired range of df1$midpoint, store that subsetted dataframe as a new dataframe and add an idcolumn to it:
df2new <- df2[df2$midpoint >= df1$midpoint - 1 & df2$midpoint <= df2$midpoint + 1, ]
df2new$id <- 1:nrow(df2new)
df2new
# A tibble: 20 x 5
Object minimum maximum midpoint id
<int> <int> <int> <dbl> <int>
1 2645 4 22 13 1
2 2646 40 85 62.5 2
3 2650 147 173 160 3
4 2652 224 242 233 4
5 2656 532 588 560 5
6 2657 595 627 611 6
7 2658 641 655 648 7
8 2659 670 702 686 8
9 2661 722 739 730. 9
10 2662 811 828 820. 10
11 2663 835 891 863 11
12 2664 907 923 915 12
13 2665 978 1022 1000 13
14 2667 1061 1085 1073 14
15 2670 1128 1143 1136. 15
16 2675 1289 1302 1296. 16
17 2681 1520 1544 1532 17
18 2688 1748 1779 1764. 18
19 2690 1843 1870 1856. 19
20 2691 1897 1925 1911 20
Alternatively, if you wanted to keep df2as it is but 'flag' those rows that fall into the desired range with 1and those that don't with 0, you could do this:
df2$id <-ifelse(df2$midpoint >= df1$midpoint - 1 & df2$midpoint <= df2$midpoint + 1, 1, 0)
df2
# A tibble: 21 x 5
Object minimum maximum midpoint id
<int> <int> <int> <dbl> <dbl>
1 2645 4 22 13 1
2 2646 40 85 62.5 1
3 2650 147 173 160 1
4 2652 224 242 233 1
5 2655 415 445 430 0
6 2656 532 588 560 1
7 2657 595 627 611 1
8 2658 641 655 648 1
9 2659 670 702 686 1
10 2661 722 739 730. 1
# … with 11 more rows
... and if you wnted to have a continuous range of id values that still marks the row outside the range (as it will just be a repetition of the previous id), then use cumsum on id:
df2$id2 <- cumsum(df2$id)
df2$id2[df2$id < 1] <- 0 # keep the `id` value `0`:
to obtain this:
df2
# A tibble: 21 x 6
Object minimum maximum midpoint id id2
<int> <int> <int> <dbl> <dbl> <dbl>
1 2645 4 22 13 1 1
2 2646 40 85 62.5 1 2
3 2650 147 173 160 1 3
4 2652 224 242 233 1 4
5 2655 415 445 430 0 0
6 2656 532 588 560 1 5
7 2657 595 627 611 1 6
8 2658 641 655 648 1 7
9 2659 670 702 686 1 8
10 2661 722 739 730. 1 9
# … with 11 more rows
Since you want to check the points one-by-one, then this is probably what you are looking for,
i1 <- cumsum(-1 <= (df2$midpoint - df1$midpoint) | 1 >= (df2$midpoint - df1$midpoint))
i1[ave(i1, i1, FUN = length) != 1] <- 0
i1
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

How can I delete "a lot" of rows from a dataframe in r

I tried all the similar posts but none of the answers seemed to work for me. I want to delete 8500+ rows (by rowname only) from a dataframe with 27,000+. The other columns are completely different, but the smaller dataset was derived from the larger one, and just looking for names shows me that whatever I look for from smaller df it is present in larger df. I could of course do this manually (busy work for sure!), but seems like there should be a simple computational answer.
I have tried:
fordel<-df2[1,]
df3<-df1[!rownames(df1) %in% fordel
l1<- as.vector(df2[1,])
df3<- df1[1-c(l1),]
and lots of other crazy ideas!
Here is a smallish example: df1:
Ent_gene_id clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
ENSMUSG00000000001.4 10634 6954 6835 6510
ENSMUSG00000000003.15 0 0 0 0
ENSMUSG00000000028.14 559 1570 807 1171
ENSMUSG00000000031.15 5748 174 4103 146
ENSMUSG00000000037.16 37 194 49 96
ENSMUSG00000000049.11 0 3 1 0
ENSMUSG00000000056.7 1157 1125 806 947
ENSMUSG00000000058.6 75 304 123 169
ENSMUSG00000000078.6 4012 4391 5637 3854
ENSMUSG00000000085.16 381 560 482 368
ENSMUSG00000000088.6 2667 4777 3483 3450
ENSMUSG00000000093.6 3 48 41 22
ENSMUSG00000000094.12 23 201 102 192
df2
structure(list(base_mean = c(7962.408875, 947.1240794, 43.76698418 ), log2foldchange = c(-0.363434063, -0.137403759, -0.236463207 ), lfcSE = c(0.096816743, 0.059823215, 0.404929452), stat = c(-3.753834854, -2.296830066, -0.583961493)), row.names = c("ENSMUSG00000000001.4", "ENSMUSG00000000056.7", "ENSMUSG00000000093.6"), class = "data.frame")
I want to delete from df1 the rows corresponding to the rownames in df2.
Tried to format it, but seems no longer formatted... oh well....
Suggestions really appreciated!
You mentioned row names but your data does not include that, so I'll assume that they really don't matter (or exist). Also, your df2 has more column headers than columns, not sure what's going on there ... so I'll ignore it.
Data
df1 <- structure(list(Ent_gene_id = c("ENSMUSG00000000001.4", "ENSMUSG00000000003.15",
"ENSMUSG00000000028.14", "ENSMUSG00000000031.15", "ENSMUSG00000000037.16",
"ENSMUSG00000000049.11", "ENSMUSG00000000056.7", "ENSMUSG00000000058.6",
"ENSMUSG00000000078.6", "ENSMUSG00000000085.16", "ENSMUSG00000000088.6",
"ENSMUSG00000000093.6", "ENSMUSG00000000094.12"), clone57_RNA = c(10634L,
0L, 559L, 5748L, 37L, 0L, 1157L, 75L, 4012L, 381L, 2667L, 3L,
23L), clone43_RNA_2 = c(6954L, 0L, 1570L, 174L, 194L, 3L, 1125L,
304L, 4391L, 560L, 4777L, 48L, 201L), clone67_RNA = c(6835L,
0L, 807L, 4103L, 49L, 1L, 806L, 123L, 5637L, 482L, 3483L, 41L,
102L), clone55_RNA = c(6510L, 0L, 1171L, 146L, 96L, 0L, 947L,
169L, 3854L, 368L, 3450L, 22L, 192L)), class = "data.frame", row.names = c(NA,
-13L))
df2 <- structure(list(Ent_gene_id = c("ENSMUSG00000000001.4", "ENSMUSG00000000056.7",
"ENSMUSG00000000093.6"), base_mean = c(7962.408875, 947.1240794,
43.76698418), log2foldchange = c(-0.36343406, -0.137403759, -0.236463207
), pvalue = c(0.00017415, 0.021628466, 0.55924622)), class = "data.frame", row.names = c(NA,
-3L))
Base
df1[!df1$Ent_gene_id %in% df2$Ent_gene_id,]
# Ent_gene_id clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
# 2 ENSMUSG00000000003.15 0 0 0 0
# 3 ENSMUSG00000000028.14 559 1570 807 1171
# 4 ENSMUSG00000000031.15 5748 174 4103 146
# 5 ENSMUSG00000000037.16 37 194 49 96
# 6 ENSMUSG00000000049.11 0 3 1 0
# 8 ENSMUSG00000000058.6 75 304 123 169
# 9 ENSMUSG00000000078.6 4012 4391 5637 3854
# 10 ENSMUSG00000000085.16 381 560 482 368
# 11 ENSMUSG00000000088.6 2667 4777 3483 3450
# 13 ENSMUSG00000000094.12 23 201 102 192
dplyr
dplyr::anti_join(df1, df2, by = "Ent_gene_id")
# Ent_gene_id clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
# 1 ENSMUSG00000000003.15 0 0 0 0
# 2 ENSMUSG00000000028.14 559 1570 807 1171
# 3 ENSMUSG00000000031.15 5748 174 4103 146
# 4 ENSMUSG00000000037.16 37 194 49 96
# 5 ENSMUSG00000000049.11 0 3 1 0
# 6 ENSMUSG00000000058.6 75 304 123 169
# 7 ENSMUSG00000000078.6 4012 4391 5637 3854
# 8 ENSMUSG00000000085.16 381 560 482 368
# 9 ENSMUSG00000000088.6 2667 4777 3483 3450
# 10 ENSMUSG00000000094.12 23 201 102 192
Edit: same thing but with row names:
# update my df1 to change Ent_gene_id from a column to rownames
rownames(df1) <- df1$Ent_gene_id
df1$Ent_gene_id <- NULL
# use your updated df2 (from dput)
# df2 <- structure(...)
df1[ !rownames(df1) %in% rownames(df2), ]
# clone57_RNA clone43_RNA_2 clone67_RNA clone55_RNA
# ENSMUSG00000000003.15 0 0 0 0
# ENSMUSG00000000028.14 559 1570 807 1171
# ENSMUSG00000000031.15 5748 174 4103 146
# ENSMUSG00000000037.16 37 194 49 96
# ENSMUSG00000000049.11 0 3 1 0
# ENSMUSG00000000058.6 75 304 123 169
# ENSMUSG00000000078.6 4012 4391 5637 3854
# ENSMUSG00000000085.16 381 560 482 368
# ENSMUSG00000000088.6 2667 4777 3483 3450
# ENSMUSG00000000094.12 23 201 102 192

Change data set from wide to long while retaining group id, and also gathering columns [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I'd really appreciate some help getting this messy set of new survey data into a usable form. It was collected in a strange way and now I've got strange data to work with. I've looked through tidyr and used those approaches to no end. I suspect my problem is that I'm thinking about this dataset all wrong and I'm blind to some real answer. But given all the things I need to do to this df, I cant figure out where to start and thus where to start googling.
What I need:
For each person to be their own row
Each person retains their GroupID and Treated value
For the variables currently attached to each person individually to become columns (age, weight, height)
Fake (and much smaller):
structure(list(GroupID = 1:5, Treated = c("Y", "Y", "N", "Y",
"N"), person1_age = c(45L, 33L, 71L, 19L, 52L), person1_weight = c(187L,
145L, 136L, 201L, 168L), person1_height = c(69L, 64L, 51L, 70L,
66L), person2_age = c(54L, 20L, 48L, 63L, 26L), person2_weight = c(140L,
122L, 186L, 160L, 232L), person2_height = c(62L, 70L, 65L, 72L,
74L), person3_age = c(21L, 56L, 40L, 59L, 67L), person3_weight = c(112L,
143L, 187L, 194L, 159L), person3_height = c(61L, 69L, 73L, 63L,
72L)), .Names = c("GroupID", "Treated", "person1_age", "person1_weight",
"person1_height", "person2_age", "person2_weight", "person2_height",
"person3_age", "person3_weight", "person3_height"), row.names = c(NA,
5L), class = "data.frame")
Any help or further readings you could point me to would be very much appreciated.
reshape can do this, with the appropriate arguments:
> reshape(x, direction="long", varying=names(x)[3:11], timevar='person', v.names=c('height', 'age', 'weight'), sep='_')
GroupID Treated person height age weight id
1.1 1 Y 1 187 45 69 1
2.1 2 Y 1 145 33 64 2
3.1 3 N 1 136 71 51 3
4.1 4 Y 1 201 19 70 4
5.1 5 N 1 168 52 66 5
1.2 1 Y 2 140 54 62 1
2.2 2 Y 2 122 20 70 2
3.2 3 N 2 186 48 65 3
4.2 4 Y 2 160 63 72 4
5.2 5 N 2 232 26 74 5
1.3 1 Y 3 112 21 61 1
2.3 2 Y 3 143 56 69 2
3.3 3 N 3 187 40 73 3
4.3 4 Y 3 194 59 63 4
5.3 5 N 3 159 67 72 5
This relies on the order of the columns in your original data, for the varying argument, being in increasing order in the original data.
If that's not the case, specify varying manually. Here's what is used above:
> names(x)[3:11]
[1] "person1_age" "person1_weight" "person1_height" "person2_age" "person2_weight" "person2_height"
[7] "person3_age" "person3_weight" "person3_height"
We can also use melt from data.table which can take multiple patterns in the measure argument
library(data.table)
melt(setDT(x), measure = patterns("age$", "weight$", "height$"),
variable.name = "person", value.name = c("age", "weight", "height"))
# GroupID Treated person age weight height
# 1: 1 Y 1 45 187 69
# 2: 2 Y 1 33 145 64
# 3: 3 N 1 71 136 51
# 4: 4 Y 1 19 201 70
# 5: 5 N 1 52 168 66
# 6: 1 Y 2 54 140 62
# 7: 2 Y 2 20 122 70
# 8: 3 N 2 48 186 65
# 9: 4 Y 2 63 160 72
#10: 5 N 2 26 232 74
#11: 1 Y 3 21 112 61
#12: 2 Y 3 56 143 69
#13: 3 N 3 40 187 73
#14: 4 Y 3 59 194 63
#15: 5 N 3 67 159 72

Resources