Counting and separating values of a dataframe - r

Good evening dear community, I have the following query please. I have the following dataframe of genes that has more than 100 thousand values and are found according to the total column between values of 221 and 213 (ie there are values for: 213, 214, 215, 216, 216, 217, 218, 219, 220 and 221).
mix total
1 A2M-ACTB 221
2 A2M-ACTG1 221
3 A2M-ANXA1 221
4 A2M-APP 221
5 A2M-B2M 221
6 A2M-CD24 221
7 A2M-CD74 221
8 A2M-COL1A2 221
9 A2M-COL3A1 221
10 A2M-DSP 221
11 A2M-EEF1A1 221
12 A2M-ENO1 221
13 A2M-FN1 221
14 A2M-GAPDH 221
15 A2M-HLA-B 221
16 A2M-HSP90AB1 221
17 A2M-MGP 221
18 A2M-RPL13A 221
19 A2M-RPS6 221
20 A2M-TM4SF1 221
And what I would like first is to count that the number of gene rows that present 221, then 220, and so on up to 213. Subsequently I would like to separate into individual objects each count.
I hope I have explained clearly, thank you very much for your suggestions.
Kind regards

Related

Loop through and mutate multiple columns using an incrementing variable

My goal is to perform multiple column operations in one line of code without hard coding the variable names.
structure(list(Subject = 1:6, Congruent_1 = c(359, 391, 384,
316, 287, 403), Congruent_2 = c(361, 378, 322, 286, 276, 363),
Congruent_3 = c(342, 355, 334, 274, 297, 335), Congruent_4 = c(365,
503, 324, 256, 266, 388), Congruent_5 = c(335, 354, 320,
272, 260, 337), Incongruent_1 = c(336, 390, 402, 305, 310,
400), Incongruent_2 = c(366, 407, 386, 280, 243, 393), Incongruent_3 = c(323,
455, 317, 308, 259, 325), Incongruent_4 = c(361, 392, 357,
274, 342, 350), Incongruent_5 = c(300, 366, 378, 263, 258,
349)), row.names = c(NA, 6L), class = "data.frame")
My data looks like this.
I want need to do column subtraction and save those new values into new columns. For example, a new column by the name of selhist_1 should be computed as Incongruent_1 - Congruent_1. I tried to write a for loop that indexes the existing columns by their names and creates new columns using the same indexing variable:
for(i in 1:5)(
DP4 = mutate(DP4, as.name(paste("selhistB_",i,sep="")) = as.name(paste("Incongruent_",i,sep="")) - as.name(paste("Congruent_",i,sep="")))
)
but I received this error:
Error: unexpected '=' in: "for(i in 1:5)( DP4 = mutate(DP4, as.name(paste("selhistB_",i,sep="")) ="
I rather use this modular approach, as opposed to hard coding and writing out "selhistB = incongruent_1 - congruent_1" five times, using the mutate() function.
I also wonder if i could achieve the same goal on the long version of this data, and maybe it would make more sense.
library(dplyr)
d %>%
pivot_longer(-Subject,
names_to = c(".value", "id"),
names_sep = "_") %>%
mutate(selhistB = Incongruent - Congruent) %>%
pivot_wider(names_from = id, values_from = c(Congruent, Incongruent, selhistB))
Or just skip the last pivot, and keep everything long.
As long as you are already using tidyverse packages, the following code will do exactly what you need:
library(dplyr)
for(i in 1:5){
DP4 <- DP4 %>% mutate(UQ(sym(paste0("selhistB_",i))) :=
UQ(sym(paste0("Incongruent_",i))) - UQ(sym(paste0("Congruent_",i))))
}
DP4
Subject Congruent_1 Congruent_2 Congruent_3 Congruent_4 Congruent_5
1 1 359 361 342 365 335
2 2 391 378 355 503 354
3 3 384 322 334 324 320
4 4 316 286 274 256 272
5 5 287 276 297 266 260
6 6 403 363 335 388 337
Incongruent_1 Incongruent_2 Incongruent_3 Incongruent_4 Incongruent_5
1 336 366 323 361 300
2 390 407 455 392 366
3 402 386 317 357 378
4 305 280 308 274 263
5 310 243 259 342 258
6 400 393 325 350 349
selhistB_1 selhistB_2 selhistB_3 selhistB_4 selhistB_5
1 23 -5 19 4 35
2 1 -29 -100 111 -12
3 -18 -64 17 -33 -58
4 11 6 -34 -18 9
5 -23 33 38 -76 2
6 3 -30 10 38 -12
You can use split.default and split on column names suffix, then loop over the list and subtract column 2 from column 1, i.e.
lapply(split.default(df[-1], sub('.*_', '', names(df[-1]))), function(i) i[1] - i[2])
Using subtract over all matching columns, then cbind, try:
x <- df1[, grepl("^C", colnames(df1)) ] - df1[, grepl("^I", colnames(df1)) ]
names(x) <- paste0("selhistB_", seq_along(names(x)))
res <- cbind(df1, x)
res
Subject Congruent_1 Congruent_2 Congruent_3 Congruent_4 Congruent_5
1 1 359 361 342 365 335
2 2 391 378 355 503 354
3 3 384 322 334 324 320
4 4 316 286 274 256 272
5 5 287 276 297 266 260
6 6 403 363 335 388 337
Incongruent_1 Incongruent_2 Incongruent_3 Incongruent_4 Incongruent_5
1 336 366 323 361 300
2 390 407 455 392 366
3 402 386 317 357 378
4 305 280 308 274 263
5 310 243 259 342 258
6 400 393 325 350 349
selhistB_1 selhistB_2 selhistB_3 selhistB_4 selhistB_5
1 23 -5 19 4 35
2 1 -29 -100 111 -12
3 -18 -64 17 -33 -58
4 11 6 -34 -18 9
5 -23 33 38 -76 2
6 3 -30 10 38 -12

How to find detect duplicates of single values in all rows and columns in R data.frame

I have a large data-set consisting of a header and a series of values in that column. I want to detect the presence and number of duplicates of these values within the whole dataset.
1 2 3 4 5 6 7
734 456 346 545 874 734 455
734 783 482 545 456 948 483
So for example, it would detect 734 3 times, 456 twice etc.
I've tried using the duplicated function in r but this seems to only work on rows as a whole or columns as a whole. Using
duplicated(df)
doesn't pick up any duplicates, though I know there are two duplicates in the first row.
So I'm asking how to detect duplicates both within and between columns/rows.
Cheers
You can use table() and data.frame() to see the occurrence
data.frame(table(v))
such that
v Freq
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 346 1
9 455 1
10 456 2
11 482 1
12 483 1
13 545 2
14 734 3
15 783 1
16 874 1
17 948 1
DATA
v <- c(1, 2, 3, 4, 5, 6, 7, 734, 456, 346, 545, 874, 734, 455, 734,
783, 482, 545, 456, 948, 483)
You can transform it to a vector and then use table() as follows:
library(data.table)
library(dplyr)
df<-fread("734 456 346 545 874 734 455
734 783 482 545 456 948 483")
df%>%unlist()%>%table()
# 346 455 456 482 483 545 734 783 874 948
# 1 1 2 1 1 2 3 1 1 1

Rank() in R excluding zeros

I am trying to duplicated "manually" the example in this Wikipedia post using R.
Here is the data:
after = c(125, 115, 130, 140, 140, 115, 140, 125, 140, 135)
before = c(110, 122, 125, 120, 140, 124, 123, 137, 135, 145)
sgn = sign(after-before)
abs = abs(after - before)
d = data.frame(after,before,sgn,abs)
after before sgn abs
1 125 110 1 15
2 115 122 -1 7
3 130 125 1 5
4 140 120 1 20
5 140 140 0 0
6 115 124 -1 9
7 140 123 1 17
8 125 137 -1 12
9 140 135 1 5
10 135 145 -1 10
If I try to rank the rows based on the abs column, the 0 entry is naturally ranked as 1:
rank = rank(abs)
(d = data.frame(after,before,sgn,abs,rank))
after before sgn abs rank
1 125 110 1 15 8.0
2 115 122 -1 7 4.0
3 130 125 1 5 2.5
4 140 120 1 20 10.0
5 140 140 0 0 1.0
6 115 124 -1 9 5.0
7 140 123 1 17 9.0
8 125 137 -1 12 7.0
9 140 135 1 5 2.5
10 135 145 -1 10 6.0
However, zeros are ignored in the Wilcoxon signed-test.
How can I get R to ignore that row, so as to end up with:
after before sgn abs rank
1 125 110 1 15 7.0
2 115 122 -1 7 3.0
3 130 125 1 5 1.5
4 140 120 1 20 9.0
5 140 140 0 0 0
6 115 124 -1 9 4.0
7 140 123 1 17 8.0
8 125 137 -1 12 6.0
9 140 135 1 5 1.5
10 135 145 -1 10 5.0
SOLUTION (accepted answer below):
after = c(125, 115, 130, 140, 140, 115, 140, 125, 140, 135)
before = c(110, 122, 125, 120, 140, 124, 123, 137, 135, 145)
sgn = sign(after-before)
abs = abs(after - before)
d = data.frame(after,before,sgn,abs)
d$rank = rank(replace(abs,abs==0,NA), na='keep')
d$multi = d$sgn * d$rank
(W=abs(sum(d$multi, na.rm = T)))
9
From the Wikipedia article:
Exclude pairs with |x2,i − x1,i| = 0. Let Nr be the reduced sample size.
We need to exclude zeroes. By my thinking, you should replace zeroes with NA, and then specify to rank() that you want to exclude NAs from consideration for ranking. Since you need to return a vector of the same length as the input, you can specify 'keep' as the argument:
d$rank <- rank(replace(abs,abs==0,NA),na='keep');
d;
## after before sgn abs rank
## 1 125 110 1 15 7.0
## 2 115 122 -1 7 3.0
## 3 130 125 1 5 1.5
## 4 140 120 1 20 9.0
## 5 140 140 0 0 NA
## 6 115 124 -1 9 4.0
## 7 140 123 1 17 8.0
## 8 125 137 -1 12 6.0
## 9 140 135 1 5 1.5
## 10 135 145 -1 10 5.0
The subtraction-based solutions will not work if the input vector contains zero zeroes or multiple zeroes.
You could create the new column and then just update the rank where the abs value isn't 0
d$rank <- 0 # default value for rows with abs=0
d$rank[d$abs!=0] <- rank(d$abs[d$abs!=0])
If you wanted to drop the row completely, you could just do
transform(subset(d, abs!=0), rank=rank(abs))
A quick way to do it would be to rank as normal and then do:
d$rank <- ifelse(d$rank == 1, 0, d$rank - 1)
This switches all ranks of 1 to 0, and reduces any other ranks by 1.

Spline Interpolation in R

I have some data with
h = c[157 144 80 106 124 46 207 188 190 208 143 170 162 178 155 163 162 149 135 160 149 147 133 146 126 120 151 74 122 145 160 155 173 126 172 93]
Then I get local maxima with diff function
max = [1 5 7 10 12 14 16 20 24 27 31 33 35 36]
I have simple matlab code to find spline interpolation
maxenv = spline(max,h(max),1:N);
This code will show result
maxenv = 157 86.5643152828762 67.5352696350679 84.9885891697257 124 169.645228239041 207 224.396380746179 223.191793341491 208 185.421032390413 170 172.173624690130 178 172.759468849065 163 158.147870987344 157.874968589889 159.414581897490 160 157.622863516083 153.308219179638 148.839465253375 146 146.051320982064 148.167322961480 151 153.474200222188 155.606188003845 157.685081783579 160 163.653263154551 173 186.027639098700 172 93
Now I'm trying with R but got some errors
maxenv <- spline(max, h(max), seq(N))
Then error
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
You're code and question are really not clear, so I'm not 100% sure this is what you are looking for:
# vector of values
h <- c(157, 144, 80, 106, 124, 46, 207, 188, 190, 208, 143, 170, 162, 178, 155,
163, 162, 149, 135, 160, 149, 147, 133, 146, 126, 120, 151, 74, 122, 145,
160, 155, 173, 126, 172, 93)
# local maxima
lmax <- h[c(1, which(diff(sign(diff(h)))==-2)+1, length(h))]
# spline calculation
spl <- spline(1:length(lmax), lmax)
# visual inspection
plot(spl)
lines(spl)

How can I draw a boxplot of an R table?

I have a table produced by calling table(...) on a column of data, and I get a table that looks like:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
346 351 341 333 345 415 421 425 429 437 436 469 379 424 387 419 392 396 381 421
I'd like to draw a boxplot of these frequencies, but calling boxplot on the table results in an error:
Error in Axis.table(x = c(333, 368.5, 409.5, 427, 469), side = 2) :
only for 1-D table
I've tried coercing the table to an array with as.array but it seems to make no difference. What am I doing wrong?
If I understand you correctly, boxplot(c(tab)) or boxplot(as.vector(tab)) should work (credit to #joran as well).

Resources