Remove Columns from a table that are 90% one value - r

Example Data:
A<- c(1,2,3,4,1,2,3,4,1,2)
B<- c(A,B,C,D,E,F,G,H,I,J)
C<- c(1,1,1,1,1,1,1,1,1,0)
D<- c(TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,FALSE)
df1<-data.frame(A,B,C,D)
df1 %>%
select_if(
###column is <90% one value
)
So I have a table that has a few columns that are predominantly one value--like C and D in the above example. I need to get rid of any columns that are 90% or more one unique value. How can I get rid of the columns that fit this criteria?

We may use select with where, get the frequency count with table, convert to proportions, get the max value and check if it is less than .90 to select the particular column
library(dplyr)
df1 <- df1 %>%
select(where(~ max(proportions(table(.))) < .90))
data
df1 <- structure(list(A = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2), B = c("A",
"B", "C", "D", "E", "F", "G", "H", "I", "J"), C = c(1, 1, 1,
1, 1, 1, 1, 1, 1, 0), D = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, FALSE)), class = "data.frame", row.names = c(NA,
-10L))

Related

Separate entries in dataframe in new rows in R [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 28 days ago.
I have data.frame df below.
df <- data.frame(id = c(1:12),
A = c("alpha", "alpha", "beta", "beta", "gamma", "gamma", "gamma", "delta",
"epsilon", "epsilon", "zeta", "eta"),
B = c("a", "a; b", "a", "c; d; e", "e", "e", "c; f", "g", "a", "g; h", "f", "d"),
C = c(NA, 4, 2, 7, 4, NA, 9, 1, 1, NA, 3, NA),
D = c("ii", "ii", "i", "iii", "iv", "v", "viii", "v", "viii", "i", "iii", "i"))
Column 'B' contains four entries with semicolons. How can I copy each of these rows and enter in column 'B' each of the separate values?
The expected result df2 is:
df2 <- data.frame(id = c(1, 2, 2, 3, 4, 4, 4, 5, 6, 7, 7, 8, 9, 10, 10, 11, 12),
A = c(rep("alpha", 3), rep("beta", 4), rep("gamma", 4), "delta", rep("epsilon", 3),
"zeta", "eta"),
B = c("a", "a", "b", "a", "c", "d", "e", "e", "e", "c", "f", "g", "a", "g", "h", "f", "d"),
C = c(NA, 4, 4, 2, 7, 7, 7, 4, NA, 9, 9, 1, 1, NA, NA, 3, NA),
D = c("ii", "ii", "ii", "i", "iii", "iii", "iii", "iv", "v", "viii", "viii", "v", "viii", "i", "i", "iii", "i"))
I tried this, but no luck:
df2 <- df
# split the values in column B
df2$B <- unlist(strsplit(as.character(df2$B), "; "))
# repeat the rows for each value in column B
df2 <- df2[rep(seq_len(nrow(df2)), sapply(strsplit(as.character(df1$B), "; "), length)),]
# match the number of rows in column B with the number of rows in df2
df2$id <- rep(df2$id, sapply(strsplit(as.character(df1$B), "; "), length))
# sort the dataframe by id
df2 <- df2[order(df2$id),]
We may use separate_rows here - specify the sep as ; followed by zero or more spaces (\\s*) to expand the rows
library(tidyr)
df_new <- separate_rows(df, B, sep = ";\\s*")
-checking with OP's expected
> all.equal(df_new, df2, check.attributes = FALSE)
[1] TRUE
In the base R, we may replicate the sequence of rows by the lengths of the list output
lst1 <- strsplit(df$B, ";\\s+")
df_new2 <- transform(df[rep(seq_len(nrow(df)), lengths(lst1)),], B = unlist(lst1))
row.names(df_new2) <- NULL

Volcano plot for multiple clusters

I am trying to make a volcano plot for different clusters. I have 2 conditions, untreated vs. treated. I have a differential expression excel file that cellranger generated for me but within the file it has multiple clusters each which have a fold change and p value. How do I create a volcano plot that contains all the clusters rather than one? Would I have to do a volcano plot for each cluster and then combine them all somehow?
I used this code to generate the plot for just one of the clusters...
macrophage_list <- read.table("differential_expression_macrophage.csv", header = T, sep = ",")`
EnhancedVolcano(macrophage_list, lab = as.character(macrophage_list$FeatureName), x = 'Cluster1.Log2.Fold.Change', y = 'Cluster1.Adjusted.P.Value', xlim = c(-8,8), title = 'Macrophage', pCutoff = 10e-5, FCcutoff = 1.5, pointSize = 3.0, labSize = 3.0)
How do I merge all the information in the excel file to create a volcano plot?
I uploaded each data cluster one by one and then merged them by using rbind, but is there a simpler/quicker way to do this?
output for dput(gene_list[1:20, 1:14])
structure(list(Feature.ID = structure(1:20, .Label = c("a", "b",
"c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o",
"p", "q", "r", "s", "t"), class = "factor"), Feature.Name = structure(1:20, .Label = c("A",
"B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N",
"O", "P", "Q", "R", "S", "T"), class = "factor"), Cluster.1_Mean.Counts = c(0.000960904,
0.000320301, 0.001281205, 0.000320301, 0.000320301, 0.016335362,
0.000960904, 0, 0.001601506, 0.000320301, 0.007046627, 0.026585,
0.017296265, 0.004804518, 0, 0.874742598, 0.017616566, 0.007366928,
0.008327831, 0.001921807), Cluster.1_Log2.fold.change = c(0.291978774,
1.954943787, -2.008530337, -2.482461526, 3.539906287, 0.407455991,
-0.214981215, 1.539906287, 0.802940693, 2.539906287, -1.333136538,
-1.879953595, -0.52422405, -0.877946228, 1.539906287, -0.629373147,
1.118442519, 0.170672478, 1.065975099, 1.099333696), Cluster.1_Adjusted.p.value = c(1,
0.910243711, 0.04672812, 0.080866038, 0.610296549, 0.80063597,
1, 1, 0.951841603, 0.797013021, 0.103401275, 0.000594428, 0.907754993,
0.532689631, 1, 0.480958806, 0.078345008, 1, 0.198557945, 0.668312142
), Cluster.2_Mean.Counts = c(0.000902278, 0.001804555, 0.006315943,
0.004511388, 0, 0.029775159, 0.001804555, 0, 0.002706833, 0,
0.023459216, 0.128123411, 0.030677437, 0.009022775, 0, 2.174488883,
0.018947828, 0.019850106, 0.010827331, 0.000902278), Cluster.2_Log2.fold.change = c(0.792589781,
4.769869705, 0.35201719, 0.839132367, 3.184907204, 1.32985554,
0.962514783, 3.184907204, 1.725475586, 2.599944703, 0.560416339,
0.580736324, 0.407299626, 0.184907204, 3.184907204, 0.816580902,
1.120776867, 1.742684876, 1.409613491, 0.599944703), Cluster.2_Adjusted.p.value = c(1,
0.153573448, 1, 0.737977734, 1, 0.14478935, 0.853816767, 1, 0.47952604,
1, 0.65316285, 0.507251471, 0.776636022, 1, 1, 0.346630571, 0.285006452,
0.060868933, 0.21546202, 1), Cluster.3_Mean.Counts = c(0.001813813,
0, 0.019045032, 0.00725525, 0, 0.022672657, 0.000906906, 0, 0,
0, 0.029927908, 0.043531502, 0.046252221, 0.029021001, 0, 3.146057931,
0.020858845, 0.013603594, 0.008162157, 0), Cluster.3_Log2.fold.change = c(1.455721575,
2.192687169, 2.008262598, 1.504631175, 3.192687169, 0.9044422,
0.334706174, 3.192687169, -0.451169021, 2.607724668, 0.931421856,
-1.032594057, 1.038258504, 1.970294748, 3.192687169, 1.412371018,
1.26985503, 1.14829305, 0.991053308, -0.451169021), Cluster.3_Adjusted.p.value = c(0.757752635,
1, 0.032609935, 0.33316083, 1, 0.441825712, 1, 1, 1, 1, 0.380305075,
0.605158722, 0.339946318, 0.016952505, 1, 0.056529024, 0.259458704,
0.339639234, 0.536765022, 1), Cluster.4_Mean.Counts = c(0.000641899,
0, 0.002567596, 0.004493293, 0, 0.010270384, 0.003209495, 0,
0.000641899, 0, 0.028243557, 0.160474756, 0.012196081, 0.005135192,
0, 1.199709274, 0.005135192, 0.004493293, 0.005777091, 0.001283798
), Cluster.4_Log2.fold.change = c(0.269229783, 1.661547206, -0.886889419,
0.778904157, 2.661547206, -0.289908942, 1.602653517, 2.661547206,
0.076584705, 2.076584705, 0.854192284, 0.961549693, -0.967809414,
-0.644261223, 2.661547206, -0.104384578, -0.790579612, -0.467735811,
0.459913345, 0.722947751), Cluster.4_Adjusted.p.value = c(1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.584036686, 1, 1, 1, 1, 1, 1,
1, 1)), class = "data.frame", row.names = c(NA, 20L))
Based on your dataset, you need to reshape them but first in order to reshape them using the right pattern, we will rename some column names:
colnames(df) <- gsub(".Mean", "_Mean", colnames(df))
colnames(df) <- gsub(".Log2", "_Log2", colnames(df))
colnames(df) <- gsub(".Adjus","_Adjus",colnames(df))
Now, we can reshape it using the right pattern with pivot_longer function from tidyr package:
library(tidyr)
final_df <- df %>% pivot_longer(., -c(Feature.ID, Feature.Name), names_to = c("set",".value"), names_pattern = "(.+)_(.+)")
# A tibble: 80 x 6
Feature.ID Feature.Name set Mean.Counts Log2.fold.change Adjusted.p.value
<fct> <fct> <chr> <dbl> <dbl> <dbl>
1 a A Cluster.1 0.000961 0.292 1
2 a A Cluster.2 0.000902 0.793 1
3 a A Cluster.3 0.00181 1.46 0.758
4 a A Cluster.4 0.000642 0.269 1
5 b B Cluster.1 0.000320 1.95 0.910
6 b B Cluster.2 0.00180 4.77 0.154
7 b B Cluster.3 0 2.19 1
8 b B Cluster.4 0 1.66 1
9 c C Cluster.1 0.00128 -2.01 0.0467
10 c C Cluster.2 0.00632 0.352 1
# … with 70 more rows
Now, we can create the volcano plot by using ggplot2 and ggrepel libraries for the labeling of Feature.Name (if you don't have ggrepel, you have to install it):
library(ggplot2)
library(ggrepel)
ggplot(final_df, aes(x = Log2.fold.change,y = -log10(Adjusted.p.value), label = Feature.Name))+
geom_point()+
geom_text_repel(data = subset(final_df, Adjusted.p.value < 0.05),
aes(label = Feature.Name))
And you get your volcano plot with all clusters merged, all points with the same color, and with labeling of Feature.names with an adjusted p value < 0.05

Building sequence data for a recommender system- replacing cross-tabular matrix with a variable value

I am trying to build a sequence data for a recommender system. I have built a cross-tabular data (Table 1) and Table 2 as shown below:
enter image description here
I have been trying to replace all the 1's in Table 1 by the "Grade" from the Table 2 in R.
Any insight/suggestion is greatly appreciated.
Instead of replacing the first one with second, the second table and directly changed to 'wide' with dcast
library(reshape2)
res <- dcast(df2, St.No. ~ Courses, value.var = 'Grade')[names(df1)]
res
# St.No. Math Phys Chem CS
#1 1 A B
#2 2 B B
#3 3 A A C
#4 4 B B D
If we need to replace the blanks with 0
res[res =='"] <- "0"
data
df1 <- data.frame(St.No. = 1:4, Math = c(0, 0, 1, 1), Phys = c(1, 1, 0, 1),
Chem = c(0, 1, 1, 0), CS = c(1, 0, 1, 1))
df2 <- data.frame(St.No. = rep(1:4, each = 4), Courses = rep(c("Math",
"Phys", "Chem", "CS"), 4),
Grade = c("", "A", "", "B", "", "B", "B", "",
"A", "", "A", "C", "B", "B", "", "D"),
stringsAsFactors = FALSE)

Conditional displaying values in R

I'd like to see which values have a particular entry issue, but I'm not getting things done right.
For instance, I need to print on screen values from column "c" but conditional of a given value from "b" say where [b==0].
Finally, I need to add a new string for those whose condition is true.
df<- structure(list(a = c(11.77, 10.9, 10.32, 10.96, 9.906, 10.7,
11.43, 11.41, 10.48512, 11.19), b = c(2, 3, 2, 0, 0, 0, 1, 2,
4, 0), c = c("q", "c", "v", "f", "", "e", "e", "v", "a", "c")), .Names = c("a",
"b", "c"), row.names = c(NA, -10L), class = "data.frame")
I tried this without success:
if(df[b]==0){
print(df$c)
}
if((df[b]==0)&(df[c]=="v")){
df[c] <-paste("2")
}
Thanks for helping.
The correct syntax is like df[rows, columns], so you could try:
df[df$b==0, "c"]
You can accomplish changing values using ifelse:
df$c <- ifelse(df$b==0 & df$c=="v", paste(df$c, 2, sep=""), df$c)
Does this help?
rows <- which(df$b==0)
if (length(rows)>0) {
print(df$c[rows])
df$c[rows] <- paste(df$c[rows],'2')
## maybe you wanted to have:
# df$c[rows] <- '2'
}
There are several ways to subset data in R, like e.g.:
df$c[df$b == 0]
df[df$b == 0, "c"]
subset(df, b == 0, c)
with(df, c[b == 0])
# ...
To conditionally add another column (here: TRUE/FALSE):
df$e <- FALSE; df$e[df$b == 0] <- TRUE
df <- transform(df, c = ifelse(b == 0, TRUE, FALSE))
df <- within(df, e <- ifelse(b == 0, TRUE, FALSE))
# ...

How to reorder rows in a matrix

I have a matrix and would like to reorder the rows so that for example row 5 can be switched to row 2 and row 2 say to row 7. I have a list with all rownames delimited with \n and I thought I could somehow read it into R (its a txt file) and then just use the name of the matrix (in my case 'k' and do something like k[txt file,]-> k_new but this does not work since the identifiers are not the first column but are defined as rownames.
k[ c(1,5,3,4,7,6,2), ] #But probably not what you meant....
Or perhaps (if your 'k' object rownames are something other than the default character-numeric sequence):
k[ char_vec , ] # where char_vec will get matched to the row names.
(dat <- structure(list(person = c(1, 1, 1, 1, 2, 2, 2, 2), time = c(1,
2, 3, 4, 1, 2, 3, 4), income = c(100, 120, 150, 200, 90, 100,
120, 150), disruption = c(0, 0, 0, 1, 0, 1, 1, 0)), .Names = c("person",
"time", "income", "disruption"), row.names = c("h", "g", "f",
"e", "d", "c", "b", "a"), class = "data.frame"))
dat[ c('h', 'f', 'd', 'b') , ]
#-------------
person time income disruption
h 1 1 100 0
f 1 3 150 0
d 2 1 90 0
b 2 3 120 1

Resources