How to replace 0's with blank in kables - r

I'm creating a rather large html table using the kable function, and this table has a lot of 0's in it. In order to only show the relevant information more clearly, I'm trying hide the 0's in the table by just replacing them with blank space.
Right now, I'm trying something like this but it's not working:
my_table = knitr::kable(...)
cat(gsub(0," ",my_table), sep = '\n')
Something similar to the above works to remove NA's, but I can't seem to get it to work for 0's.
Thanks in advance!
EDIT: example data:
Product = c('A','B','A','A','C','B')
Month = c('Jan', 'Feb', 'Feb', 'Apr', 'Jan', 'Feb')
my_data = data.frame(Product, Month)
my_table = table(my_data)
kable(my_table) #This has the 0's which I don't want
Product | Month
A | Jan
B | Feb
A | Feb
A | Apr
C | Jan
B | Feb
Current output:
----Jan Feb Mar Apr
A 1 1 0 1
B 0 2 0 0
C 1 0 0 0
Desired output:
----Jan Feb Mar Apr
A 1 1 - 1
B - 2 - -
C 1 - - -
except "-" would be a blank space instead of a dash
EDIT2: never mind, I figured it out even though this is really hacky:
my_kable = knitr::kable(my_table)
gsub(0, ' ', my_kable)
lol

The reason your original gsub wasn't working was that it was flattening the table to a vector. One of many options to maintain the table structure would be to use the replace function:
knitr::kable(replace(my_table, my_table==0, ""))
#| |Apr |Feb |Jan |
#|:--|:---|:---|:---|
#|A |1 |1 |1 |
#|B | |2 | |
#|C | | |1 |

You can use base R gsub():
gsub(0, " ", kable(my_table))
To get:
| | Apr| Feb| Jan|
|:--|---:|---:|---:|
|A | 1| 1| 1|
|B | | 2| |
|C | | | 1|

You can try:
gsub(" 0", " ", kable(my_table))

Related

Is there an elegant pyspark solution for the following ranking problem?

is there an elegant function that exists for the following problem?
I've been tasked to create a function to determine the differences in days and rank the values. The closest positive number would rank as 0 and be the 'starting point'. From the starting point, depending on whether the ranked value is negative, or non-negative, the function would assign a rank to the value either positive or negative respectively.
Datediff()
Rank
-50
-3
-32
-2
-1
-1
5
0
14
1
32
2
128
3
254
4
My solution so far would be to separate the negative and positive numbers and use the window.partitionBy() function to assign the correlating rank. It would work, but I'm curious for a more elegant solution. :)
You can use Window to generate serial numbers to use as rank:
from pyspark.sql.functions import col, row_number
from pyspark.sql.window import Window
df = spark.createDataFrame([(10,),(-10,),(5,),(-5,),(15,),(-15,)], ["dated_diff"])
window = Window.orderBy(df["dated_diff"])
df = df.select("dated_diff", row_number().over(window).alias("row_number"))
df.show()
+----------+----------+
|dated_diff|row_number|
+----------+----------+
| -15| 1|
| -10| 2|
| -5| 3|
| 5| 4|
| 10| 5|
| 15| 6|
+----------+----------+
Then find rank of first positive number:
first_positive_rank = df.filter("dated_diff>=0").first()["row_number"]
print(first_positive_rank)
>> 4
OR
first_positive_rank = df.filter("dated_diff<0").count() + 1
print(first_positive_rank)
>> 4
And finally subtract that rank from rank of all:
df = df.withColumn("row_number", col("row_number") - first_positive_rank)
df.show()
+----------+----------+
|dated_diff|row_number|
+----------+----------+
| -15| -3|
| -10| -2|
| -5| -1|
| 5| 0|
| 10| 1|
| 15| 2|
+----------+----------+

How to match two columns in one dataframe using values in another dataframe in R

I have two dataframes. One is a set of ≈4000 entries that looks similar to this:
| grade_col1 | grade_col2 |
| --- | --- |
| A-| A-|
| B | 86|
| C+| C+|
| B-| D |
| A | A |
| C-| 72|
| F | 96|
| B+| B+|
| B | B |
| A-| A-|
The other is a set of ≈700 entries that look similar to this:
| grade | scale |
| --- | --- |
| A+|100|
| A+| 99|
| A+| 98|
| A+| 97|
| A | 96|
| A | 95|
| A | 94|
| A | 93|
| A-| 92|
| A-| 91|
| A-| 90|
| B+| 89|
| B+| 88|
...and so on.
What I'm trying to do is create a new column that shows whether grade_col2 matches grade_col1 with a binary, 0-1 output (0 = no match, 1 = match). Most of grade_col2 is shown by letter grade. But every once in awhile an entry in grade_col2 was accidentally entered as a numeric grade instead. I want this match column to give me a "1" even when grade_col2 is a numeric grade instead of a letter grade. In other words, if grade_col1 is B and grade_col2 is 86, I want this to still be read as a match. Only when grade_col1 is F and grade_col2 is 96 would this not be a match (similar to when grade_col1 is B- and grade_col2 is D = not a match).
The second data frame gives me the information I need to translate between one and the other (entries between 97-100 are A+, between 93-96 are A, and so on). I just don't know how to run a script that uses this information to find matches through all ≈4000 entries. Theoretically, I could do this manually, but the real dataset is so lengthy that this isn't realistic.
I had been thinking of using nested if_else statements with dplyr. But once I got past the first "if" statement, I got stuck. I'd appreciate any help with this people can offer.
You can do this using a join.
Let your first dataframe be grades_df and your second dataframe be lookup_df, then you want something like the following:
output = grades_df %>%
# join on look up, keeping everything grades table
left_join(lookup_df, by = c(grade_col2 = "scale")) %>%
# combine grade_col2 from grades_df and grade from lookup_df
mutate(grade_col2b = ifelse(is.na(grade), grade_col2, grade)) %>%
# indicator column
mutate(indicator = ifelse(grade_col1 == grade_col2b, 1, 0))

WGCNA package: value matching function output contains wrong NAs

I use WGCNA package for analyzing the co-expressed genes. Here I try to Form a data frame analogous to expression data that will hold the clinical traits. and i use the following codes:
table for traitData
| x | sample | NoduleperPlant |
|- |- |- |
| 1 | 1021_verbena_rep_1 | 2 |
| 2 | 1021_verbena_rep_2 | 3 |
| 3 | 1021_verbena_rep_3 | 1 |
| 4 | 1021_camporegio_rep_1 | 2 |
| 5 | 1021_camporegio_rep_2 | 3 |
| 6 | 1021_camporegio_rep_3 | 4 |
| 7 | BL225C_camporegio_rep_1 | 5 |
| 8 | BL225C_camporegio_rep_2 | 4 |
| 9 | BL225C_camporegio_rep_3 | 1 |
Table dfxpr (some of the genes are presented in table)
|FIELD1 |aacC-1|aacC4-1|aapJ-1|aapM-1|aapP-1|aapQ-1|aarF-1|
|-----------------------|------|-------|------|------|------|------|------|
|X1021_verbena_rep_1 |42 |46 |12412 |935 |3354 |2876 |550 |
|X1021_verbena_rep_2 |52 |37 |11775 |946 |2970 |2824 |514 |
|X1021_verbena_rep_3 |12 |22 |5077 |397 |1462 |1228 |230 |
|X1021_camporegio_rep_1 |52 |71 |12983 |1454 |3408 |3248 |707 |
|X1021_camporegio_rep_2 |20 |65 |9240 |803 |2807 |3146 |445 |
|X1021_camporegio_rep_3 |28 |53 |11030 |1065 |3480 |3410 |582 |
|BL225C_camporegio_rep_1|29 |19 |6346 |375 |938 |768 |118 |
|BL225C_camporegio_rep_2|51 |62 |12938 |781 |1765 |1629 |291 |
|BL225C_camporegio_rep_3|52 |43 |6462 |504 |1120 |1091 |238 |
traitData = read.csv("NodulPerPlantTraitForLowGroup.csv"); #this csv file contains 3 columns as the first column is non-relevant information, second column contains the names of samples and the third column holds the values measured for the traits.
# remove columns that hold information I do not need.
allTraits = traitData[, -1];
allTraits = allTraits[, 1:2];
# Form a data frame analogous to expression data that will hold the clinical traits.
lowNoduleSamples = rownames(dfxpr) #dfxpr is a data frame containing 9 observations (i.e. samples) and 6398 variables (i.e. genes)
traitRows = match(lowNoduleSamples, allTraits$sample); #here is the line i get wrong values as NAs while i know they all should match
datTraits = allTraits[traitRows, -1]; #then this lines result NAs too
rownames(datTraits) = allTraits[traitRows, 1];
collectGarbage();
how can I fix the problem?
I have Added a "drop = FALSE" to this line: datTraits = allTraits[traitRows, -1]
datTraits = allTraits[traitRows, -1, drop = FALSE]
I realized that my allTraits contains only 2 columns; when I remove the first one, I'm left with just one column and R converts that into a single vector unless I add the drop = FALSE argument.

Control digits in specific cells

I have a table that looks like this:
+-----------------------------------+-------+--------+------+
| | Male | Female | n |
+-----------------------------------+-------+--------+------+
| way more than my fair share | 2,4 | 21,6 | 135 |
| a little more than my fair share | 5,4 | 38,1 | 244 |
| about my fair share | 54,0 | 35,3 | 491 |
| a littles less than my fair share | 25,1 | 3,0 | 153 |
| way less than my fair share | 8,7 | 0,7 | 51 |
| Can't say | 4,4 | 1,2 | 31 |
| n | 541,0 | 564,0 | 1105 |
+-----------------------------------+-------+--------+------+
Everything is fine but what I would like to do is to show no digits in the last row at all since they show the margins (real cases). Is there any chance in R I can manipulate specific cells and their digits?
Thanks!
You could use ifelse to output the numbers in different formats in different rows, as in the example below. However, it will take some additional finagling to get the values in the last row to line up by place value with the previous rows:
library(knitr)
library(tidyverse)
# Fake data
set.seed(10)
dat = data.frame(category=c(LETTERS[1:6],"n"), replicate(3, rnorm(7, 100,20)))
dat %>%
mutate_if(is.numeric, funs(sprintf(ifelse(category=="n", "%1.0f", "%1.1f"), .))) %>%
kable(align="lrrr")
|category | X1| X2| X3|
|:--------|-----:|-----:|-----:|
|A | 100.4| 92.7| 114.8|
|B | 96.3| 67.5| 101.8|
|C | 72.6| 94.9| 80.9|
|D | 88.0| 122.0| 96.1|
|E | 105.9| 115.1| 118.5|
|F | 107.8| 95.2| 109.7|
|n | 76| 120| 88|
The huxtable package makes it easy to decimal-align the values (see the Vignette for more on table formatting):
library(huxtable)
tab = dat %>%
mutate_if(is.numeric, funs(sprintf(ifelse(category=="n", "%1.0f", "%1.1f"), .))) %>%
hux %>% add_colnames()
align(tab)[-1] = "."
tab
Here's what the PDF output looks like when knitted to PDF from an rmarkdown document:

Is it possible to plot two variables using a third one as filter in R?

I am trying to create a plot using two variables (DATE and INT_RATE) using for filter the content of a third variable GRADE.
The problem is that I can't really figure out how to use the variable GRADE as a filter for the row.
In the below section i provide a detailed sample of starting data as well as draw of the plot I'm trying to achieve.
Thanks in advance.
STARTING DATA
| DATE | INT_RATE | GRADE |
––––––––––––––––––––––––––––––
| 1-jan | 5% | A | <-- A
| 5-feb | 3% | B |
| 9-feb | 2% | D |
| 1-apr | 3% | A | <-- A
| 5-jun | 5% | A | <-- A
| 1-aug | 3% | G |
| 1-sep | 2% | E |
| 3-nov | 1% | C |
| 8-dec | 8% | A | <-- A
| . | . | . |
| . | . | . |
| . | . | . |
And this is the kind of graph i would like to achieve, which is a very basic one, except for the filtering work needed before.
WANTED RESULT:
GRADE "A"
INT_RATE
|
|
8%-| •
| ̷
| ̷
| ̷
5%-| • •
| \ /
| \ /
| \ /
| \ /
3%-| •
|
|
|
|
––––––––––––––––––––––––––––––––––-–––>
| ˆ ˆ ˆ ˆ DATE
|1-jan 1-apr 5-jun 8-dec
EDIT 1:
Following the precious help from #apax I managed to get a plot, but the result is not satisfying because of the weird way R is displaying it (I think it might be related to the fact that the dataset in question is very large 800k rows). Do you have any suggestion?
By the way, this solved my problem:
plot(x = DATE, y = INT_RATE, data = filter(df, GRADE == "A"))
I am also uploading a PNG of the malformed chart.
Thanks again to all.
Here's a quick one-liner solution where I assume your data is stored in an object named df
library(dplyr) ## For filter() function below
plot(x = DATE, y = INT_RATE, data = filter(df, GRADE == "A"))
You could use ggplot2 and facet_wrap(...)
library(ggplot2)
ggplot(mtcars, aes(x=mpg, y=disp)) +
geom_point() +
facet_wrap(~cyl)
For your data
ggplot(data, aes(x=DATE, y=INT_RATE)) +
geom_line() +
facet_wrap(~GRADE)
P.S. This gives separate graphs for all grades. But that should not be a problem.

Resources