How to fix corrplot formatting issue in R - r

When I run the following chunk of code, I get a very small and unuseful correlation plot(see photo). I think it is because the column labels are too large. Is there any suggestion on how to fix this issue without modifying column names (maybe indicating only substring of names for example...)
numeric.var <- sapply(df_train, is.numeric)
corr.matrix <- cor(df_train[,numeric.var])
corrplot(corr.matrix, main="\n\nCorrelation Plot for Numerical Variables",
method="number")

Related

How to use corrplot with is.corr=FALSE

I previously made a beautiful functional and perfect actual corrolation plot with corrplot (my plot). Now I have to get the underlying data in the same look. So my goal is to have triangular similarity matrixes in the same colours as my corrolation plot. Imagine it like the conditional formatting in excel.
My Data: my Data from excel
Link to CSV Data file
it is loaded in as a csv and it can read the csv perfectly
My Code:corrplot(Phylogeny, is.corr=FALSE,method="number", cl.lim=c(0,1))
The error it throws me: Error in if (any(corr < cl.lim[1]) || any(corr > cl.lim[2])) { : Missing value, where TRUE/FALSE is required
i made sure all colums are numeric
i made sure to fill the missing bits with NA's (because that was a problem somwhere before)
i made sure all my values are between 0 and 1 like i want the limit to be (in between it told me that my values are not within the limit, when i tried around with some stuff)
the error does not change when i change the limit
the error does not change when i take the is.corr=FALSE out (default=TRUE)
i played around with corrplot.mixed and its still not working
have been referencing information from Corrplot Intro
I have looked into the condformat function but i am not really sure if it can do a filling of each cell with one colour according to the overall gradient like i used for my corrolation plot.
What am I missing here that it does not want to give me my table back with pretty colours?
I had the same error, but I was able to fix it by converting my data.frame to a matrix. I ended up with corrplot(as.matrix(df), is.corr = FALSE).
If I am understanding correctly, your posted data are already a correlation matrix - although not a fully symmetrical one of the sort that would be produced with the call cor on raw data.
In that case, the problem is just that you have variable names (Species) as a column in your data. Change this column to row names, drop the variable names, and call corrplot as user9536160 suggests:
# read in your data
phyl <- as.data.frame(read_csv("Phylogeny.csv"))
# name rows and drop variable names in the df itself
row.names(phyl) <- phyl$Species
phyl <- phyl %>%
select(-Species)
# call corrplot
corrplot(as.matrix(phyl), is.corr = FALSE)
The result:

How to get variable labels to be saved as labels from R to Stata?

I have an R dataset which contains labels, that can be recovered from the command, for the dataset X as:
var.label(X)
after making use of library(labelled).
Thereafter, I save it as a Stata dataset as:
write.dta(X, 'X.dta', val.labels=var_label(X))
I get an unused argument error message. I tried as well:
write.dta(X, 'X.dta', val.labels=c(var_label(X)))
to no avail. Any guidance on this is much appreciated.

R extract labels from a rda data frame

I am looking at some data downloaded from ICPSR and I am specifically using their R data file (.rda). Beneath the column name of each data file, there are some descriptions of the variables (a.k.a labels). An example is attached as well.
I tried various ways to get the label including base::label, Hmisc::label, labelled::var_label, sjlabelled::get_label and etc. But none worked.
So I am asking any ideas on how to extract the labels from this data file?
Thanks very much in advance!
this could work using purrr
#load library
library(purrr)
#get col n
n <- ncol(yourdata)
#extract labels as vector
labels <- map_chr(1:n, function(x) attr(yourdata[[x]], "label") )
This worked for me (I am working with ICPSR 35206):
attributes(yourdata)$variable.labels -> labels
Make sure that your attribute referring to the labels is actually called "variable.labels".

Getting subscripts from Excel into R

I just startet learning R but I already have my first problem. I want to disply my data in a graph. My data is in an Excel sheet converted to a .csv sheet. But I have some chemical formulars like Fe2O3 in my data and with the .csv all subscripst are gone. That doesn't look very nice. Is there any way to get the subscripts from the original Excel file into R?
I would really appreciate your help :)
Edit: My data contains 6 chemical formulars displayed on the x-axis, which all contain subscripts (i.e. Fe2O3, ZnCl2, CO2, ...) and nummeric values displayed on the y-axis. The graph is a bar chart. I am not sure if there is a way to either change the numbers to subscipts in R or keep them prior to the import.
The graph looks like this. But I would like to have the numbers as subscripts:
I don't know that there's a way to bring the formatting from excel into a CSV and then R, unless you can make those subscripts using unicode. UTF8 symbols for subscript letters
Given that your list of chemicals is short, it's not much work to tweak the chemical names to help ggplot interpret them with subscripts. You'll want brackets around the numbers, plus tildes afterwards if there are more elements to include. Then we also tell scale_x_discrete to "parse" the labels and convert those symbols to formatting.
set.seed(42)
chem_df <- tibble(
Chemicals =
c("AgNO3", "Al2SiO5", "CO2", "Fe2O3", "FeSO4", "ZnCl2"),
Chemicals_parsed =
c("AgNO[3]", "Al[2]~SiO[5]", "CO[2]", "Fe[2]~O[3]", "FeSO[4]", "ZnCl[2]"),
Mean = rnorm(6, 50, 30))
ggplot(chem_df, aes(x=Chemicals_parsed, Mean)) + geom_col() +
scale_x_discrete(name = "Chemicals",
labels=parse(text=chem_df$Chemicals_parsed))
To add to the excellent answer of #JonSpring, you can write a function which will convert strings like ""Al2SiO5" to strings like "Al[2]~SiO[5]", so you don't have to manually make all the conversions:
library(stringr)
chem.form <- function(s){
s <- str_replace_all(s,"([0-9]+)","[\\1]~")
if(endsWith(s,"~")) s <- substr(s,1,nchar(s) - 1)
s
}
Chemicals <- c("AgNO3", "Al2SiO5", "CO2", "Fe2O3", "FeSO4", "ZnCl2")
Chemicals_parsed <- as.vector(sapply(Chemicals,chem.form))

R: Issues with line graph of germination trough time

I'm still in the process of learning R using Swirl and RStudio, and a goal I've set for myself is to recreate this graph. I have a small dataset that I will link below (it's saved as a plain text CSV file that I import into R with headings enabled).
If I try to plot that dataset without changing anything, I get this, which is obviously not the goal.
At first I thought the problem would be in the class of my imported dataset, defined as kt. After class(kt) turned out to be data.frame I figured that wasn't the problem. Should I be trying to rewrite the table to something that R can plot instantly, or should I be trying to extract each species individually, plot them separately and then combining the different plots into one graph? Perhaps there is something wrong with my dates, I know that R handles dates in a specific way. Maybe these solutions are not even needed and I'm just doing something stupidly simple wrong, but I can't find it myself.
Your help is much appreciated.
Dataset:
Species,week 0,week 1,week 2,week 3,week 4,week 5,week 6,week 7,week 8,week 9,week 10,week 11,week 12,week 13,week 14,week 15,week 16,week 17,week 18
Caesalpinia coriaria,0.0%,24.0%,28.0%,28.0%,32.0%,37.0%,40.0%,46.0%,52.0%,56.0%,63.0%,64.0%,68.0%,71.0%,72.0%,,,,
Coccoloba swartzii,0.0%,0.0%,1.0%,10.0%,19.0%,31.0%,33.0%,39.0%,43.0%,48.0%,52.0%,52.0%,52.0%,52.0%,52.0%,52.0%,52.0%,55.0%,
Cordia dentata,0.0%,5.0%,18.0%,21.0%,24.0%,26.0%,27.0%,30.0%,32.0%,32.0%,32.0%,32.0%,32.0%,32.0%,33.0%,33.0%,33.0%,34.0%,35.0%
Guaiacum officinale,0.0%,0.0%,0.0%,0.0%,4.0%,5.0%,5.0%,5.0%,7.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,,
Randia aculeata,0.0%,0.0%,0.0%,4.0%,13.0%,14.0%,18.0%,19.0%,21.0%,21.0%,21.0%,21.0%,21.0%,22.0%,22.0%,22.0%,22.0%,,
Schoepfia schreberi,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,1.0%,4.0%,8.0%,11.0%,13.0%,21.0%,21.0%,24.0%,24.0%,25.0%,27.0%,,
Prosopis juliflora,0.0%,7.5%,31.3%,34.2%,,,,,,,,,,,,,,,
Something like this??
# get rid of "%" signs
df <- data.frame(sapply(df,function(x)gsub("%","",x,fixed=T)))
# convert cols 2:20 to numeric
df[,2:20] <- sapply(df[,2:20],function(x)as.numeric(as.character(x)))
library(reshape2)
library(ggplot2)
gg <- melt(df,id="Species")
ggplot(gg,aes(x=variable,y=value,color=Species,group=Species)) +
geom_line()+
theme_bw()+
theme(legend.position="bottom", legend.title=element_blank())
There are lots of problems here.
First, if your dataset really has those % signs, then R interprets the data as character and imports it as factors. So first we have to get rid of the % (using gsub(...), and then we have to convert what's left to numeric. With factors, you have to convert to character first, then numeric, so: as.numeric(as.character(...)). All of this could have been avoided if you exported the data without the % signs!!!
Plotting multiple curves with different colors is something the ggplot package was designed for (among many other things), so we use that. ggplot prefers data in "long" format - all the data in one column, with a second column distinguishing different datasets. Your data is in "wide" format - data in different columns. So we convert to long using melt(...) from the reshape2 package. The result, gg has three columns: Species, variable and value. value contains the actual data and variable contains the week number.
So now we create a ggplot object, setting the x-axis to the variable column, the y-axis to the value column, with color mapped to Species, and we tell ggplot to plot lines (using geom_line(...)).
The rest is to position the legend at the bottom, and turn off some of the ggplot default formatting.

Resources