Value not mapping correctly on Y axis using GGPLOT - r

I am new to R and have very simple code. I am trying to create a barchart with 2 variables and 6 observations; however, the data appears to be plotting incorrectly. The combined value for MAYBE is 5.9, the combined value for NO is 5.3 and the combined value for YES is 5.3 Categories MAYBE and NO appear to be showing correctly; however, YES appears to be showing 3.2 and not 5.3. Can you please review and advise what might be wrong with my code.
library(tidyverse)
xaxis_data <- c("YES","NO","MAYBE")
yaxis_data <- c(2.1,1.6,3.4,3.2,3.7,2.5)
data_to_plot <- data.frame(cbind(xaxis_data,yaxis_data),stringsAsFactors = FALSE)
ggplot(data=data_to_plot) +
geom_bar(mapping=aes(x = xaxis_data,y=yaxis_data,fill = xaxis_data),stat="identity")[enter image description here][1]

The issue is that cbind converts to matrix and matrix can have only a single class. The xaxis_data is character class and it converts the whole matrix to character. Instead, we can just construct with data.frame alone.
data_to_plot <- data.frame(xaxis_data,yaxis_data,stringsAsFactors = FALSE)
str(data_to_plot)
#'data.frame': 6 obs. of 2 variables:
#$ xaxis_data: chr "YES" "NO" "MAYBE" "YES" ...
#$ yaxis_data: num 2.1 1.6 3.4 3.2 3.7 2.5
If we use the cbind with data.frame
str(data.frame(cbind(xaxis_data,yaxis_data),stringsAsFactors = FALSE))
'data.frame': 6 obs. of 2 variables:
#$ xaxis_data: chr "YES" "NO" "MAYBE" "YES" ...
#$ yaxis_data: chr "2.1" "1.6" "3.4" "3.2" ... ### character class
Using the OP's code
library(ggplot2)
ggplot(data=data_to_plot) +
geom_bar(mapping=aes(x = xaxis_data,y=yaxis_data,
fill = xaxis_data), stat="identity")

Related

R- expand.grid given a data.frame of parameter names and sequence definitions

I have a data.frame that arbitrarily defines parameter names and sequence boundaries:
dfParameterValues <- data.frame(ParameterName = character(), seqFrom = integer(), seqTo = integer(), seqBy = integer())
row1 <- data.frame(ParameterName = "parameterA", seqFrom = 1, seqTo = 2, seqBy = 1)
row2 <- data.frame(ParameterName = "parameterB", seqFrom = 5, seqTo = 7, seqBy = 1)
row3 <- data.frame(ParameterName = "parameterC", seqFrom = 10, seqTo = 11, seqBy = 1)
dfParameterValues <- rbind(dfParameterValues, row1)
dfParameterValues <- rbind(dfParameterValues, row2)
dfParameterValues <- rbind(dfParameterValues, row3)
I would like to use this approach to create a grid of c parameter columns based on the number of unique ParameterNames that contain r rows of all possible combinations of the sequences given by seqFrom, seqTo, and seqBy. The result would therefore look somewhat like this or should have a content like the following:
ParameterA ParameterB ParameterC
1 5 10
1 5 11
1 6 10
1 6 11
1 7 10
1 7 11
2 5 10
2 5 11
2 6 10
2 6 11
2 7 10
2 7 11
Edit: Note that the parameter names and their numbers are not known in advance. The data.frame comes from elsewhere so I cannot use the standard static expand.grid approach and need something like a flexible function that creates the expanded grid based on any dataframe with the columns ParameterName, seqFrom, seqTo, seqBy.
I've been playing around with for loops (which is bad to begin with) and it hasn't lead me to any elegant ideas. I can't seem to find a way to come up with the result by using tidyr without constructing the sequences seperately first, either. Do you have any elegant approaches?
Bonus kudos for extending this to include not only numerical sequences, but vectors/sets of characters / other factors, too.
Many thanks!
Going off CPak's answer, you could use
my_table <- expand.grid(apply(dfParameterValues, 1, function(x) seq(as.numeric(x['seqFrom']), as.numeric(x['seqTo']), as.numeric(x['seqBy']))))
names(my_table) <- c("ParameterA", "ParameterB", "ParameterC")
my_table <- my_table[order(my_table$ParameterA, my_table$ParameterB), ]
#smanski's answer is technically correct (and should arguably be accepted since it motivated this), but it is also a good example of when to be careful when using apply with data.frames. In this case, the frame contains at least one column that is character, so all columns are converted, resulting in the need to use as.numeric. The safer alternative is to only pull the columns needed, such as either of:
expand.grid(apply(dfParameterValues[,-1], 1,
function(x) seq(x['seqFrom'], x['seqTo'], x['seqBy']) ))
expand.grid(apply(dfParameterValues[,c("seqFrom","seqTo","seqBy")], 1,
function(x) seq(x['seqFrom'], x['seqTo'], x['seqBy']) ))
I prefer the second, because it only pulls what it needs and therefore what it "knows" should be numeric. (I find explicit is often safer.)
The reason this is happening is that apply silently converts the data to a matrix, so to see the effects, try:
str(as.matrix(dfParameterValues))
# chr [1:3, 1:4] "parameterA" "parameterB" "parameterC" " 1" " 5" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:3] "1" "2" "3"
# ..$ : chr [1:4] "ParameterName" "seqFrom" "seqTo" "seqBy"
str(as.matrix(dfParameterValues[c("seqFrom","seqTo","seqBy")]))
# num [1:3, 1:3] 1 5 10 2 7 11 1 1 1
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:3] "1" "2" "3"
# ..$ : chr [1:3] "seqFrom" "seqTo" "seqBy"
(Note the chr on the first and the num on the second.)
Neither one preserves the parameter names. To do that, just sandwich the call with setNames:
setNames(
expand.grid(apply(dfParameterValues[,c("seqFrom","seqTo","seqBy")], 1,
function(x) seq(x['seqFrom'], x['seqTo'], x['seqBy']) )),
dfParameterValues$ParameterName)

How to get labels from hclust result

let's say i have a dataset like this
dt<-data.frame(id=1:4,X=sample(4),Y=sample(4))
and then i try to make a hierarchical clustering using the below code
dis<-dist(dt[,-1])
clusters <- hclust(dis)
plot(clusters)
and it works well
The point is when i ask for
clusters$labels
it gives me NULL, meanwhile i expect to see the label of indivisuals in order like
1, 4, 2, 3
it is important to have them with the order that they are added in plot
Use cluster$order rather than labels if you happened to not have assigned the labels.
Infact you can see all the contents by using function called summary
clusters <- hclust(dis)
plot(clusters)
summary(clusters)
clusters$order
You can compare with the plot i received at my end, it is offcourse little different than yours
My outcome:
> clusters$order
[1] 4 1 2 3
Content of summary command:
> summary(clusters)
Length Class Mode
merge 6 -none- numeric
height 3 -none- numeric
order 4 -none- numeric
labels 0 -none- NULL
method 1 -none- character
call 2 -none- call
dist.method 1 -none- character
You can observe that since there is null value against labels, hence you are not getting the labels. To receive the labels you need to assign them first using clusters$labels <- c("A","B","C","D") or you can assign with the rownames, once your labels are assigned you will no longer see the numbers you will able to see the names/labels.
In my case I have not assigned any name hence receiving the numbers instead.
You can put the labels in the plot function itself as well.
From the documentation ?hclust
labels
A character vector of labels for the leaves of the tree. By
default the row names or row numbers of the original data are used. If
labels = FALSE no labels at all are plotted.
You could use the following code:
# your data, I changed the id to characters to make it more clear
set.seed(1234) # for reproducibility
dt<-data.frame(id=c("A", "B", "C", "D"),X=sample(4),Y=sample(4))
dt
# your code, no labels
dis<-dist(dt[,-1])
clusters <- hclust(dis)
clusters$labels
# add labels, plot and check labels
clusters$labels <- dt$id
plot(clusters)
## labels in the order plotted
clusters$labels[clusters$order]
## [1] A D B C
## Levels: A B C D
Please let me know whether this is what you want.
Please make sure you use rownames(...) to ensure your data has labels
> rownames(dt) <- dt$id
> dt
id X Y
1 1 2 1
2 2 4 3
3 3 1 2
4 4 3 4
> dis<-dist(dt[,-1])
> clusters <- hclust(dis)
> str(clusters)
List of 7
$ merge : int [1:3, 1:2] -1 -2 1 -3 -4 2
$ height : num [1:3] 1.41 1.41 3.16
$ order : int [1:4] 1 3 2 4
$ labels : chr [1:4] "1" "2" "3" "4"
$ method : chr "complete"
$ call : language hclust(d = dis)
$ dist.method: chr "euclidean"
- attr(*, "class")= chr "hclust"
>

R nested for loop with as.numeric runs but does not change to num

Unfortunately, I was not able to find the solution to my question on here; thus, this new question.
Here is what I am trying to do:
Read in table from wikipedia with the xml2 package and put it in a data frame. So far so good. Now, I would like to convert the chr items into num with as.numeric leaving out the the first column and the first row in every column.
if(!require("pacman")) install.packages("pacman")
pacman::p_load(rvest, dplyr, xml2)
uebergewicht <- read_html("https://de.wikipedia.org/wiki/%C3%9Cbergewicht")
uebergewicht <- uebergewicht %>%
html_nodes("table") %>%
.[[2]] %>% # table number two at link
html_table(fill=TRUE)
for (i in 2:5){
for (j in 2:6){
uebergewicht[j,i] <- as.numeric(uebergewicht[j,i])
}
}
The code runs without any errors but leaves the chr items unchanged. What am I doing wrong?
It is because each column of a data.frame is a vector, and all elements of a vector must be the same type. Since you are not changing all of the elements to type numeric, the class of the entire vector remains chr. See the example below.
vector<-c("a", as.numeric(1), as.numeric(3), as.numeric(5))
str(vector)
chr [1:4] "a" "1" "3" "5"
vector[2:4]<-as.numeric(vector[2:4])
str(vector)
chr [1:4] "a" "1" "3" "5"
If you tried something like vector<-as.numeric(vector) you would get:
Warning message:
NAs introduced by coercion
And you will lose some of the information stored in your vector.
str(vector)
num [1:4] NA 1 3 5
Thanks for your comments!
I was able to solve it via subsetting:
uebergewicht <- read_html("https://de.wikipedia.org/wiki/%C3%9Cbergewicht")
uebergewicht <- uebergewicht %>%
html_nodes("table") %>%
.[[2]] %>%
html_table(fill=TRUE)
ueber_sub <- uebergewicht[2:6, 2:5]
names(ueber_sub) <- c("jungen.ueber", "jungen.adi", "mädchen.ueber", "mädchen.adi")
for (i in 1:4){
ueber_sub[,i] <- as.numeric(ueber_sub[,i])
}
> str(ueber_sub)
'data.frame': 5 obs. of 4 variables:
$ jungen.ueber : num 6.4 8.9 11.3 9 8.8
$ jungen.adi : num 2.5 7 7 8.2 6.3
$ maädchen.ueber: num 6 9 11.6 8.1 8.5
$ maädchen.adi : num 3.3 5.7 7.3 8.9 6.4
Through subsetting, we get rid of the first line of each column, which would cause the column-vector to stay chr. I simply recoded the column names to reflect the information from row 1. That way, everything that goes into the for-loop is converted to num.
I didn't have any issues with the packages by the way.

Converting from a character to a numeric data frame

I have a character data frame in R which has NaNs in it. I need to remove any row with a NaN and then convert it to a numeric data frame.
If I just do as.numeric on the data frame, I run into the following
Error: (list) object cannot be coerced to type 'double'
1:
0:
As #thijs van den bergh points you to,
dat <- data.frame(x=c("NaN","2"),y=c("NaN","3"),stringsAsFactors=FALSE)
dat <- as.data.frame(sapply(dat, as.numeric)) #<- sapply is here
dat[complete.cases(dat), ]
# x y
#2 2 3
Is one way to do this.
Your error comes from trying to make a data.frame numeric. The sapply option I show is instead making each column vector numeric.
Note that data.frames are not numeric or character, but rather are a list which can be all numeric columns, all character columns, or a mix of these or other types (e.g.: Date/logical).
dat <- data.frame(x=c("NaN","2"),y=c("NaN","3"),stringsAsFactors=FALSE)
is.list(dat)
# [1] TRUE
The example data just has two character columns:
> str(dat)
'data.frame': 2 obs. of 2 variables:
$ x: chr "NaN" "2"
$ y: chr "NaN" "3
...which you could add a numeric column to like so:
> dat$num.example <- c(6.2,3.8)
> dat
x y num.example
1 NaN NaN 6.2
2 2 3 3.8
> str(dat)
'data.frame': 2 obs. of 3 variables:
$ x : chr "NaN" "2"
$ y : chr "NaN" "3"
$ num.example: num 6.2 3.8
So, when you try to do as.numeric R gets confused because it is wondering how to convert this list object which may have multiple types in it. user1317221_G's answer uses the ?sapply function, which can be used to apply a function to the individual items of an object. You could alternatively use ?lapply which is a very similar function (read more on the *apply functions here - R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate )
I.e. - in this case, to each column of your data.frame, you can apply the as.numeric function, like so:
data.frame(lapply(dat,as.numeric))
The lapply call is wrapped in a data.frame to make sure the output is a data.frame and not a list. That is, running:
lapply(dat,as.numeric)
will give you:
> lapply(dat,as.numeric)
$x
[1] NaN 2
$y
[1] NaN 3
$num.example
[1] 6.2 3.8
While:
data.frame(lapply(dat,as.numeric))
will give you:
> data.frame(lapply(dat,as.numeric))
x y num.example
1 NaN NaN 6.2
2 2 3 3.8

Data frame and the very common mistake while using character columns

A very unexpected behavior of the useful data.frame in R arises from keeping character columns as factor. This causes many problems if it is not considered. For example suppose the following code:
foo=data.frame(name=c("c","a"),value=1:2)
# name val
# 1 c 1
# 2 a 2
bar=matrix(1:6,nrow=3)
rownames(bar)=c("a","b","c")
# [,1] [,2]
# a 1 4
# b 2 5
# c 3 6
Then what do you expect of running bar[foo$name,]? It normally should return the rows of bar that are named according to the foo$name that means rows 'c' and 'a'. But the result is different:
bar[foo$name,]
# [,1] [,2]
# b 2 5
# a 1 4
The reason is here: foo$name is not a character vector, but an integer vector.
foo$name
# [1] c a
# Levels: a c
To have the expected behavior, I manually convert it to character vector:
foo$name = as.character(foo$name)
bar[foo$name,]
# [,1] [,2]
# c 3 6
# a 1 4
But the problem is that we may easily miss to perform this, and have hidden bugs in our codes. Is there any better solution?
This is a feature and R is working as documented. This can be dealt with generally in a few ways:
use the argument stringsAsFactors = TRUE in the call to data.frame(). See ?data.frame
if you detest this behaviour so, set the option globally via
options(stringsAsFactors = FALSE)
(as noted by #JoshuaUlrich in comments) a third option is to wrap character variables in I(....). This alters the class of the object being assigned to the data frame component to include "AsIs". In general this shouldn't be a problem as the object inherits (in this case) the class "character" so should work as before.
You can check what the default for stringsAsFactors is on the currently running R process via:
> default.stringsAsFactors()
[1] TRUE
The issue is slightly wider than data.frame() in scope as this also affects read.table(). In that function, as well as the two options above, you can also tell R what all the classes of the variables are via argument colClasses and R will respect that, e.g.
> tmp <- read.table(text = '"Var1","Var2"
+ "A","B"
+ "C","C"
+ "B","D"', header = TRUE, colClasses = rep("character", 2), sep = ",")
> str(tmp)
'data.frame': 3 obs. of 2 variables:
$ Var1: chr "A" "C" "B"
$ Var2: chr "B" "C" "D"
In the example data below, author and title are automatically converted to factor (unless you add the argument stringsAsFactors = FALSE when you are creating the data). What if we forgot to change the default setting and don't want to set the options globally?
Some code I found somewhere (most likely SO) uses sapply() to identify factors and convert them to strings.
dat = data.frame(title = c("title1", "title2", "title3"),
author = c("author1", "author2", "author3"),
customerID = c(1, 2, 1))
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : Factor w/ 3 levels "title1","title2",..: 1 2 3
# $ author : Factor w/ 3 levels "author1","author2",..: 1 2 3
# $ customerID: num 1 2 1
dat[sapply(dat, is.factor)] = lapply(dat[sapply(dat, is.factor)],
as.character)
# > str(dat)
# 'data.frame': 3 obs. of 3 variables:
# $ title : chr "title1" "title2" "title3"
# $ author : chr "author1" "author2" "author3"
# $ customerID: num 1 2 1
I assume this would be faster than re-reading in the dataset with the stringsAsFactors = FALSE argument, but have never tested.

Resources