R lattice -- unexpected legend - r

I want to graphically compare carbon pools under different forms of land use.
I am reading in my data,
later I remove a factor level.
However, when I display the data the legend does not work.
The first legend symbol has no label, and
the deleted factor (SPEZIALKULTUREN) is still shown.
how can I control the legend?
A sample of the data is shown below
enter image description here
enter image description here
agdata <- read.csv("C:/Jandl/LfdProjekte/2017FAO/2020Paper/Agdata.csv", header = FALSE, sep = ";", dec = ",")
colnames(agdata) <- c("State","AgType","Cpool")
agdata$Cpool <- as.numeric(agdata$Cpool)
levels(agdata$AgType)[levels(agdata$AgType)=="ACKERLAND"] <- "crop"
levels(agdata$AgType)[levels(agdata$AgType)=="ALMEN"] <- "alpine meadow"
levels(agdata$AgType)[levels(agdata$AgType)=="EXTENSIVES GRÜNLAND *"] <- "extens grassland"
levels(agdata$AgType)[levels(agdata$AgType)=="INTENSIVES GRÜNLAND**"] <- "intens grassland"
levels(agdata$AgType)[levels(agdata$AgType)=="WEINGARTEN***"] <- "vineyard"
#dropping SPEZIALKULTUREN
agdata <- subset(agdata,agdata$AgType !="SPEZIALKULTUREN")
agdata <- subset(agdata,agdata$State !="")
agdata <- subset(agdata,agdata$State !="Wien")
library(lattice)
colors = c("lightsalmon3", "lightgoldenrod2", "cadetblue4", "yellow", "red", "blue")
barchart(
data = agdata,
origin = 0,
Cpool ~ State,
groups = AgType ,
xlab = list (
label = "State",
font = 2,
cex = 1),
ylab= list (
label = "C pool t/ha",
font = 2,
cex = 1),
#ylim=c(0,25),
labels = TRUE,
auto.key = list(space="top", columns= 3),
par.settings = list(superpose.polygon = list(col = colors)))

I am not sure, but it happened to me sometimes that, when using the subset function, removed factors remain empty in the dataframe and will occasionally show on some plots. For example, consider this simple dataset named "myexample":
> myexample
var1 var2
1 1 a
2 5 a
3 6 b
4 3 b
5 7 c
Now I can subset it to keep only rows where var2 is either a or b
> myexample2<-subset(myexample, var2=="a" | var2=="b")
> myexample2
var1 var2
1 1 a
2 5 a
3 6 b
4 3 b
> levels(myexample2$var2)
[1] "a" "b" "c"
It will look like "c" was dropped, but it is still there. To properly get rid of it, you can use the function droplevels()
> myexample2<-droplevels(myexample2)
> myexample2
var1 var2
1 1 a
2 5 a
3 6 b
4 3 b
> levels(myexample2$var2)
[1] "a" "b"
And now it is really gone. Maybe give it a try and see if at least the removed factor is no longer in the plot.

Related

Control which nodes to change size igraph

This may look simple but I am not able to do this. I want to plot two types of nodes, small and big using a cutoff. The values are in the attribute degree.
Here is a small toy example
g1 <- graph(edges=c(1,2, 2,3, 3, 1, 4,2), n=4, directed=F) %>%
set_vertex_attr("names", value = LETTERS[1:4])
g1_degree <- degree(g1, mode = "total")
g1_degree
[1] 2 3 2 1
g1 <- set_vertex_attr(g1, "degree", value = g1_degree)
plot(g1, vertex.size=V(g1)$degree)
This gives me every node according to the degree, but I want nodes of degree 2 and 3 big and 1 small.
So I tried to edit the values within V(g1)$degree
ifelse(V(g1)$degree < 2, yes = V(g1)$degree==1, no = V(g1)$degree==5)
FALSE FALSE FALSE TRUE
Ok, I checked my degree values, but how can I overwrite the TRUE or FALSE using the cutoffs I need?
Here are two solutions.
One with ifelse, like in the question.
g1 <- set_vertex_attr(g1, "degree", value = ifelse(V(g1)$degree < 2, 1, 5))
V(g1)$degree
#[1] 5 5 5 1
And another with findInterval. This has better performance than ifelse, which can be important in large data sets.
i <- findInterval(V(g1)$degree, c(0, 2, Inf))
g1 <- set_vertex_attr(g1, "degree", value = c(1, 5)[i])
V(g1)$degree
#[1] 5 5 5 1
With different new sizes set with the findInterval index, c(10, 50)[i], the graph would look like below.
g1 <- set_vertex_attr(g1, "degree", value = c(10, 50)[i])
plot(g1, vertex.size = V(g1)$degree)

ggpairs only plotting 1 of 5 plots then error

I am getting the below error when trying to plot the dat data frame
library(GGally)
library(ggplot2)
dat = data.frame(a=rnorm(5) , b= rnorm(5) ,c =rnorm(5) , d=rnorm(5) , e= c(1,2,3,4,5))
dat
a b c d e
1 0.21444531 1.9972134 2.1988103 -0.47624689 1
2 -0.32468591 0.6007088 1.3124130 -0.78860284 2
3 0.09458353 -1.2512714 -0.2651451 -0.59461727 3
4 -0.89536336 -0.6111659 0.5431941 1.65090747 4
5 -1.31080153 -1.1854801 -0.4143399 -0.05402813 5
ggpairs(dat ,mapping=aes(color =e),upper=list(continuous=wrap("cor",size=2)), columns = c("a","b","c","d"))
Error:
Error in $<-.data.frame(tmp, "label", value = ": ") :
replacement has 1 row, data has 0
I would like to color the data points using column "e"
Any ideas?
If you factorize e then it runs:
dat$e <- factor(dat$e)
ggpairs(dat,mapping=aes(color=e),upper=list(continuous=wrap("cor",size=2)), columns = c("a","b","c","d"))
But that is a pretty ugly figure not to mention a useless comparison.
If you eliminate the mapping then the code also runs fine:
ggpairs(dat,upper=list(continuous=wrap("cor",size=2)), columns = c("a","b","c","d"))

Label Encoder functionality in R?

In python, scikit has a great function called LabelEncoder that maps categorical levels (strings) to integer representation.
Is there anything in R to do this? For example if there is a variable called color with values {'Blue','Red','Green'} the encoder would translate:
Blue => 1
Green => 2
Red => 3
and create an object with this mapping to then use for transforming new data in a similar fashion.
Add:
It doesn't seem like just factors will work because there is no persisting of the mapping. If the new data has an unseen level from the training data, the entire structure changes. Ideally I would like the new levels labeled missing or 'other' somehow.
sample_dat <- data.frame(a_str=c('Red','Blue','Blue','Red','Green'))
sample_dat$a_int<-as.integer(as.factor(sample_dat$a_str))
sample_dat$a_int
#[1] 3 1 1 3 2
sample_dat2 <- data.frame(a_str=c('Red','Blue','Blue','Red','Green','Azure'))
sample_dat2$a_int<-as.integer(as.factor(sample_dat2$a_str))
sample_dat2$a_int
# [1] 4 2 2 4 3 1
Create your vector of data:
colors <- c("red", "red", "blue", "green")
Create a factor:
factors <- factor(colors)
Convert the factor to numbers:
as.numeric(factors)
Output: (note that this is in alphabetical order)
# [1] 3 3 1 2
You can also set a custom numbering system: (note that the output now follows the "rainbow color order" that I defined)
rainbow <- c("red","orange","yellow","green","blue","purple")
ordered <- factor(colors, levels = rainbow)
as.numeric(ordered)
# [1] 1 1 5 4
See ?factor.
Try CatEncoders package. It replicates the Python sklearn.preprocessing functionality.
# variable to encode values
colors = c("red", "red", "blue", "green")
lab_enc = LabelEncoder.fit(colors)
# new values are transformed to NA
values = transform(lab_enc, c('red', 'red', 'yellow'))
values
# [1] 3 3 NA
# doing the inverse: given the encoded numbers return the labels
inverse.transform(lab_enc, values)
# [1] "red" "red" NA
I would add the functionality of reporting the non-matching labels with a warning.
PS: It also has the OneHotEncoder function.
If I correctly understand what do you want:
# function which returns function which will encode vectors with values of 'vec'
label_encoder = function(vec){
levels = sort(unique(vec))
function(x){
match(x, levels)
}
}
colors = c("red", "red", "blue", "green")
color_encoder = label_encoder(colors) # create encoder
encoded_colors = color_encoder(colors) # encode colors
encoded_colors
new_colors = c("blue", "green", "green") # new vector
encoded_new_colors = color_encoder(new_colors)
encoded_new_colors
other_colors = c("blue", "green", "green", "yellow")
color_encoder(other_colors) # NA's are introduced
# save and restore to disk
saveRDS(color_encoder, "color_encoder.RDS")
c_encoder = readRDS("color_encoder.RDS")
c_encoder(colors) # same result
# dealing with multiple columns
# create data.frame
set.seed(123) # make result reproducible
color_dataframe = as.data.frame(
matrix(
sample(c("red", "blue", "green", "yellow"), 12, replace = TRUE),
ncol = 3)
)
color_dataframe
# encode each column
for (column in colnames(color_dataframe)){
color_dataframe[[column]] = color_encoder(color_dataframe[[column]])
}
color_dataframe
I wrote the following which I think works, the efficiency of which and/or how it will scale is not yet tested
str2Int.fit_transform<-function(df, plug_missing=TRUE){
list_of_levels=list() #empty list
#loop through the columns
for (i in 1: ncol(df))
{
#only
if (is.character(df[,i]) || is.factor(df[,i]) ){
#deal with missing
if(plug_missing){
#if factor
if (is.factor(df[,i])){
df[,i] = factor(df[,i], levels=c(levels(df[,i]), 'MISSING'))
df[,i][is.na(df[,i])] = 'MISSING'
}else{ #if character
df[,i][is.na(df[,i])] = 'MISSING'
}
}#end missing IF
levels<-unique(df[,i]) #distinct levels
list_of_levels[[colnames(df)[i]]] <- levels #set list with name of the columns to the levels
df[,i] <- as.numeric(factor(df[,i], levels = levels))
}#end if character/factor IF
}#end loop
return (list(list_of_levels,df)) #return the list of levels and the new DF
}#end of function
str2Int.transform<-function(df,list_of_levels,plug_missing=TRUE)
{
#loop through the columns
for (i in 1: ncol(df))
{
#only
if (is.character(df[,i]) || is.factor(df[,i]) ){
#deal with missing
if(plug_missing){
#if factor
if (is.factor(df[,i])){
df[,i] = factor(df[,i], levels=c(levels(df[,i]), 'MISSING'))
df[,i][is.na(df[,i])] = 'MISSING'
}else{ #if character
df[,i][is.na(df[,i])] = 'MISSING'
}
}#end missing IF
levels=list_of_levels[[colnames(df)[i]]]
if (! is.null(levels)){
df[,i] <- as.numeric(factor(df[,i], levels = levels))
}
}# character or factor
}#end of loop
return(df)
}#end of function
######################################################
# Test the functions
######################################################
###Test fit transform
# as strings
sample_dat <- data.frame(a_fact=c('Red','Blue','Blue',NA,'Green'), a_int=c(1,2,3,4,5), a_str=c('a','b','c','a','v'),stringsAsFactors=FALSE)
result<-str2Int.fit_transform(sample_dat)
result[[1]] #list of levels
result[[2]] #transformed df
#as factors
sample_dat <- data.frame(a_fact=c('Red','Blue','Blue',NA,'Green'), a_int=c(1,2,3,4,5), a_str=c('a','b','c','a','v'),stringsAsFactors=TRUE)
result<-str2Int.fit_transform(sample_dat)
result[[1]] #list of levels
result[[2]] #transformed df
###Test transform
str2Int.transform(sample_dat,result[[1]])
It's hard to believe why no one has mentioned caret's dummyVars function.
This is a widely searched question, and people don't want to write their own methods or copy and paste other users methods, they want a package, and caret is the closest thing to sklearn in R.
EDIT: I now realize that what the user actually want's is to turn strings into a counting number, which is just as.numeric(as.factor(x)) but I'm going to leave this here because using hot-one encoding is the more accurate method of encoding categorical data.
# input P to the function below is a dataframe containing only categorical variables
numlevel <- function(P) {
n <- dim(P)[2]
for(i in 1: n) {
m <- length(unique(P[[i]]))
levels(P[[i]]) <- c(1:m)
}
return(P)
}
Q <- numlevel(P)
df<- mtcars
head(df)
df$cyl <- factor(df$cyl)
df$carb <- factor(df$carb)
vec <- sapply(df, is.factor)
catlevels <- sapply(df[vec], levels)
#store the levels for each category
#level appearing first is coded as 1, second as 2 so on
df <- sapply(df, as.numeric)
class(df) #matrix
df <- data.frame(df)
#converting back to dataframe
head(df)
# Data
Country <- c("France", "Spain", "Germany", "Spain", "Germany", "France")
Age <- c(34, 27, 30, 32, 42, 30)
Purchased <- c("No", "Yes", "No", "No", "Yes", "Yes")
df <- data.frame(Country, Age, Purchased)
df
# Output
Country Age Purchased
1 France 34 No
2 Spain 27 Yes
3 Germany 30 No
4 Spain 32 No
5 Germany 42 Yes
6 France 30 Yes
Using CatEncoders package : Encoders for Categorical Variables
library(CatEncoders)
# Saving names of categorical variables
factors <- names(which(sapply(df, is.factor)))
# Label Encoder
for (i in factors){
encode <- LabelEncoder.fit(df[, i])
df[, i] <- transform(encode, df[, i])
}
df
# Output
Country Age Purchased
1 1 34 1
2 3 27 2
3 2 30 1
4 3 32 1
5 2 42 2
6 1 30 2
Using R base : factor function
# Label Encoder
levels <- c("France", "Spain", "Germany", "No", "Yes")
labels <- c(1, 2, 3, 1, 2)
for (i in factors){
df[, i] <- factor(df[, i], levels = levels, labels = labels, ordered = TRUE)
}
df
# Output
Country Age Purchased
1 1 34 1
2 2 27 2
3 3 30 1
4 2 32 1
5 3 42 2
6 1 30 2
Here is an easy and neat solution:
From the superml package:
https://www.rdocumentation.org/packages/superml/versions/0.5.3
There is a LabelEncoder class:
https://www.rdocumentation.org/packages/superml/versions/0.5.3/topics/LabelEncoder
install.packages("superml")
library(superml)
lbl <- LabelEncoder$new()
lbl$fit(sample_dat$column)
sample_dat$column <- lbl$fit_transform(sample_dat$column)
decode_names <- lbl$inverse_transform(sample_dat$column)

asssign values to dataframe subset in R

I'm having trouble assigning a dataframe to a subset of another. In the example below, the line
ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")
only modifies one column instead of two. I would expect it either to modify no columns or both, not only one. I wrote the function to fill in the PrefName and CountyID columns in dataframe ds where they are NA by joining ds to another dataframe cs.
As you can see if you run it, the test is failing because PrefName is not getting filled in. After doing a bit of debugging, I realized that join() is doing exactly what it is expected to do, but the actual assignment of the result of that join somehow drops the PrefName back to a NA.
# fully copy-paste-run-able (but broken) code
suppressMessages({
library("plyr")
library("methods")
library("testthat")
})
# Fill in the missing PrefName/CountyIDs in delstat
# - Find the missing values in Delstat
# - Grab the CityState Primary Record values
# - Match on zipcode to fill in the holes in the delstat data
# - Remove any codes that could not be fixed
# - #param ds: delstat dataframe with 6 columns (see test case)
# - #param cs: citystate dataframe with 6 columns (see test case)
getMissingCounties <- function(ds, cs) {
if (length(is.na(ds$CountyID))) {
cavities <- which(is.na(ds$CountyID))
fillings <- cs[cs$PrimRec==TRUE, c(1,3,4)]
ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")
ds <- ds[!is.na(ds$CountyID),]
}
return(ds)
}
test_getMissingCounties <- function() {
ds <- data.frame(
CityStateKey = c(1, 2, 3, 4 ),
ZipCode = c(11, 22, 33, 44 ),
Business = c(1, 1, 1, 1 ),
Residential = c(1, 1, 1, 1 ),
PrefName = c("One", NA , NA, NA),
CountyID = c(111, NA, NA, NA))
cs <- data.frame(
ZipCode = c(11, 22, 22, 33, 55 ),
Name = c("eh", "eh?", "eh?", "eh!?", "ah." ),
PrefName = c("One", "To", "Two", "Three", "Five"),
CountyID = c(111, 222, 222, 333, 555 ),
PrimRec = c(TRUE, FALSE, TRUE, TRUE, TRUE ),
CityStateKey = c(1, 2, 2, 3, 5 ))
expected <- data.frame(
CityStateKey = c(1, 2, 3 ),
ZipCode = c(11, 22, 33 ),
Business = c(1, 1, 1 ),
Residential = c(1, 1, 1 ),
PrefName = c("One", "Two", "Three"),
CountyID = c(111, 222, 333 ))
expect_equal(getMissingCounties(ds, cs), expected)
}
# run the test
test_getMissingCounties()
The results are:
CityStateKey ZipCode Business Residential PrefName CountyID
1 11 1 1 One 111
2 22 1 1 <NA> 222
3 33 1 1 <NA> 333
Any ideas why PrefName is getting set to NA by the assignment or how to do the assignment so I don't lose data?
The short answer is that you can avoid this problem by making sure that there are no factors in your data frames. You do this by using stringsAsFactors=FALSE in the call(s) to data.frame(...). Note that many of the data import functions, including read.table(...) and read.csv(...) also convert character to factor by default. You can defeat this behavior the same way.
This problem is actually quite subtle, and is also a good example of how R's "silent coercion" between data types creates all sorts of problems.
The data.frame(...) function converts any character vectors to factors by default. So in your code ds$PerfName is a factor with one level, and cs$PerfName is a factor with 5 levels. So in your assignment statement:
ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")
the 5th column on the LHS is a factor with 1 level, and the 5th column on the RHS is a factor with 5 levels.
Under some circumstances, when you assign a factor with more levels to a factor with fewer levels, the missing levels are set to NA. Consider this:
x <- c("A","B",NA,NA,NA) # character vector
y <- LETTERS[1:5] # character vector
class(x); class(y)
# [1] "character"
# [1] "character"
df <- data.frame(x,y) # x and y coerced to factor
sapply(df,class) # df$x and df$y are factors
# x y
# "factor" "factor"
# assign rows 3:5 of col 2 to col 1
df[3:5,1] <- df[3:5,2] # fails with a warning
# Warning message:
# In `[<-.factor`(`*tmp*`, iseq, value = 3:5) :
# invalid factor level, NA generated
df # missing levels set to NA
# x y
# 1 A A
# 2 B B
# 3 <NA> C
# 4 <NA> D
# 5 <NA> E
The example above is equivalent to your assignment statement. However, notice what happens if you assign all of column 2 to column 1.
# assign all of col 2 to col 1
df <- data.frame(x,y)
df[,1] <- df[,2] # succeeds!!
df
# x y
# 1 A A
# 2 B B
# 3 C C
# 4 D D
# 5 E E
This works.
Finally, a note on debugging: if you are debugging a function, sometimes it is useful to run through the statements line by line at the command line (e.g., in the global environment). If you did that, you would have gotten the warning above, whereas inside a function call the warnings are suppressed.
The constraints of the test can be satisfied by reimplementing getMissingCountries with:
merge(ds[1:4], subset(subset(cs, PrimRec)[c(1, 3, 4)]), by="ZipCode")
Caveat: the ZipCode column is always emitted first, which differs from your expected result.
But to answer the subassignment question: it breaks, because the level sets of PrefName are incompatible between ds and cs. Either avoid using a factor or relevel them. You might have missed R's warning about this, because testthat was somehow suppressing warnings.

Define factors whose levels depend on another variable

Be this mock data:
set.seed(20120220)
x <- c(rep("a", 4), rep("b", 4))
y <- c(sample(c(1, 2), 8, replace = TRUE))
z <- data.frame(cbind(x, y))
Data frame z will look like this:
x y
1 a 1
2 a 1
3 a 1
4 a 2
5 b 2
6 b 1
7 b 2
8 b 2
I want to run something akin to factor(z$y, levels = 1:2, labels = c("alpha", "beta")), but I don't want every 1 to become alpha and every 2 to become beta. I want that to happen only for x = a. If x = b, I want 1 to become gamma and 2 to become delta.
In other words, I want my data frame to look like this:
x y
1 a alpha
2 a alpha
3 a alpha
4 a beta
5 b delta
6 b gamma
7 b delta
8 b delta
This is what I came up with so far:
for (i in 1:nrow(z)) {
if (z$x[i] == "a")
z$y[i] <- factor(z$y[i], levels = 1:2, labels = c("alpha", "beta"))
else
z$y[i] <- factor(z$y[i], levels = 1:2, labels = c("gamma", "delta"))
}
But it gives me several warning messages (one for each i) like this:
Warning messages:
1: In `[<-.factor`(`*tmp*`, i, value = c(NA, 1L, 1L, 2L, 2L, 1L, 2L, :
invalid factor level, NAs generated
And then, when I call z again, the data frame is a mess, every y has been made into <NA>.
I bet there's a simple solution for this, but I've been trying several approaches for hours to no avail. My head is about to explode! Help!
> z$ynew <- ifelse(z$x == "a", ifelse( z$y==1, "alpha", "beta"),
ifelse(z$y==1, "delta", "gamma") )
> z
x y ynew
1 a 1 alpha
2 a 1 alpha
3 a 1 alpha
4 a 2 beta
5 b 2 gamma
6 b 1 delta
7 b 2 gamma
8 b 2 gamma
(I guess I swapped your delta's and gamma's. If you want 'ynew' to be a factor then just: z$ynew <- factor(z$ynew)
What about using merge ?
# define x and y to 'alpha', 'beta' etc. correspondences
# (it's just one row for each possible factor)
auxDf <- data.frame( x = c('a', 'a', 'b', 'b' ),
y = c( 1, 2, 1, 2 ),
newy= c('alpha', 'beta', 'gamma', 'delta'))
# merge the 2 data.frame getting a new data.frame with the factors column
newDf <- merge(z,auxDf)
newDf
Here's one additional step to make the previous answer even a bit quicker -
you can use 'unique' to pull out all the unique combinations in a data frame.
auxDf=unique(z)
auxDf$newy=c('alpha','beta','gamma','delta')
Then, as in the previous post
newDf <- merge(z,auxDf)
newDf
I've managed to come up with a solution that works, even though it is quite messy.
First, create subsets of the data frame z for each x
z1 <- subset(z, x == "a")
z2 <- subset(z, x == "b")
Then, apply factor() to each subset:
z1$y <- factor(z1$y, levels = 1:2, labels = c("alpha", "beta"))
z2$y <- factor(z2$y, levels = 1:2, labels = c("gamma", "delta"))
And finally, reunite the subsets into the original object.
z <- rbind(z1, z2)

Resources