I would like to have a column in my barplot for missing data.
adult <- read.csv(
"http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
header = FALSE,
na.strings = "?",
strip.white = TRUE
)
colnames(adult) <- c("age", "workClass", "fnlwgt", "education", "educationNum", "maritalStatus", "occupation", "relationship", "race", "sex", "capitalGain", "capitalLoss", "hoursPerWeek", "nativeCountry", "prediction")
barplot(table(adult$workClass), main="Job Distribution", xlab="Job", ylab="Count",las=2)
I know that in this dataset, there are 1836 missing values for workClass, from
length(which(is.na(adult$workClass)))
You can use the argument useNA = "ifany" in table.
tab <- table(adult$workClass, useNA = "ifany")
# Federal-gov Local-gov Never-worked Private
# 960 2093 7 22696
# Self-emp-inc Self-emp-not-inc State-gov Without-pay
# 1116 2541 1298 14
# <NA>
# 1836
By default, the name of the NA count is NA itself. You can change the name to the character string "NA" with the following command.
names(tab)[is.na(names(tab))] <- "NA"
Now, the plot displays the name "NA" on the x axis too.
barplot(tab, main = "Job Distribution", xlab = "Job", ylab = "Count", las = 2)
You can combine useNA = "ifany" in table() and names.arg in barplot()
barplot(table(adult$workClass, useNA = "ifany"),
names.arg = c(levels(adult$workClass),"NA's") )
c(levels(adult$workClass),"NA's") Is creating a vector that includes the names of all the levels/categories within the variable and the custom name NA's to represent the NA values
Related
I am trying to make conditional formatting for cells based on value between two values using library(openxlsx). The problem is with the first conditional formatting rule = c(0.015,0.020), it don't work.
## Dataframe
cost_table <- read.table(text = "FRUIT COST SUPPLY_RATE
1 APPLE 15 0.026377
2 ORANGE 14 0.01122
3 KIWI 13 0.004122
5 BANANA 11 0.017452
6 AVOCADO 10 0.008324 " , header = TRUE)
## This is the line where I label the %.
cost_table$SUPPLY_RATE <- label_percent(accuracy = 0.01)(cost_table$SUPPLY_RATE)
## Creating workbook and sheet
Fruits_Table <- createWorkbook()
addWorksheet(Fruits_Table,"List 1")
writeData(Fruits_Table,"List 1",cost_table)
## Style color for conditional formatting
posStyle <- createStyle(fontColour = "#006100", bgFill = "#C6EFCE")
negStyle <- createStyle(fontColour = "#9C0006", bgFill = "#FFC7CE")
## Here is the error
conditionalFormatting(Fruits_Table, "List 1",
cols = 3,
rows = 2:6, rule = c(0.015,0.020), style = posStyle, type='between'
)
conditionalFormatting(Fruits_Table, "List 1",
cols = 3,
rows = 2:6, rule = ">0.020", style = negStyle
)
Replace this line
cost_table$SUPPLY_RATE <- label_percent(accuracy = 0.01)(cost_table$SUPPLY_RATE)
with this
class(cost_table$SUPPLY_RATE) <- "percentage"
You have replaced the numeric column with a character column and Excel got confused. Using percentage tells Excel to apply a percentage format to the cells.
I have the following data frame:
> agg_2
# A tibble: 3 × 3
bcs default_flag pred_default
<chr> <dbl> <dbl>
1 high-score 0.00907 0.0121
2 low-score 0.0345 0.0353
3 mid-score 0.0210 0.0204
I plot it as a bar plot using the following code:
barplot(t(as.matrix(agg_2[,-1])),
main = "Actual Default vs Predicted Default",
xlab = "Score Category",
ylab = "Default Rate",
names.arg = c("High Score", "Low Score", "Mid Score"),
col = gray.colors(2),
beside = TRUE)
legend("topleft",
c("Default", "Pred. Default"),
fill = gray.colors(2))
and it gives me this:
How can I rearrange the data frame/matrix so that the pairs of bars in the bar plot are as follows: Low Score then Mid Score then High Score?
Here is one potential solution:
agg_2 <- read.table(text = "bcs default_flag pred_default
high-score 0.00907 0.0121
low-score 0.0345 0.0353
mid-score 0.0210 0.0204", header = TRUE)
agg_2$bcs <- factor(agg_2$bcs, levels = c("low-score", "mid-score", "high-score"), ordered = TRUE)
agg_2 <- agg_2[order(agg_2$bcs),]
barplot(t(as.matrix(agg_2[,-1])),
main = "Actual Default vs Predicted Default",
xlab = "Score Category",
ylab = "Default Rate",
names.arg = agg_2$bcs,
col = gray.colors(2),
beside = TRUE)
legend("topright",
c("Default", "Pred. Default"),
fill = gray.colors(2))
Created on 2022-06-21 by the reprex package (v2.0.1)
I am stuck on what is probably a simple problem: Loop on xts objects.
I would like to make four different plots for the elements in the basket: basket <- cbind(AAPLG, GEG, SPYG, WMTG)
> head(basket)
new.close new.close.1 new.close.2 new.close.3
2000-01-04 1.0000000 1.0000000 1.0000000 1.0000000
2000-01-05 1.0146341 0.9982639 1.0017889 0.9766755
2000-01-06 0.9268293 1.0115972 0.9856887 0.9903592
2000-01-07 0.9707317 1.0507639 1.0429338 1.0651532
2000-01-10 0.9536585 1.0503472 1.0465116 1.0457161
2000-01-11 0.9048780 1.0520833 1.0339893 1.0301664
This is my idea so far, as I cannot simply put in i as column name:
tickers <- c("AAPLG", "GEG", "SPYG", "WMTG")
par(mfrow=c(2,2))
for (i in 1:4){
print(plot(x = basket[, [i]], xlab = "Time", ylab = "Cumulative Return",
main = "Cumulative Returns", ylim = c(0.0, 3.5), major.ticks= "years",
minor.ticks = FALSE, col = "red"))
}
This is the error I get when running the script:
Error: unexpected ',' in " main = "Cumulative Returns","
> minor.ticks = FALSE, col = "red"))
Error: unexpected ',' in " minor.ticks = FALSE,"
> }
Error: unexpected '}' in "}"
Any help is very much appreciated.
As mentioned, remove the square brackets around i:
par(mfrow=c(2,2))
for (i in 1:4){
print(plot(x = basket[, i], xlab = "Time", ylab = "Cumulative Return",
main = "Cumulative Returns", ylim = c(0.0, 3.5), major.ticks= "years",
minor.ticks = FALSE, col = "red"))
}
But even better, assign names with cbind in building xts object or re-name your xts object like any data frame, then iterate across names for column referencing and titles:
Plot
# PASS NAMES WITH cbind
basket <- cbind(AAPLG=APPLG, GEG=GEG, SPYG=SPYG, WMTG=WMTG)
# RENAME AFTER cbind
# basket <- cbind(AAPLG, GEG, SPYG, WMTG)
# colnames(basket) <- c("AAPLG", "GEG", "SPYG", "WMTG")
par(mfrow=c(2,2))
sapply(names(basket), function(col)
print(plot(x = basket[, col], xlab = "Time", ylab = "Cumulative Return", data = basket,
main = paste(col, "Cumulative Returns"), ylim = c(0.0, 3.5),
major.ticks= "years", minor.ticks = FALSE, col = "red"))
)
I have a dataframe with a column of strings that I want to further label into the following categories: city, country, and continent. I used gsub to replace all the cities with "City," all the countries with "Country," and all the continents with "Continent."
#This is what I have
dataframe
Color Letter Words
red A Paris,Asia,parrot,Antarctica,North America,cat,lizard
blue A Panama,New York,Africa,dog,Tokyo,Washington DC,fish
red B Copenhagen,bird,USA,Japan,Chicago,Mexico,insect
blue B Israel,Antarctica,horse,South America,North America,turtle,Brazil
#This is what I want
dataframe
Color Letter New
red A City,Continent
blue A Country,City,Continent
red B City,Country
blue B Country,Continent
#This is the code I have so far
dataframe$New <- NA
#groups all the cities
dataframe$New <- lapply)dataframe$Words, function(x) {
gsub("Paris|New York|Tokyo|Washington DC|Copenhagen|Chicago", "City", x)})
#groups all the countries
dataframe$New <- lapply)dataframe$Words, function(x) {
gsub("Panama|USA|Japan|Mexico|Israel|Brazil", "Country", x)})
#groups all the continents
dataframe$New <- lapply)dataframe$Words, function(x) {
gsub("Asia|Antarctica|Africa|North America|South America", "Continent", x)})
dataframe$Words <- NULL
How do I keep prevent overwriting in dataframe$New each time and how do I delete the extra words (i.e. fish, horse, cat)?
The above data is an example based on a very large dataset. In the dataset the Words column has many repeats. See below for some sample rows from dataframe$Words:
Words
Panama,Paris
Panama,Israel,cat
Panama,Paris,horse,
Panama,Asia
Panama
Panama,Chicago
Israel,Chicago
Israel,lizard,Paris
Israel,Panama,horse,Africa
```
Consider pasting several ifelse calls checking for specific strings:
dataframe$New <- paste(ifelse(grepl("Paris|New York|Tokyo|Washington DC|Copenhagen|Chicago", dataframe$Words), "City", "N/A"),
ifelse(grepl("Panama|USA|Japan|Mexico|Israel|Brazil", dataframe$Words), "Country", "N/A"),
ifelse(grepl("Asia|Antarctica|Africa|North America|South America", dataframe$Words), "Continent", "N/A"),
sep=",")
dataframe$New <- gsub("N/A,|,N/A", "", dataframe$New)
dataframe
# Color Letter Words New
# 1 red A Paris,Asia,parrot,Antarctica,North America,cat,lizard City,Continent
# 2 blue A Panama,New York,Africa,dog,Tokyo,Washington DC,fish City,Country,Continent
# 3 red B Copenhagen,bird,USA,Japan,Chicago,Mexico,insect City,Country
# 4 blue B Israel,Antarctica,horse,South America,North America,turtle,Brazil Country,Continent
Or dryer version with do.call + lapply:
strs <- list(c("Paris|New York|Tokyo|Washington DC|Copenhagen|Chicago", "City"),
c("Panama|USA|Japan|Mexico|Israel|Brazil", "Country"),
c("Asia|Antarctica|Africa|North America|South America", "Continent"))
df$New2 <- do.call(paste,
c(lapply(strs, function(s) ifelse(grepl(s[1], df$Words), s[2], "N/A")),
list(sep=",")))
df$New2 <- gsub("N/A,|,N/A", "", df$New2)
It may be better to create a key/value pair of list and then extract the elements after replacement by matching the 'key's
library(gsubfn)
# key val list
lst1 <- list(Paris = "City", `New York` = "City", Tokyo = "City",
`Washington DC` = "City",
Copenhagen = "City", Chicago = "City", Panama = "Country",
USA = "Country", Japan = "Country", Mexico = "Country", Israel = "Country",
Brazil = "Country", Asia = "Continent", Antarctica = "Continent",
Africa = "Continent", `North America` = "Continent",
`South America` = "Continent")
Extract the matching values with strapply into a list, loop over the list with sapply and paste the unique strings that are either 'City', 'Continent' or 'Country'
nm1 <- c("City", "Continent", "Country")
df1$New <- sapply(strapply(df1$Words, "([^,]+)", lst1), function(x)
paste(unique(x[x %in% nm1]), collapse=","))
df1$New
#[1] "City,Continent" "Country,City,Continent"
#[3] "City,Country" "Country,Continent"
data
df1 <- structure(list(Color = c("red", "blue", "red", "blue"), Letter = c("A",
"A", "B", "B"), Words = c("Paris,Asia,parrot,Antarctica,North America,cat,lizard",
"Panama,New York,Africa,dog,Tokyo,Washington DC,fish",
"Copenhagen,bird,USA,Japan,Chicago,Mexico,insect",
"Israel,Antarctica,horse,South America,North America,turtle,Brazil"
)), class = "data.frame", row.names = c(NA, -4L))
When I try to run the following code I get an error:
value <- as.matrix(wsu.wide[, c(4, 3, 2)])
Error in [.data.frame(wsu.wide, , c(4, 3, 2)) : undefined columns
selected
How do I get this line of work? It's part of dcasting my data.
This is full the code:
library(readxl)
library(reshape2)
Store_and_Regional_Sales_Database <- read_excel("~/Downloads/Data_Files/Store and Regional Sales Database.xlsx", skip = 2)
store <- Store_and_Regional_Sales_Database
freq <- table(store$`Sales Region`)
freq
rel.freq <- freq / nrow(store)
rel.freq
rel.freq.scaled <- rel.freq * 100
rel.freq.scaled
labs <- paste(names(rel.freq.scaled), "\n", "(", rel.freq.scaled, "%", ")", sep = "")
pie(rel.freq.scaled, labels = labs, main = "Pie Chart of Sales Region")
monitor <- store[which(store$`Item Description` == '24" Monitor'),]
wsu <- as.data.frame(monitor[c("Week Ending", "Store No.", "Units Sold")])
wsu.wide <- dcast(wsu, "Store No." ~ "Week Ending", value.var = "Units Sold")
value <- as.matrix(wsu.wide[, c(4, 3, 2)])
Thanks.
Edit:
This is my table called "monitor":
When I then make this wsu <- as.data.frame(monitor[c("Week Ending", "Store No.", "Units Sold")]) I create another vector with only variables "Week Ending", "Store No." and "Units Sold".
However, as I write the wsu.wide code the ouput I get is only this:
Why do I only get this small table when I'm asking to dcast my data?
After this I don't get what is wrong.
The problem is at the line:
wsu.wide <- dcast(wsu, "Store No." ~ "Week Ending", value.var="Units Sold")
Instead of the double quotation mark " you should use the grave accent - ` in the formula:
wsu.wide <- dcast(wsu, `Store No.` ~ `Week Ending`, value.var = "Units Sold")
To avoid this kind of problem it is better not to use spaces in the R object names it is better to substitute Sales Region variable name to sales_region using underscore. See e.g. Google's R Style Guide.
Please see the code below, I used simulation of your data as extract it from the picture is quite cumbersome:
library(readxl)
library(reshape2)
#simulation
n <- 4
Store_and_Regional_Sales_Database <- data.frame(
a = seq_along(LETTERS[1:n]),
sr = LETTERS[1:n],
sr2 = '24" Monitor',
sr3 = 1:4,
sr4 = 2:5,
sr5 = 3:6)
names(Store_and_Regional_Sales_Database)[2:6] <- c(
"Sales Region", "Item Description",
"Week Ending", "Store No.", "Units Sold")
# algorithm
store <- Store_and_Regional_Sales_Database
freq <- table(store$`Sales Region`)
freq
rel.freq <- freq/nrow(store)
rel.freq
rel.freq.scaled <- rel.freq * 100
rel.freq.scaled
labs <- paste(names(rel.freq.scaled), "\n", "(", rel.freq.scaled, "%", ")", sep = "")
pie(rel.freq.scaled, labels = labs, main = "Pie Chart of Sales Region")
monitor <- store[which(store$`Item Description` == '24" Monitor'),]
wsu <- as.data.frame(monitor[c("Week Ending", "Store No.", "Units Sold")])
wsu.wide <- dcast(wsu, `Store No.` ~ `Week Ending`, value.var = "Units Sold")
value <- as.matrix(wsu.wide[ ,c(4,3,2)])
Output:
3 2 1
[1,] NA NA 3
[2,] NA 4 NA
[3,] 5 NA NA
[4,] NA NA NA