Reading from CSV and Plotting Boxes in R - r

I am looking for the most convenient way of creating boxplots for different values and groups read from a CSV file in R.
First, I read my Sheet into memory:
Sheet <- read.csv("D:/mydata/Table.csv", sep = ";")
Which just works fine.
names(Sheet)
gives me correctly the Headlines of the different columns.
I can also access and filter different groups into separate lists, like
myData1 <- Sheet[Sheet$Group == 'Group1',]$MyValue
myData2 <- Sheet[Sheet$Group == 'Group2',]$MyValue
...
and draw a boxplot using
boxplot(myData1, myData2, ..., main = "Distribution")
where the ... stand for more lists I have filled using the selection method above.
However, I have seen that using some formular could do these steps of selection and boxplotting in one go. But when I use something like
boxplot(Sheet~Group, Sheet)
it won't work because I get the following error:
invalid type (list) for variable 'Sheet'
The data in the CSV looks like this:
No;Gender;Type;Volume;Survival
1;m;HCM;150;45
2;m;UCM;202;103
3;f;HCM;192;5
4;m;T4;204;101
...
So i have multiple possible groups and different values which I'd like to represent as a box plot for each group. For example, I could group by gender or group by type.
How can I easily draw multiple boxes from my CSV data without having to grab them all manually out of the data?
Thanks for your help.

Try it like this:
Sheet <- data.frame(Group = gl(2, 50, labels=c("Group1", "Group2")),
MyValue = runif(100))
boxplot(MyValue ~ Group, data=Sheet)

Using ggplot2:
ggplot(Sheet, aes(x = Group, y = MyValue)) +
geom_boxplot()
The advantage of using ggplot2 is that you have lots of possibilities for customizing the appearance of your boxplot.

Related

How to use R character vector element as string and variable inside function?

I am trying to apply SPSS style category labels to my dataset in R. I think my question arises as I do not know how to parse variables correctly, so is not necessarily related to just these types of data.
To begin with, doing this manually as per the expss library documentation works fine:
library(expss)
#Load in the data
data(mtcars)
#Apply Variable Labels and Value Labels (and Numeric Coding) to each Variable.
mtcars = apply_labels(mtcars,
vs = "Engine",
vs = c("V-engine" = 1,
"Straight engine" = 2,
"Other engine" = 3)
)
Now my problem arises if I have my "Variable Names", "Variable Labels", "Value Labels" and corresponding "Value Numeric Codes" stored in some R data type and I try to use them in the apply_labels function. For example, if I have these stored in character vectors like so:
#Load in the data
data(mtcars)
#Value Labels
value_lab<-c("V-engine","Straight engine","Other engine")
#Value's Numeric coding
value_num<-c("1","2","3")
#Variable names
var <- c("vs")
#Variable Labels
var_lab<-c("Engine")
Then my question is, how would I use my character vector elements inside the apply_labels function? e.g. how would I do something like this:
#Apply Variable Labels and Value Labels (and Numeric Coding) to each Variable.
mtcars = apply_labels(mtcars,
var[1] = var_lab[1],
var[1] = c(value_lab[1] = value_num[1],
value_lab[2] = value_num[2],
value_lab[3] = value_num[3])
)
I have tried various combinations of paste and toString without success. My next step will be to apply this to my 500,000+ rows x 20,000 columns of data with a to-be-determined number of possible Value Labels/Numeric Codings.
Obligatory: I am new to R.
Thank you.
To achieve your desired result
Make use of named lists and vectors to store your variable and value labels
Doing so you can make use of do.call to pass the variable and value labels to apply_labels
To make the example more interesting I added labels for a second variable.
library(expss)
# Variable Labels
var_labels <- list(vs = "Engine", am = "Transmission")
#Value Labels
val_labels <- list(
vs = c("V-engine" = 0, "Straight engine" = 1),
am = c("Automatic" = 0, "Manual" = 1)
)
mtcars2 <- do.call(apply_labels, c(list(data = mtcars), var_labels, val_labels))
table(mtcars2$am, mtcars2$vs)
#>
#> V-engine Straight engine
#> Automatic 12 7
#> Manual 6 7
Great, thank you! That has led me to understand named lists and build a solution with setNames.
I ended up not using expss. It appeared to work within R and labelled everything as expected, but when I exported the final dataframe from R to SPSS using haven::write_sav, the value labels were not maintained (but the variable labels were).
Instead I used the haven labelled vector class to apply the Variable and Value labels. My final solution looks like this:
#Load in the data
data(mtcars)
#Variables
var <- c("vs")
#Variable Labels
var_labels<-c("Engine")
#Value Labels (for first Variable)
value_labs<-c("V-engine","Straight engine","Other engine")
#Value's Numeric coding )
value_num<-c("1","2","3")
#Make a named list to use as the value labels
value_labels <- setNames(as.integer(value_num),value_labs)
#Apply the label with haven
mtcars[,c(var[1])]<-labelled(mtcars[, c(var[1])],
labels=value_labels,
label=var_labels[1])
#Save out in spss format
haven::write_sav(mtcars, "test.sav")
Also, I have set it up so my data comes in one grouping of values labels at a time, but your example of expanding to the second variable helped me generalise this too, so thanks again!

R extract labels from a rda data frame

I am looking at some data downloaded from ICPSR and I am specifically using their R data file (.rda). Beneath the column name of each data file, there are some descriptions of the variables (a.k.a labels). An example is attached as well.
I tried various ways to get the label including base::label, Hmisc::label, labelled::var_label, sjlabelled::get_label and etc. But none worked.
So I am asking any ideas on how to extract the labels from this data file?
Thanks very much in advance!
this could work using purrr
#load library
library(purrr)
#get col n
n <- ncol(yourdata)
#extract labels as vector
labels <- map_chr(1:n, function(x) attr(yourdata[[x]], "label") )
This worked for me (I am working with ICPSR 35206):
attributes(yourdata)$variable.labels -> labels
Make sure that your attribute referring to the labels is actually called "variable.labels".

R for loop overwriting variable data

I am trying to use a for loop to create a ggplot for each column in a dataframe. I am pretty new to this so my approach may be very wrong here.
I have written a function to create the ggplot:
create_scatter <- function(df, x, y) {
ggplot(df, aes(x, y)) +
geom_point() +
xlab(name) +
ylab("quality")
}
And a for loop to iterate through the Dataframe columns by name (to get the name of the column for use later) then get the contents of the column for the plotting function.
for (name in names(whiteWines)) {
for (column in whiteWines[name]) {
assign(paste0(name, "_scatter"),
create_scatter(whiteWines, column, whiteWines$quality))
}
}
Using assign() I am able to create a variable name from the column name on the fly and assign the results of ggplot to it.
I am then using grid.arrange to arrange the resulting plots in a 3 x 4 grid.
grid.arrange(fixed.acidity_scatter,
volatile.acidity_scatter,
citric.acid_scatter,
residual.sugar_scatter,
chlorides_scatter,
free.sulfur.dioxide_scatter,
total.sulfur.dioxide_scatter,
density_scatter,
pH_scatter,
sulphates_scatter,
alcohol_scatter,
layout_matrix = rbind(c(1,2,3), c(4,5,6), c(7,8,9), c(10,11,12)))
When executed all scatter plots are created, however they all contain the data from the last scatter plot in the loop.
Undesired Results
If I wrap the assign statement in a print() statement then I do get the desired outcome in the grid, but each individual plot gets printed as well.
Desired Results
Dataset
You're probably looking for something more like this:
library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
ww <- read_delim(file = "~/Downloads/winequality-white.csv",delim = ";")
ww_long <- ww %>%
gather(key = measure,value = value,`fixed acidity`:`alcohol`)
ggplot(data = ww_long,aes(x = quality,y = value)) +
facet_wrap(~measure,scales = "free_y") +
geom_point()
R has some tools that can be very tempting for beginners as they think through solving a problem. Among them are assign(), get() and eval(parse(text = )). It is usually the case that a solution using those will cause more problems than they solve; there's typically a better way, but will require digging a little deeper into the "normal" way of doing things in R.
The followings are the variables of the data
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
the followings are sample rows
7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9.5;6
8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;10.1;6
7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4;9.9;6
7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4;9.9;6
8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;10.1;6
6.2;0.32;0.16;7;0.045;30;136;0.9949;3.18;0.47;9.6;6
7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9.5;6
8.1;0.22;0.43;1.5;0.044;28;129;0.9938;3.22;0.45;11;6
All form the excel sheet.

Getting subscripts from Excel into R

I just startet learning R but I already have my first problem. I want to disply my data in a graph. My data is in an Excel sheet converted to a .csv sheet. But I have some chemical formulars like Fe2O3 in my data and with the .csv all subscripst are gone. That doesn't look very nice. Is there any way to get the subscripts from the original Excel file into R?
I would really appreciate your help :)
Edit: My data contains 6 chemical formulars displayed on the x-axis, which all contain subscripts (i.e. Fe2O3, ZnCl2, CO2, ...) and nummeric values displayed on the y-axis. The graph is a bar chart. I am not sure if there is a way to either change the numbers to subscipts in R or keep them prior to the import.
The graph looks like this. But I would like to have the numbers as subscripts:
I don't know that there's a way to bring the formatting from excel into a CSV and then R, unless you can make those subscripts using unicode. UTF8 symbols for subscript letters
Given that your list of chemicals is short, it's not much work to tweak the chemical names to help ggplot interpret them with subscripts. You'll want brackets around the numbers, plus tildes afterwards if there are more elements to include. Then we also tell scale_x_discrete to "parse" the labels and convert those symbols to formatting.
set.seed(42)
chem_df <- tibble(
Chemicals =
c("AgNO3", "Al2SiO5", "CO2", "Fe2O3", "FeSO4", "ZnCl2"),
Chemicals_parsed =
c("AgNO[3]", "Al[2]~SiO[5]", "CO[2]", "Fe[2]~O[3]", "FeSO[4]", "ZnCl[2]"),
Mean = rnorm(6, 50, 30))
ggplot(chem_df, aes(x=Chemicals_parsed, Mean)) + geom_col() +
scale_x_discrete(name = "Chemicals",
labels=parse(text=chem_df$Chemicals_parsed))
To add to the excellent answer of #JonSpring, you can write a function which will convert strings like ""Al2SiO5" to strings like "Al[2]~SiO[5]", so you don't have to manually make all the conversions:
library(stringr)
chem.form <- function(s){
s <- str_replace_all(s,"([0-9]+)","[\\1]~")
if(endsWith(s,"~")) s <- substr(s,1,nchar(s) - 1)
s
}
Chemicals <- c("AgNO3", "Al2SiO5", "CO2", "Fe2O3", "FeSO4", "ZnCl2")
Chemicals_parsed <- as.vector(sapply(Chemicals,chem.form))

Plot data from several large data files in ggplot

I have several data files (numeric) with around 150000 rows and 25 columns. Before I was using gnuplot (where script lines are proportional plot objects) to to plot the data but as I have to do now some additional analysis with it I moved to R and ggplot2.
How to organize the data, thought? Is one big data.frame with an additional column to mark from which file the data is coming from really the only option? Or is there some way around that?
Edit: To be a bit more precise, I'll give as an example in what form I have the data now:
filelst=c("filea.dat", "fileb.dat", "filec.dat")
dat=c()
for(i in 1:length(filelst)) {
dat[[i]]=read.table(file[i])
}
Assuming you have filenames ending with ".dat", here's a mockup example of the strategies proposed by Chase,
require(plyr)
# list the files
lf = list.files(pattern = "\.dat")
str(lf)
# 1. read the files into a data.frame
d = ldply(lf, read.table, header = TRUE, skip = 1) # or whatever options to read
str(d) # should contain all the data, and and ID column called L1
# use the data, e.g. plot
pdf("all.pdf")
d_ply(d, "L1", plot, t="l")
dev.off()
# or using ggplot2
ggplot(d, aes(x, y, colour=L1)) + geom_line()
# 2. read the files into a list
ld = lapply(lf, read.table, header = TRUE, skip = 1) # or whatever options to read
names(ld) = gsub("\.dat", "", lf) # strip the file extension
str(ld)
# use the data, e.g. plot
pdf("all2.pdf")
lapply(names(l), function(ii) plot(l[[ii]], main=ii), t="l")
dev.off()
# 3. is not fun
Your question is a little vague. If I followed along properly, I think you have three main options:
Do as you suggest and then use any one of the "split-apply-combine" functions that exist in R to conduct your analyses by group. These functions may include by, aggregate, ave, package(plyr), package(data.table) and many others.
Store your data object as separate elements in a list(). Then use lapply() and friends to work on them.
Keep everything separate in different data objects and work on them individually. This is probably the most inefficient way to go about doing things, unless you have memory constraints et al.

Resources