Creating a histogram from a subset created from the subset function - r

This is how I've retrieved my dataset, everything is good so far.
> mantis<-read.csv("mantis.csv")
> attach(mantis)
The dataset provides numerical data on body mass/length/claw strength/etc. of FEMALE and MALE mantises. The object is to create a histogram showing the body masses of ONLY female mantises. I created a subset;
> mantis_sub<-subset(mantis, Sex=="f",select="Body.Mass.g")
Then I tried;
> hist(mantis_sub)
Error in hist.default(mantis_sub) : 'x' must be numeric
I've searched this link;
Plot a histogram of subset of a data
...and I cannot figure out how to properly create this histogram. I am unfortunately not fluent enough in R to understand the solution and the textbook I'm using does not cover this.

It is because mantis_sub is a dataframe (ie a table of body masses, lengths, claw strengths, ..), not a set of numbers, so hist is unsure which column you wish to plot.
You need to extract the column you want to do a histogram of. To do this you put mantis_sub${column name}. The dollar sign extracts the appropriate column from the mantis_sub table.
e.g. to do a histogram of the column named "BodyMass"
hist(mantis_sub$BodyMass)
If you want to do histograms of many columns automatically, then you'll have to loop through them, e.g.
for (column in c("BodyMass", "ClawStrength")) {
hist(mantis_sub[[column]])
}

Related

Create a histogram of specific columns and rows from a `data.frame` in R

## my data frame
crime = read.csv("url")
## specific columns that need to be represented
property_crime = crime$Burglary + crime$Theft + crime$`Motor Vehical Theft`
## the rows that I am looking for have the name "harris" within the column named "county_name"
## my attempt
with(crime, hist(harris))
## Error in hist(harris) : object 'harris' not found
Not sure why I am getting object 'harris' not found as that is the name under the county_name column. I'm new to R, could someone walk me through the process of displaying a histogram only including the values of specific columns and specific rows?
the rows that I am looking for have the name "harris" within the column named "county_name"
You have to tell R the same logic that you are telling us.
There are several ways of making this in R but I am going to put here the base R way.
We can access the desired rows of object crime column county_name by indexing like data.frame[rows, columns]. So, in your case, crime[harris_rows, "county_name"] should work. To get harris_rows, we can make a boolean index like so crime$county_name == harris. If we put all of this together and call hist():
hist(crime[crime$county_name == "harris", "county_name"])
You don't provide a reproducible example, but you can check a similar logic with the mtcars dataset. Here, I am making the histogram of the cars with mpg > 15
hist(mtcars[mtcars$mpg >15, "mpg"])
# this is another option that produces the same result
# hist(mtcars$mpg[mtcars$mpg >15])

R: Text and function in one 'cell'

I have to create a table where I analyse 9 variables in a bigger data set. For each variable, I have to state how it is scaled, what the measure of central tendency is, and what the dispersion measure is.
As, depending on how the variable is scaled, I have different measures, I would like to specify that inside the corresponding cell of the table I'm writing. Example:
"Median: (median(GB$government,na.rm=T)"
or
"Median:" (median(GB$government, na.rm=T)
This doesn't work, RStudio warns me because of an unexpected symbol. The code I have is this (it includes specify_decimal because I have to include two decimals of each value - that function works flawlessly so don't mind it :)
MZT <- c("Median:" specify_decimal(median(GB$government,na.rm=T),2),
specify_decimal(Modus(GB$local),2),specify_decimal(Modus(GB$gender),2),
specify_decimal(mean(GB$height,na.rm=T),2),
specify_decimal(mean(GB$weight,na.rm=T),2),specify_decimal(mean(GB$age,na.rm=T),2),
specify_decimal(mean(GB$education,na.rm=T),2),
specify_decimal(median(GB$income,na.rm=T),2),
specify_decimal(median(GB$father_educ,na.rm=T),2))
/ edit: I now understand how kable works :D
One way to make custom tables in R is to use the knitr::kable() function, along with R Markdown. Here is a trivial example that prints a table comparing sample and theoretical values for an exponential distribution where lambda = 0.2.
library(knitr)
Statistic <- c("Mean","Std. Deviation")
Sample <- c(round(5.220134,2),round(5.4018713,2))
Theoretical <- c(5.0,5.0)
theTable <- data.frame(Statistic,Sample,Theoretical)
rownames(theTable) <- NULL
kable(theTable)
...and the text based output:
> kable(theTable)
|Statistic | Sample| Theoretical|
|:--------------|------:|-----------:|
|Mean | 5.22| 5|
|Std. Deviation | 5.40| 5|
>
When run in R Markdown, the output looks like this:
Explanation
I used the following technique to create the table.
Data Frame is used as the container to hold the data
Each column in the table is a column of data in the data frame
The first column stores the names of the rows
The second thru n-th columns store different values related to each row

R dataframe not creating properly

I have used the following code to obtain mean decrease in accuracy for random forest
AAA<-randomForest(CPercentage~., data=data, importance= T)
BBB<-as.data.frame(importance(AAA))
I have created the following dataframe by the above process
%IncMSE IncNodePurity
Campaigntype 3.4815273 207.5336052
Email -1.1606079 2042.5660103
get 4.9073550 35.1237017
free 2.8777972 14.5362957
new 8.4464445 93.3491610
buy 5.9636483 23.9926669
just 4.1262164 21.5611278
month 4.0817729 16.6345631
I am able to obtain the second and third column by BBB$%IncMSE and
BBB$IncNodePurity. I want to subset thsi based on the first column which appears unnamed. I am unable to do this. When writing this dataframe to a csv file, it works and all three columns are listed separately. However, I am unable to separate the first two columns. Is there any way i can do this and rename the first column. Will be grateful to anyone who helps
It seems that your "first column" may actually be the index. You can create a column in your data frame from the index, and then reset your index so they're row numbers instead of names.
Try this:
BBB$col_one <- rownames(BBB)
rownames(BBB) <- 1:nrow(rownames)

How to check if a column has numeric or categorical levels in R?

I am trying to plot 9 barplots in a 3X3 matrix in R using base-R wrapped inside a for loop. (I am working on a workhorse solution for visualizing every column before I begin working on manipulating data) Below is the code:
library(ISLR);
library(ggplot2);
# load wage data
data(Wage)
par(mfrow=c(3,3))
for(i in 1:(dim(Wage)[2]-2)){
plot(Wage[,i],main = paste0(names(Wage)[i]),las = 2)
}
But unfortunately can't do properly for first 2 columns because they are numeric and actually needs a histogram. I get it that I need to fit if-else condition somewhere inside for() statement but that is giving me errors. below is the output where first 2 columns are plotted wrong. (Age and year are actually numeric and I may need to use them in X-axis instead of defaulting them to y).
Kindly requesting to suggest an edit/hack? I also learnt that I cant' use par() when I am wrapping ggplot inside for so I had to use base-R otherwise ggplot would have been great aesthetically.

check for and plot correlation of string properties in R

I need to check for and plot correlance between few properties in R, where many of them are String-based.
Consider the following CSV example data extract for webpage hits:
id;type;lang
1;EN;browser
2;EN;ios
3;DE;android
4;DE;browser
5;FR;ios
the type and lang columns contain only strings, and (as far as I understand) cannot be used for plotting or correlation analysis. So I would need to convert them into numbers, right? But how do I reattach the string when plotting language against browser type?
If I consider some methods like PCA, are they even possible with number-converted strings, as there is no useful information in the distance or distribution that way?
Probably, a great solution is to create an if/else statement. You will compare the strings, and you assign a auxiliar variable a number for each string.
Now, when you have your numbers in a list or in a vector, you can create a new data frame with the number values. After that you can represent your plot.
Here it's your code:
id<-c(1:5)
lang<-c("EN","EN","DE","DE","FR")
type<-c("browser","ios","android","browser","ios")
data<-data.frame(id,type,lang)
tmp<-vector(mode="list",length=nrow(data))
auxType<-NULL
auxLang<-NULL
for(i in 1:nrow(data))
{
#Assign lang
if(data$lang[i]=="EN")
auxLang<-1
else
if(data$lang[i]=="DE")
auxLang<-2
else
if(data$lang[i]=="FR")
auxLang<-3
#Assign type
if(data$type[i]=="browser")
auxType<-1
else
if(data$type[i]=="ios")
auxType<-2
else
if(data$lang[i]=="android")
auxType<-3
#Create an auxiliar data frame
tmp[[i]]<-data.frame(data$id[i],auxType,auxLang)
}
allData<-do.call(rbind,tmp)
names(allData)<-c("id","types","lang")

Resources