Struggling to create a box plot, histogram, and qqplot in R [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 10 months ago.
Improve this question
I am a very new R user, and I am trying to use R to create a box plot for prices at target vs at Walmart. I also want to create 2 histograms for the prices at each store as well as qqplots. I keep getting various errors, including "Error in hist.default(mydata) : 'x' must be numeric:" and boxplot(mydata)
"Error in x[floor(d)] + x[ceiling(d)] :
non-numeric argument to binary operator" . I have correctly uploaded my csv file and I will attach my data for clarity. I have also added a direct c & p of some of my code. I have tried using hist(mydata), boxplot(mydata), and qqplot(mydata) as well, all which have returned with the x is not numeric error. I'm sorry if any of this is dumb, I am extremely new to R not to mention extremely bad at it. Thank you all for your help!
#[Workspace loaded from ~/.RData]
mydata <- read.csv(file.choose(), header = T) names(mydata)
#Error: unexpected symbol in " mydata <- read.csv(file.choose(), header = T) names"
mydata <- read.csv(file.choose(), header = T)
names(mydata)
#[1] "Product" "Walmart" "Target"
mydata
Product
1 Sara lee artesano bread
2 Store brand dozen large eggs
3 Store brand 2% milk 1 gallon (128 fl oz)
4 12.4 oz cheez its
5 Ritz cracker fresh stacks 8ct, 11.8 oz
6 Sabra classic hummus 10 oz
7 Oreo chocolate sandwich cookies 14.3 oz
8 Motts applesauce 6 ct/4oz cups
9 Bananas (each)
10 Hass Avocado (each)
11 Chips ahoy original family size, 18.2 oz
12 Lays potato chips party size, 13 oz
13 Amy’s frozen mexican casserole, 9.5 oz
14 Jack’s frozen pizza original thin crust, 13.8 oz
15 Store brand sweet cream unsalted butter, 4 count, 16 oz
16 Sour cream and onion pringles, 5.5 oz
17 Philadelphia original cream cheese spread, 8 oz
18 Daisy sour cream, regular, 16 oz:
19 Kraft singles, 24 ct/16 oz:
20 Doritos nacho cheese, party size, 14.5 oz
21 Tyson Fun Chicken nuggets, 1.81 lb (29 oz), frozen
22 Kraft mac n cheese original, 7.25 oz
23 appleapple gogo squeeze, 12ct, 3.2 oz each
24 Yoplait original french vanilla yogurt, 6oz
25 Essentia bottled water, 1 liter
26 Premium oyster crackers, 9oz
27 Aunt Jemima buttermilk pancake miz, 32 oz
28 Eggo frozen homestyle waffles, 10ct/12.3 oz
29 Kellogg's Froot Loops, 10.1 oz
30 Tostitos scoops tortilla chips, 10 oz
Walmart Target
1 2.98 2.99
2 1.93 1.99
3 2.92 2.99
4 3.14 3.19
5 3.28 3.29
6 3.68 3.69
7 3.48 3.39
8 2.26 2.29
9 0.17 0.25
10 1.18 1.19
11 3.98 4.49
12 4.48 4.79
13 4.58 4.59
14 3.42 3.59
15 3.18 2.99
16 1.78 1.79
17 3.24 3.39
18 1.94 2.29
19 4.18 4.39
20 4.48 4.79
21 6.42 6.69
22 1.00 0.99
23 5.98 6.49
24 0.56 0.69
25 1.88 1.99
26 3.12 2.99
27 2.64 2.79
28 2.63 2.69
29 2.98 2.99
30 3.48 3.99
hist(mydata)
#Error in hist.default(mydata) : 'x' must be numeric
x<-sample(LETTERS[1:5],20,replace=TRUE)
df<-data.frame(x)
df
x
1 E
2 B
3 A
4 B
5 E
6 B
7 A
8 A
9 C
10 E
11 A
12 B
13 A
14 B
15 C
16 D
17 C
18 E
19 A
20 D
x<-sample(LETTERS[1:5],20,replace=TRUE)
df<-data.frame(x)
hist(df$x)
#Error in hist.default(df$x) : 'x' must be numeric
x<-sample(LETTERS[1:5],20,replace=TRUE)
df<-data.frame(x)
barplot(table(df$x))
boxplot(mydata)
#Error in x[floor(d)] + x[ceiling(d)] :
# non-numeric argument to binary operator
qqplot("Walmart")
#Error in sort(y) : argument "y" is missing, with no default
qqplot(mydata)
#Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) :
# undefined columns selected
#In addition: Warning message:
#In xtfrm.data.frame(x) : cannot xtfrm data frames

There seems to be a problem with the data you uploaded but no matter...I will just create data resembling your problem and show you how to do it with some simple code (some may offer alternatives like ggplot, but I think my example will use shorter code and be more intuitive.)
First, we can load ggpubr for plotting functions:
# Load ggpubr for plotting functions:
library(ggpubr)
Then we can create a new data frame, first with the prices and store names, then combining them into a data frame we can use:
# Create price values and store values:
prices.1 <- c(1,2,3,4,5,3)
prices.2 <- c(8,6,4,2,0,1)
store <- c("walmart",
"walmart",
"walmart",
"target",
"target",
"target")
# Create dataframe for these values:
store.data <- data.frame(prices.1,
prices.2,
store)
Now we can just plug in our data into all of these plots nearly the same way each time. the first part of the code is the plot function name, the data part is our stored data, and the x and y values are what we use for our variables:
# Scatterplot:
ggscatter(data = store.data,
x="prices.1",
y="prices.2")
# Boxplot:
ggboxplot(data = store.data,
x="store",
y="prices.1")
# Histogram:
gghistogram(data = store.data,
x="prices.1")
# QQ Plot:
ggqqplot(data = store.data,
x="prices.1")
There are simpler alternatives like base R functions like this, but I find they are much harder to customize compared to ggpubr and ggplot:
plot(x,y)
Of course, you can really customize the ggpubr and ggplot output to look much better, but thats up to you and what you want to learn:
ggboxplot(data = store.data,
x="store",
y="prices.1",
fill = "store",
title = "Prices of Merchandise by Store",
caption = "*Data obtained from Stack Overflow",
palette = "jco",
legend = "none",
xlab ="Store Name",
ylab = "Prices of Merchandise",
ggtheme = theme_pubclean())
Hope thats helpful. Let me know if you have questions!

Related

Plotting missing data

I'm trying plotting the following imputed dataset with LOCF method, according this procedure
> dati
# A tibble: 27 x 6
id sex d8 d10 d12 d14
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 F 21 20 21.5 23
2 2 F 21 21.5 24 25.5
3 3 NA NA 24 NA 26
4 4 F 23.5 24.5 25 26.5
5 5 F 21.5 23 22.5 23.5
6 6 F 20 21 21 22.5
7 7 F 21.5 22.5 23 25
8 8 F 23 23 23.5 24
9 9 F NA 21 NA 21.5
10 10 F 16.5 19 19 19.5
# ... with 17 more rows
dati_locf <- dati %>% mutate(across(everything(),na.locf)) %>%
mutate(across(everything(),na.locf,fromlast = T))
apply(dati_locf[which(dati_locf$sex=="F"),1:4], 1, function(x) lines(x, col = "green"))
Howrever, when I run the last line to plot dataset it turns me back both these error and warning messages:
Warning in xy.coords(x, y) : a NA has been produced by coercion
Error in plot.xy(xy.coords(x, y), type = type, ...) :
plot.new has not been called yet
Called from: plot.xy(xy.coords(x, y), type = type, ...)
Can you explain why and how I could fix them? I let you attach the page I has been being address to after running it.
enter image description here
If you just want to plot the LOCF imputation for one variable to see how good the fit for the imputations looks for this one variable, you can use the following:
library(imputeTS)
# Example 1: Visualize imputation by LOCF
imp_locf <- na_locf(tsAirgap)
ggplot_na_imputations(tsAirgap, imp_locf)
tsAirgap is an time series example, which comes with the imputeTS package. You would have to replace this with the time series / variable you want to plot. Imputed values are shown in red. As you can see, for this series last observation carried forward would be kind of ok, but there are algorithms tat come with the imputeTS package, that give a better result (e.g. na_kalman or na_seadec). Here is also an example of next observation carried backward, since you also used NOCB.
library(imputeTS)
# Example 2: Visualize imputation by NOCB
imp_locf <- na_locf(tsAirgap, option = "nocb")
ggplot_na_imputations(tsAirgap, imp_locf)
There are several problems here:
apply will convert its first argument to matrix and since the second column is character it gives a character matrix. Clearly one can't plot that with lines.
presumably we want to plot columns 3:6, not 1:4
na.locf will produce multiple values that are the same wherever there is an NA but what we really want is to connect non-NA points. Use na.approx instead.
lines can only be used after plot but there is no plot command. Use matplot instead.
Making these changes we have the following.
library(zoo)
# see Note below for dati in reproducible form
matplot(na.approx(dati[3:6]), type = "l", ylab = "")
legend("topright", names(dati)[3:6], col = 1:4, lty = 1:4)
(continued after plot)
We could alternately use ggplot2 graphics. First convert to zoo and then use na.approx and autoplot. Omit facet=NULL if you want separate panels.
library(ggplot2)
autoplot(na.approx(zoo(dati[3:6])), facet = NULL)
Note
We provide dati in reproducible form below. Note that the sex column only contains NA and F so in the absence of direction it will assume those are a logical NA and FALSE. Instead we specify that the sex column is character in the read.table line.
Lines <- "
id sex d8 d10 d12 d14
1 1 F 21 20 21.5 23
2 2 F 21 21.5 24 25.5
3 3 NA NA 24 NA 26
4 4 F 23.5 24.5 25 26.5
5 5 F 21.5 23 22.5 23.5
6 6 F 20 21 21 22.5
7 7 F 21.5 22.5 23 25
8 8 F 23 23 23.5 24
9 9 F NA 21 NA 21.5
10 10 F 16.5 19 19 19.5"
dati <- read.table(text = Lines, colClasses = list(sex = "character"))

Applying a label depending on which condition is met using R

I would like to use a simple R function where the contents of a specified data frame column are read row by row, then depending on the value, a string is applied to that row in a new column.
So far, I've tried to use a combination of loops and generating individual columns which were combined later. However, I cannot seem to get the syntax right.
The input looks like this:
head(data,10)
# A tibble: 10 x 5
Patient T1Score T2Score T3Score T4Score
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 96.4 75 80.4 82.1
2 5 100 85.7 53.6 55.4
3 6 82.1 85.7 NA NA
4 7 82.1 85.7 60.7 28.6
5 8 100 76.8 64.3 57.7
6 10 46.4 57.1 NA 75
7 11 71.4 NA NA NA
8 12 98.2 92.9 85.7 82.1
9 13 78.6 89.3 37.5 42.9
10 14 89.3 100 64.3 87.5
and the function I have written looks like this:
minMax<-function(x){
#make an empty data frame for the output to go
output<-data.frame()
#making sure the rest of the commands only look at what I want them to look at in the input object
a<-x[2:5]
#here I'm gathering the columns necessary to perform the calculation
minValue<-apply(a,1,min,na.rm=T)
maxValue<-apply(a,1,max,na.rm=T)
tempdf<-as.data.frame((cbind(minValue,maxValue)))
Difference<-tempdf$maxValue-tempdf$minValue
referenceValue<-ave(Difference)
referenceValue<-referenceValue[1]
#quick aside to make the first two thirds of the output file
output<-as.data.frame((cbind(x[1],Difference)))
#Now I need to define the class based on the referenceValue, and here is where I run into trouble.
apply(output, 1, FUN =
for (i in Difference) {
ifelse(i>referenceValue,"HIGH","LOW")
}
)
output
}
I also tried...
if (i>referenceValue) {
apply(output,1,print("HIGH"))
}else(print("LOW")) {}
}
)
output
}
Regardless, both end up giving me the error message,
c("'for (i in Difference) {' is not a function, character or symbol", "' ifelse(i > referenceValue, \"HIGH\", \"LOW\")' is not a function, character or symbol", "'}' is not a function, character or symbol")
The expected output should look like:
Patient Difference Toxicity
3 21.430000 LOW
5 46.430000 HIGH
6 3.570000 LOW
7 57.140000 HIGH
8 42.310000 HIGH
10 28.570000 HIGH
11 0.000000 LOW
12 16.070000 LOW
13 51.790000 HIGH
14 35.710000 HIGH
Is there a better way for me to organize the last loop?
Since you seem to be using tibbles anyway, here's a much shorter version using dplyr and tidyr:
> d %>%
gather(key = tscore,value = score,T1Score:T4Score) %>%
group_by(Patient) %>%
summarise(Difference = max(score,na.rm = TRUE) - min(score,na.rm = TRUE)) %>%
ungroup() %>%
mutate(AvgDifference = mean(Difference),
Toxicity = if_else(Difference > mean(Difference),"HIGH","LOW"))
# A tibble: 10 x 4
Patient Difference AvgDifference Toxicity
<int> <dbl> <dbl> <chr>
1 3 21.4 30.3 LOW
2 5 46.4 30.3 HIGH
3 6 3.6 30.3 LOW
4 7 57.1 30.3 HIGH
5 8 42.3 30.3 HIGH
6 10 28.6 30.3 LOW
7 11 0 30.3 LOW
8 12 16.1 30.3 LOW
9 13 51.8 30.3 HIGH
10 14 35.7 30.3 HIGH
I think maybe your expected output might have been based on a slightly different average difference, so this output is very slightly different.
And a much simpler base R version if you prefer:
d$min <- apply(d[,2:5],1,min,na.rm = TRUE)
d$max <- apply(d[,2:5],1,max,na.rm = TRUE)
d$diff <- d$max - d$min
d$avg_diff <- mean(d$diff)
d$toxicity <- with(d,ifelse(diff > avg_diff,"HIGH","LOW"))
A few notes on your existing code:
as.data.frame((cbind(minValue,maxValue))) is not an advisable way to create data frames. This is more awkward than simply doing data.frame(minValue = minValue,maxValue = maxValue) and risks unintended coercion from cbind.
ave is for computing summaries over groups; just use mean if you have a single vector
The FUN argument in apply expects a function, not an arbitrary expression, which is what you're trying to pass at the end. The general syntax for an "anonymous" function in that context would be apply(...,FUN = function(arg) { do some stuff and return exactly the thing you want}).

Wordcloud in R: color based on data in other column

I'm trying to get a wordcloud where the color of the wordcloud is based on another column in the dataframe. Hereby I'm using the packages wordcloud2 and RColorBrewer. I'm using the following (sample) code:
set.seed(1)
DF <- data.frame(
word = c('football','tennis','squash','curling','baseball','diving','archery','cricket','cycling','hockey','formula1','rugby','volleyball','tabletennis','swimming','shooting','taekwondo','judo','handball','horseracing'),
freq = sample(100:1000,20),
diff = sample(-100:100,20)/100)
library(wordcloud2)
library(RColorBrewer)
color_range_number <- length(unique(DF$diff))
custColorPal <- colorRampPalette(c("#ff0000","#00cc00"))
custColors <- custColorPal(color_range_number)
colors <- custColors[factor(DF$diff)]
wordcloud2(data = DF, color = colors)
DF is as follows:
word freq diff
1 football 339 0.87
2 tennis 434 -0.58
3 squash 614 0.29
4 curling 915 -0.76
5 baseball 280 -0.48
6 diving 904 -0.25
7 archery 945 -0.98
8 cricket 690 -0.26
9 cycling 661 0.67
10 hockey 155 -0.35
11 formula1 283 -0.08
12 rugby 257 0.13
13 volleyball 710 -0.07
14 tabletennis 441 -0.65
15 swimming 782 0.54
16 shooting 540 0.24
17 taekwondo 735 0.46
18 judo 976 -0.81
19 handball 435 0.32
20 horseracing 785 0.93
In DF, I'd like to use column 'diff' to assign colors to the words: the more negative the more red, the more positive, the more green.
However, I'm getting unexpected results, such as the fact that 'hockey' is colored green, whereas it should have a more red color due to the value diff of -0.35 for variable 'diff'. See also this screenshot.
I think it is due to the fact that not all words are plotted, since for example 'horseracing' is not plotted.
My questions:
Is it correct to state that the colors are 'misassigned' due to the fact that not all words are plotted?
How can it made sure that always all words are plotted? Reducing the value of argument 'size' is not always a guarantee. It might be good to note that I'd like to paste this wordcloud via rmarkdown in a PDF.
The line colors <- custColors[factor(DF$diff)] is causing the issue I reckon. Try this...
set.seed(1)
DF <- data.frame(
word = c('football','tennis','squash','curling','baseball','diving','archery','cricket','cycling','hockey','formula1','rugby','volleyball','tabletennis','swimming','shooting','taekwondo','judo','handball','horseracing'),
freq = sample(100:1000,20),
diff = sample(-100:100,20)/100)
library(wordcloud2)
library(RColorBrewer)
color_range_number = nrow(DF)
custColorPal <- colorRampPalette(c("#ff0000","#00cc00"))
custColors <- custColorPal(color_range_number)
wordcloud2(data = DF, color = custColors)
All the custColorPal function is doing is a lookup of the diff value to scale it in the range of the colours. Factors are being interpreted as equally spaced numbers in the range 1 to 20.
As for the second question, my suggestion is to make the font small so that there is more chance that all words will be displayed

changing variable value in data frame

I have a data frame:
id,male,exposure,age,tol
9,0,1.54,tol12,1.79
9,0,1.54,tol13,1.9
9,0,1.54,tol14,2.12
9,0,1.54,tol11,2.23
However, I want the values of the age variable to be (11,12,13,14) not (tol11,tol12,tol13,tol14). I tried the following, but it does not make a difference.
levels(tolerance_wide$age)[levels(tolerance_wide$age)==tol11] <- 11
levels(tolerance_wide$age)[levels(tolerance_wide$age)==tol12] <- 12
Any help would be appreciated.
(data from Singer, Willett book)
Assuming that you data frame is named foo:
foo$age <- as.numeric(gsub("tol", "", foo$age))
id male exposure age tol
1: 9 0 1.54 12 1.79
2: 9 0 1.54 13 1.90
3: 9 0 1.54 14 2.12
4: 9 0 1.54 11 2.23
Here we use two functions:
gsub to replace pattern in a string (we replace tol with nothing "").
as.numeric to transform gsub output (which is character) into numbers

R merge with itself

Can I merge data like
name,#797,"Stachy, Poland"
at_rank,#797,1
to_center,#797,4.70
predicted,#797,4.70
According to the second column and take the first column as column names?
name at_rank to_center predicted
#797 "Stachy, Poland" 1 4.70 4.70
Upon request, the whole set of data: http://sprunge.us/cYSJ
The first problem, of reading the data in, should not be a problem if your strings with commas are quoted (which they seem to be). Using read.csv with the header=FALSE argument does the trick with the data you shared. (Of course, if the data file had headers, delete that argument.)
From there, you have several options. Here are two.
reshape (base R) works fine for this:
myDF <- read.csv("http://sprunge.us/cYSJ", header=FALSE)
myDF2 <- reshape(myDF, direction="wide", idvar="V2", timevar="V1")
head(myDF2)
# V2 V3.name V3.at_rank V3.to_center V3.predicted
# 1 #1 Kitoman 1 2.41 2.41
# 5 #2 Hosaena 2 4.23 9.25
# 9 #3 Vinzelles, Puy-de-Dôme 1 5.20 5.20
# 13 #4 Whitelee Wind Farm 6 3.29 8.07
# 17 #5 Steveville, Alberta 1 9.59 9.59
# 21 #6 Rocher, Ardèche 1 0.13 0.13
The reshape2 package is also useful in these cases. It has simpler syntax and the output is also a little "cleaner" (at least in terms of variable names).
library(reshape2)
myDFw_2 <- dcast(myDF, V2 ~ V1)
# Using V3 as value column: use value.var to override.
head(myDFw_2)
# V2 at_rank name predicted to_center
# 1 #1 1 Kitoman 2.41 2.41
# 2 #10 4 Icaraí de Minas 6.07 8.19
# 3 #100 2 Scranton High School (Pennsylvania) 5.78 7.63
# 4 #1000 1 Bat & Ball Inn, Clanfield 2.17 2.17
# 5 #10000 3 Tăuteu 1.87 5.87
# 6 #10001 1 Oak Grove, Northumberland County, Virginia 5.84 5.84
Look at the reshape package from Hadley. If I understand correctly, you are just pivoting your data from long to wide.
I think in this case all you really need to do is transpose, cast to data.frame, set the colnames to the first row and then remove the first row. It might be possible to skip the last step through some combination of arguments to data.frame but I don't know what they are right now.

Resources