I have a dataframe called Insectsprays that has two columns, count and spray. When I try to use split to create boxplots for each value of spray, I get the error shown below.
Can anyone explain the error for me? It's no doubt clear I'm new to R.
#
class(InsectSprays)
[1] "data.frame"
#
head(InsectSprays)
count spray
1 10 A
2 7 A
3 20 A
4 14 A
5 14 A
6 12 A
#
boxplot(split(x=InsectSprays,f=InsectSprays$spray))
Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) :
'x' must be atomic
boxplot expects a basic ('atomic') object like a series of numbers 1:10 or a list of basic atomic objects list(1:10,2:11). Your split produces a list of data.frames which boxplot doesn't know how to handle. Luckily, boxplot can also take a formula if you want to get results per group, like:
boxplot(count ~ spray, data=InsectSprays)
If you were working with a different function that didn't have this possibility, you would need to loop over the split list. Possibly something like:
## divide the plot window into 3 columns/2 rows
par(mfrow=c(2,3))
## loop over each object and `boxplot` the `count` column
lapply(split(InsectSprays, InsectSprays$spray), \(x) boxplot(x$count) )
Related
I've been assigned a homework problem that asks:
Write a function called numeric_summary which takes two inputs:\
`x`: A numeric vector\
`group`: A factor vector of the same length as x
and produces a list as output which contains the following elements:\
`missing`: The number of missing values in x\
`means`: The means of x for each level of groups.\
`sds`: The standard deviations of x for each level of groups\
`p.value`: The p-value for a test of the hypothesis that the means across the >levels of groups are the same (versus the two-sided alternative)\
`is.binary`: Set to FALSE for for this function
At the moment I am not concerned with creating the list of outputs, but I am unsure how to code the two input aspect, especially with them being two different types. There is an example given:
#numeric_summary <- function(x, group){}
#
# for example:
#numeric_summary(titanic4$age, titanic4$pclass)
I assume that the function should work for any assigned and x and group so my difficulty lies in making that aspect modular or so to speak. The packages used are tidyverse, PASWR, and knitr.
We can create a function like
numeric_summary <- function(x, group) {
#Count missing values
list(missing = sum(is.na(x)),
#Mean by group
means = tapply(x, group, mean),
#SD by group
sds = tapply(x, group, sd))
}
numeric_summary(mtcars$mpg, mtcars$cyl)
#$missing
#[1] 0
#$means
# 4 6 8
#26.66 19.74 15.10
#$sds
# 4 6 8
#4.510 1.454 2.560
I have a few tables that count the frequency of eye color across three dataframes. I thought I could translate those tables in dataframes, then merge them as a new dataframe. I believe is the error is coming from the fact that I transformed the tables into dataframes and merge seems to add on the rows of it. But I need:
Color Tb1 Tb2 Tb3
Bl 5 0 3
Blk 6 7 0
Small condition is that not each dataframe has Black or Blue eye colors in it. So I need to account for that, then change the NA's to 0's.
I have tried:
chart<-merge(tb1,tb2,tb3)
chart
Error in fix.by(by.x, x) :
'by' must specify one or more columns as numbers, names or logical
AND
chart<-merge(tb1,tb2,tb3,all=T)
chart
Error in fix.by(by.x, x) :
'by' must specify one or more columns as numbers, names or logical
Example code:
one<-c('Black','Black','Black','Black','Black','Black','Blue','Blue','Blue','Blue','Blue')
two<-c('Black','Black','Black','Black','Black','Black','Black')
three<-c('Blue','Blue','Blue')
tb1<-table(one)
tb2<-table(two)
tb3<-table(three)
tb1<-as.data.frame(tb1)
tb2<-as.data.frame(tb2)
tb3<-as.data.frame(tb3)
You can convert all tables directly into one tibble using bind_rows from the package dplyr:
# creating the setup given in the question
one<-c('Black','Black','Black','Black','Black','Black','Blue','Blue','Blue','Blue','Blue')
two<-c('Black','Black','Black','Black','Black','Black','Black')
three<-c('Blue','Blue','Blue')
tb1<-table(one)
tb2<-table(two)
tb3<-table(three)
# note that there is no need for individual dataframes
# bind the rows of the given tables into a tibble
result <- dplyr::bind_rows(tb1, tb2, tb3)
# replace NAs with 0 values
result[is.na(result)] <- 0
# check the resulting tibble
result
# # A tibble: 3 x 2
# Black Blue
# <dbl> <dbl>
# 1 6 5
# 2 7 0
# 3 0 3
Doing it your way, I will do something as follows (column names are needed to be corrected):
newframe <- merge(tb1, tb2, by.x ="one", by.y ="two", all = TRUE)
merge(newframe, tb3, by.x ="one", by.y ="three", all = TRUE)
However, for nicer ways, check dplyr() joins.
I have a matrix B that is 10 rows x 2 columns:
B = matrix(c(1:20), nrow=10, ncol=2)
Some of the rows are technical duplicates, and they correspond to the same
number in a list of length 20 (list1).
list1 = c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8)
list1 = as.list(list1)
I would like to use this list (list1) to take the mean of any duplicate values for all columns in B such that I end up with a matrix or data.frame with 8 rows and 2 columns (all the duplicates are averaged).
Here is my code:
aggregate.data.frame(B, by=list1, FUN=mean)
And it generates this error:
Error in aggregate.data.frame(B, by = list1, FUN = mean) :
arguments must have same length
What am I doing wrong?
Thank you!
Your data have 2 variables (2 columns), each with 10 observations (10 rows). The function aggregate.data.frame expects the elements in the list to have the same length as the number of observations in your variables. You are getting an error because the vector in your list has 20 values, while you only have 10 observations per variable. So, for example, you can do this because now you have 1 variable with 20 observations, and list 1 has a vector with 20 elements.
B <- 1:20
list1 <- list(B=c(1,1,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,8,8))
aggregate.data.frame(B, by=list1, FUN=mean)
The code will also work if you give it a matrix with 2 columns and 20 rows.
aggregate.data.frame(cbind(B,B), by=list1, FUN=mean)
I think this answer addresses why you are getting an error. However, I am not sure that it addresses what you are actually trying to do. How do you expect to end up with 8 rows and 2 columns? What exactly would the cells in that matrix represent?
I have a list object as shown below ->
> myaggregate
input$AgeAndGender input$CTR
1 Female_<18 0.030041698
2 Female_18-24 0.010918938
3 Female_25-34 0.009839806
4 Female_35-44 0.010193773
5 Female_45-54 0.009996056
6 Female_55-64 0.020024678
7 Female_65+ 0.030060728
8 Male_<18 0.028356698
9 Male_18-24 0.011031902
10 Male_25-34 0.010218562
11 Male_35-44 0.010168911
12 Male_45-54 0.010021256
13 Male_55-64 0.020191223
14 Male_65+ 0.029717747
Im trying to plot a bargraph representing the CTR levels(Y axis) for each value in AgeAndGender(X axis).
When I attempt a simple plot however I run into the following issue ->
> ggplot(data= myaggregate,aes(x=input$AgeAndGender,y=input$CTR))+geom_bar()
Error in data.frame(x = c("Male_35-44", "Female_65+", "Male_25-34", "Female_45-54", :
arguments imply differing number of rows: 3378934, 14
I'm sure I'm missing something pretty basic. Any help is appreciated!
If you are just wanting to plot the values, then you need stat="identity" like in the following example:
library(ggplot2)
AgeAndGender <- c("f1","f2","f3")
CTR <- c(.1,.15,.12)
myaggregate <- data.frame(AgeAndGender, CTR)
ggplot(data= myaggregate,aes(x=AgeAndGender, y=CTR)) + geom_bar(stat = "identity")
Which results in the following:
Looking at your comment about your data being in a list concerns me. Try making myaggregate a dataframe.
I was able to plot with something like what you are using but it's a rather weird construction. Dataframes do not generally have dollar-signs in there name because $ is an infix function in R. I read in the data with read.table and the dollar-signs get converted to periods. I put back the column names as you have them with:
names(myaggregate) <- c('input$AgeAndGender', 'input$CTR')
And then you can get a rather messy barplot with:
ggplot(data= myaggregate,aes(x=`input$AgeAndGender`,y=`input$CTR`))+ geom_bar(stat = "identity")
When you just put your code in, the unquoted names get interpreted as x being the "AgeAndGender"-clumn in the input dataframe. If you only use ordinary quotes rather than backticks you do not succeed.
Suppose I have a data frame in R where I would like to use 2 columns "factor1" and "factor2" as factors and I need to calculate mean value for all other columns per each pair of the above mentioned factors. After running the code below, the last line gives the following warnings:
Warning messages:
1: In split.default(seq_along(x), f, drop = drop, ...) :
data length is not a multiple of split variable
...
Why is it happening and what should I do to make it right?
Thanks.
Here is my code:
# Create data frame
myDataFrame <- data.frame(factor1=c(1,1,1,2,2,2,3,3,3), factor2=c(3,3,3,4,4,4,5,5,5), val1=c(1,2,3,4,5,6,7,8,9), val2=c(9,8,7,6,5,4,3,2,1))
# Split by 2 columns (factors)
splitDataFrame <- split(myDataFrame, list(myDataFrame$factor1, mydataFrame$factor2))
# Calculate mean value for each column per each pair of factors
splitMeanValues <- lapply(splitDataFrame, function(x) apply(x, 2, mean))
# Combine back to reduced table whereas there is only one value (mean) per each pair of factors
MeanValues <- unsplit(splitMeanValues, list(unique(myDataFrame$factor1), unique(mydataFrame$factor2)))
EDIT1: Added data frame creation (see above)
If you need to calculate the mean for all other columns than the factors, you can use the formula syntax of aggregate()
aggregate(.~factor1+factor2, myDataFrame, FUN=mean)
That returns
factor1 factor2 val1 val2
1 1 3 2 8
2 2 4 5 5
3 3 5 8 2
Your split() method didn't work because when you unsplit you must have the same number of rows as when you split your data. You were reduing the number of rows for all groups to just one row. Plus, unsplit really should be used with the exact same list of factors that was used to do the split otherwise groups may get out of order. You could to a split and then lapply some collapsing function and then rbind the list back into a single data.frame if you really wanted, but for a simple mean, aggregate is probably best.
The same result can be obtained with summaryBy() in the doBy package. Although it's pretty much the same as aggregate() in this case.
> library(doBy)
> summaryBy( . ~ factor1+factor2, data = myDataFrame)
# factor1 factor2 val1.mean val2.mean
# 1 1 3 2 8
# 2 2 4 5 5
# 3 3 5 8 2
Have you tried aggregate?
aggregate(myDataFrame$valueColum, myDataFrame$factor1, FUN=mean)
aggregate(myDataFrame$valueColum, myDataFrame$factor2, FUN=mean)