r- hist.default, 'x' must be numeric - r

Just picking up R and I have the following question:
Say I have the following data.frame:
v1 v2 v3
3 16 a
44 457 d
5 23 d
34 122 c
12 222 a
...and so on
I would like to create a histogram or barchart for this in R, but instead of having the x-axis be one of the numeric values, I would like a count by v3. (2 a, 1 c, 2 d...etc.)
If I do hist(dataFrame$v3), I get the error that 'x 'must be numeric.
Why can't it count the instances of each different string like it can for the other columns?
What would be the simplest code for this?

OK. First of all, you should know exactly what a histogram is. It is not a plot of counts. It is a visualization for continuous variables that estimates the underlying probability density function. So do not try to use hist on categorical data. (That's why hist tells you that the value you pass must be numeric.)
If you just want counts of discrete values, that's just a basic bar plot. You can calculate counts of values in R for discrete data using table and then plot that with the basic barplot() command.
barplot(table(dataFrame$v3))
If you want to require a minimum number of observations, try
tbl<-table(dataFrame$v3)
atleast <- function(i) {function(x) x>=i}
barplot(Filter(atleast(10), tbl))

Related

R: ggplot:: geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?

I am trying to make a line graph in ggplot and I am having difficulty diagnosing my error. I've read through nearly all the similar threads, but have been unable to solve my issue.
I am trying to plot Japanese CPI. I downloaded the data online from FRED.
my str looks like:
str(jpycpi)
data.frame: 179 obs. of 2 variables:
$ DATE : Factor w/ 179 levels "2000-04-01","2000-05-01",..: 1 2 3 4 5 6 7 8 9 10 ...
$ JPNCPIALLMINMEI: num 103 103 103 102 103 ...
My code to plot:
ggplot(jpycpi, aes(x=jpycpi$DATE, y=jpycpi$JPNCPIALLMINMEI)) + geom_line()
it gives me an error saying:
geom_path: Each group consists of only one observation. Do you need to adjust the group aesthetic?
I have tried the following and have been able to plot it, but the graph x bar is distorted for some odd reason. That code is below:
ggplot(jpycpi, aes(x=jpycpi$DATE, y=jpycpi$JPNCPIALLMINMEI, group=1)) + geom_line()
The "Each group consists of only one observation" error message happens because your x aesthetic is a factor. ggplot takes that to mean that your independent variable is categorical, which doesn't make sense in conjunction with geom_line.
In this case, the right way to fix it is to convert that column of the data to a Date vector. ggplot understands how to use all of R's date/time classes as the x aesthetic.
Converting from a factor to a Date is a little tricky. A direct conversion,
jpycpi$DATE <- as.Date(jpycpi$DATE)
works in R version 3.3.1, but, if I remember correctly, would give nonsense results in older versions of the interpreter, because as.Date would look only at the ordinals of the factor levels, not at their labels. Instead, one should write
jpycpi$DATE <- as.Date(as.character(jpycpi$DATE))
Conversion from a factor to a character vector does look at the labels, so the subsequent conversion to a Date object will do the Right Thing.
You probably got a factor for $DATE in the first place because you used read.table or read.csv to load up the data set. The default behavior of these functions is to attempt to convert each column to a numeric vector, and failing that, to convert it to a factor. (See ?type.convert for the exact behavior.) If you're going to be importing lots of data with date columns, it's worth learning how to use the colClasses argument to read.table; this is more efficient and doesn't have gotchas like the above.

How do you create a three way contingency table in R when you only have probabilities?

I'd like to create a three level contingency table in R. Let's assume that one of my variables is a categorical variable with two levels, one of my levels is a categorical variable with two levels and the third variable is a categorical variable with three levels.
I know that you have to use the
table()
function, but I'm not sure how to set things up. Most of the explanations that I've seen online assume that you are given the data in a format similar to this, but I just an actual contingency table that is similar to the one in example 1 here.
How do I make a contingency table when I don't have an actual data file?
For example, in the example that I linked to could I create factors like this?
yesFactor<- c(19,0,11,6)
For example this is my code below
#do this to use the mass library
datos2.array <- array(c(16,7,15,34,5,3,1,1,3,8,1,3),
dim=c(2,2,3),
dimnames=list(Conduct=c("Buena","Mala"),
Riesgo=c("N","R"),
Adversidad=c("Bajo","Medio","Alto")))
my.table <- ftable(datos2.array)
model2 <- loglm(~ Riesgo + Adversidad, data = datos2.array)
When I try printing the table I get the following
Adversidad Bajo Medio Alto
Conduct Riesgo
Buena N 16 5 3
R 15 1 1
Mala N 7 3 8
R 34 1 3
I need a 2X3X2 table with N and R being factors of Bajo, Medio and Alto.
I've also tried the other way
datos2.array <- array(c(16,7,15,34,5,3,1,1,3,8,1,3),
dim=c(2,3,2),
dimnames=list(Conduct=c("Buena","Mala"),
Adversidad=c("Bajo","Medio","Alto"),
Riesgo=c("N","R")))
and that is wrong as well.

Simple line plot using R ggplot2

I have data as follows in .csv format as I am new to ggplot2 graphs I am not able to do this
T L
141.5453333 1
148.7116667 1
154.7373333 1
228.2396667 1
148.4423333 1
131.3893333 1
139.2673333 1
140.5556667 2
143.719 2
214.3326667 2
134.4513333 3
169.309 8
161.1313333 4
I tried to plot a line graph using following graph
data<-read.csv("sample.csv",head=TRUE,sep=",")
ggplot(data,aes(T,L))+geom_line()]
but I got following image it is not I want
I want following image as follows
Can anybody help me?
You want to use a variable for the x-axis that has lots of duplicated values and expect the software to guess that the order you want those points plotted is given by the order they appear in the data set. This also means the values of the variable for the x-axis no longer correspond to the actual coordinates in the coordinate system you're plotting in, i.e., you want to map a value of "L=1" to different locations on the x-axis depending on where it appears in your data.
This type of fairly non-sensical thing does not work in ggplot2 out of the box. You have to define a separate variable that has a proper mapping to values on the x-axis ("id" in the code below) and then overwrite the labels with the values for "L".
The coe below shows you how to do this, but it seems like a different graphical display would probbaly be better suited for this kind of data.
data <- as.data.frame(matrix(scan(text="
141.5453333 1
148.7116667 1
154.7373333 1
228.2396667 1
148.4423333 1
131.3893333 1
139.2673333 1
140.5556667 2
143.719 2
214.3326667 2
134.4513333 3
169.309 8
161.1313333 4
"), ncol=2, byrow=TRUE))
names(data) <- c("T", "L")
data$id <- 1:nrow(data)
ggplot(data,aes(x=id, y=T))+geom_line() + xlab("L") +
scale_x_continuous(breaks=data$id, labels=data$L)
You have an error in your code, try this:
ggplot(data,aes(x=L, y=T))+geom_line()
Default arguments for aes are:
aes(x, y, ...)

R:More than 52 levels in a predicting factor, truncated for printout

Hi I'm a beginner in R programming language. I wrote one code for regression tree using rpart package. In my data some of my independent variables have more than 100 levels. After running the rpart function
I'm getting following warning message "More than 52 levels in a predicting factor, truncated for printout" & my tree is showing in very weird way. Say for example my tree is splitting by location which has around 70 distinct levels, but when the label is displaying in tree then it is showing "ZZZZZZZZZZZZZZZZ..........." where I don't have any location called "ZZZZZZZZ"
Please help me.
Thanks in advance.
Many of the functions in R have limits on the number of levels a factor-type variable can have (ie randomForest limits the number of levels of a factor to 32).
One way that I've seen it dealt with especially in data mining competitions is to:
1) Determine maximum number of levels allowed for a given function (call this X).
2) Use table() to determine the number of occurrences of each level of the factor and rank them from greatest to least.
3) For the top X - 1 levels of the factor leave them as is.
4) For the levels < X change them all to one factor to identify them as low-occurrence levels.
Here's an example that's a bit long but hopefully helps:
# Generate 1000 random numbers between 0 and 100.
vars1 <- data.frame(values1=(round(runif(1000) * 100,0)))
# Changes values to factor variable.
vars1$values1 <- factor(vars1$values1)
# Show top 6 rows of data frame.
head(vars1)
# Show the number of unique factor levels
length(unique(vars1$values1 ))
# Create table showing frequency of each levels occurrence.
table1 <- data.frame(table(vars1 ))
# Orders the table in descending order of frequency.
table1 <- table1[order(-table1$Freq),]
head(table1)
# Assuming we want to use the CART we choose the top 51
# levels to leave unchanged
# Get values of top 51 occuring levels
noChange <- table1$vars1[1:51]
# we use '-1000' as factor to avoid overlap w/ other levels (ie if '52' was
# actually one of the levels).
# ifelse() checks to see if the factor level is in the list of the top 51
# levels. If present it uses it as is, if not it changes it to '-1000'
vars1$newFactor <- (ifelse(vars1$values1 %in% noChange, vars1$values1, "-1000"))
# Show the number of levels of the new factor column.
length(unique(vars1$newFactor))
Finally, you may want to consider using truncated variables in rpart as the tree display gets very busy when there are a large number of variables or they have long names.

R plot frequency of strings with specific pattern

Given a data frame with a column that contains strings. I would like to plot the frequency of strings that bear a certain pattern. For example
strings <- c("abcd","defd","hfjfjcd","kgjgcdjrye","yryriiir","twtettecd")
df <- as.data.frame(strings)
df
strings
1 abcd
2 defd
3 hfjfjcd
4 kgjgcdjrye
5 yryriiir
6 twtettec
I would like to plot the frequency of the strings that contain the pattern `"cd"
Anyone with a quick solution?
I presume from your question that you meant to have some entries that appear more than once, so I've added one duplicate string:
x <- c("abcd","abcd","defd","hfjfjcd","kgjgcdjrye","yryriiir","twtettecd")
To find only those strings that contain a specific pattern, use grep or grepl:
y <- x[grepl("cd", x)]
To get a table of frequencies, you can use table
table(y)
y
abcd hfjfjcd kgjgcdjrye twtettecd
2 1 1 1
And you can plot it using plot or barplot as follows:
barplot(table(y))
Others have already mentioned grepl. Here is an implementation with plot.density using grep to get the positions of the matches
plot( density(0+grepl("cd", strings)) )
If you don't like the extension of the density plot beyond the range there are other methods in the 'logspline' package that allow one to get sharp border at range extremes. Searching RSiteSearch
check "Kernlab" package.
You can define a kernel (pattern) which could any kind of string and count them later on.

Resources