pie3D: x values must be positive numbers - r

I load the supportdata variable as follows.
supportdata=aggregate(scoredata$Support, list(Topic = scoredata$Topic), sum)
slices <- supportdata[2]
lbls <- supportdata[1]
typeof(slices)
3D Exploded Pie Chart Below
pie3D(slices,labels=lbls,explode=0.1,main="Year wise scores for topic 1")
and I get the below error:
Error in pie3D(slices, labels = lbls, explode = 0.1, main = "Year wise
scores for topic 1") :pie3D: x values must be positive numbers
supportdata variable contains the following information and is generated using aggregate function which sums up the scores in the second column.
# supportdata
#
# Topic x
#
# 1 c 14
# 2 c# 80
# 3 c++ 15
# 4 css 4
# 5 html 3
# 6 .net 3
# 7 php 0
# 8 sql 0
How do I get rid of this error? I tried searching but couldn't find a solution to this problem..I tried casting into as.numeric, as.integer but it says the list cannot be coerced into double or integer type. :(

Your problem is indexing with [ rather than [[, which returns a list of numbers rather than a numeric vector.
library("plotrix")
pie3D(supportdata[[2]],labels=supportdata[[1]],
explode=0.1,main="Year wise scores for topic 1")
works fine, as does
with(supportdata,pie3D(x,labels=Topic,
explode=0.1,main="Year wise scores for topic 1"))

The below solution works too apart from one provided by Ben.
slices <- t(supportdata[2])
lbls <- t(supportdata[1])
pie3D(slices,labels=lbls,explode=0.1,main="Pie Diagram for Support")

Related

How to import single-column CSV as an array in R? (values importing as column names)

I have an "upside down" comma separated list of 1000 single-digit integers. What I'm trying to do is produce a count of each integer, and a histogram showing their distribution.
e.g. 3,1,3,2,1,2,1,1,0,3,1,2,0,2,1,1,2,1,2,1,2,0,0,1,....
However, when I read.csv(), R produces a data frame with 0 observations of 1000 variables, and my numbers as column names.
How can I bring in this unconventional data format as a single column data-frame/array? I have tried x <- x[-1, ]
TIA!
As pointed out in the comments, scan("filename.csv", sep = ",") is the way to go here.
The following code produces a histogram and prints out the proportion of each value in the data.
library(tidyverse)
x = scan("text.csv", sep=",")
N = length(x)
df <- as.data.frame(x)
ggplot(df) + geom_histogram(aes(x=x))
count(df, x) %>% mutate(n=n/N)
Output:
x n
1 0 0.171
2 1 0.352
3 2 0.319
4 3 0.158
...Confirming my suspicion that Math.round(Math.random() * N) in JS is not uniform, and under-represents the borders (in this case 0 and 3), due to rounding e.g. 0.6 up to 1. And of course, all of this was in vein, because I should have been using Math.floor() this whole time! :)
Running the same test, with Math.floor() instead of Math.round(), truncates the top number but produces a far more even result:

R get all categories in column

I have a large Dataset (dataframe) where I want to find the number and the names of my cartegories in a column.
For example my df was like that:
A B
1 car
2 car
3 bus
4 car
5 plane
6 plane
7 plane
8 plane
9 plane
10 train
I would want to find :
car
bus
plane
train
4
How would I do that?
categories <- unique(yourDataFrame$yourColumn)
numberOfCategories <- length(categories)
Pretty painless.
This gives unique, length of unique, and frequency:
table(df$B)
bus car plane train
1 3 5 1
length(table(x$B))
[1] 4
You can simply use unique:
x <- unique(df$B)
And it will extract the unique values in the column. You can use it with apply to get them from each column too!
I would recommend you use factors here, if you are not already. It's straightforward and simple.
levels() gives the unique categories and nlevels() gives the number of them. If we run droplevels() on the data first, we take care of any levels that may no longer be in the data.
with(droplevels(df), list(levels = levels(B), nlevels = nlevels(B)))
# $levels
# [1] "bus" "car" "plane" "train"
#
# $nlevels
# [1] 4
Additionally, to see sorted values you can use the following:
sort(table(df$B), decreasing = TRUE)
And you will see the values in the decreasing order.
Firstly you must ensure that your column is in the correct data type. Most probably R had read it in as a 'chr' which you can check with 'str(df)'.
For the data you have provided as an example, you will want to change this to a 'factor'. df$column <- as.factor(df$column)
Once the data is in the correct format, you can then use 'levels(df$column)' to get a summary of levels you have in the dataset

Datamatrix in R - extracting data from scatterplot

I am trying to modify a R script but I have only basic experience with R:
question 1:
In line: for (i in 1:nrow(x)). what does the integer 1 actually do? Changing the value to 2 or higher seem to have a big effect on the output.
question 2:
I have been getting the message:
"Error in if (p[2] > a + b * p[1]) { :
missing value where TRUE/FALSE needed"
. In general, what might be causing this?
Any help is much appreciated!
question edited:
Say I have a dataframe for plotting scatterplot. The dataframe would be organized in the following fashion (in CSV format):
name ABC EFG
1 32 45
2 56 67
to, say 200 000 entries
I am going to first do a scatterplot, after which I am going to subset a portion of the dataset into A using alphahull and export them as XYZ. The script for doing this:
#plot first plot containing all data
plot(x = X$ABC,
y = X$EFG,
pch=20,
)
#subset data using ahull. choose 4 points on the plot
A <- ahull(locator(4, type="p", pch=20), alpha=10000)
#exporting subset
XYZ <- {}
for (i in 1:nrow(X)) { if (inahull(A, c(X$ABC[i],X$EFG[i]))) XYZ <- rbind(X,X[i,])}
I am getting the following message if the number of data points in the subset that I choose is too large:Error in if (p[2] > a + b * p[1]) { :
missing value where TRUE/FALSE needed
Question 1 - this is a for loop - it is executing once for each row in the matrix or data frame x (not sure what x is here exactly). Changing it to 2 will mean the loop happens one less time. Without the rest of the code I can't say much else.
Question 2 - can you post the whole code? It apparently needs to evaluate that expression and one or more of the values is missing.
Say you have data x
set.seed(123) # for reproducibility
x<-as.data.frame(rnorm(10)) # generate random number and store it as dataframe
k<-2 #assign n as 2
for (i in (1:nrow(x))){
cat("this is row",i,"\n")
show (k)
k<-k+i
}
show (k)
this is row 1
[1] 2
this is row 2
[1] 3
this is row 3
[1] 5
this is row 4
[1] 8
this is row 5
[1] 12
this is row 6
[1] 17
this is row 7
[1] 23
this is row 8
[1] 30
this is row 9
[1] 38
this is row 10
[1] 47
> show (k)
[1] 57

Get ordered kmeans cluster labels

Say I have a data set x and do the following kmeans cluster:
fit <- kmeans(x,2)
My question is in regards to the output of fit$cluster: I know that it will give me a vector of integers (from 1:k) indicating the cluster to which each point is allocated. Instead, is there a way to have the clusters be labeled 1,2, etc... in order of decreasing numerical value of their center?
For example: If x=c(1.5,1.4,1.45,.2,.3,.3) , then fit$cluster should result in (1,1,1,2,2,2) but not result in (2,2,2,1,1,1)
Similarly, if x=c(1.5,.2,1.45,1.4,.3,.3) then fit$cluster should return (1,2,1,1,2,2), instead of (2,1,2,2,1,1)
Right now, fit$cluster seems to label the cluster numbers randomly. I've looked into documentation but haven't been able to find anything. Please let me know if you can help!
I had a similar problem. I had a vector of ages that I wanted to separate into 5 factor groups based on a logical ordinal set. I did the following:
I ran the k-means function:
k5 <- kmeans(all_data$age, centers = 5, nstart = 25)
I built a data frame of the k-means indexes and centres; then arranged it by centre value.
kmeans_index <- as.numeric(rownames(k5$centers))
k_means_centres <- as.numeric(k5$centers)
k_means_df <- data_frame(index=kmeans_index, centres=k_means_centres)
k_means_df <- k_means_df %>%
arrange(centres)
Now that the centres are in the df in ascending order, I created my 5 element factor list and bound it to the data frame:
factors <- c("very_young", "young", "middle_age", "old", "very_old")
k_means_df <- cbind(k_means_df, factors)
Looks like this:
> k_means_df
index centres factors
1 2 23.33770 very_young
2 5 39.15239 young
3 1 55.31727 middle_age
4 4 67.49422 old
5 3 79.38353 very_old
I saved my cluster values in a data frame and created a dummy factor column:
cluster_vals <- data_frame(cluster=k5$cluster, factor=NA)
Finally, I iterated through the factor options in k_means_df and replaced the cluster value with my factor/character value within the cluster_vals data frame:
for (i in 1:nrow(k_means_df))
{
index_val <- k_means_df$index[i]
factor_val <- as.character(k_means_df$factors[i])
cluster_vals <- cluster_vals %>%
mutate(factor=replace(factor, cluster==index_val, factor_val))
}
Voila; I now have a vector of factors/characters that were applied based on their ordinal logic to the randomly created cluster vector.
# A tibble: 3,163 x 2
cluster factor
<int> <chr>
1 4 old
2 2 very_young
3 2 very_young
4 2 very_young
5 3 very_old
6 3 very_old
7 4 old
8 4 old
9 2 very_young
10 5 young
# ... with 3,153 more rows
Hope this helps.
K-means is a randomized algorithm. It is actually correct when the labels are not consistent across runs, or ordered in "ascending" order.
But you can of course remap the labels as you like, you know...
You seem to be using 1-dimensional data. Then k-means is actually not the best choice for you.
In contrast to 2- and higher-dimensional data, 1-dimensional data can efficiently be sorted. If your data is 1-dimensional, use an algorithm that exploits this for efficiency. There are much better algorithms for 1-dimensional data than for multivariate data.

Read multidimensional group data in R

I have done lot of googling but I didn't find satisfactory solution to my problem.
Say we have data file as:
Tag v1 v2 v3
A 1 2 3
B 1 2 2
C 5 6 1
A 9 2 7
C 1 0 1
The first line is header. The first column is Group id (the data have 3 groups A, B, C) while other column are values.
I want to read this file in R so that I can apply different functions on the data.
For example I tried to read the file and tried to get column mean
dt<-read.table(file_name,head=T) #gives warnings
apply(dt,2,mean) #gives NA NA NA
I want to read this file and want to get column mean. Then I want to separate the data in 3 groups (according to Tag A,B,C) and want to calculate mean(column wise) for each group. Any help
apply(dt,2,mean) doesn't work because apply coerces the first argument to an array via as.matrix (as is stated in the first paragraph of the Details section of ?apply). Since the first column is character, all elements in the coerced matrix object will be character.
Try this instead:
sapply(dt,mean) # works because data.frames are lists
To calculate column means by groups:
# using base functions
grpMeans1 <- t(sapply(split(dt[,c("v1","v2","v3")], dt[,"Tag"]), colMeans))
# using plyr
library(plyr)
grpMeans2 <- ddply(dt, "Tag", function(x) colMeans(x[,c("v1","v2","v3")]))

Resources