R get all categories in column - r

I have a large Dataset (dataframe) where I want to find the number and the names of my cartegories in a column.
For example my df was like that:
A B
1 car
2 car
3 bus
4 car
5 plane
6 plane
7 plane
8 plane
9 plane
10 train
I would want to find :
car
bus
plane
train
4
How would I do that?

categories <- unique(yourDataFrame$yourColumn)
numberOfCategories <- length(categories)
Pretty painless.

This gives unique, length of unique, and frequency:
table(df$B)
bus car plane train
1 3 5 1
length(table(x$B))
[1] 4

You can simply use unique:
x <- unique(df$B)
And it will extract the unique values in the column. You can use it with apply to get them from each column too!

I would recommend you use factors here, if you are not already. It's straightforward and simple.
levels() gives the unique categories and nlevels() gives the number of them. If we run droplevels() on the data first, we take care of any levels that may no longer be in the data.
with(droplevels(df), list(levels = levels(B), nlevels = nlevels(B)))
# $levels
# [1] "bus" "car" "plane" "train"
#
# $nlevels
# [1] 4

Additionally, to see sorted values you can use the following:
sort(table(df$B), decreasing = TRUE)
And you will see the values in the decreasing order.

Firstly you must ensure that your column is in the correct data type. Most probably R had read it in as a 'chr' which you can check with 'str(df)'.
For the data you have provided as an example, you will want to change this to a 'factor'. df$column <- as.factor(df$column)
Once the data is in the correct format, you can then use 'levels(df$column)' to get a summary of levels you have in the dataset

Related

How to calculate most frequent occurring terms/words in a document collection/corpus using R?

First I create a document term matrix like below
dtm <- DocumentTermMatrix(docs)
Then I take the sum of the occurance of each word vectors as below
totalsums <- colSums(as.matrix(dtm))
My totalsums (R says type 'double') looks like below for first 7 elements.
aaab aabb aabc aacc abbb abbc abcc ...
9 2 10 4 7 3 12 ...
I managed to sort this with the following command
sorted.sums <- sort(totalsums, decreasing=T)
Now I want to extract the first 4 terms/words with the highest sums which are greater than value 5.
I could get the first 4 highest with sorted.sums[1:4] but how can I set a threshold value?
I managed to do this with the order function like below but, is there a way to do this than sort function or without using findFreqTerms fucntion?
ord.totalsums <- order(totalsums)
findFreqTerms(dtm, lowfreq=5)
Appreciate your thoughts on this.
You can use
sorted.sums[sorted.sums > 5][1:4]
But if you have at least 4 values that are greater than 5 only using sorted.sums[1:4] should work as well.
To get the words you can use names.
names(sorted.sums[sorted.sums > 5][1:4])

R: how to map a set of integers to another set of integers

I have a data set where each individual has a unique person ID. I'm interested in turning these ID numbers to another set of more manageable type integer IDs.
ID <- c(59970013552, 51730213552, 1233923, 2949394, 9999999999)
Essentially, I'd like to map these IDs a new_ID, where
> new_ID
[1] 1 2 3 4 5
The reason I'm doing this is that my analysis requires as.integer(ID), and R will coerce large integers into NA. I have tried using as.integer64 from the bit64 package, but the class integer64 is not compatible with my analysis.
I've also thought to just do ID - min(ID) + 1 to get around having huge ID numbers. But this also doesn't work, because some of my larger IDs are so large that even if I subtract the min(ID) value, as.integer(ID) will still coerce them to NA.
This should be a duplicate but I couldn't find a relevant answer hence posting an answer.
We can use match
match(ID, unique(ID))
#[1] 1 2 3 4 5
OR convert the ID into factors along with levels
as.integer(factor(ID, levels = unique(ID)))
#[1] 1 2 3 4 5

Get ordered kmeans cluster labels

Say I have a data set x and do the following kmeans cluster:
fit <- kmeans(x,2)
My question is in regards to the output of fit$cluster: I know that it will give me a vector of integers (from 1:k) indicating the cluster to which each point is allocated. Instead, is there a way to have the clusters be labeled 1,2, etc... in order of decreasing numerical value of their center?
For example: If x=c(1.5,1.4,1.45,.2,.3,.3) , then fit$cluster should result in (1,1,1,2,2,2) but not result in (2,2,2,1,1,1)
Similarly, if x=c(1.5,.2,1.45,1.4,.3,.3) then fit$cluster should return (1,2,1,1,2,2), instead of (2,1,2,2,1,1)
Right now, fit$cluster seems to label the cluster numbers randomly. I've looked into documentation but haven't been able to find anything. Please let me know if you can help!
I had a similar problem. I had a vector of ages that I wanted to separate into 5 factor groups based on a logical ordinal set. I did the following:
I ran the k-means function:
k5 <- kmeans(all_data$age, centers = 5, nstart = 25)
I built a data frame of the k-means indexes and centres; then arranged it by centre value.
kmeans_index <- as.numeric(rownames(k5$centers))
k_means_centres <- as.numeric(k5$centers)
k_means_df <- data_frame(index=kmeans_index, centres=k_means_centres)
k_means_df <- k_means_df %>%
arrange(centres)
Now that the centres are in the df in ascending order, I created my 5 element factor list and bound it to the data frame:
factors <- c("very_young", "young", "middle_age", "old", "very_old")
k_means_df <- cbind(k_means_df, factors)
Looks like this:
> k_means_df
index centres factors
1 2 23.33770 very_young
2 5 39.15239 young
3 1 55.31727 middle_age
4 4 67.49422 old
5 3 79.38353 very_old
I saved my cluster values in a data frame and created a dummy factor column:
cluster_vals <- data_frame(cluster=k5$cluster, factor=NA)
Finally, I iterated through the factor options in k_means_df and replaced the cluster value with my factor/character value within the cluster_vals data frame:
for (i in 1:nrow(k_means_df))
{
index_val <- k_means_df$index[i]
factor_val <- as.character(k_means_df$factors[i])
cluster_vals <- cluster_vals %>%
mutate(factor=replace(factor, cluster==index_val, factor_val))
}
Voila; I now have a vector of factors/characters that were applied based on their ordinal logic to the randomly created cluster vector.
# A tibble: 3,163 x 2
cluster factor
<int> <chr>
1 4 old
2 2 very_young
3 2 very_young
4 2 very_young
5 3 very_old
6 3 very_old
7 4 old
8 4 old
9 2 very_young
10 5 young
# ... with 3,153 more rows
Hope this helps.
K-means is a randomized algorithm. It is actually correct when the labels are not consistent across runs, or ordered in "ascending" order.
But you can of course remap the labels as you like, you know...
You seem to be using 1-dimensional data. Then k-means is actually not the best choice for you.
In contrast to 2- and higher-dimensional data, 1-dimensional data can efficiently be sorted. If your data is 1-dimensional, use an algorithm that exploits this for efficiency. There are much better algorithms for 1-dimensional data than for multivariate data.

Add a row above row headers in R

I want to add an extra row above what is row 1 of the following dataframe (i.e. above the labels a, b and Percent):
a<-c(1:5)
b<-c(4,3,2,1,1)
Percent<-c(40,30,20,10,10)
df1<-data.frame(a,b,Percent)
These dataframes represent questions in an interview analysis I am doing, and I want to include the question descriptor above the row headers so I can easily identify which dataframe belongs to which question (i.e. "Age"). I have been using rbind to add rows, but is it possible to use this command above the row headers?
Thanks.
If it is just meta-data, you can add it as an attribute to the data.frame.
> attr(df1, "Question") <- "Age"
> attributes(df1)
$names
[1] "a" "b" "Percent"
$row.names
[1] 1 2 3 4 5
$class
[1] "data.frame"
$Question
[1] "Age"
If you want the question to be printed above the data.frame,
you can define a Question class, that extends data.frame,
and override the print method.
class(df1) <- c( "Question", class(df1) )
print.Question <- function( x, ... ) {
if( ! is.null( attr(x, "Question") ) ) {
cat("Question:", attr(x, "Question"), "\n")
}
print.data.frame(x)
}
df1
But that looks overkill: it may be simpler to just add a column.
> df1$Question <- "Age"
> df1
a b Percent Question
1 1 4 40 Age
2 2 3 30 Age
3 3 2 20 Age
4 4 1 10 Age
5 5 1 10 Age
I wish this was a part of core R, but I hacked up a solution with Jason Bryer's Likert package using attributes to store column names, and having the likert function read these attributes and use them when plotting. It only works with that function though - there a function HMisc called label, but again none of the functions care about this (including the functions that show dataframes etc).
Here's a writeup of my hack http://reganmian.net/blog/2013/10/02/likert-graphs-in-r-embedding-metadata-for-easier-plotting/, with a link to the code.
rbind is really the only way to go but everything would switch to atomic data. For example:
cols <- c("Age", "Age", "Age")
df1 <- rbind(cols,df1)
str(df1)
Definitely agree with Vincent on this one, I do this quite frequently with survey data, if it's all in one data.frame I generally set a comment attribute on each element of the data.frame(), it's also useful when you perform multiple operations and you want to maintain reasonable colnames(df1). It's not good practice, but if this is for presentation you can always set check.names=F when you create your data.frame()
a<-c(1:5)
b<-c(4,3,2,1,1)
Percent<-c(40,30,20,10,10)
df1<-data.frame(a,b,Percent)
comment(df1$a) <- "Q1a. This is a likert scale"
comment(df1$b) <- "Q1b. This is another likert scale"
comment(df1$Percent) <- "QPercent. This is some other question"
Then, if I "forget" what's in the columns, I can take a quick peak:
sapply(df1, comment)

count of entries in data frame in R

I'm looking to get a count for the following data frame:
> Santa
Believe Age Gender Presents Behaviour
1 FALSE 9 male 25 naughty
2 TRUE 5 male 20 nice
3 TRUE 4 female 30 nice
4 TRUE 4 male 34 naughty
of the number of children who believe. What command would I use to get this?
(The actual data frame is much bigger. I've just given you the first four rows...)
Thanks!
You could use table:
R> x <- read.table(textConnection('
Believe Age Gender Presents Behaviour
1 FALSE 9 male 25 naughty
2 TRUE 5 male 20 nice
3 TRUE 4 female 30 nice
4 TRUE 4 male 34 naughty'
), header=TRUE)
R> table(x$Believe)
FALSE TRUE
1 3
I think of this as a two-step process:
subset the original data frame according to the filter supplied
(Believe==FALSE); then
get the row count of this subset
For the first step, the subset function is a good way to do this (just an alternative to ordinary index or bracket notation).
For the second step, i would use dim or nrow
One advantage of using subset: you don't have to parse the result it returns to get the result you need--just call nrow on it directly.
so in your case:
v = nrow(subset(Santa, Believe==FALSE)) # 'subset' returns a data.frame
or wrapped in an anonymous function:
>> fnx = function(fac, lev){nrow(subset(Santa, fac==lev))}
>> fnx(Believe, TRUE)
3
Aside from nrow, dim will also do the job. This function returns the dimensions of a data frame (rows, cols) so you just need to supply the appropriate index to access the number of rows:
v = dim(subset(Santa, Believe==FALSE))[1]
An answer to the OP posted before this one shows the use of a contingency table. I don't like that approach for the general problem as recited in the OP. Here's the reason. Granted, the general problem of how many rows in this data frame have value x in column C? can be answered using a contingency table as well as using a "filtering" scheme (as in my answer here). If you want row counts for all values for a given factor variable (column) then a contingency table (via calling table and passing in the column(s) of interest) is the most sensible solution; however, the OP asks for the count of a particular value in a factor variable, not counts across all values. Aside from the performance hit (might be big, might be trivial, just depends on the size of the data frame and the processing pipeline context in which this function resides). And of course once the result from the call to table is returned, you still have to parse from that result just the count that you want.
So that's why, to me, this is a filtering rather than a cross-tab problem.
sum(Santa$Believe)
You can do summary(santa$Believe) and you will get the count for TRUE and FALSE
DPLYR makes this really easy.
x<-santa%>%
count(Believe)
If you wanted to count by a group; for instance, how many males v females believe, just add a group_by:
x<-santa%>%
group_by(Gender)%>%
count(Believe)
A one-line solution with data.table could be
library(data.table)
setDT(x)[,.N,by=Believe]
Believe N
1: FALSE 1
2: TRUE 3
using sqldf fits here:
library(sqldf)
sqldf("SELECT Believe, Count(1) as N FROM Santa
GROUP BY Believe")

Resources