Discretization by only one column of the dataset using mdlp() - r

library(discretization)
data("CO2")
disc<- mdlp(CO2[4])
I just need to discretize the 4th column of the data set provided. Then it is getting an Error in data[1, ] : incorrect number of dimensions error. Could you please help me to fix this.

I don't know if this is what you're going for, but 1) mdlp needs more than just one column of data, and 2) it also has trouble working with complex objects like CO2. Here is one way to make it execute:
CO2.df <- as.data.frame(CO2) # strips the extra info
mdlp(CO2.df[,4:5])

Related

Converting chr to numeric and still not able to take mean

I am working with a dataframe from NYC opendata. On the information page it claims that a column, ACRES, is numeric, but when I download it is chr. I've tried the following:
parks$ACRES <- as.numeric(as.character(parks$ACRES))
which turned the column info type into dbl, but I was unable to take the mean, so I tried:
parks$ACRES <- as.integer(as.numeric(parks$ACRES))
I've also tried sapply() and I get an error message with NAs introduced by coercion. I tried convert() to but R didn't recognize it though it is supposed to be part of dplyr.
Either way I get NA as a result for the mean.
I've tried taking the mean a few different ways:
mean(parks[["ACRES"]])
mean(parks$ACRES)
Which also didn't work? Is it the dataframe? I'm wondering since it is from the government there are limits?
I'd appreciate any help.
You have NAs in your data. Either they were there before you converted or some of the data can't be converted to numeric directly (do you have comma separators for the 1000s in your input? Those need to be removed before converting to numeric).
Identifying why you have NAs and fixing if necessary is the first step you'll need to do. If the NAs are valid then what you want to do is to add the na.rm = TRUE parameter to the mean function which ignores NAs while calculating the mean.
Check to see how ACRES is being loaded in (i.e., what data type is it?). If it's being loaded in as a factor, you will have trouble changing a factor to a numerical value. The way to solve this is to use the 'stringsAsFactors = FALSE' argument in your read.csv or whatever function you're using to read in the data.

Column of original data is returning length as 0 in R

I am working in R studio and trying to create a table. The error I keep getting is "Error in table(players, fitmod1$classification) : all arguments must have the same length". When I check the length of my data, fitmod1$classification is returning a value; but players is returning 0. I have no idea how to fix this.
Player's is a qualitative column of the Hitters data in R package ISLR. fitmod1 is a mclust model. I am attaching my code below so hopefully that helps! Thanks]1
Your issue is that the players are the row names and not an actual column of the data. So when you subset the Hitter's data frame with:
players <- Hitters[,0]
you end up with an empty dataframe (though the rows are still named which what you are seeing when you view it in RStudio).
Instead you want to get the row names and store them as a vector:
players <- row.names(Hitters)
You will now be able to generate a table.
Here is all of the code (by the way it is much easier for us as a community to answer your questions if you use the code feature in stack overflow rather than attaching a png. This way we can copy and paste your code rather than having to type it by hand) :
library(ISLR)
library(mclust)
data(Hitters)
Hitters=Hitters[,c(1:7)]
Hitters<-na.omit(Hitters)
players <- row.names(Hitters)
fitmod1<-Mclust(Hitters, G=3, modelNames=c("VEE"))
table(players, fitmod1$classification)

After removing some rows in this data frame all information turns NA, what's wrong?

I'm working on a replication of the study for this particular data that you could find in this link, the data is named AProrok_AJPS.tab, please click on Download and then you can choose the RData format.
I want to remove all the rows whose value in a specific column is 1, so with this code:
df <- data[data$unknownleader!=1,]
After that, however, all the data becomes NA, it becomes all blank basically. I tried to change the type of data between integer, factor, class, etc. but all resulted into the same problem. I am not sure what is with this data file that causes this problem. Could anyone please investigate and show me a possible way to fix it?
Ok so thanks to #PaulHiemstra for pointing out that the problem arose from the NA in the dataset. Then, based on this thread, I could come up with a solution:
First replacing all the NA in that particular unknownleader column to 0:
df$unknownleader <- replace(df$unknownleader, is.na(df$unknownleader), 0)
Then proceed to remove the rows as mentioned in the question as normally:
df <- df[df$unknownleader==0, ]
Note that since the unknownleader variable happens to be binomial, therefore it still makes sense to replace NA to 0. For other dataset some appropriate adjustments might be needed.

Strangeness with filtering in R and showing summary of filtered data

I have a data frame loaded using the CSV Library in R, like
mySheet <- read.csv("Table.csv", sep=";")
I now can print a summary on that mySheet object
summary(mySheet)
and it will show me a summary for each column, for example, one column named Diagnose has the unique values RCM, UCM, HCM and it shows the number of occurences of each of these values.
I now filter by a diagnose, like
subSheet <- mySheet[mySheet$Diagnose=='UCM',]
which seems to be working, when I just type subSheet in the console it will print only the rows where the value has been matched with 'UCM'
However, if I do a summary on that subSheet, like
summary(subSheet)
it still 'knows' about the other two possibilities RCM and HCM and prints those having a value of 0. However, I expected that the new created object will NOT know about the possible values of the original mySheet I initially loaded.
Is there any way to get rid of those other possible values after filtering? I also tried subset but this one just seems to be some kind of shortcut to '[' for the interactive mode... I also tried DROP=TRUE as option, but this one didn't change the game.
Totally mind squeezing :D Any help is highly appreciated!
What you are dealing with here are factors from reading the csv file. You can get subSheet to forget the missing factors with
subSheet$Diagnose <- droplevels(subSheet$Diagnose)
or
subSheet$Diagnose <- subSheet$Diagnose[ , drop=TRUE]
just before you do summary(subSheet).
Personally I dislike factors, as they cause me too many problems, and I only convert strings to factors when I really need to. So I would have started with something like
mySheet <- read.csv("Table.csv", sep=";", stringsAsFactors=FALSE)

Limiting Window Size and/or Removing Specific Rows of Time Values In R

I'm trying to figure out how to observe just one particular section of the data in the graph below (e.g. 5pm onwards). I know there are basically two methods of doing this:
1) Method 1: Limiting the window size, which requires the following function:
< symbols(Data$Times, Data$y, circles=Data$z, xlim=c("5:00pm","10:00pm"))
The problem is, I get an "invalid 'xlim' value" error when I try to input the two time endpoints.
2) Method 2: Clearing out the rows in Data$Times that have values over 5pm.
The problem here is that I'm not sure how to sort the rows by earliest time -> latest time OR how to define a new variable such that TimesPM <- Data$Times>"5pm" (what I typed just now obviously did not work.)
Any ideas? Thanks in advance.
ETA: This is what I plotted:
Times<-strptime(DATA$Time,format="%I:%M%p")
symbols(Times, y, circles=z, xaxt='n', inches=.4, fg="3", bg=(a), xlab="Times", ylab="y")
axis.POSIXct(1, at=Times, format="%I:%M%p")
Both approaches have the problem that in all likelihood your datetime format will not equal the values expressed just as a character vector like "5:00pm" even after coercion with the ">" comparison operator. To get the best advice you need to present str(DATA$Times) or dput(head(DATA$Times)) or class(Data$Times) . Generally plotting functions recognize either valid date or datetime classes or their numeric representation. If the ordering operation is not working, then it raises the question whether you have a proper class. But you appear to have an axis labeling that suggests a date-time format of some sort, and that we just need to figure out what class it really is.
Because you are creating a character vector from you Time column, you probably want to apply the restriction before you send the DATA$Time vector to strptime(). You still have not offered the requested clarifications, so I have no way to give tested or even very specific code, but you might be doing something like
Times<-strptime(DATA$Time[ as.POSIXlt(DATA$Time)$hour >= 17 &
as.POSIXlt(DATA$Time)$hour <= 22 ] ,
format="%I:%M%p")

Resources