When I try to order a large number data set
using
test <- StatePop[with(StatePop, order(StatePop$CENSUS_2010_POP, StatePop$state.name)), ]
It gives me :
I figured this one.
Commas in the large numbers was a problem.
I used gsub to remove commas and tried to order again.
it worked!
Related
I have a dataset that has 210 columns and 150K rows. Some of the columns in my data set are suppose to be integers but when I import the dataset into R those columns appear as double. I am not sure what the issue is ? I have used read_csv function and specified that col_names = TRUE. Am I missing something? Please guide me.
when I try to correct this using ceiling or floor, some of the values change by +/-1 and others stay the same which is not ideal.
because there are a lot of columns, is difficult to set the class/type from each one. Did you tried
round(...)
or maybe you can change the columns that are double with this:
for (col in colnames(data)) {
if (is.double(data[,col])) {
class(data[,col]) <- "integer"
}
}
I loaded a data big set with read_delim(), since there I have the possibility to skip the first 4 rows of the data set which is not important for me. The data set is separated by ";". My Problem is the following:
I have some numbers like
-0,000364929204806685
0,00367021351121366
-0,0184237491339445
as you can see this numbers are seperated by commas. Therefore if i change the type of it to "numeric", during the loading phase, i get a formatting error like -3.649292e+14 for the first number. Thus i have to load the data as characters.
But now I am not able to do numeric calculations. as.numeric() doesen't work.
Is there any possibility to change this characters to numeric?
Thanks
Matthias
Thanks everybody for help, it can be solved by using gsub(). In the upper example:
as.numeric(gsub(",", ".", Dat[1,12]))
provides:
-0.0003649292
I have the set of data below. It has a few rows of unwanted characters before the numbers I want to read in, as well as a few unwanted rows after the data. I created a substring that will serve as my first column, which is purely numerical. There is data, when the set is read in, above and below these numericals that were converted to NA. Is there a way, other than skip and nrow, that I can remove the NA rows and read in only those rows that are numerical?
x<-read.csv("..."),
header=FALSE, na.strings="Y")
y<-substr(x$V1,1,8)
y<-as.numeric(y)
x2<-cbind(y,x1)
x2<-as.data.frame(x2)
I have tried:
if (x$y == is.numeric) {
print(x)
} else {
print("")}
But that is clearly wrong as all I get are errors. I have been trying different combinations of the above code, as well as:
x3<-sapply(x$y,is.numeric)
x[x3,]
But nothing I try is working.. I am either completely off or am missing something.
UPDATE: I was able to do this with both methods that were answered below.. but the problem now is, since the rows above the numeric rows contained characters, my columns are factors rather than numeric. Rather than actually deleting the rows, we were just temporarily removing them. Is there a way to permanently remove them so that my columns will be class numeric?
If this is just the case of remove rows containing NAs, have you tried using complete.cases? Perhaps something like:
x2[complete.cases(x2),]
Also if would be great if you could provide a minimal reproducible sample.
I imported a set of data into RStudio containing 85 variables and 139 observations. All values are integers except for the last column which is blank and for some reason was imported alongside everything else in the .csv file I created from a .xls file.
As such, this last column is all NA values. The problem is that when I try to run any kind of analysis it seems to be reading that all values are NA values. Despite this, in the data window in RStudio everything seems to be fine. Are there solutions to this problem that don't involve the data? Is it almost certainly the data that's the problem?
It seems strange that when opening the file anywhere else and even viewing it in R
The most likely issue is that the file is being imported as all text rather than as numeric data. If all of the data is numeric you can just use colClasses="numeric" as an argument to the read.csv() function and that should import correctly. You could also change the data class once it is in R, or give colClasses a vector of different classes if you have a variety of different data types (logical, character, numeric etc.) in your file.
Edit
Seeing as colClasses is not working (it is hard to say why without looking at your data), you can try this:
MyDF<-data.frame(sapply(MyDF,FUN=as.numeric))
Where MyDF is your datafraome. That will change all of your columns to numeric. If you have some charcter/factor/logical values in there this may not work as expected. You might want to check your excel file/csv to see why it is importing a NA column. It could be that there is a cell with a space in it that is being pulled in and this is throwing things off. You could always try deleting that empty column and retrying your import.
If you want to omit your last column while reading the data itself, you can try the following code. In this example, I am assuming that your file has 5 columns and the 5th column has NA values. So, you want to skip reading 5th column in your data set.
data <- read.csv (fileName, ....) [,1:4]
or, if you want to use column names, you can use:
data <- read.csv (fileName, ....) [,c('col1','col2','col3','col4')]
This will read all the observations from selected columns within your data set.
Hope this helps.
If you are trying too find the mean and standard deviation you can use
Data<- mean( dataframe$colname , na.rm = TRUE)
Data1<- sd( dataframe$colname , na.rm = TRUE)
This will give u the answer after omitting the na values from the column
I have a data frame loaded using the CSV Library in R, like
mySheet <- read.csv("Table.csv", sep=";")
I now can print a summary on that mySheet object
summary(mySheet)
and it will show me a summary for each column, for example, one column named Diagnose has the unique values RCM, UCM, HCM and it shows the number of occurences of each of these values.
I now filter by a diagnose, like
subSheet <- mySheet[mySheet$Diagnose=='UCM',]
which seems to be working, when I just type subSheet in the console it will print only the rows where the value has been matched with 'UCM'
However, if I do a summary on that subSheet, like
summary(subSheet)
it still 'knows' about the other two possibilities RCM and HCM and prints those having a value of 0. However, I expected that the new created object will NOT know about the possible values of the original mySheet I initially loaded.
Is there any way to get rid of those other possible values after filtering? I also tried subset but this one just seems to be some kind of shortcut to '[' for the interactive mode... I also tried DROP=TRUE as option, but this one didn't change the game.
Totally mind squeezing :D Any help is highly appreciated!
What you are dealing with here are factors from reading the csv file. You can get subSheet to forget the missing factors with
subSheet$Diagnose <- droplevels(subSheet$Diagnose)
or
subSheet$Diagnose <- subSheet$Diagnose[ , drop=TRUE]
just before you do summary(subSheet).
Personally I dislike factors, as they cause me too many problems, and I only convert strings to factors when I really need to. So I would have started with something like
mySheet <- read.csv("Table.csv", sep=";", stringsAsFactors=FALSE)