I have a dataset that has 210 columns and 150K rows. Some of the columns in my data set are suppose to be integers but when I import the dataset into R those columns appear as double. I am not sure what the issue is ? I have used read_csv function and specified that col_names = TRUE. Am I missing something? Please guide me.
when I try to correct this using ceiling or floor, some of the values change by +/-1 and others stay the same which is not ideal.
because there are a lot of columns, is difficult to set the class/type from each one. Did you tried
round(...)
or maybe you can change the columns that are double with this:
for (col in colnames(data)) {
if (is.double(data[,col])) {
class(data[,col]) <- "integer"
}
}
Related
I am writing this post to ask for some advice for looping code to rename columns by index.
I have a data set that has scale item columns positioned next to each other. Unfortunately, they are oddly named.
I want to re-name each column in this format: SimRac1, SimRac2, SimRac3.... and so on. I know the location of the columns (Columns number 30 to 37). I know these scale items are ordered in such a way that they can be named and numbered in increased order from left to right.
The code I currently have works, but is not efficient. There are other scales, in different locations, that also need to be renamed in a similar fashion. This would result in dozens of code rows.
See below code.
names(Total)[30] <- "SimRac1"
names(Total)[31] <- "SimRac2"
names(Total)[32] <- "SimRac3"
names(Total)[33] <- "SimRac4"
names(Total)[34] <- "SimRac5"
names(Total)[35] <- "SimRac6"
names(Total)[36] <- "SimRac7"
names(Total)[37] <- "SimRac8"
I want to loop this code so that I only have a chunk of code that does the work.
I was thinking perhaps a "for loop" would help.
Hence, the below code
for (i in Total[,30:37]){
names(Total)[i] <- "SimRac(1:8)"
}
This, unfortunately does not work. This chunk of code runs without error, but it doesn't do anything.
Do advice.
In the OP's code, "SimRac(1:8)" is a constant. To have dynamic names, use paste0.
We do not need a loop here. We can use a vectorized function to create the names, then assign the names to a subset of names(Total)
names(Total)[30:37]<-paste0('SimRac', 1:8)
I have an large dataset looking like:
There are overall 43 different values for PID. I have identified PIDs that need to be removed and summarized them in a vector:
I want to remove all observations (rows) from my data set that contain one of the PIDs from the vecotor NullNK. I have tried writing a function for it, but i get an error ( i have never written functiones before):
for (i in length(NullNK)){
SR_DynUeber_einfam <- SR_DynUeber_einfam [-which(SR_DynUeber_einfam$PID == NullNK(i)),]
}
How can i efficently remove the observations from my original data set that are containing PIDs from NullNK vector?
What is wrong with my function?
Thanks!
For basic operations like this, for loops are often not needed. This does what you are looking for:
SR_DynUeber_einfam[!SR_DynUeber_einfam$PID %in% NullNK,]
One mistake in your function is NullNK(i). You should subset from a vector with NullNK[i] in R.
Hope this helps!
When I try to order a large number data set
using
test <- StatePop[with(StatePop, order(StatePop$CENSUS_2010_POP, StatePop$state.name)), ]
It gives me :
I figured this one.
Commas in the large numbers was a problem.
I used gsub to remove commas and tried to order again.
it worked!
I have the set of data below. It has a few rows of unwanted characters before the numbers I want to read in, as well as a few unwanted rows after the data. I created a substring that will serve as my first column, which is purely numerical. There is data, when the set is read in, above and below these numericals that were converted to NA. Is there a way, other than skip and nrow, that I can remove the NA rows and read in only those rows that are numerical?
x<-read.csv("..."),
header=FALSE, na.strings="Y")
y<-substr(x$V1,1,8)
y<-as.numeric(y)
x2<-cbind(y,x1)
x2<-as.data.frame(x2)
I have tried:
if (x$y == is.numeric) {
print(x)
} else {
print("")}
But that is clearly wrong as all I get are errors. I have been trying different combinations of the above code, as well as:
x3<-sapply(x$y,is.numeric)
x[x3,]
But nothing I try is working.. I am either completely off or am missing something.
UPDATE: I was able to do this with both methods that were answered below.. but the problem now is, since the rows above the numeric rows contained characters, my columns are factors rather than numeric. Rather than actually deleting the rows, we were just temporarily removing them. Is there a way to permanently remove them so that my columns will be class numeric?
If this is just the case of remove rows containing NAs, have you tried using complete.cases? Perhaps something like:
x2[complete.cases(x2),]
Also if would be great if you could provide a minimal reproducible sample.
I imported a set of data into RStudio containing 85 variables and 139 observations. All values are integers except for the last column which is blank and for some reason was imported alongside everything else in the .csv file I created from a .xls file.
As such, this last column is all NA values. The problem is that when I try to run any kind of analysis it seems to be reading that all values are NA values. Despite this, in the data window in RStudio everything seems to be fine. Are there solutions to this problem that don't involve the data? Is it almost certainly the data that's the problem?
It seems strange that when opening the file anywhere else and even viewing it in R
The most likely issue is that the file is being imported as all text rather than as numeric data. If all of the data is numeric you can just use colClasses="numeric" as an argument to the read.csv() function and that should import correctly. You could also change the data class once it is in R, or give colClasses a vector of different classes if you have a variety of different data types (logical, character, numeric etc.) in your file.
Edit
Seeing as colClasses is not working (it is hard to say why without looking at your data), you can try this:
MyDF<-data.frame(sapply(MyDF,FUN=as.numeric))
Where MyDF is your datafraome. That will change all of your columns to numeric. If you have some charcter/factor/logical values in there this may not work as expected. You might want to check your excel file/csv to see why it is importing a NA column. It could be that there is a cell with a space in it that is being pulled in and this is throwing things off. You could always try deleting that empty column and retrying your import.
If you want to omit your last column while reading the data itself, you can try the following code. In this example, I am assuming that your file has 5 columns and the 5th column has NA values. So, you want to skip reading 5th column in your data set.
data <- read.csv (fileName, ....) [,1:4]
or, if you want to use column names, you can use:
data <- read.csv (fileName, ....) [,c('col1','col2','col3','col4')]
This will read all the observations from selected columns within your data set.
Hope this helps.
If you are trying too find the mean and standard deviation you can use
Data<- mean( dataframe$colname , na.rm = TRUE)
Data1<- sd( dataframe$colname , na.rm = TRUE)
This will give u the answer after omitting the na values from the column