Removing non-numeric values from data in R from Excel - r

All other questions and answers I have seen on this have not been useful.
I have imported data from an Excel sheet into R and I want to remove all of the non-numeric data from a particular column so I can perform calculations. Ideally I would like to be able to create a function that does this.
Assuming the data frame is called price and the column I want is Q1:
I have found answers to this question where I could do this by using
convert <- as.numeric(as.character(price$Q1)
ncolumn <-concert[!is.na(convert)]
However, when I try and create the function I want to have the inputs to be both the name of the data frame as well as the name of the column. I have tried using price[2] instead of price$Q1 in that first line I showed but it doesn't seem to work. I also tried extracting the name of the column and the name of the data frame and using the $ notation instead of [ ] and it still doesn't work.

Guessing from the question, would the below be what you are trying to do?
#function with inputs of both name of a data frame and name of a column
NumConv <- function(data, column){
convert<-as.numeric(as.character(data[, column]))
convert[!is.na(convert)]
}
#executing the function with your assumed example
NumConv(price, "Q1")

Related

Data extraction in R - multiple columns

Hello, I have this type of table consisting of a single row and several columns. I have tried a code to extract my KD_PL parameters without success. Do you know a way in R to extract all the KD_PLs and store them in a vector or data frame array?
I tried this:
KDPL <- select("KD_PL.", which(substr(colnames(max_LnData), start=1, stop=6)))
This should do the trick:
library(tidyverse)
KDPL <- max_LnData %>% select(starts_with("KD_PL."))
This function selects all columns from your old dataset starting with "KD_PL." and stores them in a new dataframe KDPL.
If you only want the names of the columns to be saved, you could use the following:
KDPL_names <- colnames(KDPL)
This saves the column names in the vector KDPL_names.

Trying to predict in R

I created a data set using a random row generator:
training_data <- fulldata[sample(nrow(fulldata),100,]
I am under the impression that I can create a second data set of the rest of the data ... rest_data <- fulldata[-training_data] is the code I jotted down in my notes but I am getting
"Error in '[.default'(fulldata, -training_data) :
What part of my code is incorrect?
assuming that fulldatais a dataframe you need a comma in the subscript to indicate that you want the rows of the data frame (i.e. fulldata[rows,columns]). But the indices of the new dataframe training_data will be numbered 1:100so you need a different sort of indicator that corresponds between training_dataand fulldata to show which rows of fulldata should not be included. What you might do is use the rownames, something like:
rest_data<-fulldata[-which(rownames(fulldata)%in%rownames(training_data)),]
which should tell R to remove the rownames of fulldata that occur in training_data. If you have something like an ID variable that is unique to each row you could also use this
rest_data<-fulldata[-which(fulldata$ID%in%training_data$ID),]

Dropping unary variables in R

I would like to understand how I can drop variables from a data frame in R if they are unary, that contains only one value. I sometimes have data frames with thousands of variables, and one of my first steps would be to get rid of those variables (which often is handed over to me from a data warehouse).
I understand that I can drop columns like
drops <- c("x","z")
DF[,!(names(DF) %in% drops)]
as outlined here:
Drop data frame columns by name
But I would like some way of searching through all the variables, and dropping unary only.
I think this should identify a "nonunary" variable according to your definition:
nonunary <- function(x) length(unique(x))>1
And this should filter the variables in a data frame accordingly:
DF[sapply(DF,nonunary)]

Removing rows causes "row.names" column to appear when displayed with View()

To remove rows from a data frame, I use the following command:
data <- data[-1, ]
for example to remove the first row. I need to remove the first 6 rows, so I used the following:
data <- data[-c(1,2,3,4,5,6), ]
OR
data <- data[-(1:6), ]
this works as far as removing the row names, but introduced a new column called row.names that I cannot get rid of unless I use the command:
row.names(data) <- NULL
What is the reason for this? Is there a better way of removing a number of rows/columns with one command?
Example:
after the following code:
tquery <- tquery[-(1:6), ]
This is the data:
Although it seems as such, you are not actually adding a column to the data. What you are seeing is just a result of using View(). The function is showing the "row.names" attribute of the data frame as the first column, but you didn't really add the column.
This is expected and documented behavior. From the Details section of help(View)
If there are row names on the data frame that are not 1:nrow, they are displayed in a separate first column called row.names.
So since you subsetted the data, the row names are technically not 1:nrow any more and hence the new column is introduced in the viewer.
Print your data in the console and you'll see the difference.
View(mtcars) ## because the mtcars row names are not 1:nrow
versus
mtcars
Basically, don't trust View() to display an exact representation of the actual data. Instead use attributes(), *names(), dim(), length(), etc. or just peek at the data with head().
See r help via "?row.names" for more info. From the documentation, "All data frames have a row names attribute"
?row.names ## get more information about row.names from r help
row.names is not a new column, but rather an attribute of every single data frame. This is simply meta data and is ignored by most data. When you output this data (i.e. CSV) or use it in a function, this data will not interfere. This is similar to how excel has row numbers on the left margin, which is referential data for the application.
str(your_dataframe) ## see that those columns don't exist
colnames(your_dataframe) ## see column names

Specifying names of columns to be used in a loop R

I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.
Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric

Resources