I'm writing a function for a data set called opps on part number sales data, and I'm trying to break the data down into smaller data sets that are specific to the part numbers. I am trying to name the data sets as the argument "modNum". Here is what I have so far-
# modNum (Modified Product Number) takes a product number that looks
# like "950-0004-00" and makes it "opQty950.0004.00"
productNumber <- function(prodNum,modNum){
path <- "C:/Users/Data/"
readFile <- paste(path,"/opps.csv",sep="")
oppsQty <- read.csv(file=readFile,sep=",")
oppsQty$Line.Created.date <- as.Date(as.character(oppsQty$Line.Created),
"%m/%d/%Y")
modNum <- oppsQty[oppsQty$Part.Number=="prodNum",]
}
productNumber(280-0213-00,opQty280.0213.00)
#Error: object 'opQty910.0002.01' not found
The line I believe I'm having problems with is
modNum <- oppsQty[oppsQty$Part.Number=="prodNum",]
and it's because in order for the code to work, there have to be parenthesis around prodNum, but when i put the parenthesis in the code,
prodNum is no longer seen as the argument to be filled in. When i put the parenthesis inside the argument like this,-
productNumber(280-0213-00,"opQty280.0213.00")
I still have a problem.
How can I get around this?
I have tried rewriting the oppsQty$Part.Number variable to be numeric (shown below) so that I can eliminate the parenthesis all together, but I still have errors...
productNumber <- function(prodNum,nameNum){
path <- "C:/Users/Data"
readFile <- paste(path,"/opps.csv",sep="")
oppsQty <- read.csv(file=readFile,sep=",")
oppsQty$Line.Created.date <- as.Date(as.character(oppsQty$Line.Created),
"%m/%d/%Y")
#ifelse(oppsQty$Part.Number=="Discount",
# oppsQty$Part.Number=="000000000",
# oppsQty$Part.Number)
oppsQty$Part <- paste(substr(oppsQty$Part.Number,1,3),
substr(oppsQty$Part.Number,5,8),
substr(oppsQty$Part.Number,10,11),sep = "")
oppsQty$Part <- as.numeric(oppsQty$Part)
oppsQty$Part[is.na(oppsQty$Part)] <- 0
nameNum <- oppsQty[oppsQty$Part==prodNum,]
}
> productNumber(401110201,opQty401.1102.01)
Warning message:
In productNumber(401110201, opQty401.1102.01) : NAs introduced by coercion
Help is much appreciated!
Thank you!
At the moment you are passing prodNum as a numeric value, thus
280-0213-00 is evaluated as 67 (280-213-0= 67)
You should pass (and consider) prodNum as a character string (as this is what you intend)
ie. "280-0213-00"
Related
This is what I'm trying to do:
I have a large excel sheet I'm importing to R.
The data needs to be cleaned so one of the procedures is to test for character length.
Once the program finds a string that is too long, it needs to prompt the operator for a replacement
The operator inputs an alternative, and the program replaces the original with the input text.
The code I have seems to work procedurally, but the variable I have is not overwriting the original value.
library(tidyr)
library(dplyr)
library(janitor)
library(readxl)
fileToOpen <-read_excel(file.choose(),sheet="Data")
MasterFile <- fileToOpen
#This line checks the remaining bad strings in the column
CPNErrors <- nrow(filter(MasterFile,nchar(Field_to_Check) > 26))
#This line selects the bad field from the first in the list of strings to exceed the limit
TEST <- select(filter(MasterFile,nchar(Field_to_Check) > 26),Field_to_Check)[1,]
#This is the loop -- prompts the operator for a replacement, assigns a variable to the input and then replaces the bad value in the data frame
while (CPNErrors >= 1) {message("Replace ",TEST," with what?"); var=readline();MasterFile$Field_to_Check[MasterFile$Field_to_Check == TEST] <- var;print(var)}
The prompt works and assigns the readline() to the var, but the code will not replace the original string as a variable. When I run the code separately outside the loop, it will replace as long as I input an exact string (no variable assignment), so there's some syntactical thing I'm missing.
I've been searching for hours, and am just starting out in R, so if anyone can offer any assistance I'd greatly appreciate it.
EDIT -- ok... I think I found the source of the problem, but I don't know how to fix it. When I run
MasterFile$Field_to_Check[MasterFile$Field_to_Check == TEST]
It comes with a null result, but if I run
MasterFile$Field_to_Check[MasterFile$Field_to_Check == "Some Text that's in the data frame"]
It comes out with a result. Any idea on why I can't filter this list by the variable? The TEST variable comes out as expected.
Try this approach with a for loop :
CPNErrors <- which(nchar(MasterFile$Field_to_Check) > 26)
for(i in CPNErrors){
var=readline(paste0("Replace ",MasterFile$Field_to_Check[i]," with what? "))
MasterFile$Field_to_Check[i] <- var
}
Let say that I have these vectors:
time <- c(306,455,1010,210,883,1022,310,361,218,166)
status <- c(1,1,0,1,1,0,1,1,1,1)
gender <- c(1,1,1,1,1,1,2,2,1,1)
And I turn it into these data frame:
dataset <- data.frame(time, status, gender)
I want to list the factors in the third column using this function (p/s: pardon the immaturity. I'm still learning):
getFactor<-function(dataset){
result <- list()
result["Factors"] <- unique(dataset[[3]])
return(result)
}
And all I get is this:
getFactor(dataset)
$Factors
[1] 1
Warning message:
In result["Factors"] <- unique(dataset[[3]]) :
number of items to replace is not a multiple of replacement length
I tried using levels, but all I get is an empty list. My question is (1) why does this happen? and (2) is there any other way that I can get the list of the factor in a function?
Solution is simple, you just need double brackets around "Factors" :)
In the function
result[["Factors"]] <- unique(dataset[[3]])
That should be the line.
The double brackets return an element, single brackets return that selection as a list.
Sounds silly, by try this
test <- list()
class(test["Factors"])
class(test[["Factors"]])
The first class will be of type 'list'. The second will be of type 'NULL'. This is because the single brackets returns a subset as a list, and the double brackets return the element itself. It's useful depending on the scenario. The element in this case is "NULL" because nothing has been assigned to it.
The error "number of items to replace is not a multiple of replacement length" is because you've asked it to put 3 things into a single element (that element is a list). When you use double brackets you actually put it inside a list, where you can have multiple elements, so it can work!
Hope that makes sense!
Currently, when you create your data frame, dataset$gender is double vector (which R will automatically do if everything in it is numbers). If you want it to be a factor, you can declare it that way at the beginning:
dataset <- data.frame(time, status, gender = as.factor(gender))
Or coerce it to be a factor later:
dataset$gender <- as.factor(gender)
Then getting a vector of the levels is simple, without writing a function:
level_vector <- levels(dataset$gender)
level_vector
You're also subsetting lists & data frames incorrectly in your function. To call the third column of dataset, use dataset[,3]. The first element of a list is called by list[[1]]
In R, I am using readHTMLTable to read in a tables from the web. The tables I want occur at indexes 16 & 17, [[16]] & [[17]].
Here is a small sample of the data for you to work with:
These are some of the urls that contain the HTML tables.
url1 = "http://www.basketball-reference.com/leagues/NBA_1980.html"
url2 = "http://www.basketball-reference.com/leagues/NBA_1981.html"
url3 = "http://www.basketball-reference.com/leagues/NBA_1982.html"
And here, I read in the tables to variables named x1, x2, and x3.
x1 = readHTMLTable(url1)
x2 = readHTMLTable(url2)
x3 = readHTMLTable(url3)
If you look at the summary of each of these summary(x1), summary(x2), summary(x3) and count down through the indexes, the tables I want are the ones named "team" and "opponent", which occur on line 16 and line 17.
I have been trying to write a loop that would cycle through these and name the "team" table from each to a variables named team.1980, team.1981, and team.1982, respectively. The "opponent" tables would follow the same trend, opp.1980, and so forth.
This is the code for the loop I have been trying:
for(i in 1:3) {
for (j in 1980:1982) {
nam1 = paste0("team.", j)
nam2 = paste0("opp.", j)
assign(nam1, paste0("x.", i)[[16]])
assign(nam2, paste0("x.", i)[[17]])
}
}
I think the theory behind this loop works, however the problem occurs with the two assign functions:
assign(nam1, paste0("x.", i)[[16]])
assign(nam2, paste0("x.", i)[[17]])
When I run the loop, I get the error message
Error in paste0("x.", i)[[16]] : subscript out of bounds
which is the same error I get if I just run:
paste0("x", 1)[[16]]
> paste0("x", 1)[[16]]
Error in paste0("x", 1)[[16]] : subscript out of bounds
So I am pretty sure this is where my problem is. Does anyone know how I could cycle through variables and pull out indexes from each?
Please keep in mind that I am rather new to R, so simplicity would be much appreciated! Thanks in advance!
The output from readHTMLTable() is a list and the elements can be referenced by name; index isn't necessary. (Though you can use it.)
Suppose x1, x2, and x3 are defined as in your post. Then you can just do this:
for (i in 1:3) {
year <- 1980 + i - 1
eval(parse(text=paste0("team.", year, " <- x", i, '[["team"]]')))
eval(parse(text=paste0("opp.", year, " <- x", i, '[["opponent"]]')))
}
This evaluates the parsed text that's constructed dynamically in the loop. It creates 6 data frames: team.1980 and opp.1980 for years 1980-1982.
Let's take a closer look at what it's doing...
First a string is constructed using paste0() to concatenate the values into a string with no separator. The first call to paste0() in the first iteration yields this string:
'team.1980 <- x1[["team"]]'
Calling parse() on this tells R to turn that string into an object called an expression. Expressions can be evaluated using eval(). So this string gets turned into an R statement and executed, thereby assigning team.1980.
This process continues for each of the 3 iterations.
This may not be the best approach, but it should work in your situation. I assume you have more than just these 6, otherwise you might as well just write them as individual assignments.
I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written.
mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character")
allstate <- unique(mdata$State)
allstate <- allstate[order(allstate)]
spldata <- split(mdata,mdata$State)
if (num=="best") num <- 1
ranklist <- data.frame("hospital" = character(),"state" = character())
for (i in seq_len(length(allstate))) {
if (outcome=="heart attack"){
pdata <- spldata[[i]]
pdata[,11] <- as.numeric(pdata[,11])
bestof <- pdata[!is.na(as.numeric(pdata[,11])),][]
inorder <- order(bestof[,11],bestof[,2])
if (num=="worst") num <- nrow(bestof)
hospital <- bestof[inorder[num],2]
state <- allstate[i]
ranklist <- rbind(ranklist,c(hospital,state))
}
}
allstate is a character vector of states.
outcome can have values similar to "heart attack"
num will be numeric or "best" or "worst"
I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion.
However I keep getting the error
invalid factor level, NA generated
I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work.
I would be grateful for any help.
Thanks in advance!
Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary.
Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary.
By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.
I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.