a lot of people seem to have this issue however I was not able to find a satisfying answer. If you indulge me, I would like to be sure to understand what's happening
I'm having dates of various format in a dataframe (also a common issue) so i have built a small function to handle it for me:
dateHandler <- function(inputString){
if(grepl("-",inputString)==T){
lubridate::dmy(inputString, tz="GMT")
}else{
as.POSIXct(as.numeric(inputString)*60*60*24, origin="1899-12-30", tz="GMT")
}
}
When using it on one element it works fine:
myExample <-c("18-Mar-11","42433")
> dateHandler(myExample[1])
[1] "2011-03-18 GMT"
> dateHandler(myExample[2])
[1] "2016-03-04 GMT"
However when using it on a whole column it does not work:
myDf <- as.data.frame(myExample)
> myDf <- myDf %>%
+ dplyr::mutate(dateClean=dateHandler(myExample))
Warning messages:
1: In if (grepl("-", inputString) == T) { :
the condition has length > 1 and only the first element will be used
2: 1 failed to parse.
From reading on the forum, my current understanding is that R passes a vector with all the elements of myDf$myExample to the function, which is not built to handle vector of length >1. If that is correct, the next step is to understand what to do from there. Many people recommend using ifelse rather than if but I do not understand how this would help me. Also I read that ifelse returns something of the same format as its input, which does not work for me in that case.
Thank you in advance for answering this question for the 10000th time.
Nicolas
You have two option on where to go from there. One is to apply your current function to a list using lapply. As in:
myDf$dateClean <- lapply(myDf$myExample, function(x) dateHandler(x))
The other option is to build a vectorized function that is designed to take a vector as an input rather than a single data point. Here is a simple example:
dateHandlerVectorized <- function(inputVector){
output <- rep(as.POSIXct("1/1/11"), length(inputVector))
UseLuridate <- grepl("-", inputVector)
output[UseLuridate] <- lubridate::dmy(inputVector[UseLuridate], tz="GMT")
output[!UseLuridate] <- as.POSIXct(as.numeric(inputVector[!UseLuridate])*60*60*24, origin="1899-12-30", tz="GMT")
output
}
myDf <- myDf %>% dplyr::mutate(dateClean=dateHandlerVectorized(myDf$myExample))
Related
I'm trying to insert a check step in my R script to determine if the structure of the CSV table I'm reading is as expected.
See details:
table.csv has the following colnames:
[1] "A","B","C","D"
This file is generated by someone else, hence I'd like to make sure at beginning of my script that the colnames and the number/order of columns has not change.
I tried to do the following:
#dataframes to import
df_table <- read.csv('table.csv')
#define correct structure of file
Correct_Columns <- c('A','B','C','D')
#read current structure of table
Current_Columns <- colnames(df_table)
#Check whether CSV was correctly imported from Source
if(Current_Columns != Correct_Columns)
{
# if structure has changed, stop the script.
stop('Imported CSV has a different structure, please review export from Source.')
}
#if not, continue with the rest of the script...
Thanks in advance for any help!
Using base R, I suggest you take a look at all.equal(), identical() or any().
See the following example:
a <- c(1,2)
b <- c(1,2)
c <- c(1,2)
d <- c(1,2)
df <- data.frame(a,b,c,d)
names.df <- colnames(df)
names.check <- c("a","b","c","d")
!all.equal(names.df,names.check)
# [1] FALSE
!identical(names.df,names.check)
# [1] FALSE
any(names.df!=names.check)
# [1] FALSE
Following, your code could be modified as follows:
if(!all.equal(Current_Columns,Correct_Columns))
{
# call your stop statement here
}
Your code probably throws a warning because Current_Columns!=Correct_Columns will compare all entries of the vector (i.e. running Current_Columns!=Correct_Columns on its own on the console will return a vector with TRUE/FALSE values).
Contrary, all.equal() or identical() will compare the whole vectors while treating them as objects.
For the sake of completeness, please be aware of the slight difference between all.equal() and identical(). In your case it doesn't matter which one you use but it can get important when dealing with numerical vectors. See here for more information.
A quick way with data.table:
library(data.table)
DT <- fread("table.csv")
Correct_Columns <- c('A','B','C','D')
Current_Columns <- colnames(df_table)
Check if there is a false in pairwise matching:
if(F %in% Current_Columns == Correct_Columns){
stop('Imported CSV has a different structure, please review export from Source.')
}
}
So I came across a very strange problem in writing a convenience function to count the number of rows in each dataframe in a list of dataframes. I think there must be some basic behavior I'm missing, like indexing over lists doesn't work the way I think it does, or something's getting coerced to the wrong type of variable or something. Can someone help a brother out?
Reproducible example:
myvec <- c(1,2,3,4,5)
df1 <- as.data.frame(rbind(myvec, myvec))
df2 <- as.data.frame(rbind(myvec, myvec, myvec))
dflist <- list(df1, df2)
nrow(dflist[[1]])
# output as expected: [1] 2
nrow(dflist[[2]])
# output as expected: [1] 3
# convenience function
countrows <- function(pglist) {
dfsizes <- rep(NA, length(pglist))
for (i in length(pglist)) {
dfsizes[i] <- nrow(pglist[[i]])
return(dfsizes)
}
}
newvector <- countrows(dflist)
newvector
# output totally not as expected: [1] NA 3
I've gotta be missing something obvious here.
Yes, I know that this could be done perfectly easily with lapply(dflist, nrow) --- and that actually does produce the right output. But clearly I don't know how to loop over the elements of a list properly, and that is a problem totally apart from there being an easier way to do what I'm trying to achieve...
Edit: a kind commenter pointed out that I had the return statement inside the for loop, oops. However, correcting that still produces the same bad output:
countrows2 <- function(pglist) {
dfsizes <- rep(NA, length(pglist))
for (i in length(pglist)) {
dfsizes[i] <- nrow(pglist[[i]])
}
return(dfsizes)
}
doom <- countrows2(dflist)
doom
# still bad output: [1] NA 3
second edit: I am bad at avoiding stupid syntax errors, like forgetting to start the loop at 1. Double whoops. See comments from Neal Fultz, who is less bad at avoiding stupid syntax errors than I am.
Your code has a problem it needs to be 1:length(pglist) not just length(pglist) in the for() part. You looped for i in only length(pglist). Also taking the return expression out of the loop is necessary.
countrows <- function(pglist) {
dfsizes <- rep(NA, length(pglist))
for (i in 1:length(pglist)) {
dfsizes[i] <- nrow(pglist[[i]])
}
return(dfsizes)
}
newvector <- countrows(dflist)
newvector
This should work as expected now. Cheers
Edit: I am not allowed to comment yet
Does anyone why the result of the following code is different?
a <- cbind(1:10,1:10)
b <- a
colnames(a) <- c("a","b")
colnames(b) <- c("c","d")
colnames(cbind(a,b))
> [1] "a" "b" "c" "d"
colnames(cbind(ts(a),ts(b)))
> [1] "ts(a).a" "ts(a).b" "ts(b).c" "ts(b).d"
Is this or compatibility reasons? Cbind for xts and zoo does not have this feature.
I always accepted this as given, but now my code is littered with the following:
ca<-colnames(a)
cb<-colnames(b)
out <- cbind(a,b)
colnames(out) <- c(ca,cb)
This is just what the cbind.ts method does. You can see the relevant code via stats:::cbind.ts, stats:::.cbind.ts, and stats:::.makeNamesTs.
I can't explain why it was made to be different, since I didn't write it, but here's a work-around.
cbts <- function(...) {
dots <- list(...)
ists <- sapply(dots,is.ts)
if(!all(ists)) stop("argument ", which(!ists), " is not a ts object")
do.call(cbind,unlist(lapply(dots,as.list),recursive=FALSE))
}
I take it that you're interested in why this happens.
Taking a look at the body of stats:::.cbind.ts, which is the function that does column binding for time series, shows that naming is performed by .makeNamesTs. Taking a look at stats:::.make.Names.Ts reveals that the names are derived directly from the arguments you pass to cbind, and there is no obvious way to influence this. As an example, try:
cbind(ts(a),ts(b, start = 2))
You will find that the start specification of the second time series appears in the name of the respective columns.
As to why that's the way things are ... I can't help you there!
I am just beginning to learn R and am having an issue that is leaving me fairly confused. My goal is to create an empty vector and append elements to it. Seems easy enough, but solutions that I have seen on stackoverflow don't seem to be working.
To wit,
> a <- numeric()
> append(a,1)
[1] 1
> a
numeric(0)
I can't quite figure out what I'm doing wrong. Anyone want to help a newbie?
append does something that is somewhat different from what you are thinking. See ?append.
In particular, note that append does not modify its argument. It returns the result.
You want the function c:
> a <- numeric()
> a <- c(a, 1)
> a
[1] 1
Your a vector is not being passed by reference, so when it is modified you have to store it back into a. You cannot access a and expect it to be updated.
You just need to assign the return value to your vector, just as Matt did:
> a <- numeric()
> a <- append(a, 1)
> a
[1] 1
Matt is right that c() is preferable (fewer keystrokes and more versatile) though your use of append() is fine.
I'm writing a function for a data set called opps on part number sales data, and I'm trying to break the data down into smaller data sets that are specific to the part numbers. I am trying to name the data sets as the argument "modNum". Here is what I have so far-
# modNum (Modified Product Number) takes a product number that looks
# like "950-0004-00" and makes it "opQty950.0004.00"
productNumber <- function(prodNum,modNum){
path <- "C:/Users/Data/"
readFile <- paste(path,"/opps.csv",sep="")
oppsQty <- read.csv(file=readFile,sep=",")
oppsQty$Line.Created.date <- as.Date(as.character(oppsQty$Line.Created),
"%m/%d/%Y")
modNum <- oppsQty[oppsQty$Part.Number=="prodNum",]
}
productNumber(280-0213-00,opQty280.0213.00)
#Error: object 'opQty910.0002.01' not found
The line I believe I'm having problems with is
modNum <- oppsQty[oppsQty$Part.Number=="prodNum",]
and it's because in order for the code to work, there have to be parenthesis around prodNum, but when i put the parenthesis in the code,
prodNum is no longer seen as the argument to be filled in. When i put the parenthesis inside the argument like this,-
productNumber(280-0213-00,"opQty280.0213.00")
I still have a problem.
How can I get around this?
I have tried rewriting the oppsQty$Part.Number variable to be numeric (shown below) so that I can eliminate the parenthesis all together, but I still have errors...
productNumber <- function(prodNum,nameNum){
path <- "C:/Users/Data"
readFile <- paste(path,"/opps.csv",sep="")
oppsQty <- read.csv(file=readFile,sep=",")
oppsQty$Line.Created.date <- as.Date(as.character(oppsQty$Line.Created),
"%m/%d/%Y")
#ifelse(oppsQty$Part.Number=="Discount",
# oppsQty$Part.Number=="000000000",
# oppsQty$Part.Number)
oppsQty$Part <- paste(substr(oppsQty$Part.Number,1,3),
substr(oppsQty$Part.Number,5,8),
substr(oppsQty$Part.Number,10,11),sep = "")
oppsQty$Part <- as.numeric(oppsQty$Part)
oppsQty$Part[is.na(oppsQty$Part)] <- 0
nameNum <- oppsQty[oppsQty$Part==prodNum,]
}
> productNumber(401110201,opQty401.1102.01)
Warning message:
In productNumber(401110201, opQty401.1102.01) : NAs introduced by coercion
Help is much appreciated!
Thank you!
At the moment you are passing prodNum as a numeric value, thus
280-0213-00 is evaluated as 67 (280-213-0= 67)
You should pass (and consider) prodNum as a character string (as this is what you intend)
ie. "280-0213-00"