Argument is of length 0, no NA's - r

I have a dataframe that looks like this:
logentrytime ord_lat_dt0 ord_lat_dt1 ord_lat_dt2 ord_lat_dt3 ord_lat_dt4 ord_lat_dt5 ord_lat_dt6 ord_lat_dt7 ord_lat_dt8 ord_lat_dt9 ord_num0 ord_num1 ord_num2
1 2016-11-10 14:23:36 0 0 0 0 0 0 2016-02-12 0 0 0 0 0 0
2 2016-11-10 14:22:22 0 0 0 0 0 0 2016-02-12 0 0 0 0 0 0
3 2016-11-07 16:02:45 0 0 0 0 0 0 2016-02-12 0 0 0 0 0 0
4 2016-11-07 21:10:00 0 0 0 0 0 0 2016-02-12 0 0 0 0 0 0
5 2016-11-07 16:03:29 0 0 0 0 0 0 2016-02-12 0 0 0 0 0 0
6 2016-11-10 14:23:05 0 0 0 0 0 0 2016-02-12 0 0 0 0 0 0
Where ord_lat_dt columns are last purchase date of a customer. ord_lat_dt[0-9] were pulled from different database tables. Thus each row represents one customer, and their last order date will be indicated in one of the 9 columns.
I would like to merge these, but before I do, want to calculate "months_since_last_purchase" based on the date in each column.
Thus, I have converted the date columns into character strings, and am looping through using these functions:
elapsed_time <- function(end_date, start_date) {
ed <- as.POSIXlt(end_date)
sd <- as.POSIXlt(start_date)
12 * (ed$year - sd$year) + (ed$mon - sd$mon)
}
convert_time <- function(data, column){
for(i in seq(1,length(data$column))){
if((data$column[i]!= "0") ==TRUE){
data$column[i] <- elapsed_months(Sys.time(), as.Date(data$column[i], format="%Y/%m/%d"))
}
}
return(data)
}
test1 <- convert_time(test2, ord_lat_dt0)
But I obtain the error
Error in if ((data$column[i] != "0") == TRUE) { :
argument is of length zero
I've also tried changing the if statement to check :
grepl("[-]", data$column[i])==FALSE)
But I obtain the same error.
Any ideas?
If you decide to down vote, please explain to me what is wrong with my question. I am trying to learn and would like to make sure I am asking correctly.
NOTE: I am having a different issue and have completely changed the question. Thus some of the comments below do not apply. Because of the down votes, I could not open a new question.

The problem here is that when you do data_theme[is.na(data_theme)] <- 0, NA in the date columns will be replaced. But date columns are in a POSIXct format, and if you try as.POSIXct(0), it will through an error.
One solution could be to do it in two step. First replace NA from numeric column first, then do whatever you want with POSIXct values :
library(dplyr)
df %>%
mutate_if("is.numeric", funs(if_else(is.na(.), 0, .))

You can only replace all NAs by the value 0 if all columns are numeric first. This could for instance be achieved by writing a little function to first convert a column to numeric if needed, and then replace the NAs. Using lapply you can loop over the columns, and make the resulting list of columns a data frame again afterwards.
f <- function(x) {
x <- as.numeric(x)
x[is.na(x)] <- 0
x
}
data_theme <- as.data.frame(lapply(data_theme, f))
Of course, this will also convert any meaningful datetimes to numbers.

Related

A problem with ifelse according to two variables from different dataframes in R

I have two data frames. The first one (A) contain information about GOALS, and the second one (B) contains the specific information about the IDs which had that GOAL:
> A
GOAL
1 A116642173
2 A116642174
3 A116642175
4 A116642176
5 A116642178
6 A116642181
> B
ID GOAL
1 1873 A116433509
2 478 A116642178
3 2165 A116192937
4 165 A116192937
5 313 A116433701
6 475 A116367456
I would like to create new columns in one of this according the other data frame. So, first I create aditional columns:
> idkids=c(313,475,165,478,1873,2165)
> ids<-c(idkids)
> A[ ,paste0(ids)]<-0
> A
GOAL 313 475 165 478 1873 2165
1 A116642173 0 0 0 0 0 0
2 A116642174 0 0 0 0 0 0
3 A116642175 0 0 0 0 0 0
4 A116642176 0 0 0 0 0 0
5 A116642178 0 0 0 0 0 0
6 A116642181 0 0 0 0 0 0
I tried to use ifelse to find the GOAL for a specifid ID, but I didn't. I have tried to do this by two ways:
for (i in 1:kids){
A[ ,i+1]<-ifelse(A[ ,i+1]%in%B$ID,"",
ifelse(A$GOAL%in%B$GOAL, 1, 0))
}
for (i in 1:kids){
A[ ,i+1]<-ifelse(A[,i+1]%in%B$ID & A$GOAL%in%B$GOAL,1,0)
}
But my code didn't recognize the specific ID and it didn't give me 1 (TRUE) or 0 (FALSE). It give me 0 for all the columns... Can any one help me, please?
Here is one method to reshape the 'B' data into 'wide' and then do a join
library(dplyr)
library(tidyr)
pivot_wider(B, names_from = ID, values_from = ID, values_fn = length,
values_fill = 0) %>%
right_join(A)

Sample random column in dataframe

I have the following code: model$data
model$data
[[1]]
Category1 Category2 Category3 Category4
3555 1 0 0 0
6447 1 0 0 0
5523 1 0 1 0
7550 1 0 1 0
6330 1 0 1 0
2451 1 0 0 0
4308 1 0 1 0
8917 0 0 0 0
4780 1 0 1 0
6802 1 0 1 0
2021 1 0 0 0
5792 1 0 1 0
5475 1 0 1 0
4198 1 0 0 0
223 1 0 1 0
4811 1 0 1 0
678 1 0 1 0
I am trying to use this formula to get an index of the column names:
sample(colnames(model$data), 1)
But I receive the following error message:
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
Is there a way to avoid that error?
Notice this?
model$data
[[1]]
The [[1]] means that model$data is a list, whose first component is a data frame. To do anything with it, you need to pass model$data[[1]] to your code, not model$data.
sample(colnames(model$data[[1]]), 1)
This seems to be a near-duplicate of Random rows in dataframes in R and should probably be closed as duplicate. But for completeness, adapting that answer to sampling column-indices is trivial:
you don't need to generate a vector of column-names, only their indices. Keep it simple.
sample your col-indices from 1:ncol(df) instead of 1:nrow(df)
then put those column-indices on the RHS of the comma in df[, ...]
df[, sample(ncol(df), 1)]
the 1 is because you apparently want to take a sample of size 1.
one minor complication is that your dataframe is model$data[[1]], since your model$data looks like a list with one element which is a dataframe, rather than a plain dataframe. So first, assign df <- model$data[[1]]
finally, if you really really want the sampled column-name(s) as well as their indices:
samp_col_idxs <- sample(ncol(df), 1)
samp_col_names <- colnames(df) [samp_col_idxs]

How can I select columns and rows with variable in R?

I have an object currency I would like to select one column and the rows equal to 1 with the variable Pair.
>currency
EURUSD EURUSDi USDJPY USDJPYi GBPUSD GBPUSDi AUDUSD AUDUSDi XAUUSD XAUUSDi zeroes
2000-07-16 0 0 0 0 0 1 0 0 0 0 0
2000-07-23 0 0 0 0 0 1 0 0 0 0 0
2000-07-30 0 0 0 0 0 1 0 0 0 0 0
2000-08-06 0 0 0 0 0 0 0 0 0 1 0
2000-08-13 0 1 0 0 0 0 0 0 0 0 0
From the console I can do it with subset like this :
> subset(currency$GBPUSDi, GBPUSDi == 1)
GBPUSDi
2000-07-16 1
2000-07-23 1
2000-07-30 1
2000-08-06 1
2000-08-13 1
2000-08-20 1
But as soon as it is passed in a script with variable Pair it fails. I've searched for hours in the documentation and I'm having a headache trying to figure out what is wrong.
Please find the different command I've try :
subset (currency$Pair, Pair == 1)
subset (currency, Pair = 1, select = Pair)
weights$Cur[currency$Pair = 1]
The one that works is currency[,c(Pair)] but it only select column, how can I complete with row selection of Pair = 1 ?
currency[,c(Pair)][Pair = 1] and subset (currency[,c(Pair)], Pair = 1) with = or == doesn't work.
currency$Pair[currency$Pair == 1] should work ($Pair select column Pair and [currency$Pair == 1] select values equal to 1). It looks like it don't work in your case, because currency don't contain variable Pair.
If currency is not a dataframe but matrix, you can try
currency[currency[, c("Pair")] == 1, c("Pair")]

Find # of rows between events in R

I have a series of data in the format (true/false). eg it looks like it can be generated from rbinom(n, 1, .1). I want a column that represents the # of rows since the last true. So the resulting data will look like
true/false gap
0 0
0 0
1 0
0 1
0 2
1 0
1 0
0 1
What is an efficient way to go from true/false to gap (in practice I'll this will be done on a large dataset with many different ids)
DF <- read.table(text="true/false gap
0 0
0 0
1 0
0 1
0 2
1 0
1 0
0 1", header=TRUE)
DF$gap2 <- sequence(rle(DF$true.false)$lengths) * #create a sequence for each run length
(1 - DF$true.false) * #multiply with 0 for all 1s
(cumsum(DF$true.false) != 0L) #multiply with zero for the leading zeros
# true.false gap gap2
#1 0 0 0
#2 0 0 0
#3 1 0 0
#4 0 1 1
#5 0 2 2
#6 1 0 0
#7 1 0 0
#8 0 1 1
The cumsum part might not be the most efficient for large vectors. Something like
if (DF$true.false[1] == 0) DF$gap2[seq_len(rle(DF$true.false)$lengths[1])] <- 0
might be an alternative (and of course the rle result could be stored temporarly to avoid calculating it twice).
Ok, let me put this in answer
1) No brainer method
data['gap'] = 0
for (i in 2:nrow(data)){
if data[i,'true/false'] == 0{
data[i,'gap'] = data[i-1,'gap'] + 1
}
}
2) No if check
data['gap'] = 0
for (i in 2:nrow(data)){
data[i,'gap'] = (data[i-1,'gap'] + 1) * (-(data[i,'gap'] - 1))
}
Really don't know which is faster, as both contain the same amount of reads from data, but (1) have an if statement, and I don't know how fast is it (compared to a single multiplication)

R: Converting multiple binary columns into one factor variable whose factors are binary column names

I am a new R user. Currently I am working on a dataset wherein I have to transform the multiple binary columns into single factor column
Here is the example:
current dataset like :
$ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
$ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
$ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
$ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
Property.RealEstate Property.Insurance Property.CarOther Property.Unknown
1 0 0 0
0 1 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
Recoded column should be:
Property
1 Real estate
2 Insurance
3 Real estate
4 Insurance
5 CarOther
6 Unknown
It is basically a reverse of melt.matrix function.
Thank You all for your Precious Inputs. It does work.
But one issue though,
I have some rows which takes value as:
Property.RealEstate Property.Insurance Property.CarOther Property.Unknown
0 0 0 0
I want these to be marked as NA or Null
Would be a help if you suggest on this as well.
Thank You
> mat <- matrix(c(0,1,0,0,0,
+ 1,0,0,0,0,
+ 0,0,0,1,0,
+ 0,0,1,0,0,
+ 0,0,0,0,1), ncol = 5, byrow = TRUE)
> colnames(mat) <- c("Level1","Level2","Level3","Level4","Level5")
> mat
Level1 Level2 Level3 Level4 Level5
[1,] 0 1 0 0 0
[2,] 1 0 0 0 0
[3,] 0 0 0 1 0
[4,] 0 0 1 0 0
[5,] 0 0 0 0 1
Create a new factor based upon the index of each 1 in each row
Use the matrix column names as the labels for each level
NewFactor <- factor(apply(mat, 1, function(x) which(x == 1)),
labels = colnames(mat))
> NewFactor
[1] Level2 Level1 Level4 Level3 Level5
Levels: Level1 Level2 Level3 Level4 Level5
also you can try:
factor(mat%*%(1:ncol(mat)), labels = colnames(mat))
also use Tomas solution - ifounf somewhere in SO
as.factor(colnames(mat)[mat %*% 1:ncol(mat)])
Melt is certainly a solution. I'd suggest using the reshape2 melt as follows:
library(reshape2)
df=data.frame(Property.RealEstate=c(0,0,1,0,0,0),
Property.Insurance=c(0,1,0,1,0,0),
Property.CarOther=c(0,0,0,0,1,0),
Property.Unknown=c(0,0,0,0,0,1))
#add id column (presumably you have ids more meaningful than row numbers)
df$row=1:nrow(df)
#melt to "long" format
long=melt(df,id="row")
#only keep 1's
long=long[which(long$value==1),]
#merge in ids for NA entries
long=merge(df[,"row",drop=F],long,all.x=T)
#clean up to match example output
long=long[order(long$row),"variable",drop=F]
names(long)="Property"
long$Property=gsub("Property.","",long$Property,fixed=T)
#results
long
Alternately, you can just do it in the naïve way. I think it's more transparent than any of the other suggestions (including my other suggestion).
df=data.frame(Property.RealEstate=c(0,0,1,0,0,0),
Property.Insurance=c(0,1,0,1,0,0),
Property.CarOther=c(0,0,0,0,1,0),
Property.Unknown=c(0,0,0,0,0,1))
propcols=c("Property.RealEstate", "Property.Insurance", "Property.CarOther", "Property.Unknown")
df$Property=NA
for(colname in propcols)({
coldata=df[,colname]
df$Property[which(coldata==1)]=colname
})
df$Property=gsub("Property.","",df$Property,fixed=T)
Something different:
Get the data:
dat <- data.frame(Property.RealEstate=c(1,0,1,0,0,0),Property.Insurance=c(0,1,0,1,0,0),Property.CarOther=c(0,0,0,0,1,0),Property.Unknown=c(0,0,0,0,0,1))
Reshape it:
names(dat)[row(t(dat))[t(dat)==1]]
#[1] "Property.RealEstate" "Property.Insurance" "Property.RealEstate"
#[4] "Property.Insurance" "Property.CarOther" "Property.Unknown"
If you want it cleaned up, do:
gsub("Property\\.","",names(dat)[row(t(dat))[t(dat)==1]])
#[1] "RealEstate" "Insurance" "RealEstate" "Insurance" "CarOther" "Unknown"
If you prefer a factor output:
factor(row(t(dat))[t(dat)==1],labels=names(dat))
...and cleaned up:
factor(row(t(dat))[t(dat)==1],labels=gsub("Property\\.","",names(dat)) )

Resources