How to match one row from one column to the next 5-10 rows in another column in R? - r

I have a table in R (Click on the link below (See my table here)) that shows observations of two events per day: observ1 and observ2. I would like to add a third column to that called 'check'. In column check, I should get a TRUE value if observ1 equals 1 and after 5 to 10 days, observ2 also equals 1.
As you see in the table, check value on row 14 is TRUE. The reason is that observ1 was 1 on row 6 and then after 9 days, observ2 also was 1.
I do not know how to code this in R and get out column 'check'. Appreciate any assistance!
See my table here

this is not considered a good way to ask a question, generally most posters will use dput() on their data.frame to provide a sample of their data to upload in the question. The result of this function is copied and pasted from the console in the format I have done below (see data). For future questions it is considered good practice. At any rate hope this solutions helps:
Base R solution:
df1$check <- with(
df1,
vapply(
seq_along(observ2),
function(i){
if(i - 5 <= 0){
NA
}else{
ir <- max(i-10, 1)
ir2 <- (any(observ1[ir:(i-5)] == 1) & observ2[i] == 1)
ifelse(ir2, ir2, NA)
}
},
logical(1)
)
)
Data:
df1 <- structure(list(day = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), observ1 = c(1, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0), observ2 = c(0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -20L))

Related

One-hot coding to numeric [duplicate]

This question already has answers here:
How do I dichotomise efficiently
(5 answers)
How to one hot encode several categorical variables in R
(5 answers)
Closed 9 months ago.
I am working on a project that requires me to one-hot code a single variable and I cannot seem to do it correctly.
I simply want to one-hot code the variable data$Ratings so that the values for 1,2,3 and separated in the dataframe and only equal either 0 or 1. E.g., if data$Ratings = 3 then the dummy would = 1. All the other columns are not to change.
structure(list(ID = c(284921427, 284926400, 284946595, 285755462,
285831220, 286210009, 286313771, 286363959, 286566987, 286682679
), AUR = c(4, 3.5, 3, 3.5, 3.5, 3, 2.5, 2.5, 2.5, 2.5), URC = c(3553,
284, 8376, 190394, 28, 47, 35, 125, 44, 184), Price = c(2.99,
1.99, 0, 0, 2.99, 0, 0, 0.99, 0, 0), AgeRating = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1), Size = c(15853568, 12328960, 674816, 21552128,
34689024, 48672768, 6328320, 64333824, 2657280, 1466515), HasSubtitle = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), InAppSum = c(0, 0, 0, 0, 0, 1.99,
0, 0, 0, 0), InAppMin = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppMax = c(0,
0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppCount = c(0, 0, 0, 0, 0,
1, 0, 0, 0, 0), InAppAvg = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0),
descriptionTermCount = c(263, 204, 97, 272, 365, 368, 113,
129, 61, 87), LanguagesCount = c(17, 1, 1, 17, 15, 1, 0,
1, 1, 1), EngSupported = c(2, 2, 2, 2, 2, 2, 1, 2, 1, 2),
GenreCount = c(2, 2, 2, 2, 3, 3, 3, 2, 3, 2), months = c(7,
7, 7, 7, 7, 7, 7, 8, 8, 8), monthsSinceUpdate = c(29, 17,
25, 29, 15, 6, 71, 12, 23, 134), GameFree = c(0, 0, 0, 0,
0, 1, 0, 0, 0, 0), Ratings = c(3, 3, 3, 3, 2, 3, 2, 3, 2,
3)), row.names = c(NA, 10L), class = "data.frame")
install.packages("mlbench")
install.packages("neuralnet")
install.packages("mltools")
library(mlbench)
library(dplyr)
library(caret)
library(mltools)
library(tidyr)
data2 <- mutate_if(data, is.factor,as.numeric)
data3 <- lapply(data2, function(x) as.numeric(as.character(x)))
data <- data.frame(data3)
summary(data)
head(data)
str(data)
View(data)
#
dput(head(data, 10))
data %>% mutate(value = 1) %>% spread(data$Ratings, value, fill = 0 )
Is this what you want? I will assume your data is called data and continue with that for the data frame you supplied:
library(plm)
plm::make.dummies(data$Ratings) # returns a matrix
## 2 3
## 2 1 0
## 3 0 1
# returns the full data frame with dummies added:
plm::make.dummies(data, col = "Ratings")
## [not printed to save space]
There are some options for plm::make.dummies, e.g., you can select the base category via base and you can choose whether to include the base (add.base = TRUE) or not (add.base = FALSE).
The help page ?plm::make.dummies has more examples and explanation as well as a comparison for LSDV model estimation by a factor variable and by explicitly self-created dummies.

Histogram in R with 1 bin for zeros only

Is it possible to make a histogram in R with bins of different sizes? I'm working with count data and the zeros need to have their own bin, but the other numbers can be binned into whatever would make sense. A single histogram for all fish counts is fine.
fish<-structure(list(num = c(0, 11, 1, 0,
13, 11, 0, 1, 0, 0, 11, 11, 0, 10, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 9, 0, 2, 1, 0,
7, 6, 0, 4, 8, 0, 8, 6, 0)),
class = "data.frame", row.names = c(NA,-50L))
Use cut to have 0 on its own, then use seq to create bins:
barplot(table(cut(fish$num, c(0, seq(1, 15, 3)), right = FALSE)), space = 0)
Edit: First bar only includes zeros, see:
table(fish$num)
# 0 1 2 4 6 7 8 9 10 11 13
#31 4 1 1 2 1 2 1 1 4 2
table(cut(fish$num, c(0, seq(1, 15, 3)), right = FALSE))
# [0,1) [1,4) [4,7) [7,10) [10,13)
# 31 5 3 4 5
You can specify the breaks to be whatever you want with the breaks parameter to the hist() function. You should combine this with the freq = TRUE parameter, or else you'll get a density plot, and the density at zero will be infinite. The plot doesn't look very good though, because the zero bin has no width:
hist(fish$num, breaks = c(0, 0, 5, 10, 15, 20), freq=TRUE)
I'm not sure what you'd want to do instead.

Specifying number of trials, bootstrap

For an assignment, I am applying mixture modeling with the mixtools package on R. When I try to figure out the optimal amount of components with bootstrap. I get the following error
Error in boot.comp(y, x, N = NULL, max.comp = 2, B = 5, sig = 0.05, arbmean = TRUE, :
Number of trials must be specified!
I found out that I have to fill an N: An n-vector of number of trials for the logistic regression type logisregmix. If
NULL, then N is an n-vector of 1s for binary logistic regression.
But, I don't know how to find out what the N is in fact to make my bootstrap working.
Link to my codes:
https://www.kaggle.com/blastchar/telco-customer-churn
My codes:
data <- read.csv("Desktop/WA_Fn-UseC_-Telco-Customer-Churn.csv", stringsAsFactors = FALSE,
na.strings = c("NA", "N/A", "Unknown*", "NULL", ".P"))
data <- droplevels(na.omit(data))
data <- data[c(1:5032),]
testdf <- data[c(5033:7032),]
data <- subset(data, select = -customerID)
set.seed(100)
library(plyr)
library(mixtools)
data$Churn <- revalue(data$Churn, c("Yes"=1, "No"=0))
y <- as.numeric(data$Churn)
x <- model.matrix(Churn ~ . , data = data)
x <- x[, -1] #remove intercept
x <-x[,-c(7, 11, 13, 15, 17, 19, 21)] #multicollinearity
a <- boot.comp(y, x, N = NULL, max.comp = 2, B = 100,
sig = 0.05, arbmean = TRUE, arbvar = TRUE,
mix.type = "logisregmix", hist = TRUE)
Below there is more information about my predictors:
dput(x[1:4,])
structure(c(0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
34, 2, 45, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 1, 1, 0, 29.85, 56.95, 53.85, 42.3, 29.85, 1889.5, 108.15,
1840.75), .Dim = c(4L, 23L), .Dimnames = list(c("1", "2", "3",
"4"), c("genderMale", "SeniorCitizen", "PartnerYes", "DependentsYes",
"tenure", "PhoneServiceYes", "MultipleLinesYes", "InternetServiceFiber optic",
"InternetServiceNo", "OnlineSecurityYes", "OnlineBackupYes",
"DeviceProtectionYes", "TechSupportYes", "StreamingTVYes", "StreamingMoviesYes",
"ContractOne year", "ContractTwo year", "PaperlessBillingYes",
"PaymentMethodCredit card (automatic)", "PaymentMethodElectronic check",
"PaymentMethodMailed check", "MonthlyCharges", "TotalCharges"
)))
My response variable is binary
I hope you guys can help me out!
Looking in the source code of mixtools::boot.comp, which is scary as it is over 800 lines long and in serious need of refactoring, the offending lines are:
if (mix.type == "logisregmix") {
if (is.null(N))
stop("Number of trials must be specified!")
Despite what the documentation says, N must be specified.
Try to set it to a vector of 1s: N = rep(1, length(y)) or N = rep(1, nrow(x))
In fact, if you look in mixtools::logisregmixEM, the internal function called by boot.comp, you'll see how N is set if NULL:
n <- length(y)
if (is.null(N)) {
N = rep(1, n)
}
Too bad this is never reached if N is NULL since it stops with an error before. This is a bug.

Optimize multiple replacement based on condition per ID

Which is the fastest way to do this? I have many 'value' columns (>100) in which I have to replace values when 'valueAux' is zero.
'Value1' column should be set to zero always that 'value1Aux' (for the same row) is zero
Original data:
df <- data.frame(ID = c(1,1,1,1,1,1,1,1),
value1 = c(23, 0, 4, 1, 0, 0, 8, 12),
value2 = c(0, 12, 56, 7, 8, 1, 8, 12),
value1aux = c(0, 0, 89, 65, 0, 0, 0, 1),
value2aux = c (1,1,0,0,4,15,67,12))
Result desired data:
df <- data.frame(ID = c(1,1,1,1,1,1,1,1),
value1 = c(0, 0, 4, 1, 0, 0, 0, 12),
value2 = c(0, 12, 0, 0, 8, 1, 8, 12),
value1aux = c(0, 0, 89, 65, 0, 0, 0, 1),
value2aux = c (1,1,0,0,4,15,67,12))
Code to optimize:
names <- colnames(df[2:3])
names2 <- colnames(df[4:5])
for (i in 1:nrow(df)){
df[i,names] <- replace (df[i,names], df[i,names2] == 0, 0)}

The number of stretches in the vector when the param is equal to 0

How can I find the number of stretches (blocks) in the vector when the param is equal to 0? In this example, the answer would be 3.
The vector param:
param <- c(25, 20, 18, 5, 1, 0, 0, 0, 1, 5, 0, 0, 3, 6, 9, 0, 0)
I'm going to assume a "stretch" is at least two or more values. But with your test data
x<- c(25, 20, 18, 5, 1, 0, 0, 0, 1, 5, 0, 0, 3, 6, 9, 0, 0)
I would use the rle() function to calculate the run lengths
with(rle(x), sum(values==0 & lengths>1))
# [1] 3

Resources