Create new column in dataframe using if {} else {} in R - r

I'm trying to add a conditional column to a dataframe, but not getting the results I'm expecting.
I have a dataframe with values recorded for the column "steps" across 5-minute intervals over various days. I'm trying to impute missing values in the 'steps' column by using the mean number of steps for a given 5-minute interval on the days that do have measurements. n.b. I tried using the MICE package for this but it just crashed my computer so I opted for a more manual workaround.
As an intermediate stage, I have bound an additional column to the existing dataframe with the mean number of steps for that interval. What I want to do next is create a column that returns that mean if the raw number of steps is NULL, and just uses the raw value if not null. Here's my code for that part:
activityTimeAvgs$stepsImp <- if(is.na(activityTimeAvgs$steps)){
activityTimeAvgs$avgsteps
} else {
activityTimeAvgs$steps
}
What I expected to happen is that the if statement would evaluate as TRUE if 'steps' is NA and consequently give 'avgsteps'; in cases where 'steps' is not NA I would expect it to just use the raw value for 'steps'. However, the output just gives the value for 'avgsteps' in every row, which is not much use. I also get the following warning:
Warning message:
In if (is.na(activityTimeAvgs$steps)) { :
the condition has length > 1 and only the first element will be used
Any ideas where I'm going wrong?
Thanks in advance.

The if statement is not suitable for this. You need to use ifelse:
activityTimeAvgs$stepsImp <- ifelse(is.na(activityTimeAvgs$steps), activityTimeAvgs$avgsteps, activityTimeAvgs$steps)

Related

traj step1measures: why am I getting an NA error if I have no missing values?

This is what happens when I run traj::step1measures
step1measures(datamat, timemat, ID = TRUE)
Error in if (cor.mat[i_row, i_col] > 0.999) { : missing value where TRUE/FALSE needed
2.
check.correlation(output[, -1], verbose)
1.
step1measures(datamat, timemat, ID = TRUE)
I have checked multiple times and I am sure that there are no null or missing values in the data and time matrices. Any suggestions for what's going wrong here/ where a missing value could be popping up?
There are a few reasons why you may be hitting this error:
At least one row of your data does not have the required number of
data points. A minimum of 4 data points per row is required. You can
see the data requirements in the function's documentation:
https://cran.rstudio.com/web/packages/traj/traj.pdf
Your data contains an ID row but you did not indicate that to the function.
Any other unexpected data value combinations that yields 'Inf', 'NA' or 'NaN' for one of the measures. This is the sneaky one. You may need to go to line 416 of the step1measures script and view the data before it's passed to the correlation function. You may notice that some data rows contain the invalid values. I would recommend removing those rows. In an ideal world, the package would be able to catch such issues and display a better error but it's not the case today.

How two decile bin continuous variables?

I am trying to decile my data into equal bins and summarise it to see if there are any existing patterns with respect to the Dependent Variable. While summarising the data, I also want to see the lower bound and the upper bound of a variable for each decile.
I have written the below code in R-
telecom_final_Analyse<-read.csv("sampletelecomfinal.csv")
col_name_final<-colnames(telecom_final_Analyse)
Variable_profile<-vector("list",79) #I have 79 variables
names(Variable_profile)<-col_name_final
for (j in 1:79) {
if(class(telecom_final_Analyse[,col_name_final[j]])=="numeric" || class(telecom_final_Analyse[,col_name_final[j]])=="integer"){
telecom_final_Analyse%>%mutate(dec=ntile(telecom_final_Analyse[,col_name_final[j]],10))->telecom_final_Analyse
z<-as.name(col_name_final[j])
telecom_final_Analyse%>%group_by(dec)%>%summarise(n=sum(churn),N=n(),churn_percentage=n/N,greaterthan = min(z,na.rm=TRUE),lessthan=max(z,na.rm=TRUE))->Variable_profile[[col_name_final[j]]]
}
else{
x<-as.name(col_name_final[j])
telecom_final_Analyse%>%group_by_(x)%>%summarise(n=sum(churn),N=n(),churn_percentage=n/N)->Variable_profile[[col_name_final[j]]]
}
}
I am getting the following error - Error in min(z, na.rm = TRUE) : invalid 'type' (symbol) of argument
The following is the code I used for one variable to get the desired output In the same way I want to get output for all integer/numeric variables in the dataset
telecom_final_Analyse%>%mutate(dec=ntile(telecom_final_Analyse$eqpdays ,10))->telecom_final_Analyse
telecom_final_Analyse%>%group_by(dec)%>%summarise(n=sum(churn),N=n(),churn_percentage=n/N,greaterthan=min(eqpdays,na.rm=TRUE),lessthan=max(eqpdays,na.rm=TRUE))
I am able to do it manually for 1 variable, this is the output I got. The same way I want for my other continuous variables as well
I've not run this (no reprex) but you can extent your code for the single variable with mutate_if(is.numeric,{a function},{some parameters})
See: https://dplyr.tidyverse.org/reference/mutate_all.html
So try...
telecom_final_Analyse%>%mutate_if(is.numeric, ntile, 10)
Note this will.mutate the existing columns. If you want to keep the old ones and create new ones you can wrap multiple mutate functions in "list(first_function, second_function)" and then the output data set will be wider than before. It's all there in the online help.
Hope this works for you

R: errors in cor() and corrplot()

Another stumbling block. I have a large set of data (called "brightly") with about ~180k rows and 165 columns. I am trying to create a correlation matrix of these columns in R.
Several problems have arisen, none of which I can resolve with the suggestions proposed on this site and others.
First, how I created the data set: I saved it as a CSV file from Excel. My understanding is that CSV should remove any formatting, such that anything that is a number should be read as a number by R. I loaded it with
brightly = read.csv("brightly.csv", header=TRUE)
But I kept getting "'x' must be numeric" error messages every time I ran cor(brightly), so I replaced all the NAs with 0s. (This may be altering my data, but I think it will be all right--anything that's "NA" is effectively 0, either for the continuous or dummy variables.)
Now I am no longer getting the error message about text. But any time I run cor()--either on all of the variables simultaneously or combinations of the variables--I get "Warning message:
In cor(brightly$PPV, brightly, use = "complete") :
the standard deviation is zero"
I am also having some of the correlations of that one variable with others show up as "NA." I have ensured that no cell in the data is "NA," so I do not know why I am getting "NA" values for the correlations.
I also tried both of the following to make REALLY sure I wasn't including any NA values:
cor(brightly$PPV, brightly, use = "pairwise.complete.obs")
and
cor(brightly$PPV,brightly,use="complete")
But I still get warnings about the SD being zero, and I still get the NAs.
Any insights as to why this might be happening?
Finally, when I try to do corrplot to show the results of the correlations, I do the following:
brightly2 <- cor(brightly)
Warning message:
In cor(brightly) : the standard deviation is zero
corrplot(brightly2, method = "number")
Error in if (min(corr) < -1 - .Machine$double.eps || max(corr) > 1 + .Machine$double.eps) { :
missing value where TRUE/FALSE needed
And instead of making my nice color-coded correlation matrix, I get this. I have yet to find an explanation of what that means.
Any help would be HUGELY appreciated! Thanks very much!!
Please check if you replaced your NAs with 0 or '0' as one is character and other is int. Or you can even try using as.numeric(column_name) function to convert your char 0s with int 0. Also this error occurs if your dataset has factors, because those are not int values corrplot throws this error.
It would be helpful of you put sample of your data in the question using
str(head(your_dataset))
That would be helpful for you to check the datatypes of columns.
Let me know if I am wrong.
Cheerio.

Create a new data frame of the means of randomly selected rows - looped

Question:
I have a data.frame (hlth) that consists of 49 vectors - a mix of numeric(25:49) and factor(1:24). I am trying to randomly select 50 rows, then calculate column means only for the numeric columns (dropping the other values), and then place the random row mean(s) into a new data.frame (beta). I would then like to iterate this process 1000 times.
I have attempted this process but the values that get returned are identical and the new means will not enter the new data.frame
Here is a few rows and columns of the data.frame(hlth)
DateIn adgadj Sex VetMedCharges pwtfc
1/01/2006 3.033310 STEER 0.00 675.1151
1/10/1992 3.388245 STEER 2540.33 640.2261
1/10/1995 3.550847 STEER 572.78 607.6200
1/10/1996 2.893707 HEIFER 549.42 425.5217
1/10/1996 3.647233 STEER 669.18 403.8238
The code I have used thus far:
set.seed[25]
beta<-data.frame()
net.row<-function(n=50){
netcol=sample(1:nrow(hlth),size=n ,replace=TRUE)
rNames <- row.names(hlth)
subset(hlth,rNames%in%netrow,select=c(25:49))
colMeans(s1,na.rm=TRUE,dims=1)
}
beta$net.row=replicate(1000,net.row()); net.row
The two issues, that I have detected, are:
1) Returns the same value(s) each iteration
2) "Error during wrap-up: object of type 'closure' is not subsettable" when the beta$netrow
Any suggestions would be appreciated!!!
Just adding to my comment (and firstly pasting it):
netcol=sample(1:nrow(hlth),size=n ,replace=TRUE) should presumably by netrow = ... and the error is a scoping problem - R is trying to subset the function beta, presumably again, because it can't find netRowMeans in the data.frame you've defined, moves on to the global environment and throws an error there.
There are also a couple of other things. You don't assign subset(hlth,rNames%in%netrow,select=c(25:49)) to a variable, which I think you mean to assign to s1, so colMeans is probably running on something you've set in the global environment.
If you want to pass a variable directly in to the data frame beta in that manner, you'll have to initialise beta with the right number of columns and number of rows - the column means you've passed out will be a vector of (1 x 25), so won't fit in a single column. You would probably be better of initalising a matrix called mat or something (to avoid confusion with scoping errors masking the actual error messages) with 25 columns and 1000 rows.
EDIT: Question has been edited slightly since I posted this, but most points still stand.

Find same observations of a list/matrix in another matrix

For a marketing class I have to write a function that calculates the retention rate of the customers (probability that a customer still is a customer). I've come so far that I isolated the ids of the individual customers and stored them in the matrix first.transactions.data. I then split them into cohorts (group of customers by time) with split() and stored them in the list cohort.
Now comes my problem: I calculated another sub-matrix from the full data set called final.period.data where I will calculate the retention rate. However, therefore I have to isolate the ids in final.period.data for each cohort. My instructor told me that I should create an additional column in final.period.data that shows TRUE or FALSE depending on whether the cohort's id and final.period.data's id are the same. For this I tried to use exists, but I always receive error messages. I tried the following:
final.period.data <- if(exists(cohort$'1'$id, where = final.period.data$id) final.period.data$same = TRUE)
but always receive error messages such as: unexpected symbol or invalid first argument. I also tried to convert the list cohort into a matrix but this didn't help either. How do I have to change the exist command or is there a simpler way to locate cohort's ids in final.period.data?
Thank you for your help.
You can just create a function that does what you want:
funct <-(final.period.data){
if (final.period.data$cohort =='1' & final.period.data$id ==<condition2>){
#Change the number for the TRUE condition}
else{ #If it doesn't fit the two conditions
#Change the number for the FALSE condition}
}
vector <- c(nrow(final.period.data))
final.period.data <- cbind(vector)
And use it as the apply function. Here will you find more information about apply
But I usually do it with a for loop, first creating the new column and then adding it to the data frame.

Resources