Estimating new columns with n-1 cases in R - r

I am trying to build a code for a fund analysis, which starts from the returns of the fund at different frequencies. I have been able to split the data by frequencies, that is, daily, weekly, monthly, quarterly and yearly, but the next step is not quite working for me. I have done it many times on excel and SPSS, but since R is a new language for me, it is proving to be challenging. A sample of my data is given herewith :
Date Dat1 Dat2
30/06/2009 54.26 1307.16
31/07/2009 65.28 1425.40
31/08/2009 70.71 1498.97
30/09/2009 76.18 1552.84
30/10/2009 71.92 1532.74
30/11/2009 77.14 1559.57
What I wish to do is to have two more columns, with n-1 elements in them, starting at the second index. So the entries in the new columns for the 30/06/2009 would be '-' and '-' and for the 31/07/2009 row would be the values of (65.28-54.26)/54.26 and (1425.40-1307.16)/1307.16 and so forth, until the very last case. But when I run the simple code
Daily$Dat1.Return <- diff(log(Daily$Dat1)) I get the following error :
Error in `$<-.data.frame`(`*tmp*`, Dat1, value = c(-0.0616981144702153, :
replacement has 2161 rows, data has 2162
How can I get the columns I want?

Related

Behaviour of MICE package in R

I am using the MICE package to impute missing values - the input data are up to six parallel hourly temperature measurements for Scottish weather stations across a calendar year. None of the vectors have more than 5% NAs as I have filtered ones with more out. Most of the sets work fine with MICE but with a few I get an error message:
This data set, which generated the error message has five columns
iter imp variable
1 1 986Error in terms.formula(tmp, simplify = TRUE) :
invalid term in model formula
986 is the station number which is the column name here for the first column. The third to fifth columns don't have any NAs and the first and second have fewer than 1% - but they do have the NAs concentrated as strings of 20 or so near the beginning of the data set. I am wondering whether MICE has a problem with a too large concentration of the NAs in particular regions but I can't find any reference to this in the literature. Has anyone else come across this as a problem, and if so, what did they do about it? Thanks Nick Wray

How do I keep my missing values to stay the same after I do mice imputation and save my results?

As a new R user I'm having trouble understanding why the NA valus in my dataframe keep changing. I'm running my code on Kaggle. Maybe that's where my problem is arising from?
Original dataframe titled "abc"
There are multiple columns that have NA values so I decided to try using multiple imputation to handle the na values.
So I created a new dataframe with just the columns that had na values and begin imputation
This is the new dataframe titled "abc1"
abc1 <- select(abc, c(9,10,15,16,17,18,19,25,26))
#mice imputation
input_data = abc1
my_imp = mice(input_data, m=5, method="pmm", maxit=20)
summary(input_data$m_0_9)
my_imp$imp$m_0_9
When the imputation begins it creates 5 columns that contain new values to fill in for the NA values of column m_0_9 and I choose which column.
Imputation of column 'm_0_9'
Then I run this code:
final_clean_abc1 <- complete(my_imp,5)
This assigns the values from column 5 of the last image to the NA values in my "abc1" dataframe and saves as "final_clean_abc1."
Lastly I replace the columns from the original "abc" dataframe that had missing values with the new columns in "final_clean_abc1."
I know this probably isnt the cleanest:
abc$m_0_9 <- final_clean_abc1$m_0_9
abc$m_10_12 <- final_clean_abc1$m_10_12
abc$f_0_9 <- final_clean_abc1$f_0_9
abc$f_10_12 <- final_clean_abc1$f_10_12
abc$f_13_14 <- final_clean_abc1$f_13_14
abc$f_15 <- final_clean_abc1$f_15
abc$f_16 <- final_clean_abc1$f_16
abc$asian_pacific_islander <- final_clean_abc1$asian_pacific_islander
abc$american_indian <- final_clean_abc1$american_indian
Now that I have a dataframe 'abc' with no missing values this is where my problem arises. I should be seeing '162' for row 10 for the m_0_9 column but when I save my code and view it on Kaggle I get the value '7' for that specific row and column. As shown in the photo below.
"abc" dataframe with no NA values
Hopefully this makes sense I tried to be as specific as I could be.
There are multiple stochastic processes going on in mice to impute multiple values for one target value, of which are then averaged. You should not expect the same result each time you run mice.
From the MICE documentation
In the first step, the dataset with missing values (i.e. the
incomplete dataset) is copied several times. Then in the next step,
the missing values are replaced with imputed values in each copy of
the dataset. In each copy, slightly different values are imputed due
to random variation. This results in mulitple imputed datasets. In the
third step, the imputed datasets are each analyzed and the study
results are then pooled into the final study result. In this Chapter,
the first phase in multiple imputation, the imputation step, is the
main topic. In the next Chapter, the analysis and pooling phases are
discussed.
https://bookdown.org/mwheymans/bookmi/multiple-imputation.html
We have a wonderful series of vignettes that detail the use of mice. Part of this series is the stochastic nature of the algorithm and how to fix that. Setting mice(yourdata, seed = 123) would generate the same set of multiple imputation every time.

How to calculate the average of different groups in a dataset using R

I have a dataset in R that I would like to find the average of a given variable for each year in the dataset (here, from 1871-2019). Not every year has the same number of entries, and so I have encountered two problems: first, how to find the average of the variable for each year, and second, how to add the column of averages to the dataset. I am unsure how to approach the first problem, but I attempted a version of the second problem by simply finding the sum of each group and then trying to add those values to the dataset for each entry of a given year with the code teams$SBtotal <- tapply(teams$SB, teams$yearID, FUN=sum). That code resulted in an error that notes replacement has 149 rows, data has 2925. I know that this can be done less quickly in Excel, but I'm hoping to be able to use R to solve this problem.
The tapply should work
data(iris)
tapply(iris$Sepal.Length, iris$Species, FUN = sum)

How to compute questionnaire total score and subscores by summing all and a selection of columns in R?

I'm new in R and I'm having a little issue. I hope some of you can help me!
I have a data.frame including answers at a single questionnaire.
The rows indicate the participants.
The first columns indicates the participant ID.
The following columns include the answers to each item of the questionnaire (item.1 up to item.20).
I need to create two new vectors:
total.score <- sum of all 20 values for each participant
subscore <- sum of some of the items
I would like to use a function, like a sum(A:T) in Excel.
Just to recap, I'm using R and not other software.
I already did it by summing each vector just with the symbol +
(data$item.1 + data$item.2 + data$item.3 etc...)
but it is a slow way to do it.
Answers range from 0 to 3 for each item, so I expect a total score ranging from 0 to 60.
Thank you in advance!!
Let's use as example this data from a national survey with a questionnaire
If you download the .csv file to your working directory
data <- read.csv("2016-SpanishSurveyBreastfeedingKnowledge-AELAMA.csv", sep = "\t")
Item names are p01, p02, p03...
Imagine you want a subtotal of the first five questions (from p01 to p05)
You can give a name to the group:
FirstFive <- c("p01", "p02", "p03", "p04", "p05")
I think this is worthy because of probably you will want to perform more tasks with this group (analysis, add or delete a question from the group...), and because it helps you to provide meaningful names (for instance "knowledge", "attitudes"...)
And then create the subtotal variable:
data$subtotal1 <- rowSums(data[ , FirstFive])
You can check that the new variable is the sum
head(data[ , c(FirstFive, "subtotal2")])
(notice that FirstFive is not quoted, because it is an object outside data, but subtotal2 is quoted, because it is the name of a variable in data)
You can compute more subtotals and use them to compute a global score
You could may be save some keystrokes if you know that these variables are the columns 20 to 24:
names(data)[20:24]
And then sum them as
rowSums(data[ , c(20:24)])
I think this is what you asked for, but I would avoid doing this way, as it is easier to make mistakes, whick can be hard to be detected

Conditionally create new column in R

I would like to create a new column in my dataframe that assigns a categorical value based on a condition to the other observations.
In detail, I have a column that contains timestamps for all observations. The columns are ordered ascending according to the timestamp.
Now, I'd like to calculate the difference between each consecutive timestamp and if it exceeds a certain threshold the factor should be increased by 1 (see Desired Output).
Desired Output
I tried solved it with a for loop, however that takes a lot of time because the dataset is huge.
After searching for a bit I found this approach and tried to adapt it: R - How can I check if a value in a row is different from the value in the previous row?
ind <- with(df, c(TRUE, timestamp[-1L] > (timestamp[-length(timestamp)]-7200)))
However, I can not make it work for my dataset.
Thanks for your help!

Resources