Looping to calculate sum of adjacent values while skipping blanks - r

I have a time series of emotional responses and I want to calculate a variable from the sum of absolute differences between these responses. For example, I have 10 variables for the intensity of sadness for T1-T10. However, there is some missing data for some participants, because some only responded for e.g. T1-5 or T1-8. So the number of responses I have for every participant varies.
Now I want to calculate a new variable (SAD_s) from the sum of absolute differences between these variables like this (T1s is the intensity of sadness for T1, T2s for T2 and so on):
COMPUTE SAD_s=abs(T2s-T1s)+abs(T3s-T2s) + abs(T4s-T3s) +abs(T5s-T4s)+abs(T6s-T5s) + abs(T7s-T6s) +abs(T8s-T7s)+abs(T9s-T8s) + abs(T10s-T9s) .
EXECUTE.
However, that only works for participants with the maximum of possible responses. For everyone else with missing data I get no value.
How can I make this work for participants who have missing data at the end of the time series (e.g. missing values from T7 onward, but complete data before that)? In principle, I would also like a solution for participants with missing values in between (e.g. T1-T7 complete, T8 missing, T9-T10 complete), but I would prioritize the former.
I also have a variable indicating the number of Ts participants responded to. I have a faint idea that I need to use a loop that is being repeated the number of times this variable indicates, but I don't know how to implement that.

If you want to just skip the missing value and still calculate differences between all pairs of adjacent valid values, you can go this way:
compute #lstvr=T1.
compute sad_s=0.
do repeat vr=T2 to T10.
if not missing(vr) and not missing (#lstvr) sad_s=sad_s+abs(vr-#lstvr).
if not missing(vr) #lstvr=vr.
end repeat.
If, as I understand from your comment, you do not want to compare values from the two sides of a missing value, just fix the second line within the loop like this:
compute #lstvr=vr. /* instead of "if not missing(vr) #lstvr=vr."

Related

replacing missing values in R with the one value that follows (not the mean)

I'm trying to replace the missing values in R with the value that follows, I have annual data for income by country, and for the missing income value for 2001 for country A I want it to pull the next value (this is for time series analysis with multiple different countries and different columns for different variables - income is just one of them)
I wrote this code for replacing the missing values with the mean, but statistically I think it makes more sense to replace the missing values with the value right below it (that comes next, the next year) since the numbers will be very different depending on the country so if I take an average it'll be of all years for all countries).
Social_data_R<-within(Social_data_R,incomeNAavg[is.na(income)]<-mean(income,na.rm=TRUE))
I tried replacing the mean part of the code above with income[i+1] but it didn't recognize 'i' (I uploaded the data from excel, so didn't create the dataframe manually)

Calculating the offset between two columns in a dataframe but ignoring some of the outliers in one of those

I have a dataeframe with two columns, one of which is the baseline (baseline_CO2) I have calculated using a previous set of data and the other is a set of data I believe to be offset with respect to this baseline value.
I want to quantify this offset and calculate it's value in order to correct my original data (CO2_LICOR). In order to do this accurately I need to be able to remove some of the outlier peak values in this offset calculation for the LICOR_CO2 data, say all values over 350.
Can anyone help?
The dataframe looks like the following:
If you want to compare the two rows then you can use the approach Jon Spring suggested.
df$offset <- df$baseline_CO2 - df$CO2_LICOR
If you want to filter these values then something like
df_filtered <- df[df$CO2_LICOR < 350]

How to measure the fixed effect of a pair of individuals using the plm package in R?

I have a panel data set consisting of bonds with daily prices observed over a period of time. Thus each bond is repeated downwards with the corresponding daily price observations and dates (ref picture below). Half of the bonds are green (identified by a dummy variable) and each green bond is matched with a non-green bond, each pair is identified with a pair-id. So a green bond and its matched non-green bond have the same pair-id, and are observed over the same time span (say 100 days each), but the individual bond-id is unique.
I want to measure the fixed effect within each pair of bonds to figure out if there is a significant difference in yield to maturity (variable used = ask.yield) between the green bond and its matching non-green bond. Thus, I believe when identifying the paneldata in R, that the individual should be pair.id and the time index should be date. I use the following regression:
fixed <- plm(ask.yield ~ liquidity + green, data = paneldata, index = c(“pair.id”, “dates”), model = “within”)
Desired output (do not mind the numbers):
I get an error message saying:
Error in pdim.default(index[1], index[2]) :
duplicate couples (id-time)
I understand the error message – each pair.id in the panel data is recorded over the same dates twice (one time for the green bond, and one for the matching non-green bond).
Does anyone know how to get around this problem and still be able to measure the fixed effect within each pair of bonds?
From the error, there are duplications in the paired id, aka, the combination of pair.id and dates are not unique. Can you check whether the values of date unique for each pair.id?
If they are, you might need to convert the date to str, depending on the data type, the date might be converted to some value that might introduce the duplication values.
Hope this helps, since I don't have the data, I have no way to reproduce.

Removing data frames from a list that contains a certain value under a variable in R

Currently have a list of 27 correlation matrices with 7 variables, doing social science research.
Some correlations are "NA" due to missing data.
When I do the analysis, however, I do not analyse all variables in one go.
In a particular instance, I would like to keep one of the variables conditionally, if it contains at least some value (i.e. other than "NA", since there are 7 variables, I am keeping anything that DOES NOT contain 6"NA"s, and correlation with itself, 1 -> this is the tricky part because 1 is a value, but it's meaningless to me in a correlation matrix).
Appreciate if anyone could enlighten me regarding the code.
I am rather new to R, and the only thought I have is to use an if statement to set the condition. But I have been trying for hours but to no avail, as this is my first real coding experience.
Thanks a lot.
since you didn't provide sample data, I am first going to convert your matrix into a dataframe and then I am just going to pretend that you want us to see if your dataframe df has a variable var with at least one non-NA or 1. value
df <- as.data.frame(as.table(matrix)) should convert your matrix into a dataframe
table(df$var) will show you the distribution of values in your dataframe's variable. from here you can make your judgement call on whether to keep the variable or not.

How to calculate Zscore in R

I have log2ratio values of each chromosome position (137221 coordinates) for different samples (15 samples). I want to calculate the Zscore of log2ratio for each chromosome position (row). Also i want to exclude first three columns because it contains ID. There are also some NAs in between the variables..
Thanking you in advance
It's not completely clear what you want. If you want a Z-score for the entire row (i.e., its mean divided by standard error) for all but the first three rows then
f <- function(x) {
mean(x,na.rm=TRUE)/(sd(x,na.rm=TRUE)*sqrt(length(na.omit(x))))
}
apply(as.matrix(df[-(1:3),]),1,f)
will do it. That gives you a vector equal to the number of columns (minus 3).
If you want entire columns of normalized data (Z-scores) then I think
t(scale(t(as.matrix(df[-(1:3),]))))
should work. If neither of those work, you need to post a reproducible example -- or at least tell us precisely what the error messages are.

Resources