replace NA with data from another column in R - r

I know how to make the NA's blanks with the following code:
IMILEFT IMIRIGHT IMIAVG
NA NA NA
NA 71.15127 NA
72.18310 72.86607 72.52458
70.61460 68.00766 69.31113
69.39032 69.91261 69.65146
72.58609 72.75168 72.66888
70.85714 NA NA
NA 69.88203 NA
74.47109 73.07963 73.77536
70.44855 71.28647 70.86751
NA 72.33503 NA
69.82818 70.45144 70.13981
68.66929 69.79866 69.23397
72.46879 71.50685 71.98782
71.11888 71.98336 71.55112
NA 67.86667 NA
IMILEFT <- ((ASLCOMPTEST$LHML + ASLCOMPTEST$LRML)/(ASLCOMPTEST$LFML +
ASLCOMPTEST$LTML)*100)
IMILEFT <- sapply(IMILEFT, as.character)
IMILEFT[is.na(IMILEFT)] <- ""
But when I do that code, it won't allow me to do an average of "IMILEFT" and "IMIRIGHT" or make the "IMIAVG" the same as the other column that has a numerical value.
IMIAVG<-((IMILEFT + IMIRIGHT)/2)
Error in IMILEFT + IMIRIGHT : non-numeric argument to binary operator
It will also be the same error if I make it as.numeric

Try the following. Leave the NAs as they are
rowSums(M, na.rm=TRUE) / 2 - (is.na(L) + is.na(R))
## WHERE
M = cbind(IMILEFT, IMIRIGHT)
L = IMILEFT
R = IMIRIGHT
if you have rows were both columns are NA, then have the denominator be
pmin(1, 2 - (is.na(L) + is.na(R)))

Related

compute diff of rows with NAs values in data frame using R

I have data frame (9000 x 304) but it looks like to this :
date
a
b
1997-01-01
8.720551
10.61597
1997-01-02
na
na
1997-01-03
8.774251
na
1997-01-04
8.808079
11.09641
I want to calculate the values data such as :
first <- data[i-1,] - data[i-2,]
second <- data[i,] - data[i-1,]
third <- data[i,] - data[i-2,]
I want to ignore the NA values and if there is na I want to get the last value that is not na in the column.
For example in the second diff i = 4 from column b :
11.09641 - 10.61597 is the value of b_diff on 1997-01-04
This is what I did but it keeps generating data with NA :
first <- NULL
for (i in 3:nrow(data)){
first <-rbind(first, data[i-1,] - data[i-2,])
}
second <- NULL
for (i in 3:nrow(data)){
second <- rbind(second, data[i,] - data[i-1,])
}
third <- NULL
for (i in 3:nrow(data)){
third <- rbind(third, data[i,] - data[i-2,])
}
It can be a way to solve it with aggregate function but I need a solution that can be applied on big data and I can't specify each colnames separately. Moreover my colnames are in foreign language.
Thank you very much ! I hope I gave you all the information you need to help me, otherwise, let me know please.
You can use fill to replace NAs with the closest value, and then use across and lag to compute the new variables. It is unclear as to what exactly is your expected output, but you can also replace the default value of lag when it does not exist (e.g. for the first value), using lag(.x, default = ...).
library(dplyr)
library(tidyr)
data %>%
fill(a, b) %>%
mutate(across(a:b, ~ lag(.x) - lag(.x, n = 2), .names = "first_{.col}"),
across(a:b, ~ .x - lag(.x), .names = "second_{.col}"),
across(a:b, ~ .x - lag(.x, n = 2), .names = "third_{.col}"))
date a b first_a first_b second_a second_b third_a third_b
1 1997-01-01 8.720551 10.61597 NA NA NA NA NA NA
2 1997-01-02 8.720551 10.61597 NA NA 0.000000 0.00000 NA NA
3 1997-01-03 8.774251 10.61597 0.0000 0 0.053700 0.00000 0.053700 0.00000
4 1997-01-04 8.808079 11.09641 0.0537 0 0.033828 0.48044 0.087528 0.48044

Creating Data Table of Regression Coefficients

I have a model with the following regression coefficient values:
(Intercept) radius perimeter compactness concavepoints
-2.3003926746 0.0743984303 -0.0111031732 -2.5826629017 5.3127565914
radius.stderr smoothness.stderr compactness.stderr concavity.stderr radius.worst
0.4256225882 16.9805981122 -3.8819567231 0.9488969352 0.1408605366
texture.worst area.worst concavity.worst symmetry.worst fractaldimension.worst
0.0105317616 -0.0009867991 0.3504860653 0.8536208289 4.7503948408
I want to make a data table with the variable names in one column, and the corresponding regression coefficient values in the other column.
This is what I have tried so far:
var_names = coef(summary(model_B))[, 0]
coef_vals = coef(summary(model_B))[, 1]
data.table(Variables=c(var_names), RegressionCoefficients = c(coef_values))
But I get the following output with the 'Variables' column all NA:
Variables RegressionCoefficients
<dbl> <dbl>
NA -2.3003926746
NA 0.0743984303
NA -0.0111031732
NA -2.5826629017
NA 5.3127565914
NA 0.4256225882
NA 16.9805981122
NA -3.8819567231
NA 0.9488969352
NA 0.1408605366
Use names to access the names of the coefficients.
var_names=names(coef(model_B))
coef_vals=coef(model_B)
data.table(Variables=var_names, RegressionCoefficients=coef_vals)
Variables RegressionCoefficients
1: (Intercept) 2.984208e-16
2: radius 1.000000e+00
3: perimeter 1.000000e+00

How to workaround NAs in subscripted assignments

I have some data with a representative subpart here
id visitdate ecgday
5130 1999-09-22 1999-09-22
6618 NA 1999-12-01
10728 2000-06-27 2000-06-27
968 1999-04-19 1999-04-19
5729 1999-09-23 NA
1946 NA NA
15070 1999-11-09 NA
What I want is to create a novel variable visitday which is equal to ecgday, unless ecgday is NA. In that case it should be visitday -> visitdate unless both visitdate and ecgday are NA, where visitday should be NA.
I have tried
int99$visitday <- int99$visitdate
int99$visitday[!is.na(int99$ecgday) & int99$ecgday > int99$visitdate]
<-int99$ecgday[!is.na(int99$ecgday) & int99$ecgday > int99$visitdate]
but it gave the error:
Error in [.data.frame(int99, , c("id", "visitday", "visitdate", :
undefined columns selected
which I understand. Any workaround to get the desired result?
this Should do it:
First if ecday is NA it will be visitday, if not it will be ecgday
int99$visitday <- felse(is.na(int99$ecgday), int99$visitdate , int99$ecgday)
for cases when both have NAs, you can add a next ifelse:
int99$visitday <- ifelse(is.na(int99$visitdate), int99$ecgday , int99$visitdate)
Thanks to Derek Corcoran
That worked except for a very small thing that visitday ended up being numeric despite both ecgday and visitdate being Date.
That was easy fixed by adding a line
int99$visitday <- ifelse(is.na(int99$ecgday), int99$visitdate , int99$ecgday)
int99$visitday <- ifelse(is.na(int99$visitdate), int99$ecgday , int99$visitdate)
int99$visitday <- as.Date(int99$visitday, origin="1970-01-01")
Thank You so much.
In my view the best way to deal with such NA comparison it to change dates to numeric and all NAs to 0. Though quite possibly I did not understand the question correctly, in case you want to set the new variable to the higher of the visitdate and ecgday, you can try this.
Or it can be adapted to any other requirement
int99<- read.table(header = T, colClasses = c("numeric", "Date","Date"),
text="id visitdate ecgday
5130 1999-09-22 1999-09-22
6618 NA 1999-12-01
10728 2000-06-27 2000-06-27
968 1999-04-19 1999-04-19
5729 1999-09-23 NA
1946 NA NA
15070 1999-11-09 NA" )
dt<- apply(int99[,2:3], 2 , zoo::as.Date)
dt
dt[is.na(dt)]<- 0
dt
mx<- apply(dt,1,max)
mx[mx==0]<- NA
int99$visitday<- zoo::as.Date(mx)
int99

calculating a parameters in equation

SMDIt=p*SMDIt-1+q*SMDt
SMDIt=SMDt/50
I want to do the above equation to my dataset (SMD). I first need to divide the first column of my dataset with 50 (eqn 2)and call it SMDI, then go for first equation where i add SMDIt-1 with the original SMD.I have two values of p and q (p_dry and p_wet, q_dry and q_wet). I want to use p_dry and q_dry if my cell value is positive otherwise p_wet and q_wet in equation one. I wrote a following code but it gives me error. NA/NAN argument. Please help.
3.343327144 0.076583722 -4.316073117 -6.064319011 -1.034313982 1.711678831 2.062381759 5.632386548 6.017760438
4.467709087 1.632745678 -2.045736377 -3.601413064 1.695347213 3.295933998 4.070685302 7.743864617 8.348716373
8.256385028 5.635534811 2.707796712 1.572985845 6.066710978 7.095101029 7.941167874 11.37490758 12.15712496
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
-47.4749727 -62.45954133 -69.42311677 -68.04854477 -69.86363461 -56.6566393 -44.02624374 -34.68257496 -5.528397863
-57.44464723 -74.11667952 -83.07777747 -81.88546602 -84.32488173 -72.37428075 -61.04778523 -51.84892678 -20.81696219
-12.6032741 -26.27089119 -36.55478576 -30.40468773 -36.15889518 -33.71339142 -16.63378788 -4.849972012 -1.667644897
-28.28948158 -38.05693676 -43.2879285 -35.34546364 -40.09848824 -34.40754496 -18.41988896 -9.867125675 -7.493617422
NA NA NA NA NA NA NA NA NA
-35.04117468 -38.74252722 -42.69080876 -43.06064215 -40.85844545 -36.79603495 -37.92408262 -34.51428202 -32.54118632
-29.35688054 -33.7004665 -37.88555224 -39.06340145 -37.19884049 -29.8488303 -32.48244008 -28.52426895 -28.39245064
-1.422800439 -6.972537109 -11.86824507 -13.14543917 -9.893061342 1.11258721 -0.415834635 2.424939039 2.65615071
Codes:
data=read.table('SMD.csv', header=TRUE, sep=',')
SMD=data.matrix(data)
p_dry<-0.1542
q_dry<-0.0338
p_wet<-0.1660
q_wet<-0.0333
SMDI<- matrix(0,nrow=nrow(SMD),ncol=ncol(SMD))
for (i in 2:nrow(SMD)) {
for(j in 1:ncol){
if(is.na(SMD[i,j])){
SMD[i,j]<-NaN
SMDI[1,j] <-SMD[1,j]/50
if(SMD[i,j]<0)
SMDI[i,j]<- p_dry[j]*SMDI[i-1,j]+SMD[i,j]*q_dry[j] else
SMDI[i,j]<- p_wet[j]*SMDI[i-1,j]+SMD[i,j]*q_wet[j]
}
}
}
write.table(SMDI,(file='SMDI.csv')
You don't need loops. In R we works with vectors.
SMDIt <- SMD/50 # second equation
# defining vectors of p and q values corresponding to SMDIt
p <- ifelse(SMDIt>0, p_dry, p_wet)
q <- ifelse(SMDIt>0, q_dry, q_wet)
SMDIt <- p*SMDIt - 1 + q*SMD # first equation
Edit: replaced SMD[, 1] with SMD to calculate values for whole matrix.

Binning average of matrix

I have a matrix with n rows and n columns and I would like to do binning average 10 rows at a time, which means in the end I am left with a matrix of size n/10-by-n. I added the matlab library and tried the following code:
nRemove = rem(size(a,1),10);
a = a(1:end-nRemove,:)
Avg = mean(reshape(a,10,[],n));
AvgF = squeeze(Avg);
but it didn't work, which code/codes should i use?
Thanks!!
Here is another way to do it:
set.seed(5)
x = matrix(runif(1000), ncol = 10)
nr = nrow(x)
gr = rep(1:floor(nr/10), each = 10)
aggregate(x ~ gr, FUN=mean)[,-1]
which results in
NA NA.1 NA.2 NA.3 NA.4 NA.5 NA.6 NA.7
1 0.5295264 0.5957229 0.4502069 0.5168083 0.3398190 0.4075922 0.6059122 0.5127865
2 0.4778341 0.3967321 0.4069635 0.4514742 0.6172677 0.2486085 0.6340686 0.4052600
3 0.5168132 0.5117207 0.5202261 0.5068593 0.5218041 0.4925462 0.5169584 0.4919296
4 0.3299557 0.3314723 0.4503393 0.3965103 0.6166598 0.5525628 0.4943880 0.6048207
5 0.6145423 0.5853235 0.4822182 0.3377771 0.3540784 0.5974846 0.5202577 0.5769518
6 0.5009249 0.5203701 0.3940540 0.4237508 0.3199265 0.4817713 0.4655320 0.6124400
7 0.7335082 0.5856578 0.3929621 0.6403662 0.5347719 0.5658542 0.4226456 0.7196593
8 0.4976663 0.5205538 0.4529273 0.4757352 0.6980300 0.5694570 0.4384924 0.5481236
9 0.5275932 0.5014861 0.5363340 0.5664576 0.5006055 0.5611069 0.3803889 0.4680865
10 0.4560031 0.5527328 0.4419076 0.6893043 0.5161281 0.5895931 0.3965911 0.3842419
NA.8 NA.9
1 0.3711607 0.5541607
2 0.4379255 0.4159131
3 0.5048523 0.5884052
4 0.4642687 0.4572388
5 0.6054209 0.5174784
6 0.4659952 0.5332438
7 0.4568273 0.3943798
8 0.6978356 0.5087778
9 0.4897584 0.4710949
10 0.6310546 0.4775762
t( sapply(1:(NROW(A)/10), function(x) colMeans(A[ x:(x+9), ] ) ) )
You need the transpose operation to re-orient the result. One often needs to do so after an 'apply' operation.

Resources