How to return back the imputed values in R - r

Is there any function in R that can help to return imputed values, for example:
x <- c(23,23,25,43,34,22,78,NA,98,23,30,NA,21,78,22,76,NA,77,33,98,22,NA,52,87,NA,23,
23)
by using single linear imputation method,
na.approx(x)
I get the imputed data as;
[1] 23 23 25 43 34 22 78 35 98 23 30 24 21 78 22 76 22 77 33 98 22 14 52 87 59
[26] 23 23
How can I get the imputed value from the program back without looking at the completed dataset one by one? For example, if the data I imputed contain $n=200$ observations, can I get 20 estimates of the missing value?

I am not 100 percent sure if I got you right, but does this help?
You first save the places, at which the original NA values are, so.e.g the first NA value is at the 8th place. Save this into the dummy variable
dummy<-NA
for (i in 1:length(x)){
if(is.na(x[i])) dummy[i]<-i
}
Now get the corresponding values in the imputed data
imputeddata<-na.approx(x)
for (i in 1:length(imputeddata)){
if(!is.na(imputeddata[dummy[i]])) print(imputeddata[dummy[i]])
}

You could use is.na to select only those values that were previously NA.
> x <- c(23,23,25,43,34,22,78,NA,98,23,30,NA,21,78,22,76,NA,77,33,98,22,NA,52,87,NA,23,23)
> na.approx(x)[is.na(x)]
[1] 88.0 25.5 76.5 37.0 55.0
Hope that helps.

Related

R: many nested loops to remove rows in multiple data frames

I have 18 data frames called regular55, regular56, regular57, collar55, collar56, etc. In each data frame, I want to delete the first row of each nest.
Each data frame looks like this:
nest interval
1 17 -8005
2 17 183
3 17 186
4 17 221
5 17 141
6 17 30
7 17 158
8 17 23
9 17 199
10 17 51
11 17 169
12 17 176
13 31 905
14 31 478
15 31 40
16 31 488
17 31 16
18 31 203
19 31 54
20 31 341
21 31 54
22 50 -14164
23 50 98
24 50 1438
25 71 240
26 71 725
27 71 819
28 85 -13935
29 85 45
30 85 589
31 85 47
32 85 161
33 85 67
The solution I came up with to avoid writing out the function for each one of the 18 data frames includes many nested loops:
for (i in 5:7){
for (j in 5:7) {
for (k in c("regular","collar")){
for (l in c(unique(paste0(k,i,j,"$nest")))){
paste0(k,i,j)=paste0(k,i,j)[(-c(which((paste0(k,i,j,"$nest")) == l )
[1])),]
}}}}
I'm basically selecting the first value at "which" there is a "unique" value of nest. However, I get:
Error in paste0(k, i, j)[(-c(which((paste0(k, i, j, "$nest")) == l)[1])), :
incorrect number of dimensions
It might be because "paste0(k,i,j)" is only considered as a character and not recognized as the name for a data frame.
Any ideas on how to fix this? Or any other ways to delete the first rows for each nest in every data frame?
Thanks to help from the comments, my problem was solved.
Originally, I divided my data frame using a for loop and then grouped it into one list:
for (i in 5:7) {
for (j in 5:7) {
for (k in c("regular","collar")){
assign(paste0(k,i,j),
df[df$x == i & df$y == j & df$z == k,])
}}}
df.list=mget(ls(pattern=("[regular,collar][5-7][5-7]")))
I later found a way to split my data frame directly into a list based on multiple columns (R subsetting a data frame into multiple data frames based on multiple column values):
df.list= split(df, with(df, interaction(df$x, df$y, df$z)), drop = TRUE)
Finally, I was able to apply the function to remove the first rows of each nest:
df.list.updated = lapply(df.list, function(d) d %>% group_by(nest) %>%
slice(2:n()))
It is definitely easier to work from a list of data frames.

Subsetting - R prints data in reverse order- [R 3.2.2, Win10 Pro, 64-bit]

Aim: To retrieve last two entries of data.( I am aware of the tail function, or direct indexing)
Code:
> tdata <- read.csv("hw1_data.csv")
> temp <- tdata[(nrow(tdata)-1):nrow(tdata), ]
> temp
Ozone Solar.R Wind Temp Month Day
152 18 131 8.0 76 9 29
153 20 223 11.5 68 9 30
> temp <- tdata[nrow(tdata)-1:nrow(tdata), ]
> temp
Ozone Solar.R Wind Temp Month Day
152 18 131 8.0 76 9 29
151 14 191 14.3 75 9 28
150 NA 145 13.2 77 9 27
149 30 193 6.9 70 9 26
148 14 20 16.6 63 9 25
147 7 49 10.3 69 9 24
.
.
.
While taking a subset using the extract operator, I have used the nrows() function to retrieve the total number of rows in the data and subtracted one from it (one less than total rows) and used sequence operator(:) to sequence till nrows(data), i.e. total number of rows.
When I use parentheses, the logic works fine, but when I skip the parentheses the output is the total dataframe in a reverse order.
I can figure out that precedence rules are at play, but unable to figure out exact logic. New at R, so any formal explanation would be valuable.
As suspected correctly in the post, the observed behavior is in fact a matter of operator precedence.
A complete list of the operator syntax and precedence rules in R can be obtained by typing
help(Syntax)
in the console.
In this context, R programmers sometimes refer to a well-known and rather witty quote which encourages the use of parentheses:
library(fortunes)
fortune(138)
nrow(tdata) = 153
So the first line you run is:
temp <- tdata[(nrow(tdata)-1):nrow(tdata),]
This executes as tdata[152:153,]
Second line:
temp <- tdata[nrow(tdata)-1:nrow(tdata),]
This executes as tdata[153-1:153,]
So it returns the following:
tdata[152,]
tdata[151,]
...
tdata[0,]

R sum multiple columns with multiple row

So i have this data
10 21 22 23 23 43
20 12 26 43 23 65
21 54 64 73 25 75
My expected outcome is:
142
189
312
I tried to use:
df = data.matrix(df)
df = colSums(df)
df = as.data.frame(df)
However, the sum of values are wrong. I would like to know how to improve or correct this solution?
We can use rowSums
rowSums(df)
#[1] 142 189 312
Your data is stored as factors. You must convert it to numeric using as.numeric(as.character()).
In your situation I suggest to do:
for(i in 1:nrow(df)){
df[i,]<-as.numeric(as.character(df[i,]))
}
rowSums(df)

Summing values after every third position in data frame in R

I am new to R. I have a data frame like following
>df=data.frame(Id=c("Entry_1","Entry_1","Entry_1","Entry_2","Entry_2","Entry_2","Entry_3","Entry_4","Entry_4","Entry_4","Entry_4"),Start=c(20,20,20,37,37,37,68,10,10,10,10),End=c(50,50,50,78,78,78,200,94,94,94,94),Pos=c(14,34,21,50,18,70,101,35,2,56,67),Hits=c(12,34,17,89,45,87,1,5,6,3,26))
Id Start End Pos Hits
Entry_1 20 50 14 12
Entry_1 20 50 34 34
Entry_1 20 50 21 17
Entry_2 37 78 50 89
Entry_2 37 78 18 45
Entry_2 37 78 70 87
Entry_3 68 200 101 1
Entry_4 10 94 35 5
Entry_4 10 94 2 6
Entry_4 10 94 56 3
Entry_4 10 94 67 26
For each entry I would like to iterate the data.frame in 3 different modes. For an example, for Entry_1 mode_1 =seq(20,50,3)and mode_2=seq(21,50,3) and mode_3=seq(22,50,3). I would like sum all the Values in Column "Hits" whose corresponding values in Column "Pos" that falls in mode_1 or_mode_2 or mode_3 and generate a data.frame like follow:
Id Mode_1 Mode_2 Mode_3
Entry_1 0 17 34
Entry_2 87 89 0
Entry_3 1 0 0
Entry_4 26 8 0
I tried the following code:
mode_1=0
mode_2=0
mode_3=0
mode_1_sum=0
mode_2_sum=0
mode_3_sum=0
for(i in dim(df)[1])
{
if(df$Pos[i] %in% seq(df$Start[i],df$End[i],3))
{
mode_1_sum=mode_1_sum+df$Hits[i]
print(mode_1_sum)
}
mode_1=mode_1_sum+counts
print(mode_1)
ifelse(df$Pos[i] %in% seq(df$Start[i]+1,df$End[i],3))
{
mode_2_sum=mode_2_sum+df$Hits[i]
print(mode_2_sum)
}
mode_2_sum=mode_2_sum+counts
print(mode_2)
ifelse(df$Pos[i] %in% seq(df$Start[i]+2,df$End[i],3))
{
mode_3_sum=mode_3_sum+df$Hits[i]
print(mode_3_sum)
}
mode_3_sum=mode_3_sum+counts
print(mode_3_sum)
}
But the above code only prints 26. Can any one guide me how to generate my desired output, please. I can provide much more details if needed. Thanks in advance.
It's not an elegant solution, but it works.
m <- 3 # Number of modes you want
foo <- ((df$Pos - df$Start)%%m + 1) * (df$Start < df$Pos) * (df$End > df$Pos)
tab <- matrix(0,nrow(df),m)
for(i in 1:m) tab[foo==i,i] <- df$Hits[foo==i]
aggregate(tab,list(df$Id),FUN=sum)
# Group.1 V1 V2 V3
# 1 Entry_1 0 17 34
# 2 Entry_2 87 89 0
# 3 Entry_3 1 0 0
# 4 Entry_4 26 8 0
-- EXPLANATION --
First, we find the indices of df$Pos That are both bigger than df$Start and smaller than df$End. These should return 1 if TRUE and 0 if FALSE. Next, we take the difference between df$Pos and df$Start, we take mod 3 (which will give a vector of 0s, 1s and 2s), and then we add 1 to get the right mode. We multiply these two things together, so that the values that fall within the interval retain the right mode, and the values that fall outside the interval become 0.
Next, we create an empty matrix that will contain the values. Then, we use a for-loop to fill in the matrix. Finally, we aggregate the matrix.
I tried looking for a quicker solution, but the main problem I cannot work around is the varying intervals for each row.

Looping through rows, creating and reusing multiple variables

I am building a streambed hydrology calculator in R using multiple tables from an Access database. I am having trouble automating and calculating the same set of indices for multiple sites. The following sample dataset describes my data structure:
> Thalweg
StationID AB0 AB1 AB2 AB3 AB4 AB5 BC1 BC2 BC3 BC4 Xdep_Vdep
1 1AAUA017.60 47 45 44 55 54 6 15 39 15 11 18.29
2 1AXKR000.77 30 27 24 19 20 18 9 12 21 13 6.46
3 2-BGU005.95 52 67 62 42 28 25 23 26 11 19 20.18
4 2-BLG011.41 66 85 77 83 63 35 10 70 95 90 67.64
5 2-CSR003.94 29 35 46 14 19 14 13 13 21 48 6.74
where each column represents certain field-measured parameters (i.e. depth of a reach section) and each row represents a different site.
I have successfully used the apply functions to simultaneously calculate simple functions on multiple rows:
> Xdepth <- apply(Thalweg[, 2:11], 1, mean) # Mean Depth
> Xdepth
1 2 3 4 5
33.1 19.3 35.5 67.4 25.2
and appending the results back to the proper station in a dataframe.
However, I am struggling when I want to calculate and save variables that are subsequently used for further calculations. I cannot seem to loop or apply the same function to multiple columns on a single row and complete the same calculations over the next row without mixing variables and data.
I want to do:
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + other_variables), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + other_variables), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + other_variables), Thalweg$AB3)
# etc.
Depth_AB0 <- (Thalweg$AB0 - Residual_AB0)
Depth_AB1 <- (Thalweg$AB1 - Residual_AB1)
Depth_AB2 <- (Thalweg$AB2 - Residual_AB2)
# etc.
I have tried and subsequently failed at for loops such as:
for (i in nrow(Thalweg)){
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + Stacks_Equation), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + Stacks_Equation), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + Stacks_Equation), Thalweg$AB3)
Residuals <- data.frame(Thalweg$StationID, Residual_AB0, Residual_AB1, Residual_AB2, Residual_AB3)
}
Is there a better way to approach looping through multiple lines of data when I need unique variables saved for each specific row that I am currently calculating? Thank you for any suggestions.
your exact problem is still a mistery to me...
but it looks like you want a double for loop
for(i in 1:nrow(thalweg)){
residual=thalweg[i,"Xdep_Vdep"]
for(j in 2:11){
residual=min(residual,thalweg[i,j])
}
}

Resources