Combine rows into one, replace NA - r

I have the following dataframe: (this is just a small sample)
VALUE COUNT AREA n_dd-2000 n_dd-2001 n_dd-2002 n_dd-2003 n_dd-2004 n_dd-2005 n_dd-2006 n_dd-2007 n_dd-2008 n_dd-2009 n_dd-2010
2 16 2431 243100 NA NA NA NA NA NA 3.402293 3.606941 4.000461 3.666381 3.499614
3 16 2610 261000 3.805082 4.013435 3.98 3.490139 3.433857 3.27813 NA NA NA NA NA
4 16 35419 3541900 NA NA NA NA NA NA NA NA NA NA NA
and I would like to combine all three rows into one row replacing NA with the number that appears in each column (there's only one number per column). Just ignore the first three columns. I used this code:
bdep[4,4:9] <- bdep[3,4:9]
to replace NA's with numbers from another row, but can't figure out how to repeat it for all the columns. The columns 4 and beyond have a sequence in each row of six numbers followed by 20 NA's, so I've tried going down the road of using lapply() and seq() or for loops, but my efforts are failing.

I did a simple solution by replacing the NA:s with zeroes and adding all rows per column. Did this work?
#data
bdep <- rbind(c(rep(NA,6),3.402293,3.606941,4.000461,3.666381,3.499614),
c(3.805082,4.013435,3.98,3.490139,3.433857,3.27813, rep(NA,5)),
c(rep(NA,11)))
#solution
bdep2 <- ifelse(is.na(bdep), 0, bdep)
bdep3 <- apply(bdep2, 2, sum)
bdep3 #the row you want?

I finally came to a solution by patching together some code I found in other posts (esp. sequencing and for loops). I think this would be considered messy coding, so I'd welcome other solutions. This should better describe what I was trying to do in the OP, where I was trying to generalize too much. Specifically, I have 17 variables, measured over 14 years (that's 238 columns), and something happened while generating these data where the first 6 years of a variable are in one row and the following 8 years are in the other row, so instead of re-run the model, I just wanted to combine the two rows into one.
Below are some sample data, simplified from my real scenario.
Create the data frame:
df <- data.frame(
VALUE = c(16, 16, 16),
COUNT = c(2431, 2610, 35419),
AREA = c(243100, 261000, 3541900),
n_dd_2000 = c(NA, 3.805, NA),
n_dd_2001 = c(3.402, NA, NA)
)
The next two lines establish a sequence starting a pattern at column 4, repeating every 1 column, repeated 2 times out in the first line, 1 time out in the second line, and how many times to repeat the sequence:
info <- data.frame(start=seq(4, by=1, length.out=2), len=rep(1,2))
info2 <- data.frame(start=seq(5, by=1, length.out=1), len=rep(1,2))
This is the code from my real dataset, where I started at column 4, repeated the pattern every 14 columns, out 17 times, and looked at the first 6, then 8 columns: info <- data.frame(start=seq(4, by=14, length.out=17), len=rep(c(6,8),17))
The two for loops below write the specified values in the sequence from row 2 and row 1 to row 3, respectively:
foo = sequence(info$len) + rep(info$start-1, info$len)
foo2 = sequence(info2$len) + rep(info2$start-1, info2$len)
for(n in 1:length(foo)){
df[3,foo[n]] <- df[2,foo[n]]
}
for(n in 1:length(foo2)){
df[3,foo2[n]] <- df[1,foo2[n]]
}
Then I removed the first two rows I got those values from and I'm left with one complete row, no NA's:
df <- df[-(1:2),]

Related

Adding a vector to a column, without specifying the other columns

I have would like to add a vector to a column, without specifying the other columns. I have example data as follows.
library(data.table)
dat <- fread("A B C D
one 2 three four
two 3 NA one")
vector_to_add <- c("five", "six")
Desired ouput:
out <- fread("A B C D
one 2 three four
two 3 NA one
NA NA five NA
NA NA six NA")
I saw some answers using an approach where vectors are used to rowbind:
row3 < c(NA, NA, "five", NA)
I would however like to find a solution in which I do not have specify the whole row.
EDIT: Shortly after posting I realised that it would probably be easiest to take an existing row, make the row NA, and replace the value in the column where the vector would be added, for each entry in the vector. This is however still quite a cumbersome solution I guess.
If you name your vector, then you can rbind that column and fill the rest of the cells with NAs.
df_to_add <- data.frame(C=c("five", "six"))
rbind(dat, df_to_add, fill=TRUE)
A B C D
1: one 2 three four
2: two 3 <NA> one
3: <NA> NA five <NA>
4: <NA> NA six <NA>
You can use the rbindlist() function from the data.table package to add a vector to a column in a data table without specifying the other columns. The rbindlist() function allows you to create a list of vectors or data tables and combine them into a single data table.
In your case, you can create a new vector with the values you want to add to the data table and use the rbindlist() function to append the vector to the data table. For example, the following code shows how to add the vector vector_to_add to the data table dat:
library(data.table)
dat <- fread("A B C D
one 2 three four
two 3 NA one")
vector_to_add <- c("five", "six")
# Create a new vector with the values to add to the data table
new_vector <- c(NA, NA, vector_to_add[1], NA)
# Use rbindlist() to append the new vector to the data table
out <- rbindlist(list(dat, new_vector))
# Add the second value from the vector to the data table
out <- rbindlist(list(out, c(NA, NA, vector_to_add[2], NA)))
After running this code, the data table out should contain the desired output:
A B C D
1: one 2 three four
2: two 3 NA one
3: NA NA five NA
4: NA NA six NA
You can use the rbindlist() function to append multiple vectors to the data table in a similar way.

how to fill missing values in a vector with the mean of value before and after the missing one

Currently I am trying to impute values in a vector in R. The conditions
of the imputation are.
Find all NA values
Then check if they have an existing value before and after them
Also check if the value which follows the NA is larger than
the value before the NA
If the conditions are met, calculate a mean taking the values before
and after.
Replace the NA value with the imputed one
# example one
input_one = c(1,NA,3,4,NA,6,NA,NA)
# example two
input_two = c(NA,NA,3,4,5,6,NA,NA)
# example three
input_three = c(NA,NA,3,4,NA,6,NA,NA)
I started out to write code to detect the values which can
be imputed. But I got stuck with the following.
# incomplete function to detect the values
sapply(split(!is.na(input[c(rbind(which(is.na(c(input)))-1, which(is.na(c(input)))+1))]),
rep(1:(length(!is.na(input[c(which(is.na(c(input)))-1, which(is.na(c(input)))+1)]))/2), each = 2)), all)
This however only detects the NAs which might be
imputable and it only works with example one. It is incomplete and
unfortunately super hard to read and understand.
Any help with this would be highly appreciated.
We can use dplyrs lag and lead functions for that:
input_three = c(NA,NA,3,4,NA,6,NA,NA)
library(dplyr)
ifelse(is.na(input_three) & lead(input_three) > lag(input_three),
(lag(input_three) + lead(input_three))/ 2,
input_three)
Retrurns:
[1] NA NA 3 4 5 6 NA NA
Edit
Explanation:
We use ifelse which is the vectorized version of if. I.e. everything within ifelse will be applied to each element of the vectors.
First we test if the elements are NA and if the following element is > than the previous. To get the previous and following element we can use dplyr lead and lag functions:
lag offsets a vector to the right (default is 1 step):
lag(1:5)
Returns:
[1] NA 1 2 3 4
lead offsets a vector to the left:
lead(1:5)
Returns:
[1] 2 3 4 5 NA
Now to the 'test' clause of ifelse:
is.na(input_three) & lead(input_three) > lag(input_three)
Which returns:
[1] NA NA FALSE FALSE TRUE FALSE NA NA
Then if the ifelse clause evaluates to TRUE we want to return the sum of the previous and following element divided by 2, othrwise return the original element
Here's an example using the imputeTS library. It takes account of more than one NA in the sequence, ensures that the mean is calculated if the next valid observation is greater than the last valid observation and also ignores NA at the beginning and end.
library(imputeTS)
myimpute <- function(series) {
# Find where each NA is
nalocations <- is.na(series)
# Find the last and the previous observation for each row
last1 <- lag(series)
next1 <- lead(series)
# Carry forward the last and next observations over sequences of NA
# Each row will then get a last and next that can be averaged
cflast <- na_locf(last1, na_remaining = 'keep')
cfnext <- na_locf(next1, option = 'nocb', na_remaining = 'keep')
# Make a data frame
df <- data.frame(series, nalocations, last1, cflast, next1, cfnext)
# Calculate the mean where there is currently a NA
# making sure that the next is greater than the last
df$mean <- ifelse(df$nalocations, ifelse(df$cflast < df$cfnext, (df$cflast+df$cfnext)/2, NA), NA)
imputedseries <- ifelse(df$nalocations, ifelse(!is.na(df$mean), df$mean, NA), series)
#list(df, imputedseries) # comment this in and return it to see the intermediate data frame for debugging
imputedseries
}
myimpute(c(NA,NA,3,4,NA,NA,6,NA,NA,8,NA,7,NA,NA,9,NA,11,NA,NA))
# [1] NA NA 3 4 5 5 6 7 7 8 NA 7 8 8 9 10 11 NA NA
There is also the na_ma function in the imputeTS package for imputing moving averages.
In your case this would be with the following settings:
na_ma(x, k = 1, weighting = "simple")
k = 1 (meaning 1 value before and 1 after the NA are taken into account)
weighting = "simple" (the mean of these two values is calculated)
This can be applied quite easy with basically 1 line of code:
library(imputeTS)
na_ma(yourData, k = 1, weighting = "simple")
You could also choose to take more values before and after the NA into account e.g. k=3. Interesting feature if you take more than 1 value to each side into account is the possibility to choose a different weighting e.g. with weighting = "linear" weights decrease in arithmetical progression (a Linear Weighted Moving Average) - meaning the further they values are away from the NA the less impact they have.

Replace NA for multiple columns with average of values from other dataframe

I am trying to replace NA values in multiple columns from dataframe x1 by the average of the values from dataframes x2 and x3, based on common and distinct atrribute 'ID'.
All the dataframes(each dataframe is for a particular year) have the same column structure:
ID A B C .....
01 2 5 7 .....
02 NA NA NA .....
03 5 4 8 .....
I have found an answer to do it for 1 column at a time, thanks to this post.
x1$A[is.na(x1$A)] <- (x2$A[match(x1$ID[is.na(x1$A)],x2$ID)] + x3$A[match(x1$ID[is.na(x1$A)],x3$ID)])/2
But since I have about a 100 coulmns to apply this for I would really like to have a smarter way to do it.
I tried the suggestions from this post and also from here.
I came up with this code, but couldn't make it work.
x1[6:105] = as.data.frame(lapply(x1[6:105], function(x) ifelse(is.na(x), (x2$x[match(x1$ID, x2$ID)]+x3$x[match(x1$ID, x3$ID)])/2, x1$x)))
Got the following error:
Error in ifelse(is.na(x), (x2$x[match(x1$ID, x2$ID)] + x3$x[match(x1$ID, : replacement has length zero
I initially thought function(x) worked on the entire column and x represented the column name, but i think it represents each individual cell value and that is why it wont work.
I am a novice in R, I would surely appreciate some guidance to let me know where I am going wrong, in applying the logic to multiple columns.
for (i in 1:ncol(x1)) {
nas <- is.na(x1[,i]) # where are NAs
if (sum(nas)==0) next
ids <- x1$ID[nas] # ids of NAs
nam <- colnames(x1)[i] # colname of the column
x1[nas, i] <- (x2[match(ids, x2$zip), nam] + x3[match(ids, x3$zip), nam]) / 2
}

Recode values omitting NA's

I want to recode the values in a matrix in such a way that all values <=.2 become 2, <=.4 become 3 etc. However, there are missings in my data, which I do not want to change (keep them NA). Here you find a simplified version of my code. Using na.omit works perfectly for the first changes
try <- matrix(c(0.78,0.62,0.29,0.47,0.30,0.63,0.30,0.20,0.15,0.58,0.52,0.64,
0.76,0.32,0.64,0.50,0.67,0.27, NA), nrow = 19)
try[na.omit(try <= .2)] <- 2 #Indeed changes .20 and .15 to 2 and leaves the NA as NA
However, when I do the same for a higher category, the NA is also changed:
try[na.omit(try <= .8)] <- 5 #changes all other values including the NA to 5
Can someone explain to me what is the difference between the two and why the second one also changes the NA-value while the first one does not? Or am I doing something else wrong?
You can do
try[try <= .8] <- 5
The NA values will remain as NA
Or create a logical condition to exclude the NA values
try[try <=.8 & !is.na(try)] <- 5

Data Manipulation, Looping to add columns

I have asked this question a couple times without any help. I have since improved the code so I am hoping somebody has some ideas! I have a dataset full of 0's and 1's. I simply want to add the 10 columns together resulting in 1 column with 3835 rows. This is my code thus far:
# select for valid IDs
data = history[history$studyid %in% valid$studyid,]
sibling = data[,c('b16aa','b16ba','b16ca','b16da','b16ea','b16fa','b16ga','b16ha','b16ia','b16ja')]
# replace all NA values by 0
sibling[is.na(sibling)] <- 0
# loop over all columns and count the number of 174
apply(sibling, 2, function(x) sum(x==174))
The problem is this code adds together all the rows, I want to add together all the columns so I would result with 1 column. This is the answer I am now getting which is wrong:
b16aa b16ba b16ca b16da b16ea b16fa b16ga b16ha b16ia b16ja
68 36 22 18 9 5 6 5 4 1
In apply() you have the MARGIN set to 2, which is columns. Set the MARGIN argument to 1, so that your function, sum, will be applied across rows. This was mentioned by #sgibb.
If that doesn't work (can't reproduce example), you could try first converting the elements of the matrix to integers X2 <- apply(sibling, c(1,2), function(x) x==174), and then use rowSums to add up the columns in each row: Xsum <- rowSums(X2, na.rm=TRUE). With this setup you do not need to first change the NA's to 0's, as you can just handle the NA's with the na.rm argument in rowSums()

Resources