Function / Loop to Replace NA with values in adjacent columns in R

Function / Loop to Replace NA with values in adjacent columns in R - r

I have a time series dataset with 1000 columns. Each row, is of course, a different record. There are some NA values that are scattered throughout the dataset.
I would like to replace each NA with either the adjacent left-value or the adjacent-right value, it doesn't matter which.
A neat solution and one which I was going for is to replace each NA with the value to its right, unless it is in the last column, in which case replace it with the value to its left.
I was just going to do a for loop, but I assume a function would be more efficient. Essentially, I wasn't sure how to reference the adjacent values.
Here is what I was trying:
for (entry in dataset) {
if (any(is.na(entry)) == TRUE && entry[,1:999]) {
entry = entry[,1]
}
else if (any(is.na(entry)) == TRUE && entry[,1000]) {
entry = cell[,-1]
}
}
As you can tell, I'm inexperienced with R :) Not really sure how you index the values to the left or right.

I would suggest using na.locf on the transposed of your dataset.
The na.locf function of the zoo package is designed to replace NA by the closest value (+1 or -1 in the same row). Since you want the columns, we can just transpose first the dataset:
library(zoo)
df=matrix(c(1,3,4,10,NA,52,NA, 11, 100), ncol=3)
step1 <- t(na.locf(t(df), fromLast=T))
step2 <- t(na.locf(t(step1), fromLast=F))
print(df)
#### [1,] 1 10 NA
#### [2,] 3 NA 11
#### [3,] 4 52 100
print(step2)
#### [1,] 1 10 10
#### [2,] 3 11 11
#### [3,] 4 52 100
I do it in 2 steps since there is a different treatment for inside columns and last column. If you know the dplyr package it's even more straightforward to turn it into a function:
library(dplyr)
MyReplace = function(data) {data %>% t %>% na.locf(.,,T) %>% na.locf %>% t}
MyReplace(df)

Related

R Removing duplicate columns based on variability

I have a very large dataset with multiple duplicated column names (the values within the columns are different). I would like to remove the columns with duplicate names and lower variability. My problem is I have too many of these duplicate variables to do this manually. One path I am trying to to use read.csv(), which automatically adds '.1' to the duplicate column name, then make a vector of the variability of all the columns and try to work with that.
df<-data.frame("A"=c(1,5,10), "A.1"=c(2,2,2), "C"=c(1,5,10), "C.1"=c(2,2,2), "C.2"=c(2,5,10))
v<-lapply(df, function(x) var(x))
Is there a way to filter out duplicates based on variability when I am importing the dataset? Again, the biggest problem is that I have too many duplicates to do this manually. Thanks in advance!

Combining techniques from base R and tidyverse:
# calculate variance for each column
dvar <- apply(df, 2, var)
library(tidyverse)
# create data frame with column names
# "grouped" column names and variance
# find column with highest variance
keep_names <- data.frame(names = names(dvar),
grouping = gsub("[[:punct:]][0-9]", "", names(dvar)),
vals = dvar) %>%
group_by(grouping) %>%
slice_max(vals) %>%
pull(names)
# pull data
df[keep_names]
# A C
# 1 1 1
# 2 5 5
# 3 10 10

df <- data.frame("A"=c(1,5,10), "A.1"=c(2,2,2), "C"=c(1,5,10), "C.1"=c(2,2,2), "C.2"=c(2,5,10))
df["var",] <-apply(df, 2, var)
nrow(df)
df <- df[,which(df["var",] < 20)]
df
Imagine "20" being your threshold. I used apply here an appended it to the dataframe.
A.1 C.1 C.2
1 2 2 2.00000
2 2 2 5.00000
3 2 2 10.00000
var 0 0 16.33333

I would like to propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
The function
dataPreparation::which_are_bijection
which_are_in_double(df)
Which return 3 and 4 the columns that are duplicated in your example
Build a data set with wanted dimensions for performance tests
df<-data.frame("A"=c(1,5,10), "A.1"=c(2,2,2), "C"=c(1,5,10), "C.1"=c(2,2,2), "C.2"=c(2,5,10))
for (i in 1:20){
df = rbind(df, df)
}
Which result in a data.frame of size (3 145 728, 5)
The benchmark
To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times
benchmark(
which_are_in_double(df, verbose=FALSE),
apply(df, 2, var) == 0
)
test replications elapsed relative
2 apply(df, 2, var) == 0 100 38.298 3.966
1 which_are_in_double(df, verbose = FALSE) 100 9.656 1.000
So which are bijection is 4 time faster than other proposed solution. The nice thing is that the bigger the data.frame the performance will be even more interesting.

Error subsetting data table with "[]" but not with $-operator

I have data table which looks like:
require(data.table)
df <- data.table(Day = seq(as.Date('2014-01-01'), as.Date('2014-12-31'), by = 'days'), Number = 1:365)
I want to subset my data table such that it returns just values of the first 110 rows which are higher than 10. When I use
df2 <- subset(df[1:110,], df$Number[1:110] > 10)
everything works well. However, if I subset using
df2 <- subset(df[1:110,], df[1:110,2] > 10)
R returns the following error:
Error in `[.data.table`(x, r, vars, with = FALSE) :
i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please report to data.table issue tracker if you'd like this, or add your comments to FR #657.
Should the way of subsetting not be the same? The problem is that I want to use this subset in an apply command and therefore, the names of the data table change. Hence, I cannot use the column name with the $-operator to refer to the second column and want use the index number but it does not work. I could rename the data table columns or read out the names of the column and use the $-operator but my apply function runs over lots of entries and I want to minimize the workload of the apply function.
So how do I make the subsetting with the index number work and why do I get the mentioned error in the first place? I would like to understand what my mistake is. Thanks!

First let's understand why it doesn't work in your case. When you are doing
df[1:110,2] > 10
# Number
# [1,] FALSE
# [2,] FALSE
# [3,] FALSE
# [4,] FALSE
# [5,] FALSE
# [6,] FALSE
# [7,] FALSE
#....
it returns a 1 column matrix which is used for subsetting.
class(df[1:110,2] > 10)
#[1] "matrix"
which works fine on dataframe
df1 <- data.frame(df)
subset(df1[1:110,], df1[1:110,2] > 10)
# Day Number
#11 2014-01-11 11
#12 2014-01-12 12
#13 2014-01-13 13
#14 2014-01-14 14
#15 2014-01-15 15
#....
but not on data.table. Unfortunately subsetting doesn't work that way in data.table. You could convert it into a vector instead of matrix and then use it for subsetting
subset(df[1:110,], df[1:110][[2]] > 10)
# Day Number
# 1: 2014-01-11 11
# 2: 2014-01-12 12
# 3: 2014-01-13 13
# 4: 2014-01-14 14
# 5: 2014-01-15 15
#...
The difference would be more clear when you see the results of
df[matrix(TRUE), ]
vs
df1[matrix(TRUE), ]
PS - in the first case doing
subset(df[1:110,], Number > 10)
would also have worked.

find var in a data frame and return next element in same row

ultimately, I need to search column 1 of my data frame and if I find a match to var, then I need to get the element next to it in the same row (the next column over)
I thought I could usematch or %in% to find the index in a vector I got from the data frame, but I keep getting NA
one example I looked at is
Is there an R function for finding the index of an element in a vector?, and I don't understand why my code is any different from the answers.
so in my code below, if I find b in column 1, I eventually want to get 20 from the data frame.
What am I doing wrong? I would preferr to stick to r-base packages if possible
> df = data.frame(names = c("a","b","c"),weight = c(10,20,30),height=c(5,10,15))
> df
names weight height
1 a 10 5
2 b 20 10
3 c 30 15
> vect = df[1]
> vect
names
1 a
2 b
3 c
> match("b", vect)
[1] NA

Summing values in rows of matrices with same column name in R

I need to turn these two matrices corresponding to (toy) word counts:
a hope to victory win
[1,] 2 1 1 1 1
and
a chance than win
[1,] 1 1 1 1
where the word "a" appears a combined number of 3 times, and the word "win" appears 2 times (once in each matrix), into:
a win chance hope than to victory
[1,] 3 2 1 1 1 1 1
where equally-named columns combine into a single column that contains the sum.
And,
a hope to victory win different than
[1,] 2 1 1 1 1 0 0
where first matrix is preserved, and the second matrix is attached at the end but with only unique column names and with all the row values equal to zero.

So, if you store this data in a data frame (Which is really recommended for this sort of data) the process is very simple.
(I'm including a conversion from that format, with any number of rows):
conversion:
newdf1 <- data.frame(Word = colnames(matrix1), Count = as.vector(t(matrix1)))
newdf2 <- data.frame(Word = colnames(matrix2), Count = as.vector(t(matrix2)))
now you can use rbind + dplyr (or data.table)
dplyr solution:
library(dplyr)
df <- rbind(newdf1,newdf2)
result <- df %>% group_by(Word) %>% summarise(Count = sum(Count))
the answer to your second question is related,
result2 <- rbind(newdf1,data.frame(Word = setdiff(newdf2$Word,newdf1$Word), Count = 0))
(the data.table solution is very similar, but if you're new to data frames and grouping/reshaping, I recommend dplyr)
(EDITED the second solution so that it's actually giving you the unique entries)

Select rows of a matrix that meet a condition

In R with a matrix:
one two three four
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 11 18
[4,] 4 9 11 19
[5,] 5 10 15 20
I want to extract the submatrix whose rows have column three = 11. That is:
one two three four
[1,] 1 6 11 16
[3,] 3 8 11 18
[4,] 4 9 11 19
I want to do this without looping. I am new to R so this is probably very obvious but the
documentation is often somewhat terse.

This is easier to do if you convert your matrix to a data frame using as.data.frame(). In that case the previous answers (using subset or m$three) will work, otherwise they will not.
To perform the operation on a matrix, you can define a column by name:
m[m[, "three"] == 11,]
Or by number:
m[m[,3] == 11,]
Note that if only one row matches, the result is an integer vector, not a matrix.

I will choose a simple approach using the dplyr package.
If the dataframe is data.
library(dplyr)
result <- filter(data, three == 11)

m <- matrix(1:20, ncol = 4)
colnames(m) <- letters[1:4]
The following command will select the first row of the matrix above.
subset(m, m[,4] == 16)
And this will select the last three.
subset(m, m[,4] > 17)
The result will be a matrix in both cases.
If you want to use column names to select columns then you would be best off converting it to a dataframe with
mf <- data.frame(m)
Then you can select with
mf[ mf$a == 16, ]
Or, you could use the subset command.

Subset is a very slow function , and I personally find it useless.
I assume you have a data.frame, array, matrix called Mat with A, B, C as column names; then all you need to do is:
In the case of one condition on one column, lets say column A
Mat[which(Mat[,'A'] == 10), ]
In the case of multiple conditions on different column, you can create a dummy variable. Suppose the conditions are A = 10, B = 5, and C > 2, then we have:
aux = which(Mat[,'A'] == 10)
aux = aux[which(Mat[aux,'B'] == 5)]
aux = aux[which(Mat[aux,'C'] > 2)]
Mat[aux, ]
By testing the speed advantage with system.time, the which method is 10x faster than the subset method.

If your matrix is called m, just use :
R> m[m$three == 11, ]

If the dataset is called data, then all the rows meeting a condition where value of column 'pm2.5' > 300 can be received by -
data[data['pm2.5'] >300,]