Filling NA in a data frame with a specified rule in R - r

Let I have such a date frame(df1) with column name x:
df1<-as.data.frame(x=c(4,3,2,16,7,8,9,1,12))
colnames(df1)<-"x"
df1[2,1]<-NA
df1[3,1]<-NA
df1[4,1]<-NA
The output is:
> df1
x
1 4
2 NA
3 NA
4 NA
5 7
6 8
7 9
8 1
9 12
I want to add a column to the data frame. The new column(y) will fill NA's with the nearest value above the first NA above.
The code and the output is(that is what I want)
df1$y<-na.locf(df1, fromLast = FALSE)
> df1
x x
1 4 4
2 NA 4
3 NA 4
4 NA 4
5 7 7
6 8 8
7 9 9
8 1 1
9 12 12
Note:I didn't understand why the second column's name is "x" alhough I defined it as "y".
However, above method gives error naturally when the first entry is NA as below:
df2<-as.data.frame(c(4,3,2,16,7,8,9,1,12))
colnames(df2)<-"x"
df2[1,1]<-NA
df2[2,1]<-NA
df2[3,1]<-NA
> df2
x
1 NA
2 NA
3 NA
4 16
5 7
6 8
7 9
8 1
9 12
When I apply the below code:
df2$y<-na.locf(df2, fromLast = FALSE)
I get the below error:
Error in `$<-.data.frame`(`*tmp*`, "y", value = list(x = c(16, 7, 8, 9, :
replacement has 6 rows, data has 9
In such situations I just want to the oppsite of na.locf(df2, fromLast =FALSE). Namely fill NA's as the first value of below NA.
Desired output is:
x y
1 NA 16
2 NA 16
3 NA 16
4 16 16
5 7 7
6 8 8
7 9 9
8 1 1
9 12 12
So using tryCatch function, I wrote the below code:
df2$y<-tryCatch(na.locf(df2, fromLast = FALSE),
error=function(err)
{na.locf(df2, fromLast = TRUE)})
However, I got such an error:
Error in `$<-.data.frame`(`*tmp*`, "y", value = list(x = c(16, 7, 8, 9, :
replacement has 6 rows, data has 9
So in summary the problem is:
if the data frame's first entry is not NA,then fill the NA with first element above
if the data frame's first entry is NA, then fill the NA with first element below.
How can I this using R? Especially with tryCatch function? I also don't understnad why the second column's name seem as "x" instead of "y"?
I will be very glad for any help. Thanks a lot.

We can do a double na.locf with the first one having the option na.rm = FALSE
library(zoo)
na.locf(na.locf(df2, na.rm = FALSE), fromLast = TRUE)
# x
#1 16
#2 16
#3 16
#4 16
#5 7
#6 8
#7 9
#8 1
#9 12
If we want to have two columns
transform(df2, y = na.locf(na.locf(x, na.rm = FALSE), fromLast = TRUE))
# x y
#1 NA 16
#2 NA 16
#3 NA 16
#4 16 16
#5 7 7
#6 8 8
#7 9 9
#8 1 1
#9 12 12
NOTE: Make sure to assign it to a new object or to the same object i.e. df2 <- transform(...

Related

How to combine/concatenate two dataframes one after the other but not merging common columns in R?

Suppose there are two dataframes as follows with same column names and I want to combine/concatenate one after the other without merging the common columns. There is a way of assigning it columnwise like df1[3]<-df2[1] but would like to know if there's some other way.
df1<-data.frame(A=c(1:10), B=c(2:5, rep(NA,6)))
df2<-data.frame(A=c(12:20), B=c(32:40))
Expected Output:
A B A.1 B.1
1 2 12 32
2 3 13 33
3 4 14 34
4 5 15 35
5 NA 16 36
6 NA 17 37
7 NA 18 38
8 NA 19 39
9 NA 20 40
10 NA NA NA
I tend to work with multiple frames like this as a list of frames. Try this:
LOF <- list(df1, df2)
maxrows <- max(sapply(LOF, nrow))
out <- do.call(cbind, lapply(LOF, function(z) z[seq_len(maxrows),]))
names(out) <- make.names(names(out), unique = TRUE)
out
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
One advantage of this is that it allows you to work with an arbitrary number of frames, not just two.
One base R way could be
setNames(Reduce(cbind.data.frame,
Map(`length<-`, c(df1, df2), max(nrow(df1), nrow(df2)))),
paste0(names(df1), rep(c('', '.1'), each=2)))
# A B A.1 B.1
# 1 1 2 12 32
# 2 2 3 13 33
# 3 3 4 14 34
# 4 4 5 15 35
# 5 5 NA 16 36
# 6 6 NA 17 37
# 7 7 NA 18 38
# 8 8 NA 19 39
# 9 9 NA 20 40
# 10 10 NA NA NA
Another option is to use the merge function. The documentation can be a bit cryptic, so here is a short explanation of the arguments:
by -- "the name "row.names" or the number 0 specifies the row names"
all = TRUE -- keeps all original rows from both dataframes
suffixes -- specify how you want the duplicated colnames to be distinguished
sort -- keep original sorting
merge(df1, df2, by = 0, all = TRUE, suffixes = c('', '.1'), sort = FALSE)
One way would be
cbind(
df1,
rbind(
df2,
rep(NA, nrow(df1) - nrow(df2))
)
)
`````

Create multiple sums

Ciao,
Here is a replicate able example.
df <- data.frame("STUDENT"=c(1,2,3,4,5),
"TEST1A"=c(NA,5,5,6,7),
"TEST2A"=c(NA,8,4,6,9),
"TEST3A"=c(NA,10,5,4,6),
"TEST1B"=c(5,6,7,4,1),
"TEST2B"=c(10,10,9,3,1),
"TEST3B"=c(0,5,6,9,NA),
"TEST1TOTAL"=c(NA,23,14,16,22),
"TEST2TOTAL"=c(10,16,15,12,NA))
I have columns STUDENT through TEST3B and want to create TEST1TOTAL TEST2TOTAL. TEST1TOTAL=TEST1A+TEST2A+TEST3A and so on for TEST2TOTAL. If there is any missing score in TEST1A TEST2A TEST3A then TEST1TOTAL is NA.
here is my attempt but is there a solution with less lines of coding? Because here I will need to write this line out many times as there are up to TEST A through O.
TEST1TOTAL=rowSums(df[,c('TEST1A', 'TEST2A', 'TEST3A')], na.rm=TRUE)
Using just R base functions:
output <- data.frame(df1, do.call(cbind, lapply(c("A$", "B$"), function(x) rowSums(df1[, grep(x, names(df1))]))))
Customizing colnames:
> colnames(output)[(ncol(output)-1):ncol(output)] <- c("TEST1TOTAL", "TEST2TOTAL")
> output
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL
1 1 NA NA NA 5 10 0 NA 15
2 2 5 8 10 6 10 5 23 21
3 3 5 4 5 7 9 6 14 22
4 4 6 6 4 4 3 9 16 16
5 5 7 9 6 1 1 NA 22 NA
Try:
library(dplyr)
df %>%
mutate(TEST1TOTAL = TEST1A+TEST2A+TEST3A,
TEST2TOTAL = TEST1B+TEST2B+TEST3B)
or
df %>%
mutate(TEST1TOTAL = rowSums(select(df, ends_with("A"))),
TEST2TOTAL = rowSums(select(df, ends_with("B"))))
I think for what you want, Jilber Urbina's solution is the way to go. For completeness sake (and because I learned something figuring it out) here's a tidyverse way to get the score totals by test number for any number of tests.
The advantage is you don't need to specify the identifiers for the tests (beyond that they're numbered or have a trailing letter) and the same code will work for any number of tests.
library(tidyverse)
df_totals <- df %>%
gather(test, score, -STUDENT) %>% # Convert from wide to long format
mutate(test_num = paste0('TEST', ('[^0-9]', '', test),
'TOTAL'), # Extract test_number from variable
test_let = gsub('TEST[0-9]*', '', test)) %>% # Extract test_letter (optional)
group_by(STUDENT, test_num) %>% # group by student + test
summarize(score_tot = sum(score)) %>% # Sum score by student/test
spread(test_num, score_tot) # Spread back to wide format
df_totals
# A tibble: 5 x 4
# Groups: STUDENT [5]
STUDENT TEST1TOTAL TEST2TOTAL TEST3TOTAL
<dbl> <dbl> <dbl> <dbl>
1 1 NA NA NA
2 2 11 18 15
3 3 12 13 11
4 4 10 9 13
5 5 8 10 NA
If you want the individual scores too, just join the totals together with the original:
left_join(df, df_totals, by = 'STUDENT')
STUDENT TEST1A TEST2A TEST3A TEST1B TEST2B TEST3B TEST1TOTAL TEST2TOTAL TEST3TOTAL
1 1 NA NA NA 5 10 0 NA NA NA
2 2 5 8 10 6 10 5 11 18 15
3 3 5 4 5 7 9 6 12 13 11
4 4 6 6 4 4 3 9 10 9 13
5 5 7 9 6 1 1 NA 8 10 NA

Remove duplicates from each row in R dataframe

I have a dataframe with 1209 columns, and 27900 rows.
For each row there are duplicated values scatter around the columns.
I have tried transposing the dataframe and remove by columns. But it crashes.
After I transpose I used:
for(i in 1:ncol(df)){
#replicate column i without duplicates, fill blanks with NAs
df <- cbind.fill(df,unique(df[,1]), fill = NA)
#rename the new column
colnames(df)[n+1] <- colnames(df)[1]
#delete the old column
df[,1] <- NULL
}
But no result so far.
I would like to know if anyone has any idea.
Best
As I understand you would like to replace duplicated values in each column with NA?
this can be done in several ways.
First some data:
set.seed(7)
df <- data.frame(x = sample(1: 20, 50, replace = T),
y = sample(1: 20, 50, replace = T),
z = sample(1: 20, 50, replace = T))
head(df, 10)
#output
x y z
1 20 12 8
2 8 15 10
3 3 16 10
4 2 13 8
5 5 15 13
6 16 8 7
7 7 4 20
8 20 4 1
9 4 8 16
10 10 6 5
with purrr library:
library(purrr)
map_dfc(df, function(x) ifelse(duplicated(x), NA, x))
#output
# A tibble: 50 x 3
x y z
<int> <int> <int>
1 20 12 8
2 8 15 10
3 3 16 NA
4 2 13 NA
5 5 NA 13
6 16 8 7
7 7 4 20
8 NA NA 1
9 4 NA 16
10 10 6 5
# ... with 40 more rows
with apply in base R
as.data.frame(apply(df, 2, function(x) ifelse(duplicated(x), NA, x)))

r - Extract subsequences with specific time increments

I have a data frame df. It has several columns, two of them are dates and serial_day, corresponding to the date an observation was taken and MATLAB's serial day. I would like to restrict my time series such that the increment (in days) between two consecutive observations is 3 or 4 and separate such blocks by a NA row.
It is known that consecutive daily observations never occur and the case of 2 day separation followed by 2 day separation is rare, so it can be ignored.
In the example, increment is shown for convenience, but it is easily generated using the diff function. So, if the data frame is
serial_day increment
1 4 NA
2 7 3
3 10 3
4 12 2
5 17 5
6 19 2
7 22 3
8 25 3
9 29 4
10 34 5
I would hope to get a new data frame as:
serial_day increment
1 4 NA
2 7 3
3 10 3
4 NA ## Entire row of NAs NA
5 19 NA
6 22 3
7 25 3
8 29 4
9 NA ## Entire row of NAs NA
I can't figure out a way to do this without looping, which is bad idea in R.
First you check in which rows the increment is not equal to 3 or 4. Then you'd replace these rows with a row of NAs:
inds <- which( df$increment > 4 | df$increment < 3 )
df[inds, ] <- rep(NA, ncol(df))
# serial_day increment
# 1 4 NA
# 2 7 3
# 3 10 3
# 4 NA NA
# 5 NA NA
# 6 NA NA
# 7 22 3
# 8 25 3
# 9 29 4
# 10 NA NA
This may result in multiple consecutive rows of NAs. In order to reduce these consecutive NA-rows to a single NA-row, you'd check where the NA-rows are located with which() and then see whether these locations are consecutive with diff() and remove these rows from df:
NArows <- which(rowSums(is.na(df)) == ncol(df)) # c(4, 5, 6, 10)
inds2 <- NArows[c(FALSE, diff(NArows) == 1)] # c(5, 6)
df <- df[-inds2, ]
# serial_day increment
# 1 4 NA
# 2 7 3
# 3 10 3
# 4 NA NA
# 7 22 3
# 8 25 3
# 9 29 4
# 10 NA NA

How to replace columns containing NA with the contents of the previous column?

I have a large dataframe with random columns which contain NA values. It looks like this:
2002-06-26 2002-06-27 2002-06-28 2002-07-01 2002-07-02 2002-07-03 2002-07-05
1 US1718711062 NA BMG4388N1065 US0116591092 NA AN8068571086 GB00BYMT0J19
2 US9837721045 NA US0025671050 US03662Q1058 NA BMG3223R1088 US0097281069
3 NA US00847J1051 US06652V2088 NA BMG4388N1065 US0305061097
4 NA US04351G1013 US1046741062 NA BMG7496G1033 US03836W1036
5 NA US2925621052 US1431301027 NA CA88157K1012 US06652V2088
6 NA US34988V1061 US1897541041 NA CH0044328745 US1547604090
7 NA US3596941068 US2053631048 NA GB00B5BT0K07 US1778351056
8 NA US4180561072 US2567461080 NA IE00B5LRLL25 US1999081045
9 NA US4198791018 US2925621052 NA IE00B8KQN827 US3498531017
10 NA US45071R1095 US3989051095 NA IE00BGH1M568 US42222N1037
I need a code which identifies and fills out the NA columns with the contents of the previous column. So for example column "2002-06-27" should contain "US1718711062" and "US9837721045". The NA columns are at irregular intervals.
Columns are also of random length some only containing one element so I think the best way to identify columns with no values is to look at the first row like so:
row.has.na <- which(is.na(data[1,]))
[1] 2 5
To complete my comment: as you have already computed row.has.na, the vector of indices for the NA column, here is a way to use it and get what you need:
data[, row.has.na] <- data[, row.has.na - 1]
This should work. Note that this also works if two (or more) NA columns are next to each other. Maybe there is a way around the while-loop, but...
# Create some data
data <- data.frame(col1 = 1:10, col2 = NA, col3 = 10:1, col4 = NA, col5 = NA, col6 = NA)
# Find which columns contain NA in the first row
col_NA <- which(is.na(data[1,]))
# Select the previous columns
col_replace <- col_NA - 1
# Check if any NA columns are next to each other and fix it:
while(any(diff(col_replace) == 1)){
ind <- which(diff(col_replace) == 1) + 1
col_replace[ind] <- col_replace[ind] - 1
}
# Replace the NA columns with the previous columns
data[,col_NA] <- data[,col_replace]
col1 col2 col3 col4 col5 col6
1 1 1 10 10 10 10
2 2 2 9 9 9 9
3 3 3 8 8 8 8
4 4 4 7 7 7 7
5 5 5 6 6 6 6
6 6 6 5 5 5 5
7 7 7 4 4 4 4
8 8 8 3 3 3 3
9 9 9 2 2 2 2
10 10 10 1 1 1 1

Resources