I want to create a "row" containing the first non-NA value that appears in a data frame. So for example, given this test data frame:
test.df <- data.frame(a=c(11,12,13,14,15,16),b=c(NA,NA,23,24,25,26), c=c(31,32,33,34,35,36), d=c(NA,NA,NA,NA,45,46))
test.df
a b c d
1 11 NA 31 NA
2 12 NA 32 NA
3 13 23 33 NA
4 14 24 34 NA
5 15 25 35 45
6 16 26 36 46
I know that I can detect the first appearance of a non-NA like this:
first.appearance <- as.numeric(sapply(test.df, function(col) min(which(!is.na(col)))))
first.appearance
[1] 1 3 1 5
This tells me that the first element in column 1 is not NA, the third element in column 2 is not NA, the first element in column 3 is not NA, and the fifth element in column 4 is not NA. But when I put the pieces together, it yields this (which is logical, but not what I want):
> test.df[first.appearance,]
a b c d
1 11 NA 31 NA
3 13 23 33 NA
1.1 11 NA 31 NA
5 15 25 35 45
I would like the output to be the first non-NA in each column. What is a base or dplyr way to do this? I am not seeing it. Thanks in advance.
a b c d
1 11 23 31 45
We can use
library(dplyr)
test.df %>%
slice(first.appearance) %>%
summarise_all(~ first(.[!is.na(.)]))
# a b c d
#1 11 23 31 45
Or it can be
test.df %>%
summarise_all(~ min(na.omit(.)))
# a b c d
#1 11 23 31 45
Or with colMins
library(matrixStats)
colMins(as.matrix(test.df), na.rm = TRUE)
#[1] 11 23 31 45
You can use :
library(tidyverse)
df %>% fill(everything(), .direction = "up") %>% head(1)
a b c d
<dbl> <dbl> <dbl> <dbl>
1 11 23 31 45
Related
I want to take differences for each pair of consecutive columns but for an arbitrary number of columns. For example...
df <- as.tibble(data.frame(group = rep(c("a", "b", "c"), each = 4),
subgroup = rep(c("adam", "boy", "charles", "david"), times = 3),
iter1 = 1:12,
iter2 = c(13:22, NA, 24),
iter3 = c(25:35, NA)))
I want to calculate the differences by column. I would normally use...
df %>%
mutate(diff_iter2 = iter2 - iter1,
diff_iter3 = iter3 - iter2)
But... I'd like to:
accomodate an arbitrary number of columns and
treat NAs such that:
if the number we're subtracting from is NA, then the result should be NA. E.g. NA - 11 = NA
if the number we're subtracting is NA, then that NA is effectively treated as a 0. E.g. 35 - NA = 35
The result should look like this...
group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
<chr> <chr> <int> <dbl> <int> <dbl> <dbl>
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
Originally, this df was in long format but the problem was that I believe the lag() function operates on position within groups and all the groups aren't the same because some have missing records (hence the NA in the wider table shown above).
Starting with long format would do but then please assume the records shown above with NA values would not exist in that longer dataframe.
Any help is appreciated.
An option in tidyverse would be - loop across the columns of 'iter' other than the iter1, then get the column value by replacing the column name (cur_column()) substring by subtracting 1 (as.numeric(x) -1) with str_replace, then replace the NA elements with 0 (replace_na) based on the OP's logic, subtract from the looped column and create new columns by adding prefix in .names ("diff_{.col}" - {.col} will be the original column name)
library(dplyr)
library(stringr)
library(tidyr)
df <- df %>%
mutate(across(iter2:iter3, ~
. - replace_na(get(str_replace(cur_column(), '\\d+',
function(x) as.numeric(x) - 1)), 0), .names = 'diff_{.col}'))
-output
df
# A tibble: 12 × 7
group subgroup iter1 iter2 iter3 diff_iter2 diff_iter3
<chr> <chr> <int> <dbl> <int> <dbl> <dbl>
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
Find the columns whose names start with iter, ix, and then take all but the first as df1, all but the last as df2 and replace the NAs in df2 with 0. Then subtract them and cbind df to that. No packages are used.
ix <- grep("^iter", names(df))
df1 <- df[tail(ix, -1)]
df2 <- df[head(ix, -1)]
df2[is.na(df2)] <- 0
cbind(df, diff = df1 - df2)
giving:
group subgroup iter1 iter2 iter3 diff.iter2 diff.iter3
1 a adam 1 13 25 12 12
2 a boy 2 14 26 12 12
3 a charles 3 15 27 12 12
4 a david 4 16 28 12 12
5 b adam 5 17 29 12 12
6 b boy 6 18 30 12 12
7 b charles 7 19 31 12 12
8 b david 8 20 32 12 12
9 c adam 9 21 33 12 12
10 c boy 10 22 34 12 12
11 c charles 11 NA 35 NA 35
12 c david 12 24 NA 12 NA
I have a data frame with columns containing NAs which I replace using replace_na. The problem is these column names can change in the future so I would like to put these column names in a vector and then use the vector in the replace_na function. I don't want to change the entire data frame in one go, just specified columns. When I try this as below, the code runs but it doesn't change the data frame. Can anyone suggest any edits to the code?
library(tidyverse)
col1<-c(9,NA,25,26,NA,51)
col2<-c(9,5,25,26,NA,51)
col3<-c(NA,3,25,26,NA,51)
col4<-c(9,1,NA,26,NA,51)
data<-data.frame(col1,col2,col3,col4, stringsAsFactors = FALSE)
columns<-c(col1,col2)
data<-data%>%
replace_na(list(columns=0))
A dplyr option:
columns <- c("col1" ,"col2")
dplyr::mutate(data, across(columns, replace_na, 0))
Returns:
col1 col2 col3 col4
1 9 9 NA 9
2 0 5 3 1
3 25 25 25 NA
4 26 26 26 26
5 0 0 NA NA
6 51 51 51 51
Another option would be using coalesce inside map_at:
at argument in map_at can be a character vector of column names that you would like to modify
We then use coalesce function to specify the replacement of NAs
library(dplyr)
library(purrr)
data %>%
map_at(c("col1","col2"), ~ coalesce(.x, 0)) %>%
bind_cols()
# A tibble: 6 x 4
col1 col2 col3 col4
<dbl> <dbl> <dbl> <dbl>
1 9 9 NA 9
2 0 5 3 1
3 25 25 25 NA
4 26 26 26 26
5 0 0 NA NA
6 51 51 51 51
columns value should be string, you can then use is.na as -
columns<-c("col1","col2")
data[columns][is.na(data[columns])] <- 0
data
# col1 col2 col3 col4
#1 9 9 NA 9
#2 0 5 3 1
#3 25 25 25 NA
#4 26 26 26 26
#5 0 0 NA NA
#6 51 51 51 51
Or using tidyverse -
library(dplyr)
library(tidyr)
data <- data %>% mutate(across(all_of(columns), replace_na, 0))
I want to calculate the mean in a row if at least three out of six observations in the row are != NA. If four or more NA´s are present, the mean should show NA.
Example which gives me the mean, ignoring the NA´s:
require(dplyr)
a <- 1:10
b <- a+10
c <- a+20
d <- a+30
e <- a+40
f <- a+50
df <- data.frame(a,b,c,d,e,f)
df[2,c(1,3,4,6)] <- NA
df[5,c(1,4,6)] <- NA
df[8,c(1,2,5,6)] <- NA
df <- df %>% mutate(mean = rowMeans(df[,1:6], na.rm=TRUE))
I thought about the use of
case_when
but i´m not sure how to use it correctly:
df <- df %>% mutate(mean = case_when( ~ rowMeans(df[,1:6], na.rm=TRUE), TRUE ~ NA))
You can try a base R solution saving the number of non NA values in a new variable and then use ifelse() for the mean:
#Data
a <- 1:10
b <- a+10
c <- a+20
d <- a+30
e <- a+40
f <- a+50
df <- data.frame(a,b,c,d,e,f)
df[2,c(1,3,4,6)] <- NA
df[5,c(1,4,6)] <- NA
df[8,c(1,2,5,6)] <- NA
#Code
#Count number of non NA
df$count <- rowSums( !is.na( df [,1:6]))
#Compute mean
df$Mean <- ifelse(df$count>=3,rowMeans(df [,1:6],na.rm=T),NA)
Output:
a b c d e f count Mean
1 1 11 21 31 41 51 6 26.00000
2 NA 12 NA NA 42 NA 2 NA
3 3 13 23 33 43 53 6 28.00000
4 4 14 24 34 44 54 6 29.00000
5 NA 15 25 NA 45 NA 3 28.33333
6 6 16 26 36 46 56 6 31.00000
7 7 17 27 37 47 57 6 32.00000
8 NA NA 28 38 NA NA 2 NA
9 9 19 29 39 49 59 6 34.00000
10 10 20 30 40 50 60 6 35.00000
You could do:
library(dplyr)
df %>%
rowwise %>%
mutate(
mean = case_when(
sum(is.na(c_across())) < 4 ~ mean(c_across(), na.rm = TRUE),
TRUE ~ NA_real_)
) %>% ungroup()
Output:
# A tibble: 10 x 7
a b c d e f mean
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11 21 31 41 51 26
2 NA 12 NA NA 42 NA NA
3 3 13 23 33 43 53 28
4 4 14 24 34 44 54 29
5 NA 15 25 NA 45 NA 28.3
6 6 16 26 36 46 56 31
7 7 17 27 37 47 57 32
8 NA NA 28 38 NA NA NA
9 9 19 29 39 49 59 34
10 10 20 30 40 50 60 35
This is leveraging rowwise and c_across which basically means operating on row level, so you can use vectorized functions such as sum, mean etc. in their usual way (also with case_when).
c_across also has a cols argument where you can specify which columns you want to take into account. For example, if you'd like to take into account columns 1:6, you can specify this as:
df %>%
rowwise %>%
mutate(
mean = case_when(
sum(is.na(c_across(1:6))) < 4 ~ mean(c_across(), na.rm = TRUE),
TRUE ~ NA_real_)
) %>% ungroup()
Alternatively, if you'd e.g. like to take into account all columns except column number 2, you would do c_across(-2). You can also use column names, e.g. for the first example c_across(a:f) (all columns) or for the second c_across(-b) (all columns except b).
This is implemented internally in dplyr, but you could also do usual vector subsetting with taking the whole c_across() (which defaults to all columns, i.e. everything()) and do e.g. c_across()[1:6] or c_across()[-2].
We can create an index first and then do the assignment based on the index
i1 <- rowSums(!is.na(df)) >=3
df$Mean[i1] <- rowMeans(df[i1,], na.rm = TRUE)
df
# a b c d e f Mean
#1 1 11 21 31 41 51 26.00000
#2 NA 12 NA NA 42 NA NA
#3 3 13 23 33 43 53 28.00000
#4 4 14 24 34 44 54 29.00000
#5 NA 15 25 NA 45 NA 28.33333
#6 6 16 26 36 46 56 31.00000
#7 7 17 27 37 47 57 32.00000
#8 NA NA 28 38 NA NA NA
#9 9 19 29 39 49 59 34.00000
#10 10 20 30 40 50 60 35.00000
I've spent the better part of a day on this but I keep getting stuck. This wouldn't take me very long using index-match-match in Excel, but I'm newer to R and merging data doesn't seem very straight-forward. I've searched the site and found similar problems but no solutions specific to this type of issue.
I have two data frames. They have different lengths in both dimensions. a is 4x4 and b is 3x3. They partially overlap:
a <- data.frame("ID" = c(1:4), "A" = c(21:24), "B" = c(31:34), "C" = c(41:44))
a
ID A B C
1 1 21 31 41
2 2 22 32 42
3 3 23 33 43
4 4 24 34 44
and
b <- data.frame("ID" = c(4:6), "C" = c(22:24), "D" = c(32:34))
b
ID C D
1 4 22 32
2 5 23 33
3 6 24 34
I'm merging on "ID" number. My goal is to get them to look like
c <- data.frame("ID" = c(1:6), "A" = c(21:24, NA, NA), "B" = c(31:34, NA, NA), "C" = c(41:43,22:24), "D" = c(NA, NA, NA, 32:34))
c
ID A B C D
1 21 31 41 NA
2 22 32 42 NA
3 23 33 43 NA
4 24 34 22 32
5 NA NA 23 33
6 NA NA 24 34
As you can see, the final data frame combines the two and assigns NA to the missing information. In column "C", I would like b to overwrite a where it has numerical values. In this example, the value in c[4,3] should change from 44 to 22.
Most of this is simple enough. But getting column "C" correct has been a nightmare. I did the simple thing first:
merge(a, b, by = "ID", all = T)
It almost does the trick but ends up with duplicate row "C"s:
ID A B C.x C.y D
1 1 21 31 41 NA NA
2 2 22 32 42 NA NA
3 3 23 33 43 NA NA
4 4 24 34 44 22 32
5 5 NA NA NA 23 33
6 6 NA NA NA 24 34
This wouldn't be so bad if I could find out how to merge the duplicate rows correctly because then I could just run
merge(a[-4], b[-2], by = "ID", all = T)
ID A B D
1 1 21 31 NA
2 2 22 32 NA
3 3 23 33 NA
4 4 24 34 32
5 5 NA NA 33
6 6 NA NA 34
to merge everything else, then bring in the merged "C" after the fact.
But I can't figure it out how to deal with this part of it:
merge(a[c(1,4)], b[c(1,2)], by = "ID", all = T)
ID C.x C.y ID C
1 1 41 NA 1 1 41
2 2 42 NA 2 2 42
3 3 43 NA -> 3 3 43
4 4 44 22 4 4 22
5 5 NA 23 5 5 23
6 6 NA 24 6 6 24
There's gotta be way.
Thanks for your help!
For anyone else looking at this in the future, I realized this could also be solved using the following in base rather than dplyr:
df <- merge(a, b, by = "ID", all = T)
df[,"C"] <- ifelse(is.na(df[,"C.y"]), df[,"C.x"], df[,"C.y"])
df <- df[,-c(match("C.x", names(df)),match("C.y", names(df)))]
This ended up being the method I used because down the road I came to needing to perform some steps that were very difficult with dplyr for a novice (using variables inside mutate() and select()) and much more straightforward in base using the above syntax.
Thanks again to CPak, without whom I could not have figured this out.
Try this
library(dplyr)
starthere <- merge(a, b, by = "ID", all = T)
starthere %>%
mutate(C = ifelse(is.na(C.y), C.x, C.y)) %>%
select(-C.x, -C.y)
# ID A B D C
# 1 1 21 31 NA 41
# 2 2 22 32 NA 42
# 3 3 23 33 NA 43
# 4 4 24 34 32 22
# 5 5 NA NA 33 23
# 6 6 NA NA 34 24
I have a data frame as follows
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30
I’d like to re-cast it with a single row for each Identifier and one column for each value in the current location column. I don’t care about the data in V1 but I need the data in V2 and these will become the values in the new columns.
Note that for the Location column there are repeated values for Identifiers 2 and 3.
I ASSUME that the first task is to make the values in the Location column unique.
I used the following (the data frame is called “Test”)
L<-length(Test$Identifier)
for (i in 1:L)
{
temp<-Test$Location[Test$Identifier==i]
temp1<-make.unique(as.character(temp), sep="-")
levels(Test$Location)=c(levels(Test$Location),temp1)
Test$Location[Test$Identifier==i]=temp1
}
This produces
Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B-1 23
3 43 A 10
3 43 B 17
3 43 A-1 18
3 43 B-1 20
3 43 C 25
3 50 A-2 30
Then using
cast(Test, Identifier ~ Location)
gives
Identifier A B C B-1 A-1 A-2
1 21 24 NA NA NA NA
2 NA 15 18 23 NA NA
3 10 17 25 20 18 30
And this is more or less what I want.
My questions are
Is this the right way to handle the problem?
I know R-people don’t use the “for” construction so is there a more R-elegant (relegant?) way to do this? I should mention that the real data set has over 160,000 rows and starts with over 50 unique values in the Location vector and the function takes just over an hour to run. Anything quicker would be good. I should also mention that the cast function had to be run on 20-30k rows of the output at a time despite increasing the memory limit. All the cast outputs were then merged
Is there a way to sort the columns in the output so that (here) they are A, A-1, A-2, B, B-1, C
Please be gentle with your replies!
Usually your original format is much better than your desired result. However, you can do this easily using the split-apply-combine approach, e.g., with package plyr:
DF <- read.table(text="Identifier V1 Location V2
1 12 A 21
1 12 B 24
2 20 B 15
2 20 C 18
2 20 B 23
3 43 A 10
3 43 B 17
3 43 A 18
3 43 B 20
3 43 C 25
3 43 A 30", header=TRUE, stringsAsFactors=FALSE)
#note that I make sure that there are only characters and not factors
#use as.character if you have factors
library(plyr)
DF <- ddply(DF, .(Identifier), transform, Loc2 = make.unique(Location, sep="-"))
library(reshape2)
DFwide <- dcast(DF, Identifier ~Loc2, value.var="V2")
# Identifier A B B-1 C A-1 A-2
#1 1 21 24 NA NA NA NA
#2 2 NA 15 23 18 NA NA
#3 3 10 17 20 25 18 30
If column order is important to you (usually it isn't):
DFwide[, c(1, order(names(DFwide)[-1])+1)]
# Identifier A A-1 A-2 B B-1 C
#1 1 21 NA NA 24 NA NA
#2 2 NA NA NA 15 23 18
#3 3 10 18 30 17 20 25
For reference, here's the equivalent of #Roland's answer in base R.
Use ave to create the unique "Location" columns....
DF$Location <- with(DF, ave(Location, Identifier,
FUN = function(x) make.unique(x, sep = "-")))
... and reshape to change the structure of your data.
## If you want both V1 and V2 in your "wide" dataset
## "dcast" can't directly do this--you'll need `recast` if you
## wanted both columns, which first `melt`s and then `dcast`s....
reshape(DF, direction = "wide", idvar = "Identifier", timevar = "Location")
## If you only want V2, as you indicate in your question
reshape(DF, direction = "wide", idvar = "Identifier",
timevar = "Location", drop = "V1")
# Identifier V2.A V2.B V2.C V2.B-1 V2.A-1 V2.A-2
# 1 1 21 24 NA NA NA NA
# 3 2 NA 15 18 23 NA NA
# 6 3 10 17 25 20 18 30
Reordering the columns can be done the same way that #Roland suggested.