Editing data in rows - r

I'm trying to convert my data in R, but I can't manage to get the column I want.
My dataset is as below, and the column I want to get is "total", it is the sum of D1 + D2 + D3 + D4 + D5, and ignores "NA".
NR
D1
D2
D3
D4
D5
total
A
1
NA
NA
1
NA
2
B
NA
NA
NA
NA
NA
NA
C
NA
1
NA
NA
NA
1
It is probably quite a domb question, but I can't get it.
I already tried:
total <- NA
total <- ifelse(D1==1, 1, total)
total <- ifelse(D2==1, total + 1, total)
total <- ifelse(D3==1, total + 1, total)
total <- ifelse(D4==1, total + 1, total)
total <- ifelse(D5==1, total + 1, total)
But it returns all my rows to "NA"
and i tried:
total <- mutate(dataset, total=D1+D2+D3+D4+D5)
but then I don't get an aggregation of the values of D1 to D5.

We could use rowSums
df1$total <- rowSums(df1[startsWith(names(df1), "D")], na.rm = TRUE)
df1$total[df1$total == 0] <- NA
Or the same logic in dplyr
library(dplyr)
df1 %>%
mutate(total = na_if(rowSums(select(., starts_with('D')), na.rm = TRUE), 0))
NR D1 D2 D3 D4 D5 total
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA NA
3 C NA 1 NA NA NA 1
data
df1 <- structure(list(NR = c("A", "B", "C"), D1 = c(1L, NA, NA), D2 = c(NA,
NA, 1L), D3 = c(NA, NA, NA), D4 = c(1L, NA, NA), D5 = c(NA, NA,
NA), total = c(2L, NA, 1L)), class = "data.frame", row.names = c(NA,
-3L))

Here is a solution with c_across and rowwise
library(dplyr)
df %>%
rowwise() %>%
mutate(Total = sum(c_across(D1:D5 & where(is.numeric)), na.rm = TRUE))
Output:
NR D1 D2 D3 D4 D5 Total
<chr> <int> <int> <lgl> <int> <lgl> <int>
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA 0
3 C NA 1 NA NA NA 1
data:
structure(list(NR = c("A", "B", "C"), D1 = c(1L, NA, NA), D2 = c(NA,
NA, 1L), D3 = c(NA, NA, NA), D4 = c(1L, NA, NA), D5 = c(NA, NA,
NA)), row.names = c(NA, -3L), class = "data.frame")

You can try the code below
df$total <- replace(u <- rowSums(!is.na(df)) - 1, u == 0, NA)
which gives
> df
NR D1 D2 D3 D4 D5 total
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA NA
3 C NA 1 NA NA NA 1

And also this one:
library(dplyr)
library(purrr)
df1 <- df1[, !names(df1) %in% "total"]
df1 %>%
mutate(total = pmap_dbl(select(cur_data(), starts_with("D")), ~ ifelse(all(is.na(c(...))),
NA, sum(c(...), na.rm = TRUE))))
NR D1 D2 D3 D4 D5 total
1 A 1 NA NA 1 NA 2
2 B NA NA NA NA NA NA
3 C NA 1 NA NA NA 1

Related

In R data frame for each set of rows and column use value that is not na

I have a data frame df of the following structure:
observation x1 x2 x3 x4
"obs1" NA NA NA 51
"obs1" NA NA NA NA
"obs1" NA 25 NA NA
"obs2" NA NA NA NA
"obs2" NA NA NA NA
"obs2" NA NA NA 56
"obs3" 26 NA NA NA
"obs3" NA 82 NA NA
"obs3" NA NA "x" NA
I want a data frame df2 that, for each observation and for each column, takes the one value, that is not NA. The resulting data frame should look like this:
observation x1 x2 x3 x4
"obs1" NA 25 NA 51
"obs2" NA NA NA 56
"obs3" 26 82 "x" NA
I tried to do:
only_value = function(x){
x[which(!is.na(x))]
}
df2 = df %>% lapply(only_value) %>% as.data.frame()
However, this only works if there is the same amount of values for each observation. This is not the case in my example.
A data.table option using fcoalesce may help
type.convert(setDT(df)[,data.table(t(fcoalesce(asplit(.SD,1)))),observation],as.is = TRUE)
which gives
observation x1 x2 x3 x4
1: obs1 NA 25 <NA> 51
2: obs2 NA NA <NA> 56
3: obs3 26 82 x NA
Data
> dput(df)
structure(list(observation = c("obs1", "obs1", "obs1", "obs2",
"obs2", "obs2", "obs3", "obs3", "obs3"), x1 = c(NA, NA, NA, NA,
NA, NA, 26L, NA, NA), x2 = c(NA, NA, 25L, NA, NA, NA, NA, 82L,
NA), x3 = c(NA, NA, NA, NA, NA, NA, NA, NA, "x"), x4 = c(51L,
NA, NA, NA, NA, 56L, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-9L))
Similarly, you can use coalesce with dplyr
df %>%
group_by(observation) %>%
summarise(across(x1:x4,~do.call(coalesce,as.list(.x))))
which gives
observation x1 x2 x3 x4
* <chr> <int> <int> <chr> <int>
1 obs1 NA 25 <NA> 51
2 obs2 NA NA <NA> 56
3 obs3 26 82 x NA
Change the only_value function to return only 1st non-NA value.
only_value = function(x){
x[!is.na(x)][1]
}
Now apply this function by group to columns x1 to x4 :
library(dplyr)
df %>%
group_by(observation) %>%
summarise(across(x1:x4, only_value))
# observation x1 x2 x3 x4
#* <chr> <int> <int> <chr> <int>
#1 obs1 NA 25 NA 51
#2 obs2 NA NA NA 56
#3 obs3 26 82 x NA

How to remove specific rows in R?

A data set I'm using is the following:
C1 C2 C3 R1
R1 NA NA NA 5
R2 NA NA 0.4 7
R3 0.1 NA 6
R4 NA NA NA 2
From the data frame, I want to remove rows that contain numbers which is larger than zero from C1 to C3.
The final outcome must be:
C1 C2 C3 R1
R1 NA NA NA 5
R4 NA NA NA 2
I tried with:
df<- df %>% filter_at(vars('C1' : 'C2`), all_vars(. > 0))
but I got en error with this. How Can I fix it?
Imported from Excel:
Wrote in R:
You can use rowSums in base R :
cols <- paste0('C', 1:3)
df[rowSums(df[cols] > 0, na.rm = TRUE) == 0, ]
Or using filter_at :
library(dplyr)
df %>% filter_at(vars(C1:C3), all_vars(. <= 0 | is.na(.)))
# C1 C2 C3 R1
#R1 NA NA NA 5
#R4 NA NA NA 2
and filter_at has been deprecated so you can write this with across as :
df %>% filter(across(C1:C3, ~. <= 0 | is.na(.)))
data
df <- structure(list(C1 = c(NA, NA, 0.1, NA), C2 = c(NA, NA, NA, NA
), C3 = c(NA, 0.4, NA, NA), R1 = c(5L, 7L, 6L, 2L)),
class = "data.frame", row.names = c("R1", "R2", "R3", "R4"))
A more manual approach is as follows:
df <- as.data.table(df)
if(length(which(df$C1 > 0)) > 0){df <- df[-(which(df$C1 > 0)),]}
if(length(which(df$C2 > 0)) > 0){df <- df[-(which(df$C2 > 0)),]}
if(length(which(df$C3 > 0)) > 0){df <- df[-(which(df$C3 > 0)),]}

Find max value within a data frame interval

I have a dataframe that has x/y values every 5 seconds, with a depth value every second (time column). There is no depth where there is an x/y value.
x <- c("1430934", NA, NA, NA, NA, "1430939")
y <- c("4943206", NA, NA, NA, NA, "4943210")
time <- c(1:6)
depth <- c(NA, 10, 19, 84, 65, NA)
data <- data.frame(x, y, time, depth)
data
x y time depth
1 1430934 4943206 1 NA
2 NA NA 2 10
3 NA NA 3 19
4 NA NA 4 84
5 NA NA 5 65
6 1430939 4943210 6 NA
I would like to calculate the maximum depth between the x/y values that are not NA and add this to a new column in the row of the starting x/y values. So max depth of rows 2-5. An example of the output desired.
x y time depth newvar
1 1430934 4943206 1 NA 84
2 NA NA 2 10 NA
3 NA NA 3 19 NA
4 NA NA 4 84 NA
5 NA NA 5 65 NA
6 1430939 4943210 6 NA NA
This is to repeat whenever a new x/y value is present.
You can use ave and cumsum with !is.na to get the groups for ave like:
data$newvar <- ave(data$depth, cumsum(!is.na(data$x)), FUN=
function(x) if(all(is.na(x))) NA else {
c(max(x, na.rm=TRUE), rep(NA, length(x)-1))})
data
# x y time depth newvar
#1 1430934 4943206 1 NA 84
#2 <NA> <NA> 2 10 NA
#3 <NA> <NA> 3 19 NA
#4 <NA> <NA> 4 84 NA
#5 <NA> <NA> 5 65 NA
#6 1430939 4943210 6 NA NA
Using dplyr, we can create groups of every 5 rows and update the first row in group as max value in the group ignoring NA values.
library(dplyr)
df %>%
group_by(grp = ceiling(time/5)) %>%
mutate(depth = ifelse(row_number() == 1, max(depth, na.rm = TRUE), NA))
In base R, we can use tapply :
inds <- seq(1, nrow(df), 5)
df$depth[inds] <- tapply(df$depth, ceiling(df$time/5), max, na.rm = TRUE)
df$depth[-inds] <- NA
Maybe you can try ave like below
df <- within(df,
newvar <- ave(depth,
ceiling(time/5),
FUN = function(x) ifelse(length(x)>1&is.na(x),max(na.omit(x)),NA)))
such that
> df
x y time depth newvar
1 1430934 4943206 1 NA 84
2 NA NA 2 10 NA
3 NA NA 3 19 NA
4 NA NA 4 84 NA
5 NA NA 5 65 NA
6 1430939 4943210 6 NA NA
DATA
df <- structure(list(x = c(1430934L, NA, NA, NA, NA, 1430939L), y = c(4943206L,
NA, NA, NA, NA, 4943210L), time = 1:6, depth = c(NA, 10L, 19L,
84L, 65L, NA)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6"))
Here is another option using data.table:
library(data.table)
setDT(data)[, newvar := replace(frollapply(depth, 5L, max, na.rm=TRUE, align="left"),
seq(.N) %% 5L != 1L, NA_integer_)]

remove NA values and combine non NA values into a single column

I have a data set which has numeric and NA values in all columns. I would like to create a new column with all non NA values and preserve the row names
v1 v2 v3 v4 v5
a 1 NA NA NA NA
b NA 2 NA NA NA
c NA NA 3 NA NA
d NA NA NA 4 NA
e NA NA NA NA 5
I have tried using the coalesce function from dplyr
digital_metrics_FB <- fb_all_data %>%
mutate(fb_metrics = coalesce("v1",
"v2",
"v3",
"v4",
"v5"))
and also tried an apply function
df2 <- sapply(fb_all_data,function(x) x[!is.na(x)])
still cannot get it to work.
I am looking for the final result to be where all non NA values come together in the final column and the row names are preserved
final
a 1
b 2
c 3
d 4
e 5
any help would be much appreciated
We can use pmax
do.call(pmax, c(fb_all_data , na.rm = TRUE))
If there are more than one non-NA element and want to combine as a string, a simple base R option would be
data.frame(final = apply(fb_all_data, 1, function(x) toString(x[!is.na(x)])))
Or using coalesce
library(dplyr)
library(tibble)
fb_all_data %>%
rownames_to_column('rn') %>%
transmute(rn, final = coalesce(v1, v2, v3, v4, v5)) %>%
column_to_rownames('rn')
# final
#a 1
#b 2
#c 3
#d 4
#e 5
Or using tidyverse, for multiple non-NA elements
fb_all_data %>%
rownames_to_column('rn') %>%
transmute(rn, final = pmap_chr(.[-1], ~ c(...) %>%
na.omit %>%
toString)) %>%
column_to_rownames('rn')
NOTE: Here we are showing data that the OP showed as example and not some other dataset
data
fb_all_data <- structure(list(v1 = c(1L, NA, NA, NA, NA), v2 = c(NA, 2L, NA,
NA, NA), v3 = c(NA, NA, 3L, NA, NA), v4 = c(NA, NA, NA, 4L, NA
), v5 = c(NA, NA, NA, NA, 5L)), class = "data.frame",
row.names = c("a",
"b", "c", "d", "e"))
With tidyverse, you can do:
df %>%
rownames_to_column() %>%
gather(var, val, -1, na.rm = TRUE) %>%
group_by(rowname) %>%
summarise(val = paste(val, collapse = ", "))
rowname val
<chr> <chr>
1 a 1
2 b 2, 3
3 c 3
4 d 4
5 e 5
Sample data to have a row with more than one non-NA value:
df <- read.table(text = " v1 v2 v3 v4 v5
a 1 NA NA NA NA
b NA 2 3 NA NA
c NA NA 3 NA NA
d NA NA NA 4 NA
e NA NA NA NA 5", header = TRUE)

How to combine rowSums and ifelse with mutate

I need to combine rowSums and ifelse in order to create a new variable. My data looks like this:
boss var1 var2 var3 newvar
1 NA NA 3 NA
1 2 3 3 8
2 NA NA NA 0
2 NA NA NA 0
2 NA NA NA 0
1 1 NA 2 3
if boss==1, and there's more than one missing value in var1 to var3, newvar should be NA, otherwise, it should be the result of var1+var2+var3
If boss==2, newvar should be automatically 0.
So far, I have been able to solve parts of the problem using dplyr:
mutate(newvar=rowSums(.[,2:4],na.rm=TRUE) +
ifelse(rowSums(is.na(.[,2:4]))>1 & boss==2,NA,0))
mutate(newvar=ifelse(boss==2,0,NA)
However, I'm struggling to combine the two. Any help is much appreciated.
Here is one option with case_when where we create an index ('i1') which computes the number of NA elements in the row. The index is used in the case_when to create logical conditions to assign the values
df %>%
mutate(i1 = rowSums(is.na(.[-1]))) %>%
mutate(newvar = case_when(i1 > 1 & boss==1 ~ NA_integer_,
boss==2 ~ 0L,
i1 <=1 & boss != 2~ as.integer(rowSums(.[2:4], na.rm = TRUE)))) %>%
select(-i1)
# boss var1 var2 var3 newvar
#1 1 NA NA 3 NA
#2 1 2 3 3 8
#3 2 NA NA NA 0
#4 2 NA NA NA 0
#5 2 NA NA NA 0
#6 1 1 NA 2 3
In base R, this can be done with creating index and without using any ifelse
i1 <- df$boss != 2
tmp <- i1 * df[-1]
df$newvar <- NA^(rowSums(is.na(tmp)) > 1 & i1) * rowSums(tmp, na.rm = TRUE)
df$newvar
#[1] NA 8 0 0 0 3
data
df <- structure(list(boss = c(1L, 1L, 2L, 2L, 2L, 1L), var1 = c(NA,
2L, NA, NA, NA, 1L), var2 = c(NA, 3L, NA, NA, NA, NA), var3 = c(3L,
3L, NA, NA, NA, 2L)), .Names = c("boss", "var1", "var2", "var3"
), row.names = c(NA, -6L), class = "data.frame")
A solution in base-R using apply can be as:
df$newvar <- apply(df,1, function(x){
#retVal = NA
if(x["boss"]==2){
0
} else if(sum(is.na(x[-1])) > 1){
NA
} else{
sum(x[-1], na.rm = TRUE)
}
})
# boss var1 var2 var3 newvar
# 1 1 NA NA 3 NA
# 2 1 2 3 3 8
# 3 2 NA NA NA 0
# 4 2 NA NA NA 0
# 5 2 NA NA NA 0
# 6 1 1 NA 2 3
Data:
df <- read.table(text =
"boss var1 var2 var3
1 NA NA 3
1 2 3 3
2 NA NA NA
2 NA NA NA
2 NA NA NA
1 1 NA 2",
header = TRUE, stringsAsFactors = FALSE)

Resources