I have dataframe like this:
I want to create a new column which is the sum of other columns by ignoring NA if there is any numeric value in a row. But if all value (like the second row) in a row are na, the sum column gets NA.
As this is your first activity here on SO you should have a look to this which describes how a minimal and reproducible examples is made. This is certainly needed in the future, if you have more questions. An image is mostly not accepted as a starting point.
Fortunately your table was a small one. I turned it into a tribble and then used rowSums to calculate the numbers you seem to want.
df <- tibble::tribble(
~x, ~y, ~z,
6000, NA, NA,
NA, NA, NA,
100, 7000, 1000,
0, 0, NA
)
df$sum <- rowSums(df, na.rm = T)
df
#> # A tibble: 4 x 4
#> x y z sum
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6000 NA NA 6000
#> 2 NA NA NA 0
#> 3 100 7000 1000 8100
#> 4 0 0 NA 0
Created on 2020-06-15 by the reprex package (v0.3.0)
Let's say that your data frame is called df
cbind(df, apply(df, 1, function(x){if (all(is.na(x))) {NA} else {sum(x, na.rm = T)}))
Note that if your data frame has other columns, you will need to restrict the df call within apply to only be the columns you're after.
You can count the NA values in df. If in a row there is no non-NA value you can assign output as NA or calculate row-wise sum otherwise using rowSums.
ifelse(rowSums(!is.na(df)) == 0, NA, rowSums(df, na.rm = TRUE))
#[1] 6000 NA 10000 8100 0
data
df <- structure(list(x = c(6000, NA, 10000, 100, 0), y = c(NA, NA,
NA, 7000, 0), z = c(NA, NA, NA, 1000, NA)), class = "data.frame",
row.names = c(NA, -5L))
Related
This question already has answers here:
Replace a value NA with the value from another column in R
(5 answers)
Closed last month.
I have a simplified dataframe:
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
I want to create a new column rating that has the value of the number in either column x or column y. The dataset is such a way that whenever there's a numeric value in x, there's a NA in y. If both columns are NAs, then the value in rating should be NA.
In this case, the expected output is: 1,2,3,3,2,NA
With coalesce:
library(dplyr)
test %>%
mutate(rating = coalesce(x, y))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
library(dplyr)
test %>%
mutate(rating = if_else(is.na(x),
y, x))
x y a rating
1 1 NA NA 1
2 2 NA NA 2
3 3 NA NA 3
4 NA 3 NA 3
5 NA 2 NA 2
6 NA NA TRUE NA
Here several solutions.
# Input
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
# Base R solution
test$rating <- ifelse(!is.na(test$x), test$x,
ifelse(!is.na(test$y), test$y, NA))
# dplyr solution
library(dplyr)
test <- test %>%
mutate(rating = case_when(!is.na(x) ~ x,
!is.na(y) ~ y,
TRUE ~ NA_real_))
# data.table solution
library(data.table)
setDT(test)
test[, rating := ifelse(!is.na(x), x, ifelse(!is.na(y), y, NA))]
Created on 2022-12-23 with reprex v2.0.2
test <- data.frame(
x = c(1,2,3,NA,NA,NA),
y = c(NA, NA, NA, 3, 2, NA),
a = c(NA, NA, NA, NA, NA, TRUE)
)
test$rating <- dplyr::coalesce(test$x, test$y)
I have a data set with many columns (DATA_OLD) in which I want to exchange all values based on an allocation list with many entries (KEY).
Every value in DATA_OLD should be replaced by its counterpart (can be seen in KEY) to create DATA_NEW.
For simplicity, the example here contains a short KEY and DATA_OLD set. In reality, there are >2500 rows in KEY and >100 columns in DATA_OLD. Therefore, an approach that can be applied to the whole data set simultaneously without calling each colname of DATA_OLD is important.
KEY:
old
new
1
1
3
2
7
3
12
4
55
5
Following this example, every value "1" should be replaced with another value "1". Every value "3" should be replaced with value "2". Every value "7" should be replaced with value "3".
DATA_OLD (START):
var1
var2
var3
NA
3
NA
NA
55
NA
1
NA
NA
NA
NA
NA
3
NA
NA
55
NA
12
DATA_NEW (RESULT):
var1
var2
var3
NA
2
NA
NA
5
NA
1
NA
NA
NA
NA
NA
2
NA
NA
5
NA
4
Here reproducible data:
KEY<-structure(list(old = c(1, 3, 7, 12, 55), new = c(1, 2, 3, 4,
5)), class = "data.frame", row.names = c(NA, -5L))
DATA_OLD<-structure(list(var1 = c(NA, NA, 1, NA, 3, 55), var2 = c(3,
55, NA, NA, NA, NA), var3 = c(1, NA, NA, NA, NA, 12)), class = "data.frame", row.names = c(NA, -6L))
DATA_NEW<-structure(list(var1 = c(NA, NA, 1, NA, 2, 5), var2 = c(2,
5, NA, NA, NA, NA), var3 = c(1, NA, NA, NA, NA, 4)), class = "data.frame", row.names = c(NA, -6L))
I have tried back and forth, and it appears that I am completely clueless. Help would be greatly apprecciated! The real data set is quite large...
1) Base R Be careful here since some solutions have the side effect of converting the numeric columns to character or factor or the data frame to something else. A solution using match will generally work. The result of lapply is a list so convert back to data frame.
DATA_OLD |>
lapply(function(x) with(KEY, new[match(x, old)])) |>
as.data.frame()
or
DATA_NEW <- DATA_OLD
DATA_NEW[] <- lapply(DATA_OLD, function(x) with(KEY, new[match(x, old)]))
This last one is easy to convert to act only on some columns
DATA_NEW <- DATA_OLD
ix <- 1:2 # only convert these columns
DATA_NEW[ix] <- lapply(DATA_OLD[ix], function(x) with(KEY, new[match(x, old)]))
2) purrr Alternately use map_dfr which returns a data frame directly:
library(purrr)
map_dfr(DATA_OLD, ~ with(KEY, new[match(.x, old)]))
3) dplyr A dplyr solution using across is the following. If there were some non-numeric columns that should not be converted then replace everything() with where(is.numeric)
library(purrr)
DATA_OLD %>%
mutate(across(everything(), ~ with(KEY, new[match(.x, old)])))
The simplest way to implement a dictionary in R is a named array, where you can use the names as indices:
key <- setNames(KEY$new, KEY$old)
> key
1 3 7 12 55
1 2 3 4 5
The only thing to be mindful of is that the indexing must by done by character, rather than integer:
> key[3]
7
3 # WRONG! This is the 3rd item!
> key["3"]
3
2 # RIGHT! This is the item named "3"
Then you can apply the transformation column-wise. This turns the data into a matrix, but you can simply turn it back.
as.data.frame(apply(DATA_OLD, 2, \(col) key[as.character(col)]))
var1 var2 var3
1 NA 2 1
2 NA 5 NA
3 1 NA NA
4 NA NA NA
5 2 NA NA
6 5 NA 4
forgive the very basic question. I have some output from an experiment that had 3 different versions of the same question, depending on the condition. The output file treated each question as a separate column so my output looks like this, where the headers for the columns repeat:
Q1,Q2,Q3,Q1,Q2,Q3,Q1,Q2,Q3
1, 0, 1
-----------0, 1, 0
--------------------1, 1, 1
How would I be able to merge the output (preferably in Excel - my output is currently stored in an excel file, or alternatively in R), so that the desired output looks like this:
Q1,Q2,Q3
1, 0, 1
0, 1, 0
1, 1, 1
Thanks in advance!
An option in R after reading the dataset with a function that reads thee excel file (read_excel etc.) would be to loop over the unique names of dataset, extract the columns, unlist, remove the NA elements (if any - assuming the blanks are NA)
nm1 <- unique(sub("\\.\\d+", "", names(df1)))
out <- sapply(nm1, function(x) na.omit(unlist(df1[grep(x, names(df1))])))
row.names(out) <- NULL
out
# Q1 Q2 Q3
#[1,] 1 0 1
#[2,] 0 1 0
#[3,] 1 1 1
Or with tidyverse with gather/spread
library(tidyverse)
gather(df1, na.rm = TRUE) %>%
mutate(key = str_remove(key, "\\.\\d+$"), ind = rowid(key)) %>%
spread(key, value) %>%
select(-ind)
# Q1 Q2 Q3
#1 1 0 1
#2 0 1 0
#3 1 1 1
Or another option is to split into a list of data.frames having similar columns, use coalesce to reduce it to a single vector which would remove the NA elements in the row and get the first non-NA element in that row
split.default(df1, nm1) %>%
map_df(reduce, coalesce)
# A tibble: 3 x 3
# Q1 Q2 Q3
# <dbl> <dbl> <dbl>
#1 1 0 1
#2 0 1 0
#3 1 1 1
data
df1 <- structure(list(Q1 = c(1, NA, NA), Q2 = c(0, NA, NA), Q3 = c(1,
NA, NA), Q1.1 = c(NA, 0, NA), Q2.1 = c(NA, 1, NA), Q3.1 = c(NA,
0, NA), Q1.2 = c(NA, NA, 1), Q2.2 = c(NA, NA, 1), Q3.2 = c(NA,
NA, 1)), class = "data.frame", row.names = c(NA, -3L))
I have a dataset where I have to fill NA values using the previous value and a sum of current value in another column. Basically, my data looks like
library(lubridate)
library(tidyverse)
library(zoo)
df <- tibble(
Id = c(1, 1, 1, 1, 2, 2, 2, 2),
Time = ymd(c("2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04", "2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04")),
av = c(18, NA, NA, NA, 21, NA, NA, NA),
Value = c(121, NA,NA, NA, 146, NA, NA, NA)
)
# A tibble: 8 x 4
Id Time av Value
<dbl> <date> <dbl> <dbl>
1 2012-09-01 18 121
1 2012-09-02 NA NA
1 2012-09-03 NA NA
1 2012-09-04 NA NA
2 2012-09-01 21 146
2 2012-09-02 NA NA
2 2012-09-03 NA NA
2 2012-09-04 NA NA
What I want to do is: where the Value is NA, I want to replace it by sum of previous Value and current value of av. If av is NA, it can be replaced with previous value. I use na.locf function from zoo package as
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>%
mutate(av = zoo::na.locf(av))
However, filling in for Value seems to be difficult. I can do it using for loop as
# Back up the Value column for testing
df1$Value_backup <- df1$Value
for(i in 2:nrow(df1))
{
df1$Value[i] <- ifelse(is.na(df1$Value[i]), df1$av[i] + df1$Value[i-1], df1$Value[i])
}
This produces the result I want but for a large dataset, I believe there are better ways to do it in R. I tried complete function from dplyr but it adds two additional rows as:
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>% mutate(av = zoo::na.locf(av)) %>%
mutate(num_rows = n()) %>%
complete(nesting(Id), Value = seq(min(Value, na.rm = TRUE),
(min(Value, na.rm = TRUE) + max(num_rows) * min(na.omit(av))), min(na.omit(av))))
The output has two extra rows; 10 instead of 8
# A tibble: 10 x 5
# Groups: Id [2]
Id Value Time av num_rows
<dbl> <dbl> <date> < dbl> <int>
1 121 2012-09-01 18 4
1 139 NA NA NA
1 157 NA NA NA
1 175 NA NA NA
1 193 NA NA NA
2 146 2012-09-01 21 4
2 167 NA NA NA
2 188 NA NA NA
2 209 NA NA NA
2 230 NA NA NA
Any help to do it faster without loops would be greatly appreciated.
In the question av starts with a non-NA in each group and is followed by NAs so if this is the general pattern then this will work. Note that it is good form to close any group_by with ungroup; however, we did not do that below so that we could compare df2 with df1.
df2 <- df %>%
group_by(Id) %>%
mutate(Value_backup = Value,
av = first(av),
Value = first(Value) + cumsum(av) - av)
identical(df1, df2)
## [1] TRUE
Note
For reproducibility first run this (taken from question except we only load needed packages):
library(dplyr)
library(tibble)
library(lubridate)
df <- tibble(
Id = c(1, 1, 1, 1, 2, 2, 2, 2),
Time = ymd(c("2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04", "
2012-09-01", "2012-09-02", "2012-09-03", "2012-09-04")),
av = c(18, NA, NA, NA, 21, NA, NA, NA),
Value = c(121, NA,NA, NA, 146, NA, NA, NA)
)
df1 <- df %>% arrange(Id, Time) %>% group_by(Id) %>%
mutate(av = zoo::na.locf(av))
df1$Value_backup <- df1$Value
for(i in 2:nrow(df1))
{
df1$Value[i] <- ifelse(is.na(df1$Value[i]), df1$av[i] + df1$Value[i-1], df1$Value[i])
}
This question already has answers here:
Replace values in a dataframe based on lookup table
(8 answers)
Closed 2 years ago.
I'm trying to assign values to different columns, separately for each row, based on lookup values. I'm working in R. Here's a minimal working example:
#Item scores
item1 <- c(NA, 1, NA, 4)
item2 <- c(NA, 2, NA, 3)
item3 <- c(NA, 3, NA, NA)
item57 <- c(NA, 4, 4, 1)
mydata <- data.frame(item1, item2, item3, item57)
#Lookup values based on item score
lookup <- data.frame(score = 1:4, value=c(6, 7, 8, 10))
I have many participants (i.e., rows) assessed with a score on each of many items (i.e., columns). I'd like to create variables in my data frame for the values that are tied to the item scores (based on the lookup table). Here's my desired output:
#Desired output (adding value that is tied to item score to the original data)
desiredOutput <- cbind(mydata,
value1 = c(NA, 6, NA, 10),
value2 = c(NA, 7, NA, 8),
value3 = c(NA, 8, NA, NA),
value57 = c(NA, 10, 10, 6))
I have a fairly large dataset and would like to stay away from loops, if possible. Also, we can skip rows with all NAs, if it's faster to process.
here's a tidyverse method. The basis of it is that you want to first gather the score columns and left_join the lookup table so that you have your values matched to scores. Then the rest is just manipulation to get back back to the desired output format. To do this, we need to create the column names that we want with gather and unite, and then finally spread back out. Note that you need rowid_to_column at the beginning so that spread will know what observations to place on what rows. If you want to exactly get your output column names, you can mix in some stringr.
item1 <- c(NA, 1, NA, 4)
item2 <- c(NA, 2, NA, 3)
item3 <- c(NA, 3, NA, NA)
item57 <- c(NA, 4, 4, 1)
mydata <- data.frame(item1, item2, item3, item57)
#Lookup values based on item score
lookup <- data.frame(score = 1:4, value=c(6, 7, 8, 10))
library(tidyverse)
mydata %>%
rowid_to_column(var = "participant") %>%
gather(items, score, starts_with("item")) %>%
left_join(lookup) %>%
gather(coltype, val, score:value) %>%
unite(colname, coltype, items) %>%
spread(colname, val)
#> Joining, by = "score"
#> participant score_item1 score_item2 score_item3 score_item57 value_item1
#> 1 1 NA NA NA NA NA
#> 2 2 1 2 3 4 6
#> 3 3 NA NA NA 4 NA
#> 4 4 4 3 NA 1 10
#> value_item2 value_item3 value_item57
#> 1 NA NA NA
#> 2 7 8 10
#> 3 NA NA 10
#> 4 8 NA 6
Created on 2018-06-19 by the reprex package (v0.2.0).