R: fill in cells with values from different rows - r

I’m trying to fill NAs in a row with values from a different row. These rows are “linked” by a case number. I want to write an if loop that goes through the entire data frame and does this. But I think I don’t grasp the R language well enough. Can anybody help me?
The data frame:
CASE <- c(1, 2, 3, 4, 5, 6)
SERIAL <-c("AB",NA, NA, "CD", NA, NA)
REF <- c(NA, 1, 1, NA, 4, 4)
PA <- c(4, NA, NA, 2, NA, NA)
PE <- c(NA, 2, NA, NA, 1, NA)
PE2 <- c(NA, NA, 3, NA, NA, 3)
df <- data.frame (CASE, SERIAL, REF, PA, PE, PE2)
CASE SERIAL REF PA PE PE2
1 AB NA 4 NA NA
2 <NA> 1 NA 2 NA
3 <NA> 1 NA NA 3
4 CD NA 2 NA NA
5 <NA> 4 NA 1 NA
6 <NA> 4 NA NA 3
In the row CASE = 1, I want to fill in the empty PE and PE2 with the values from the rows below, which reference the line (by REF = 1). In the line CASE = 4, I want to fill in the empty PE and PE2 with the values from the rows below, which reference the line (by REF = 4). The lines with no serial number only serve to fill the lines 1 and 4, so to speak. There is no way to collect the data directly into the corresponding lines. I tried this for loop, but don't know how to refrence the values correctly?
for (i in 1:dim(df)[1]{
if (data$SERIAL[i]==NA){
[data$CASE[data$REF[i]],PE] <- data$PE[i]
[data$CASE[data$REF[i]],PE2] <- data$PE2[i]}
}
)
Expected output:
CASE SERIAL REF PA PE PE2
1 1 AB NA 4 2 3
2 2 <NA> 1 NA 2 NA
3 3 <NA> 1 NA NA 3
4 4 CD NA 2 1 3
5 5 <NA> 4 NA 1 NA
6 6 <NA> 4 NA NA 3

This is a dplyr solution, but perhaps it would work:
df %>%
mutate(REF = ifelse(is.na(REF), CASE, REF)) %>%
group_by(REF) %>%
summarise(SERIAL = first(SERIAL),
across(c(PA, PE, PE2), ~sum(.x, na.rm=TRUE))) %>%
rename("CASE" = "REF")
# # A tibble: 2 x 5
# CASE SERIAL PA PE PE2
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 AB 4 2 3
# 2 4 CD 2 1 3

withSerial = subset(df, !is.na(SERIAL))
withSerial
# CASE SERIAL REF PA PE PE2
#1 1 AB NA 4 NA NA
#4 4 CD NA 2 NA NA
noSerialwithRef = subset(df, is.na(SERIAL) & !is.na(REF))
noSerialwithRef
# CASE SERIAL REF PA PE PE2
#2 2 <NA> 1 NA 2 NA
#3 3 <NA> 1 NA NA 3
#5 5 <NA> 4 NA 1 NA
#6 6 <NA> 4 NA NA 3
withSerial$PE = subset(noSerialwithRef, !is.na(PE))$PE
withSerial$PE2 = subset(noSerialwithRef, !is.na(PE2))$PE2
withSerial
# CASE SERIAL REF PA PE PE2
#1 1 AB NA 4 2 3
#4 4 CD NA 2 1 3

Update: Added library(tidyr) thanks to Martin Gal and added alternative code suggested by Martin Gal:
Here is another dplyr way:
fill SERIAL
use lead in the grouped_columns
keep only first rows of gorups with slice(1)
library(dplyr)
library(tidyr)
df %>%
fill(SERIAL, .direction = "down") %>%
group_by(SERIAL) %>%
mutate(PE = lead(PE),
PE2 = lead(PE2,2)) %>%
slice(1)
# Alternative and better (suggested by Martin Gal):
df %>% fill(-c(CASE, SERIAL), .direction = "up") %>% drop_na()
CASE SERIAL REF PA PE PE2
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 AB NA 4 2 3
2 4 CD NA 2 1 3

Related

R Lag Variable And Skip Value Between

DATA = data.frame(STUDENT = c(1,1,1,2,2,2,3,3,4,4),
SCORE = c(6,4,8,10,9,0,2,3,3,7),
CLASS = c('A', 'B', 'C', 'A', 'B', 'C', 'B', 'C', 'A', 'B'),
WANT = c(NA, NA, 2, NA, NA, -10, NA, NA, NA, NA))
I have DATA and wish to create 'WANT' which is calculate by:
For each STUDENT, find the SCORE where SCORE equals to SCORE(CLASS = C) - SCORE(CLASS = A)
EX: SCORE(STUDENT = 1, CLASS = C) - SCORE(STUDENT = 1, CLASS = A) = 8-6=2
Assuming at most one 'C' and 'A' CLASS per each 'STUDENT', just subset the 'SCORE' where the CLASS value is 'C', 'A', do the subtraction and assign the value only to position where CLASS is 'C' by making all other positions to NA (after grouping by 'STUDENT')
library(dplyr)
DATA <- DATA %>%
group_by(STUDENT) %>%
mutate(WANT2 = (SCORE[CLASS == 'C'][1] - SCORE[CLASS == 'A'][1]) *
NA^(CLASS != "C")) %>%
ungroup
-output
# A tibble: 10 × 5
STUDENT SCORE CLASS WANT WANT2
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 6 A NA NA
2 1 4 B NA NA
3 1 8 C 2 2
4 2 10 A NA NA
5 2 9 B NA NA
6 2 0 C -10 -10
7 3 2 B NA NA
8 3 3 C NA NA
9 4 3 A NA NA
10 4 7 B NA NA
Here is a solution with the data organized in a wider format first, then a longer format below. This solution works regardless of the order of the "CLASS" column (for instance, if there is one instance in which the CLASS order is CBA or BCA instead os ABC, this solution will work).
Solution
library(dplyr)
library(tidyr)
wider <- DATA %>% select(-WANT) %>%
pivot_wider( names_from = "CLASS", values_from = "SCORE") %>%
rowwise() %>%
mutate(WANT = C-A) %>%
ungroup()
output wider
# A tibble: 4 × 5
STUDENT A B C WANT
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6 4 8 2
2 2 10 9 0 -10
3 3 NA 2 3 NA
4 4 3 7 NA NA
If you really want like your output example, then we can reorganize the wider data this way:
Reorganizing wider to long format
wider %>%
pivot_longer(A:C, values_to = "SCORE", names_to = "CLASS") %>%
relocate(WANT, .after = SCORE) %>%
mutate(WANT = if_else(CLASS == "C", WANT, NA_real_))
Final Output
# A tibble: 12 × 4
STUDENT CLASS SCORE WANT
<dbl> <chr> <dbl> <dbl>
1 1 A 6 NA
2 1 B 4 NA
3 1 C 8 2
4 2 A 10 NA
5 2 B 9 NA
6 2 C 0 -10
7 3 A NA NA
8 3 B 2 NA
9 3 C 3 NA
10 4 A 3 NA
11 4 B 7 NA
12 4 C NA NA

A computation efficient way to find the IDs of the Type 1 rows just above and below each Type 2 rows?

I have the following data
df <- tibble(Type=c(1,2,2,1,1,2),ID=c(6,4,3,2,1,5))
Type ID
1 6
2 4
2 3
1 2
1 1
2 5
For each of the type 2 rows, I want to find the IDs of the type 1 rows just below and above them. For the above dataset, the output will be:
Type ID IDabove IDbelow
1 6 NA NA
2 4 6 2
2 3 6 2
1 2 NA NA
1 1 NA NA
2 5 1 NA
Naively, I can write a for loop to achieve this, but that would be too time consuming for the dataset I am dealing with.
One approach using dplyr lead,lag to get next and previous value respectively and data.table's rleid to create groups of consecutive Type values.
library(dplyr)
library(data.table)
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = rleid(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
# Type ID IDabove IDbelow
# <dbl> <dbl> <dbl> <dbl>
#1 1 6 NA NA
#2 2 4 6 2
#3 2 3 6 2
#4 1 2 NA NA
#5 1 1 NA NA
#6 2 5 1 NA
A dplyr only solution:
You could create your own rleid function then apply the logic provided by Ronak(Many thanks. Upvoted).
library(dplyr)
my_func <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
# this part is the same as provided by Ronak.
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = my_func(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
Output:
Type ID IDabove IDbelow
<dbl> <dbl> <dbl> <dbl>
1 1 6 NA NA
2 2 4 6 2
3 2 3 6 2
4 1 2 NA NA
5 1 1 NA NA
6 2 5 1 NA

How do I count values in a column and match them with a specific row?

I have the dataset that looks like this, where ID and emails correspond to a unique person. The remaining columns represent people named by that person/row. For example, a person with ID 1 and email address alex#gmail.com named Pete, Jane, and Tim when asked a question.
id email john_b alex_a pete jane tim
1 alex#gmail.com NA NA 1 1 1
2 pete#yahoo.com NA 1 1 NA NA
3 jane#q.com NA NA 1 NA 1
4 bea#mail.co NA 1 1 NA NA
5 tim#q.com NA NA 1 NA 1
I need the new dataset to look like this, where a new column nomination represents the number of times that person/row was named in the rest of the dataset. For example, Pete was named by 5 people and gets 5 in the nomination column, on the row with the relevant email address. Jane was named once (by alex#gmail.com) and gets 1 in the nomination column, on the row with Jane's email address.
id email john_b alex_a pete jane tim nomination
1 alex#gmail.com NA NA 1 1 1 0
2 pete#yahoo.com NA 1 1 NA NA 5
3 jane#q.com NA NA 1 NA 1 1
4 bea#mail.co NA 1 1 NA NA 0
5 tim#q.com NA NA 1 NA 1 3
I have a sense that I need a combination of case-when and grepl here, but can't wrap my head around it.
Thanks for any help!
Hi I finally came up with a code that I hope to get you to what you expect. However, I could not think of any way to match bea#mail.co to john_b. It takes a mind far more brighter than mine for sure but if I could think of anything, I would update my codes here:
library(dplyr)
library(tidyr)
library(stringr)
df <- tribble(
~email, ~john_b, ~alex_a, ~pete, ~jane, ~tim,
"alex#gmail.com", NA, NA, 1, 1, 1,
"pete#yahoo.com", NA , 1, 1, NA, NA,
"jane#q.com", NA , NA, 1, NA, 1,
"bea#mail.co", NA, 1, 1, NA, NA,
"tim#q.com", NA , NA, 1, NA, 1
)
# First we count the number of times each person is named
nm <- df %>%
summarise(across(john_b:tim, ~ sum(.x, na.rm = TRUE))) %>%
pivot_longer(everything(), names_to = "names", values_to = "nominations")
nm
# A tibble: 5 x 2
names nominations
<chr> <dbl>
1 john_b 0
2 alex_a 2
3 pete 5
4 jane 1
5 tim 3
Then we try to partially match every names with their corresponding emails. Here the only problem is john_b as I mentioned before.
nm2 <- nm %>%
rowwise() %>%
mutate(emails = map(names, ~ df$email[str_detect(df$email, str_sub(.x, 1L, 4L))])) %>%
unnest(cols = c(emails))
nm2
# A tibble: 4 x 3
names nominations emails
<chr> <dbl> <chr>
1 alex_a 2 alex#gmail.com
2 pete 5 pete#yahoo.com
3 jane 1 jane#q.com
4 tim 3 tim#q.com
And in the end we join these two data frames by emails:
df %>%
full_join(nm2, by = c("email" = "emails"))
# A tibble: 5 x 8
email john_b alex_a pete jane tim names nominations
<chr> <lgl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 alex#gmail.com NA NA 1 1 1 alex_a 2
2 pete#yahoo.com NA 1 1 NA NA pete 5
3 jane#q.com NA NA 1 NA 1 jane 1
4 bea#mail.co NA 1 1 NA NA NA NA
5 tim#q.com NA NA 1 NA 1 tim 3
You can also omit the column names if you like. I just leave it their so that you can compare them together. If you could make some modification on john's email they would have perfectly matched.
If you organize your name columns in the same order as your email column then you could simply:
nomination <- colSums(df[, -(1:2)], na.rm = TRUE)
names(nomination) <- NULL
df <- cbind(df, nomination)

R function for grouping rows based on patterns across columns?

I would like to group rows of a dataframe based on the pattern of each row across columns. Here is a very simple example.
df <- data.frame("gene" = 1:5,
"stg 1" = c("up", "up", NA, NA, NA),
"stg 2" = c("up", "up", NA, NA, NA),
"stg 3" = c("up", "up", NA, NA, NA),
"stg 4" = c("down", "down", "up", "up", NA))
> df
gene stg.1 stg.2 stg.3 stg.4
1 1 up up up down
2 2 up up up down
3 3 <NA> <NA> <NA> up
4 4 <NA> <NA> <NA> up
5 5 <NA> <NA> <NA> <NA>
In this case, gene 1 and 2 would be grouped, and genes 3 and 4 would be grouped. I would like the names of the genes in each pattern group, and what the pattern is for that group. I hope that is clear. Thanks in advance!
Try this approach. Create a variable to collect the values across rows using c_across() and toString(). After that, format as factor and assign the suffix Group.. Here the code using tidyverse functions:
library(tidyverse)
#Code
dfnew <- df %>% group_by(gene) %>%
mutate(Var=toString(c_across(stg.1:stg.4))) %>%
ungroup() %>%
mutate(Var=paste0('Group.',as.numeric(factor(Var,levels = unique(Var),ordered = T))))
Output:
# A tibble: 5 x 6
gene stg.1 stg.2 stg.3 stg.4 Var
<int> <fct> <fct> <fct> <fct> <chr>
1 1 up up up down Group.1
2 2 up up up down Group.1
3 3 NA NA NA up Group.2
4 4 NA NA NA up Group.2
5 5 NA NA NA NA Group.3
If you only need a pattern, try this:
#Code 2
dfnew <- df %>% group_by(gene) %>%
mutate(Var=toString(c_across(stg.1:stg.4)))
Output:
# A tibble: 5 x 6
# Groups: gene [5]
gene stg.1 stg.2 stg.3 stg.4 Var
<int> <fct> <fct> <fct> <fct> <chr>
1 1 up up up down up, up, up, down
2 2 up up up down up, up, up, down
3 3 NA NA NA up NA, NA, NA, up
4 4 NA NA NA up NA, NA, NA, up
5 5 NA NA NA NA NA, NA, NA, NA
We can do this in a vectorized way with unite
library(dplyr)
library(tidyr)
df %>%
unite(grp, starts_with('stg'), na.rm = TRUE, remove = FALSE) %>%
mutate(grp = match(grp, unique(grp)))
# gene grp stg.1 stg.2 stg.3 stg.4
#1 1 1 up up up down
#2 2 1 up up up down
#3 3 2 <NA> <NA> <NA> up
#4 4 2 <NA> <NA> <NA> up
#5 5 3 <NA> <NA> <NA> <NA>
Though not specifically asked, data.table solution goes as under
library(data.table)
setDT(df)
df[,group:= paste0(stg.1,stg.2,stg.3,stg.4),by= gene][,group:= match(group, unique(group))]
> df
gene stg.1 stg.2 stg.3 stg.4 group
1: 1 up up up down 1
2: 2 up up up down 1
3: 3 <NA> <NA> <NA> up 2
4: 4 <NA> <NA> <NA> up 2
5: 5 <NA> <NA> <NA> <NA> 3

How to merge variables looping through by variable number in R

I have a dataframe with a lot of variables seen in multiple conditions. I'd like to merge each variable by condition.
The example data frame is a simplified version of what I have (3 variables over 2 conditions).
VAR.B_1 <- c(1, 2, 3, 4, 5, 'NA', 'NA', 'NA', 'NA', 'NA')
VAR.B_2 <- c(2, 2, 3, 4, 5,'NA', 'NA', 'NA', 'NA', 'NA')
VAR.B_3 <- c(1, 1, 1, 1, 1,'NA', 'NA', 'NA', 'NA', 'NA')
VAR.E_1 <- c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1)
VAR.E_2 <- c(NA, NA, NA, NA, NA, 1, 2, 3, 4, 5)
VAR.E_3 <- c(NA, NA, NA, NA, NA, 1, 1, 1, 1, 1)
Condition <- c("B", "B","B","B","B","E","E","E","E","E")
#Example dataset
data<-as.data.frame(cbind(VAR.B_1,VAR.B_2,VAR.B_3, VAR.E_1,VAR.E_2, VAR.E_3, Condition))
I want to end up with this, appended to the original data frame:
VAR_1 VAR_2 VAR_3
1 2 1
2 2 1
3 3 1
4 4 1
5 5 1
1 1 1
1 2 1
1 3 1
1 4 1
1 5 1
I understand that R won't work with i inside the variable name, but I have an example of the kind of for loop I was trying to do. I would rather not call variables by column location, since there will be a lot of variables.
##Example of how I want to merge - this code does not work
for(i in 1:3) {
data$VAR_[,i] <-ifelse(data$Condition == "B", VAR.B_[,i],
ifelse(data$Condition == "E", VAR.E_[,i], NA))
}
This might work for your situation:
library(tidyverse)
library(stringr)
data %>%
mutate_all(as.character) %>%
gather(key, value, -Condition) %>%
filter(!is.na(value), value != "NA") %>%
mutate(key = str_replace(key, paste0("\\.", Condition), "")) %>%
group_by(Condition, key) %>%
mutate(rowid = 1:n()) %>%
spread(key, value) %>%
bind_cols(data)
#> # A tibble: 10 x 12
#> # Groups: Condition [2]
#> Condition rowid VAR_1 VAR_2 VAR_3 VAR.B_1 VAR.B_2 VAR.B_3 VAR.E_1
#> <chr> <int> <chr> <chr> <chr> <fctr> <fctr> <fctr> <fctr>
#> 1 B 1 1 2 1 1 2 1 NA
#> 2 B 2 2 2 1 2 2 1 NA
#> 3 B 3 3 3 1 3 3 1 NA
#> 4 B 4 4 4 1 4 4 1 NA
#> 5 B 5 5 5 1 5 5 1 NA
#> 6 E 1 1 1 1 NA NA NA 1
#> 7 E 2 1 2 1 NA NA NA 1
#> 8 E 3 1 3 1 NA NA NA 1
#> 9 E 4 1 4 1 NA NA NA 1
#> 10 E 5 1 5 1 NA NA NA 1
#> # ... with 3 more variables: VAR.E_2 <fctr>, VAR.E_3 <fctr>,
#> # Condition1 <fctr>
data.frame(lapply(split.default(data[-NCOL(data)], gsub("\\D+", "", head(names(data), -1))),
function(a){
a = sapply(a, function(x) as.numeric(as.character(x)))
rowSums(a, na.rm = TRUE)
}))
# X1 X2 X3
#1 1 2 1
#2 2 2 1
#3 3 3 1
#4 4 4 1
#5 5 5 1
#6 1 1 1
#7 1 2 1
#8 1 3 1
#9 1 4 1
#10 1 5 1
#Warning messages:
#1: In FUN(X[[i]], ...) : NAs introduced by coercion
#2: In FUN(X[[i]], ...) : NAs introduced by coercion
#3: In FUN(X[[i]], ...) : NAs introduced by coercion
Your data appears to have two kinds of NA values in it. It has NA, or R's NA value, and it also has the string 'NA'. In my solution below, I replace both with zero, cast each column in the data frame to numeric, and then just sum together like-numbered VAR columns. Then, drop the original columns which you don't want anymore.
data <- as.data.frame(cbind(VAR.B_1,VAR.B_2,VAR.B_3, VAR.E_1,VAR.E_2, VAR.E_3),
stringsAsFactors=FALSE)
data[is.na(data)] <- 0
data[data == 'NA'] <- 0
data <- as.data.frame(lapply(data, as.numeric))
data$VAR_1 <- data$VAR.B_1 + data$VAR.E_1
data$VAR_2 <- data$VAR.B_2 + data$VAR.E_2
data$VAR_3 <- data$VAR.B_3 + data$VAR.E_3
data <- data[c("VAR_1", "VAR_2", "VAR_3")]
Demo

Resources