Fill empty rows with values from other rows - r

I have a dataset with a number of cases. Every case has two observations. The first observation for case number 1 has value 3 and the second observation has value 7. The two observations for case number 2 have missing values. I need to write code to fill the empty cells with the same values from case number 1 so that the first row for case 2 will have the same value as case 1 for obs = 1 and the second row will have the same value for obs = 2. Of course, this is a very short version of a much bigger dataset so I need something that is flexible enough to accommodate for a couple of hundred cases and where the values to use as fillers change for every subjects.
Here is a toy data set:
# toy dataset
df <- data.frame(
case = c(1, 1, 2, 2),
obs = c(1, 2, NA, NA),
value = c(3, 7, NA, NA)
)
# case obs value
# 1 1 1 3
# 2 1 2 7
# 3 2 NA NA
# 4 2 NA NA
#Desired output:
case obs value
1 1 1 3
2 1 2 7
3 2 1 3
4 2 2 7

We may use fill with grouping on the row sequence (rowid) of case
library(dplyr)
library(data.table)
library(tidyr)
df %>%
group_by(grp = rowid(case)) %>%
fill(obs, value) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 4 × 3
case obs value
<dbl> <dbl> <dbl>
1 1 1 3
2 1 2 7
3 2 1 3
4 2 2 7

Related

Use replicate to create new variable

I have the following code:
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)
This results in the following output:
ID observations
1 1
1 3
1 4
1 5
1 6
1 8
However, I also want a variable 'times' to indicate how many times of measurement there were for each individual. But since every ID has a different length, I am not sure how to implement this. This anybody know how to include that? I want it to look like this:
ID observations times
1 1 1
1 3 2
1 4 3
1 5 4
1 6 5
1 8 6
Using dplyr you could group by ID and use the row number for times:
library(dplyr)
dat |>
group_by(ID) |>
mutate(times = row_number()) |>
ungroup()
With base we could create the sequence based on each of the lengths of the ID variable:
dat$times <- sequence(rle(dat$ID)$lengths)
Output:
# A tibble: 734 × 3
ID observations times
<int> <dbl> <int>
1 1 1 1
2 1 3 2
3 1 9 3
4 2 1 1
5 2 5 2
6 2 6 3
7 2 8 4
8 3 1 1
9 3 2 2
10 3 5 3
# … with 724 more rows
Data (using a seed):
set.seed(1)
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)

Difference between rows in long format for R based on other column variables

I have an R dataframe such as:
df <- data.frame(ID = rep(c(1, 1, 2, 2), 2), Condition = rep(c("A", "B"),4),
Variable = c(rep("X", 4), rep("Y", 4)),
Value = c(3, 5, 6, 6, 3, 8, 3, 6))
ID Condition Variable Value
1 1 A X 3
2 1 B X 5
3 2 A X 6
4 2 B X 6
5 1 A Y 3
6 1 B Y 8
7 2 A Y 3
8 2 B Y 6
I want to obtain the difference between each value of Condition (A - B) for each Variable and ID while keeping the long format. That would mean the value must appear every two rows, like this:
ID Condition Variable Value diff_value
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
So far, I managed to do something relatively similar using the dplyr package, but it does not work if I want to maintain the long format:
df_long_example %>%
group_by(Variable, ID) %>%
mutate(diff_value = lag(Value, default = Value[1]) -Value)
# A tibble: 8 x 5
# Groups: Variable, ID [4]
ID Condition Variable Value diff_value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 0
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 0
6 1 B Y 8 -5
7 2 A Y 3 0
8 2 B Y 6 -3
You don't have to use lag, but use diff:
df %>%
group_by(Variable,ID) %>%
mutate(diff = -diff(Value))
Output:
# A tibble: 8 x 5
# Groups: Variable, ID [4]
ID Condition Variable Value diff
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
You dont need to create lag variable just use Value[Condition == "A"] - Value[Condition == "B"] as below
df %>%
group_by(ID, Variable) %>%
mutate(Value, diff_value = Value[Condition == "A"] - Value[Condition == "B"])
# A tibble: 8 x 5
# Groups: ID, Variable [4]
ID Condition Variable Value diff_value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
This should work:
# Step one: create a new column of df, where we store the "Value" we need
# to add/subtract, as you required (same "ID", same "Variable", different
# "Condtion").
temp.fun = function(x, dta)
{
# Given a row x of dta, this function selects the value corresponding to the row
# with same "ID", same "Variable" and different "Condition".
# Notice that if "Condition" is not binary, we need to generalize this function.
# Notice also that this function is super specific to your case, and that it has
# been thought to be used within apply().
# INPUTS:
# - x, a row of a data frame.
# - dta, the data frame (df, in your case).
# OUTPUT:
# - temp.corresponding, "Value" you want for each row.
# Saving information.
temp.id = as.numeric(x["ID"])
temp.condition = as.character(x["Condition"])
temp.variable = as.character(x["Variable"])
# Index for selecting row.
temp.row = dta$ID == temp.id & dta$Condition != temp.condition & dta$Variable == temp.variable
# Selecting "Value".
temp.corresponding = dta$Value[temp.row]
return(temp.corresponding)
}
df$corr_value = apply(df, MARGIN = 1, FUN = temp.fun, dta = df)
# Step two: add/subtract to create the column "diff_value".
# Key: if "Condition" equals "A", we subtract, otherwise we add.
df$diff_value = NA
df$diff_value[df$Condition == "A"] = df$Value[df$Condition == "A"] - df$corr_value[df$Condition == "A"]
df$diff_value[df$Condition == "B"] = df$corr_value[df$Condition == "B"] - df$Value[df$Condition == "B"]
Notice that this solution just fits the specifics of your problem, and may be neither elegant nor efficient.
I wrote comments in the code to explain how this solution works. Anyway, the idea is to first write the function temp.fun(), which operates on single rows: for each row we pass, it finds df$Value of the row satisfying the criteria you asked (same ID, same Variable, different Condition). Then, we use apply() to pass all rows in temp.fun(), thus creating a new column in df storing the Value mentioned above.
We are now ready to compute df$diff_value. First, we initialize space, creating a column on NA. Then, we perform the operations. Be careful: because of the specifics of the problem, if Condition equals A, we want to subtract values, whether when Condition equals B we are going to add values. That is, in the former case we compute df$Value - df$corr_value, and in the latter we compute df$corr_value- df$Value.
Final warning: if Condition is not binary, this solution must be generalized in order to work.

Coding dichotomous variable based on changes in the relative highest score between a set of variables

I want to code a new variable in a dataframe based on a set of rules. I have a dataframe df1 with a subject variable, a time variable, and variables A, B and C, like this:
subject <- c(1,1,1,1,1,1,2,2,2,2,2,2)
time <- c(1,2,3,4,5,6,1,2,3,4,5,6)
A <- c(1,7,7,6,6,5,1,2,3,NA,NA,NA)
B <- c(2,1,1,1,1,1,6,5,4,NA,NA,NA)
C <-c(7,1,6,1,6,1,6,2,4,NA,NA,NA)
df1 <- data.frame(subject,time,A,B,C)
Values in A, B, and C range from 1 (lowest) to 7 (highest), there are also some NA. Now I want to code a new dichotomous variable, newvar. The first row for every subject should always be coded 0. 1 should be coded whenever the variable/s with the highest score (A,B or C) within one row change/s to one or more different variable/s in the next row. It doesn't matter if the value changes from one row to the next within one variable, only if there is a change in which of the three variables has the highest score within one row compared to the previous row.
The examples from df1 should make this clearer:
Row 1 is coded 0 because it is the first row for subject 1. C has the
highest score among the three variables A, B, and C.
In row 2, A has the highest score. Therefore, newvar = 1.
In row 3, A still has the highest score, therefore, newvar = 0.
In row 4, A still has the highest score --> newvar = 0.
In row 5, now A and C both have the highest score, therefore,
newvar = 1.
In row 6, only A has the highest score again, therefore, newvar = 1.
Row 7 is the first row for subject 2, therefore newvar is coded 0.
In row 8, newvar should be coded 1, because in the previous row, B
and C equally had the highest score, now it is only B.
In row 9, newvar should again be coded 1, because now B and C have
the highest score again within the row.
Rows 10 to 12 should be coded NA.
This is what it should look like:
newvar <-c(0,1,0,0,1,1,0,1,1,NA,NA,NA)
df2 <- data.frame(subject,time,A,B,C,newvar)
I would greatly appreciate any input in how to go about this!
Here is one approach using tidyverse. First, pivot your data into long form. Then, for each subject time combination, collect the column names that are equal to the highest value for that combination. This is stored as highest_values.
Then, change the group to subject. For each subject check if the time is the lowest value of time - if it is, code as 0 (there are alternative options if you just want to code 0 for the first row independent of time value). If the highest_values does not have any column names, code as NA. If there is a difference between highest_values and the previous row (a change), code as 1. Otherwise, it assumes highest_values has not changed, and code as 0.
library(tidyverse)
df1 %>%
pivot_longer(cols = -c(subject, time)) %>%
group_by(subject, time) %>%
summarise(highest_values = toString(name[which(value == max(value))])) %>%
group_by(subject) %>%
mutate(newvar = case_when(
time == min(time) ~ 0,
highest_values == "" ~ NA_real_,
highest_values != lag(highest_values) ~ 1,
TRUE ~ 0
)) %>%
right_join(df1)
Output
subject time highest_values newvar A B C
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 "C" 0 1 2 7
2 1 2 "A" 1 7 1 1
3 1 3 "A" 0 7 1 6
4 1 4 "A" 0 6 1 1
5 1 5 "A, C" 1 6 1 6
6 1 6 "A" 1 5 1 1
7 2 1 "B, C" 0 1 6 6
8 2 2 "B" 1 2 5 2
9 2 3 "B, C" 1 3 4 4
10 2 4 "" NA NA NA NA
11 2 5 "" NA NA NA NA
12 2 6 "" NA NA NA NA
Edit (2/11/21): Based on the comment below, sometimes there are rows with missing data. In these cases, newvar should reflect the last or most recent "highest_values" excluding those rows.
To do this, would filter out those rows without a "highest_values" value before the group_by. Then, the most recent "highest_values" will be the value that is not missing. Also, you won't need to set newvar as NA - this will happen automatically with the right_join.
Here is the revised code:
df1 %>%
pivot_longer(cols = -c(subject, time)) %>%
group_by(subject, time) %>%
summarise(highest_values = toString(name[which(value == max(value))])) %>%
filter(highest_values != "") %>%
group_by(subject) %>%
mutate(newvar = case_when(
time == min(time) ~ 0,
highest_values != lag(highest_values) ~ 1,
TRUE ~ 0
)) %>%
right_join(df1) %>%
arrange(subject, time)
I added a row of data to demonstrate with an example.
Output
subject time highest_values newvar A B C
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 C 0 1 2 7
2 1 2 A 1 7 1 1
3 1 3 A 0 7 1 6
4 1 4 A 0 6 1 1
5 1 5 A, C 1 6 1 6
6 1 6 A 1 5 1 1
7 2 1 B, C 0 1 6 6
8 2 2 B 1 2 5 2
9 2 3 B, C 1 3 4 4
10 2 4 NA NA NA NA NA
11 2 5 NA NA NA NA NA
12 2 6 B, C 0 2 3 3

Identify groups separated by NA

I want to add grouping column to my data. My data is text column and there is NA separating groups. Here is example and group is result I would like to achieve. I don't know how many rows each group will consist but there is always NA separating groups (except last group). So how can I create group column?
library(tidyverse)
data <- tibble(raw = c("This", "Is", "First", NA, "This", "Is", "Second", NA, "And", "Third"),
group = c(1,1,1,1,2,2,2,2,3,3))
Take the cumulative sum of the NAs and add one if the current value is not NA.
data %>% mutate(group = cumsum(is.na(raw)) + !is.na(raw))
One option is to create a logical vector based on the NA value and use cumsum
library(dplyr)
data %>%
mutate(groupNew = cumsum(lag(is.na(raw), default = TRUE)) )
# A tibble: 10 x 3
# raw group groupNew
# <chr> <dbl> <int>
# 1 This 1 1
# 2 Is 1 1
# 3 First 1 1
# 4 <NA> 1 1
# 5 This 2 2
# 6 Is 2 2
# 7 Second 2 2
# 8 <NA> 2 2
# 9 And 3 3
#10 Third 3 3

For each group find observations with max value of several columns

Assume I have a data frame like so:
set.seed(4)
df<-data.frame(
group = rep(1:10, each=3),
id = rep(sample(1:3), 10),
x = sample(c(rep(0, 15), runif(15))),
y = sample(c(rep(0, 15), runif(15))),
z = sample(c(rep(0, 15), runif(15)))
)
As seen above, some elements of x, y, z vectors take value of zero, the rest being drawn from the uniform distribution between 0 and 1.
For each group, determined by the first column, I want to find three IDs from the second column, pointing to the highest value of x, y, z variables in the group. Assume there are no draws except for the cases in which a variable takes a value of 0 in all observations of a given group - in that case I don't want to return any number as an id of a row with maximum value.
The output would look like so:
group x y z
1 2 2 1
2 2 3 1
... .........
My first thought is to select rows with maximum values separately for each variable and then use merge to put it in one table. However, I'm wondering if it can be done without merge, for example with standard dplyr functions.
Here is my proposed solution using plyr:
ddply(df,.variables = c("group"),
.fun = function(t){apply(X = t[,c(-1,-2)],MARGIN = 2,
function(z){ifelse(sum(abs(z))==0,yes = NA,no = t$id[which.max(z)])})})
# group x y z
#1 1 2 2 1
#2 2 2 3 1
#3 3 1 3 2
#4 4 3 3 1
#5 5 2 3 NA
#6 6 3 1 3
#7 7 1 1 2
#8 8 NA 2 3
#9 9 2 1 3
#10 10 2 NA 2
A solution uses dplyr and tidyr. Notice that if all numbers are the same, we cannot decide which id should be selected. So filter(n_distinct(Value) > 1) is added to remove those records. In the final output df2, NA indicates such condition where all numbers are the same. We can decide whether to impute those NA later if we want. This solution should work for any numbers of id or columns (x, y, z, ...).
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, -id) %>%
arrange(group, Column, desc(Value)) %>%
group_by(group, Column) %>%
# If all values from a group-Column are all the same, remove that group-Column
filter(n_distinct(Value) > 1) %>%
slice(1) %>%
select(-Value) %>%
spread(Column, id)
If you want to stick with just dplyr, you can use the multiple-column summarize/mutate functions. This should work regardless of the form of id; my initial attempt was slightly cleaner but assumed that an id of zero was invalid.
df %>%
group_by(group) %>%
mutate_at(vars(-id),
# If the row is the max within the group, set the value
# to the id and use NA otherwise
funs(ifelse(max(.) != 0 & . == max(.),
id,
NA))) %>%
select(-id) %>%
summarize_all(funs(
# There are zero or one non-NA values per group, so handle both cases
if(any(!is.na(.)))
na.omit(.) else NA))
## # A tibble: 10 x 4
## group x y z
## <int> <int> <int> <int>
## 1 1 2 2 1
## 2 2 2 3 1
## 3 3 1 3 2
## 4 4 3 3 1
## 5 5 2 3 NA
## 6 6 3 1 3
## 7 7 1 1 2
## 8 8 NA 2 3
## 9 9 2 1 3
## 10 10 2 NA 2

Resources