Use a dynamcially created variable to select column in mutate - r

I am trying to use the value of vector_of_names[position] in the code above to dynamically select a column from data which to use for the value "age" using mutate.
vector_of_names <- c("one","two","three")
id <- c(1,2,3,4,5,6)
position <- c(1,1,2,2,1,1)
one <- c(32,34,56,77,87,98)
two <- c(45,67,87,NA,33,56)
three <- c(NA,NA,NA,NA,NA,60)
data <- data.frame(id,position,one,two,three)
attempt <- data %>%
mutate(age=vector_of_names[position])
I see a similar question here but the various answer fail as I am using a variable within the data "posistion" on which to select the column from the vector of names which is never recognised as I suspect is is looking outside of the data.
I am taking this approach as the number of columns "one","two" and "three" is not known before hand but the vector of their names is, and so they need to be selected dynamically.

You could do:
data %>%
rowwise() %>%
mutate(age = c_across(all_of(vector_of_names))[position])
id position one two three age
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 32 45 NA 32
2 2 1 34 67 NA 34
3 3 2 56 87 NA 87
4 4 2 77 NA NA NA
5 5 1 87 33 NA 87
6 6 1 98 56 60 98
If you want to be more explicit about what values should be returned:
named_vector_of_names <- setNames(seq_along(vector_of_names), vector_of_names)
data %>%
rowwise() %>%
mutate(age = get(names(named_vector_of_names)[match(position, named_vector_of_names)]))

Base R vectorized option using matrix subsetting.
data$age <- data[vector_of_names][cbind(1:nrow(data), data$position)]
data
# id position one two three age
#1 1 1 32 45 NA 32
#2 2 1 34 67 NA 34
#3 3 2 56 87 NA 87
#4 4 2 77 NA NA NA
#5 5 1 87 33 NA 87
#6 6 1 98 56 60 98

Related

Create "row" from first non-NA value in an R data frame

I want to create a "row" containing the first non-NA value that appears in a data frame. So for example, given this test data frame:
test.df <- data.frame(a=c(11,12,13,14,15,16),b=c(NA,NA,23,24,25,26), c=c(31,32,33,34,35,36), d=c(NA,NA,NA,NA,45,46))
test.df
a b c d
1 11 NA 31 NA
2 12 NA 32 NA
3 13 23 33 NA
4 14 24 34 NA
5 15 25 35 45
6 16 26 36 46
I know that I can detect the first appearance of a non-NA like this:
first.appearance <- as.numeric(sapply(test.df, function(col) min(which(!is.na(col)))))
first.appearance
[1] 1 3 1 5
This tells me that the first element in column 1 is not NA, the third element in column 2 is not NA, the first element in column 3 is not NA, and the fifth element in column 4 is not NA. But when I put the pieces together, it yields this (which is logical, but not what I want):
> test.df[first.appearance,]
a b c d
1 11 NA 31 NA
3 13 23 33 NA
1.1 11 NA 31 NA
5 15 25 35 45
I would like the output to be the first non-NA in each column. What is a base or dplyr way to do this? I am not seeing it. Thanks in advance.
a b c d
1 11 23 31 45
We can use
library(dplyr)
test.df %>%
slice(first.appearance) %>%
summarise_all(~ first(.[!is.na(.)]))
# a b c d
#1 11 23 31 45
Or it can be
test.df %>%
summarise_all(~ min(na.omit(.)))
# a b c d
#1 11 23 31 45
Or with colMins
library(matrixStats)
colMins(as.matrix(test.df), na.rm = TRUE)
#[1] 11 23 31 45
You can use :
library(tidyverse)
df %>% fill(everything(), .direction = "up") %>% head(1)
a b c d
<dbl> <dbl> <dbl> <dbl>
1 11 23 31 45

How can I label rows of one dataframe according to a range specified in 2 columnns (start and end) of another dataframe?

Apologies if this has been asked before - I tried to search but I might not know the right terms to search for.
I have data in the following format:
in one data frame (utterances) I have the start and end frames of utterances in my data set
id <- c(1,1,1,2,2,2,2)
utterance_number <- c(1,2,3,1,2,3,4)
start_frame <- c(20,35,67,10,44,56,72)
end_frame <- c(29,44,72,15,52,69,82)
utterances <- cbind(id, utterance_number, start_frame, end_frame)
utterances
in another data frame I have all of the frames
id <- c(rep(1,80), rep(2,90))
frame <- c(seq(1:80), seq(1:90))
val1 <- sample(170)
val2 <- sample(170)
values <- cbind(id, frame, val1, val2)
values
I want to label each frame in values with its utterance_number, or with NA if it is not part of an utterance. So in a new column "Utterance_number" in values, the first 19 frames would be NA, frames 20-29 would be labelled "1" and so on.
What is the best way of doing this?
You can use merge and expand utterances using apply.
merge(values, do.call(rbind, apply(utterances, 1
, function(x) cbind(id=x[1], frame=x[3]:x[4], utterance_number=x[2])))
, all.x=TRUE)
# id frame val1 val2 utterance_number
#1 1 1 166 138 NA
#2 1 2 54 109 NA
#3 1 3 71 103 NA
#4 1 4 9 48 NA
#...
#17 1 17 32 22 NA
#18 1 18 170 100 NA
#19 1 19 57 112 NA
#20 1 20 45 110 1
#21 1 21 25 148 1
#22 1 22 13 25 1
#...
#28 1 28 56 62 1
#29 1 29 130 47 1
#30 1 30 163 15 NA
#31 1 31 110 64 NA
#...

Returning value from specific column in data.frame

I have a data.frame of 14 columns made up of test scores at 13 time periods, all numeric. The last column, say X, denotes the specific time point that each student (rows) received a failing grade. I would like to create a separate column that has each student's failing test score from their specific failing time point.
dataframe<-data.frame(TestA=c(58,92,65,44,88),
TestB=c(17,22,58,46,98),
TestC=c(88,98,2,45,80), TestD=c(33,25,65,66,5),
TestE=c(98,100,100,100,100), X=c(2,2,3,NA,4))
Above is a condensed version with mock data. The first student failed at time point two, etc., but the fourth student never failed. The resulting column should be 17,2 2, 2, NA, 5. How can I accomplish this?
You can try
dataframe[cbind(1:nrow(dataframe), dataframe$X)]
#[1] 17 22 2 NA 5
From ?`[`
A third form of indexing is via a numeric matrix with the one column for each dimension: each row of the index matrix then selects a single element of the array, and the result is a vector. Negative indices are not allowed in the index matrix. NA and zero values are allowed: rows of an index matrix containing a zero are ignored, whereas rows containing an NA produce an NA in the result.
Two alternative solutions.
One using map function from purrr package
library(tidyverse)
dataframe %>%
group_by(student_id = row_number()) %>%
nest() %>%
mutate(fail_score = map(data, ~c(.$TestA, .$TestB, .$TestC, .$TestD, .$TestE)[.$X])) %>%
unnest()
# # A tibble: 5 x 8
# student_id fail_score TestA TestB TestC TestD TestE X
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 17 58 17 88 33 98 2
# 2 2 22 92 22 98 25 100 2
# 3 3 2 65 58 2 65 100 3
# 4 4 NA 44 46 45 66 100 NA
# 5 5 5 88 98 80 5 100 4
And the other one uses rowwise
dataframe %>%
rowwise() %>%
mutate(fail_score = c(TestA, TestB, TestC, TestD, TestE)[X]) %>%
ungroup()
# # A tibble: 5 x 7
# TestA TestB TestC TestD TestE X fail_score
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 58 17 88 33 98 2 17
# 2 92 22 98 25 100 2 22
# 3 65 58 2 65 100 3 2
# 4 44 46 45 66 100 NA NA
# 5 88 98 80 5 100 4 5
I'm posting both because I have a feeling that the map approach would be faster if you have many students (i.e. rows) and tests (i.e. columns).

Merging content of repeated variables in a dataframe in R [duplicate]

This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Closed 6 years ago.
I merged various dataframes in R, which had variables with the same name. In the merged file I got variables names as varA, varA.x, varA.x1, varA.x.y, etc. I want to create a file merging the content of all these variables in a single column. As an example of my file:
ID weight age varA varA.x varA.x.y varA.x.y.1
1 50 30 2 NA NA NA
2 78 34 NA 3 NA NA
3 56 56 NA NA NA 6
4 56 67 NA NA 7 NA
I want a file that looks like:
ID weight age varA
1 50 30 2
2 78 34 3
3 56 56 6
4 56 67 7
It is not feasible to use ifelse: `data$varA = ifelse(is.na(varA.x),varA.y,varA.x), because the statement would be too long as I have so many repeated variables.
Can you help me, please? Thank you so much.
We can use coalesce from tidyr
library(tidyverse)
df1 %>%
mutate(varA = coalesce(varA, varA.x, varA.x.y, varA.x.y.1)) %>%
select_(.dots = names(.)[1:4])
# ID weight age varA
#1 1 50 30 2
#2 2 78 34 3
#3 3 56 56 6
#4 4 56 67 7
Or use pmax from base R
cbind(df1[1:3], varA=do.call(pmax, c(df1[grep("varA", names(df1))], na.rm = TRUE)))

Conditional filtering of data.frame with preceeding and tailing NA observations

I have a data.frame composed of observations and modelled predictions of data. A minimal example dataset could look like this:
myData <- data.frame(tree=c(rep("A", 20)), doy=c(seq(75, 94)), count=c(NA,NA,NA,NA,0,NA,NA,NA,NA,1,NA,NA,NA,NA,2,NA,NA,NA,NA,NA), pred=c(0,0,0,0,1,1,1,2,2,2,2,3,3,3,3,6,9,12,20,44))
The count column represents when observations were made and predictions are modelled over a complete set of days, in effect interpolating the data to a day level (from every 5 days).
I would like to conditionally filter this dataset so that I end up truncating the predictions to the same range as the observations, in effect keeping all predictions between when count starts and ends (i.e. removing preceding and trailing rows/values of pred when they correspond to an NA in the count column). For this example, the ideal outcome would be:
tree doy count pred
5 A 79 0 1
6 A 80 NA 1
7 A 81 NA 1
8 A 82 NA 2
9 A 83 NA 2
10 A 84 1 2
11 A 85 NA 2
12 A 86 NA 3
13 A 87 NA 3
14 A 88 NA 3
15 A 89 2 3
I have tried to solve this problem through combining filter with first and last, thinking about using a conditional mutate to create a column that determines if there is an observation in the previous doy (probably using lag) and filling that with 1 or 0 and using that output to then filter, or even creating a second data.frame that contains the proper doy range that can be joined to this data.
In my searches on StackOverflow I have come across the following questions that seemed close, but were not quite what I needed:
Select first observed data and utilize mutate
Conditional filtering based on the level of a factor R
My actual dataset is much larger with multiple trees over multiple years (with each tree/year having different period of observation depending on elevation of the sites, etc.). I am currently implementing the dplyr package across my code, so an answer within that framework would be great but would be happy with any solutions at all.
I think you're just looking to limit the rows to fall between the first and last non-NA count value:
myData[seq(min(which(!is.na(myData$count))), max(which(!is.na(myData$count)))),]
# tree doy count pred
# 5 A 79 0 1
# 6 A 80 NA 1
# 7 A 81 NA 1
# 8 A 82 NA 2
# 9 A 83 NA 2
# 10 A 84 1 2
# 11 A 85 NA 2
# 12 A 86 NA 3
# 13 A 87 NA 3
# 14 A 88 NA 3
# 15 A 89 2 3
In dplyr syntax, grouping by the tree variable:
library(dplyr)
myData %>%
group_by(tree) %>%
filter(seq_along(count) >= min(which(!is.na(count))) &
seq_along(count) <= max(which(!is.na(count))))
# Source: local data frame [11 x 4]
# Groups: tree
#
# tree doy count pred
# 1 A 79 0 1
# 2 A 80 NA 1
# 3 A 81 NA 1
# 4 A 82 NA 2
# 5 A 83 NA 2
# 6 A 84 1 2
# 7 A 85 NA 2
# 8 A 86 NA 3
# 9 A 87 NA 3
# 10 A 88 NA 3
# 11 A 89 2 3
Try
indx <- which(!is.na(myData$count))
myData[seq(indx[1], indx[length(indx)]),]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
If this is based on groups
ind <- with(myData, ave(!is.na(count), tree,
FUN=function(x) cumsum(x)>0 & rev(cumsum(rev(x))>0)))
myData[ind,]
# tree doy count pred
#5 A 79 0 1
#6 A 80 NA 1
#7 A 81 NA 1
#8 A 82 NA 2
#9 A 83 NA 2
#10 A 84 1 2
#11 A 85 NA 2
#12 A 86 NA 3
#13 A 87 NA 3
#14 A 88 NA 3
#15 A 89 2 3
Or using na.trim from zoo
library(zoo)
do.call(rbind,by(myData, myData$tree, FUN=na.trim))
Or using data.table
library(data.table)
setDT(myData)[,.SD[do.call(`:`,as.list(range(.I[!is.na(count)])))] , tree]
# tree doy count pred
#1: A 79 0 1
#2: A 80 NA 1
#3: A 81 NA 1
#4: A 82 NA 2
#5: A 83 NA 2
#6: A 84 1 2
#7: A 85 NA 2
#8: A 86 NA 3
#9: A 87 NA 3
#10: A 88 NA 3
#11: A 89 2 3

Resources