dplyr mutate column with nearest value in external list - r

I'm trying to mutate a column and populate it with exact matches from a list if those occur, and if not, the closest match possible.
My data frame looks like this:
index <- seq(1, 10, 1)
blockID <- c(100, 120, 132, 133, 201, 207, 210, 238, 240, 256)
df <- as.data.frame(cbind(index, blockID))
index blockID
1 1 100
2 2 120
3 3 132
4 4 133
5 5 201
6 6 207
7 7 210
8 8 238
9 9 240
10 10 256
I want to mutate a new column that checks whether blockID is in a list. If yes, it should just keep the value of blockID. If not, It should return the nearest value in blocklist:
blocklist <- c(100, 120, 130, 150, 201, 205, 210, 238, 240, 256)
so the additional column should contain
100 (match),
120 (match),
130 (no match for 132--nearest value is 130),
130 (no match for 133--nearest value is 130),
201,
205 (no match for 207--nearest value is 205),
210,
238,
240,
256
Here's what I've tried:
df2 <- df %>% mutate(blockmatch = ifelse(blockID %in% blocklist, blockID, ifelse(match.closest(blockID, blocklist, tolerance = Inf), "missing")))
I just put in "missing" to complete the ifelse() statements, but it shouldn't actually be returned anywhere since the preceding cases will be fulfilled for every value of blockID. However, the resulting df2 just has "missing" in all the cells where it should have substituted the nearest number. I know there are base R alternatives to match.closest but I'm not sure that's the problem. Any ideas?

You don't need if..else. Your rule can simplified by saying that we always get the blocklist element with least absolute difference when compared to blockID. If values match then absolute difference is 0 (which will always be the least).
With that here's a simple base R solution -
df$blockmatch <- sapply(df$blockID, function(x) blocklist[order(abs(x - blocklist))][1])
index blockID blockmatch
1 1 100 100
2 2 120 120
3 3 132 130
4 4 133 130
5 5 201 201
6 6 207 205
7 7 210 210
8 8 238 238
9 9 240 240
10 10 256 256
Here are a couple of ways with dplyr -
df %>%
rowwise() %>%
mutate(
blockmatch = blocklist[order(abs(blockID - blocklist))][1]
)
df %>%
mutate(
blockmatch = sapply(blockID, function(x) blocklist[order(abs(x - blocklist))][1])
)
Thanks to #Onyambu, here's a faster way -
df$blockmatch <- blocklist[max.col(-abs(sapply(blocklist, '-', df$blockID)))]

Related

How to include number of rows aggregated using aggregate() in R [duplicate]

This question already has answers here:
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 1 year ago.
I have dataset with a parentID variable and a childIQ variable which represents the IQ of the children of that specific parent:
df <- data.frame("parentID"=c(101,101,101,204,204,465,465),
"childIQ"=c(98,90,81,96,87,71,65))
parentID, childIQ
101, 98
101, 90
101, 81
204, 96
204, 87
465, 71
465, 65
I ran an aggregate() function so there is only 1 row per parent, and the childIQ value becomes the mean IQ of that parent's children:
df_agg <- aggregate(childIQ ~ parentID , data = df, mean)
parentID, avg_childIQ
101, 89.67
204, 91.5
465, 68
However, I want to add another column that represents the number of children for that parent, like this:
parentID, avg_childIQ, num_children
101, 90.67, 3
204, 91.5, 2
465, 68, 2
I'm not sure how to do this using data.table once I have already created df_agg?
It is possible to supply several functions to aggregate by using a function(x) c(...) code.
df_agg <- aggregate(childIQ ~ parentID , data = df,
function(x) c(mean = mean(x),
n = length(x)))
#> parentID childIQ.mean childIQ.n
#> 1 101 89.66667 3.00000
#> 2 204 91.50000 2.00000
#> 3 465 68.00000 2.00000
Using dplyr:
library(dplyr)
df %>% group_by(parentID) %>% summarise(avg_childID = mean(childIQ), num_children = n())
# A tibble: 3 x 3
parentID avg_childID num_children
<dbl> <dbl> <int>
1 101 89.7 3
2 204 91.5 2
3 465 68 2
Using data.table:
library(data.table)
setDT(df)[,list(avg_childID = mean(childIQ), num_children = .N), by=parentID]
parentID avg_childID num_children
1: 101 89.66667 3
2: 204 91.50000 2
3: 465 68.00000 2

I can differentiate between months using pairs in R

data(airquality)
a=airquality
convert_fahr_to_kelvin <- function(temp) {
kelvin <- ((temp - 32) * (5 / 9)) + 273.15
return(kelvin)
}
a[,4]=
convert_fahr_to_kelvin(a[,4])
oz=a[,1]
sr=a[,2]
wv=a[,3]
te=a[,4]
pairs(~oz+sr+wv+te,
col = c("orange") ,
pch = c(18),
labels = c("Ozono", "Irradiancia Solar", "Velocidad del viento","Temperatura"),
main = "Diagramas de dispersión por parejas")
This is the graphic that i get
This is what i am doing but, actually i would like to differentiate between months, like 31 first numbers of my a matrix, from all columns, be color green, for example and this for each month, i tried to separate the numbers in groups using group:
group <- NA
group[sr[1:31]]<-1
group[sr[32:61]]<-2
group[sr[62:92]]<-3
group[sr[93:123]]<-4
group[sr[124:153]]<-5
group[sr[1:31]]
group[sr[32:61]]
group[sr[62:92]]
group[sr[93:123]]
group[sr[124:153]]
here the numbers repeated
But what i get is that if the numbers in each column are the same they get in the same group, and i have been trying to solve it in other ways but i don't finally get what i want.
It is easier to create a group with gl
group <- as.integer(gl(length(sr), 31, length(sr)))
table(group)
#group
#1 2 3 4 5
#31 31 31 31 29
In the OP's code, 'group' is initialized as NA of length 1. Then, it is assigned based on values of 'sr' instead of just
group <- integer(length(sr))
group[1:31] <- 1
group[32:61] <- 2
...
whereas if we use sr values as index
sr[1:31]
#[1] 190 118 149 313 NA NA 299 99 19 194 NA 256 290 274 65 334 307 78 322 44 8 320 25 92 66 266 NA 13 252 223 279
then group values that are changed to 1 are at positions 190, 118, 149, 313, ....

In a series of series, how to subtract every 1st number in each sub-series event from every nth number in those events?

I have multiple series of timepoints. Some series have five timepoints, others have ten or fifteen timepoints. The series are in multiples of five because the event I am measuring is always five timepoints long; some recordings have multiple events in succession. For instance:
Series 1:
0
77
98
125
174
Series 2:
0
69
95
117
179
201
222
246
277
293
0 marks the beginning of each series. Series 1 is a single event, but Series 2 is two events in succession. The 6th timepoint in Series 2 is the start of the second event in that series.
I have an R dataframe that contains every timepoint in one column:
dd <- data.frame(
timepoint=c(0, 77, 98, 125, 174,
0, 69, 95, 117, 179, 201, 222, 246, 277, 293)
)
I need to know the duration from the start of each event to the 4th timepoint in each event. For the above data, that means:
Duration 1: 125 - 0 = 125
Duration 2: 179 - 0 = 179
Duration 3: 277 - 201 = 76
How can I write a simple piece of R code that will tell me the duration of that interval regardless of how many series or events there are, i.e. regardless of how many numbers are in the column?
I tried using diff() and seq_along(), but that seems only useful for every nth number, which doesn't work in this case.
diff(vec[seq_along(vec) %% 4 == 1])
This is maybe one way to do it with dplyr. We break up the data into "runs" which reset at each 0 and them we have the "sequences" which reset each 5 values.
dd %>%
group_by(run =cumsum(timepoint==0)) %>%
mutate(seq = (row_number()-1) %/% 5 + 1) %>%
group_by(run, seq) %>%
summarize(diff=timepoint[4]-timepoint[1])
# run seq diff
# <int> <dbl> <dbl>
# 1 1 1 125
# 2 2 1 117
# 3 2 2 76
It makes it somewhat easy to tie the value back to where it came from.
If you just wanted to use indexing, here's a helper function
diff4v1 <- function(x) {
idx <- (seq_along(x)-1) %% 5+1;
x[idx==4] - x[idx==1]
}
diff4v1(dd$timepoint)
# [1] 125 117 76
This is your data frame (hypothetical)
df = data.frame(series = round(rnorm(40, 100, 50)))
head(df)
series
1 16
2 35
3 75
4 125
5 190
6 85
And these are your differences
idx = c(1:nrow(df))
df[which(idx %% 5 == 4), "series"] - df[which(idx %% 5 == 1), "series"]
[1] 109 -38 -101 -47 34 -52 -63 -5

How to write a loop that looks for a condition in two columns then adds the value in the third of a data frame?

Table showing correct format of dataI have a data frame with four columns, and I need to find a way to sum the values in the third column. Only if the numbers in the first two columns are different. The only way I can think of is to maybe do an If loop? Is that something can be done or is there a better way?
Genotype summary`
Dnov1a Dnov1b Freq rel_geno_freq
1 220 220 1 0.003367003
7 220 224 4 0.013468013
8 224 224 8 0.026936027
13 220 228 14 0.047138047
This is a portion of the data as an example, I need to sum the third column Freq for rows 7 and 13 because they are different.
Here's a tidyverse way of doing it:
library(tidyverse)
data <- tribble(
~Dnov1a, ~Dnov1b, ~Freq, ~rel_geno_freq,
220, 220, 1, 0.003367003,
220, 224, 4, 0.013468013,
224, 224, 8, 0.026936027,
220, 228, 14, 0.047138047)
data %>%
mutate(filter_column = if_else(Dnov1a != Dnov1b, TRUE, FALSE)) %>%
filter(filter_column == TRUE) %>%
summarise(Total = sum(Freq))
# A tibble: 1 x 1
Total
<dbl>
1 18
data$new = data$Dnov1a!=data$Dnov1b
data
Dnov1a Dnov1b Freq rel_geno_freq new
<int> <int> <int> <dbl> <lgl>
1 220 220 1 0.00337 TRUE
2 220 224 4 0.0135 FALSE
3 224 224 8 0.0269 TRUE
4 220 228 14 0.0471 FALSE
sum(data$Freq[data$new])
28
Is this what you are looking for?

Breaking down a timed sequence into episodes

I'm trying to break down a vector of event times into episodes. An episode must meet 2 criteria. 1) It consists of 3 or more events and 2) those events have inter-event times of 25 time units or less. My data is organized in a data frame as shown below.
So far, I figured out that I can find the difference between events with diff(EventTime). By creating a logical vector that corresponds to events that the 2nd inter-event criterion, I can use rle(EpisodeTimeCriterion) to get a the total number, and length of episodes.
EventTime TimeDifferenceBetweenNextEvent EpisodeTimeCriterion
25 NA NA
75 50 TRUE
100 25 TRUE
101 1 TRUE
105 4 TRUE
157 52 FALSE
158 1 TRUE
160 2 TRUE
167 7 TRUE
169 2 TRUE
170 1 TRUE
175 5 TRUE
178 3 TRUE
278 100 FALSE
302 24 TRUE
308 6 TRUE
320 12 TRUE
322 459 FALSE
However, I would like to know the timing of the episodes and 'rle()' doesn’t let me do that.
Ideally I would like to generate a data frame that looks like this:
Episode EventsPerEpisode EpisodeStartTime EpisodeEndTime
1 4 75 105
2 7 158 178
3 3 302 322
I know that this is probably a simple problem but being new to R, the only solution I can envision is some series of loops. Is there a way of doing this without loops? Or is there package that lends itself to this sort of analysis?
Thanks!
Edited for clarity. Added a desired outcome data fame and expanded the example data to make it clearer.
You've got the pieces you need. You really need to make a variable that gives each episode a number/name so you can group by it. rle(...)$length gives you run lengths, so just use rep to repeat a number that number of times:
runs <- rle(df$EpisodeTimeCriterion)$lengths # You don't need this extra variable, but it makes the code more readable
df$Episode <- rep(1:length(runs), runs)
so df looks like
> head(df)
EventTime TimeDifferenceBetweenNextEvent EpisodeTimeCriterion Episode
1 25 NA NA 1
2 75 50 TRUE 2
3 100 25 TRUE 2
4 101 1 TRUE 2
5 105 4 TRUE 2
6 157 52 FALSE 3
Now use dplyr to summarize the data:
library(dplyr)
df2 <- df %>% filter(EpisodeTimeCriterion) %>% group_by(Episode) %>%
summarise(EventsPerEpisode = n(),
EpisodeStartTime = min(EventTime),
EpisodeEndTime = max(EventTime))
which returns
> df2
Source: local data frame [3 x 4]
Episode EventsPerEpisode EpisodeStartTime EpisodeEndTime
(int) (int) (dbl) (dbl)
1 2 4 75 105
2 4 7 158 178
3 6 3 302 320
If you want your episode numbers to be integers starting with one, you can clean up with
df2$Episode <- 1:nrow(df2)
Data
If someone wants to play with the data, the results of dput(df) before running the above code:
df <- structure(list(EventTime = c(25, 75, 100, 101, 105, 157, 158,
160, 167, 169, 170, 175, 178, 278, 302, 308, 320, 322), TimeDifferenceBetweenNextEvent = c(NA,
50, 25, 1, 4, 52, 1, 2, 7, 2, 1, 5, 3, 100, 24, 6, 12, 459),
EpisodeTimeCriterion = c(NA, TRUE, TRUE, TRUE, TRUE, FALSE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, FALSE)), .Names = c("EventTime", "TimeDifferenceBetweenNextEvent",
"EpisodeTimeCriterion"), row.names = c(NA, -18L), class = "data.frame")
Here is one approach I came up with using a combination of cut2 from Hmisc package and cumsum to label episodes into numbers:
library(Hmisc)
library(dplyr)
df$episodeCut <- cut2(df$TimeDifferenceBetweenNextEvent, c(26))
df$episode <- cumsum((df$episodeCut == '[ 1,26)' & lag(df$episodeCut) != '[ 1,26)') | df$episodeCut != '[ 1,26)')
Output is as follows:
EventTime TimeDifferenceBetweenNextEvent EpisodeTimeCriterion episodeCut episode
1 25 50 FALSE [26,52] 1
2 75 25 TRUE [ 1,26) 2
3 100 1 TRUE [ 1,26) 2
4 101 4 TRUE [ 1,26) 2
5 105 52 TRUE [26,52] 3
6 157 52 FALSE [26,52] 4
As you can see, it tags rows 2, 3, 4 as belonging to a single episode.
Is this what you are looking for? Not sure from your description. So, my answer may be wrong.

Resources