R: Check if value from dataframe is within range other dataframe - r

I am looking for a way to look up infromation from 1 dataframe in another dataframe, get a value from that other dataframe and pass it back to the first frame..
example data:
I've got a dataframe named "x"
x <- structure(list(from = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L
), to = c(2L, 3L, 4L, 5L, 6L, 2L, 3L, 4L, 5L, 6L), number = c(30,
30, 30, 33, 34, 35, 36, 37, 38, 39), name = c("region 1", "region 2",
"region 3", "region 4", "region 5", "region 6", "region 7", "region 8",
"region 9", "region 10")), .Names = c("from", "to", "number",
"name"), row.names = c(NA, -10L), class = "data.frame")
# from to number name
#1 1 2 30 region 1
#2 2 3 30 region 2
#3 3 4 30 region 3
#4 4 5 33 region 4
#5 5 6 34 region 5
#6 1 2 35 region 6
#7 2 3 36 region 7
#8 3 4 37 region 8
#9 4 5 38 region 9
#10 5 6 39 region 10
This dataframe holds information about certain regions (1-10)
I've got another dataframe "y"
y <- structure(list(location = c(1.5, 2.8, 10, 3.5, 2), id_number =
c(30, 30, 38, 40, 36)), .Names = c("location", "id_number"), row.names
= c(NA, -5L), class = "data.frame")
# location id_number
#1 1.5 30
#2 2.8 30
#3 10.0 38
#4 3.5 40
#5 2.0 36
This one containt information about locations.
What I need is a function (or command, or whatever I can throw at R ;-) ) that:
for every row in y: looks up if the y$location fits between x$from and x$to AND y$id_number == x$number.
If a match is found (a location y can only fall in 1 row of x, or in 0. it is impossible for y to exist in two rows in y), return x$name to a new column in y, named "name
desired output:
# location id_number name
#1 1.5 30 region 1
#2 2.8 30 region 2
#3 10.0 38 <NA>
#4 3.5 40 <NA>
#5 2.0 36 region 7
I'm pretty new to R, so my first idea was to use for-loops to tackle this problem (as I'm used to do in VB). But then I thought: "noooooo", I have to verctorise it, like all the people are telling me good R-programmers do ;-)
So I came up with a function, and called it with adply (from the plyr-package).
Problem is: It does not work, throws me an error I don't understand, and now I'm stuck...
Can anyone point me in the right direction?
require("dplyr")
getValue <- function(y, x) {
tmp <- x %>%
filter(from <= y$location, to > y$location, number == y$id_number)
return(tmp$name)
}
y["name"] <- adply(y, 1, getValue, x=x)

Another base method (mostly):
# we need this for the last line - if you don't use magrittr, just wrap the sapply around the lapply
library(magrittr)
# get a list of vectors where each item is whether an item's location in y is ok in each to/from in x
locationok <- lapply(y$location, function(z) z >= x$from & z <= x$to)
# another list of logical vectors indicating whether y's location matches the number in x
idok <- lapply(y$id_number, function(z) z== x$number)
# combine the two list and use the combined vectors as an index on x$name
lapply(1:nrow(y), function(i) {
x$name[ locationok[[i]] & idok[[i]] ]
}) %>%
# replace zero length strings with NA values
sapply( function(x) ifelse(length(x) == 0, NA, x)

Here's a simple base method that uses the OP's logic:
f <- function(vec, id) {
if(length(.x <- which(vec >= x$from & vec <= x$to & id == x$number))) .x else NA
}
y$name <- x$name[mapply(f, y$location, y$id_number)]
y
# location id_number name
#1 1.5 30 region 1
#2 2.8 30 region 2
#3 10.0 38 <NA>
#4 3.5 40 <NA>
#5 2.0 36 region 7

Since you want to match the columns of id_number and number, you can join x and y on the columns and then mutate the name to NA if the location doesn't fall between from and to, here is a dplyr option:
library(dplyr)
y %>% left_join(x, by = c("id_number" = "number")) %>%
mutate(name = if_else(location >= from & location <= to, as.character(name), NA_character_)) %>%
select(-from, -to) %>% arrange(name) %>%
distinct(location, id_number, .keep_all = T)
# location id_number name
# 1 1.5 30 region 1
# 2 2.8 30 region 2
# 3 2.0 36 region 7
# 4 10.0 38 <NA>
# 5 3.5 40 <NA>

Related

Finding minimum by groups and among columns

I am trying to find the minimum value among different columns and group.
A small sample of my data looks something like this:
group cut group_score_1 group_score_2
1 a 1 3 5.0
2 b 2 2 4.0
3 a 0 2 2.5
4 b 3 5 4.0
5 a 2 3 6.0
6 b 1 5 1.0
I want to group by the groups and for each group, find the row which contains the minimum group score among both group scores and then also get the name of the column which contains the minimum (group_score_1 or group_score_2),
so basically my result should be something like this:
group cut group_score_1 group_score_2
1 a 0 2 2.5
2 b 1 5 1.0
I tried a few ideas, and came up eventually to dividing the into several new data frames, filtering by group and selecting the relevant columns and then using which.min(), but I'm sure there's a much more efficient way to do it. Not sure what I am missing.
We can use data.table methods
library(data.table)
setDT(df)[df[, .I[which.min(do.call(pmin, .SD))],
group, .SDcols = patterns('^group_score')]$V1]
# group cut group_score_1 group_score_2
#1: a 0 2 2.5
#2: b 1 5 1.0
For each group, you can calculate min value and select the row in which that value exist in one of the column.
library(dplyr)
df %>%
group_by(group) %>%
filter({tmp = min(group_score_1, group_score_2);
group_score_1 == tmp | group_score_2 == tmp})
# group cut group_score_1 group_score_2
# <chr> <int> <int> <dbl>
#1 a 0 2 2.5
#2 b 1 5 1
The above works well when you have only two group_score columns. If you have many such columns it is not possible to list down each one of them with group_score_1 == tmp | group_score_2 == tmp etc. In such case, get the data in long format and get the corresponding cut value of the minimum value and join the data. Assuming cut is unique in each group.
df %>%
tidyr::pivot_longer(cols = starts_with('group_score')) %>%
group_by(group) %>%
summarise(cut = cut[which.min(value)]) %>%
left_join(df, by = c("group", "cut"))
Here is a base R option using pmin + ave + subset
subset(
df,
as.logical(ave(
do.call(pmin, df[grep("group_score_\\d+", names(df))]),
group,
FUN = function(x) x == min(x)
))
)
which gives
group cut group_score_1 group_score_2
3 a 0 2 2.5
6 b 1 5 1.0
Data
> dput(df)
structure(list(group = c("a", "b", "a", "b", "a", "b"), cut = c(1L,
2L, 0L, 3L, 2L, 1L), group_score_1 = c(3L, 2L, 2L, 5L, 3L, 5L
), group_score_2 = c(5, 4, 2.5, 4, 6, 1)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))

calculating sum based on customed rules in the data frame

Preferably using data.table in R: I want to calculate the sum of DIAM by ID, CYCLE # based on the following rules:
if any of the DIAM for particular subject cycle is presented as NE then the SUM cant be calculated (must return NA)
if any of the DIAM is presented as NA, then calculate the sum ignoring the NA (i.e. as if it is 0)
if none is NA, then calculate the sum as normal
Also i would like to substitute CYCLE number to numeric with BASELINE representing 0.
dfin <-
ID CYCLE NUM DIAM
1 BASELINE 1 8
1 BASLEINE 2 4
1 CYCLE 1 1 6
1 CYCLE 1 2 2
1 CYCLE 2 1 6
1 CYCLE 2 2 NE
1 CYCLE 3 1 6
1 CYCLE 3 2 NA
dfout <-
ID CYCLE SUM
1 0 12
1 1 8
1 2 NA
1 3 6
This need to be applied for every subject. There are many cycles there but this just an example.
Here is one option. Grouped by 'ID', and the matched index of 'CYCLE' (as showed in the expected output), change the "DIAM" values to NA if any of 'DIAM" have "NE", then summarise by taking the sum of 'DIAM' while making sure that if all of the values are NA return NA
library(tidyverse)
dfin %>%
group_by(ID, CYCLE = match(CYCLE, unique(CYCLE))-1) %>%
mutate(DIAM = as.numeric(replace(DIAM, any(DIAM== "NE"), NA))) %>%
summarise(SUM = NA^all(is.na(DIAM)) * sum(DIAM, na.rm = TRUE))
# A tibble: 4 x 3
# Groups: ID [?]
# ID CYCLE SUM
# <int> <dbl> <dbl>
#1 1 0 12
#2 1 1 8
#3 1 2 NA
#4 1 3 6
Or use an if/else condition after the group_by step
dfin %>%
group_by(ID, CYCLE = match(CYCLE, unique(CYCLE))-1) %>%
summarise(SUM = if("NE" %in% DIAM) NA else sum(as.numeric(DIAM), na.rm = TRUE))
Or using the same logic with data.table
library(data.table)
setDT(dfin)[, .(SUM = if("NE" %in% DIAM) NA_real_ else
sum(as.numeric(DIAM), na.rm = TRUE)), .(ID, CYCLE = rleid(CYCLE)-1)]
# ID CYCLE SUM
#1: 1 0 12
#2: 1 1 8
#3: 1 2 NA
#4: 1 3 6
data
dfin <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
CYCLE = c("BASELINE",
"BASELINE", "CYCLE 1", "CYCLE 1", "CYCLE 2", "CYCLE 2", "CYCLE 3",
"CYCLE 3"), NUM = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), DIAM = c("8",
"4", "6", "2", "6", "NE", "6", NA)), row.names = c(NA, -8L),
class = "data.frame")
# Data created
dfin<-data.table("ID" = rep(x = 1,times = 8),"CYCLE" = c("BASELINE","BASELINE","CYCLE 1","CYCLE 1","CYCLE 2","CYCLE 2","CYCLE 3","CYCLE 3"),
"NUM" = rep(x = c(1,2),times = 4),"DIAM" = c(8,4,6,2,6,"NE",6,NA))
# CYCLE transformed
dfin[,CYCLE := as.numeric(ifelse(CYCLE == "BASELINE","0",
substr(x = CYCLE,start = 7,stop = 7)))]
# SUM computed
dfin2<-dfin[,.(SUM = if(CYCLE == 0){
NA_real_
} else if("NE" %in% DIAM){
NA_real_
} else {
sum(as.numeric(DIAM),na.rm = T)
}),by = c("ID","CYCLE")]
# IDs with CYCLE = 0 present have SUM updated to NA
dfin2[ID %in% ID[which(CYCLE == 0)],SUM := NA]
Hope this helps!

Remove all rows of a category if one row meets a condition [duplicate]

This question already has answers here:
Remove group from data.frame if at least one group member meets condition
(4 answers)
Closed 1 year ago.
Problem:
I want to remove all the rows of a specific category if one of the rows has a certain value in another column (similar to problems in the links below). However, the main difference is I would like it to only work if it matches a criteria in another column.
Making a practice df
prac_df <- data_frame(
subj = rep(1:4, each = 4),
trial = rep(rep(1:4, each = 2), times = 2),
ias = rep(c('A', 'B'), times = 8),
fixations = c(17, 14, 0, 0, 15, 0, 8, 6, 3, 2, 3,3, 23, 2, 3,3)
)
So my data frame looks like this.
subj ias fixations
1 1 A 17
2 1 B 14
3 2 A 0
4 2 B 0
5 3 A 15
6 3 B 0
7 4 A 8
8 4 B 6
And I want to remove all of subject 2 because it has a value of 0 for fixations column in a row that ias has a value of A. However I want to do this without removing subject 3, because even though there is a 0 it is in a row where the ias column has a value of B.
My attempt so far.
new.df <- prac_df[with(prac_df, ave(prac_df$fixations != 0, subj, FUN = all)),]
However this is missing the part that will only get rid of it if it has the value A in the ias column. I've attempted various uses of & or if but I feel like there's likely a clever and clean way I just don't know of.
My goal is to make a df like this.
subj ias fixations
1 1 A 17
2 1 B 14
3 3 A 15
4 3 B 0
5 4 A 8
6 4 B 6
Thank you very much!
Related questions:
R: Remove rows from data frame based on values in several columns
How to remove all rows belonging to a particular group when only one row fulfills the condition in R?
We group by 'subj' and then filter based on the logical condition created with any and !
library(dplyr)
df1 %>%
group_by(subj) %>%
filter(!any(fixations==0 & ias == "A"))
# subj ias fixations
# <int> <chr> <int>
#1 1 A 17
#2 1 B 14
#3 3 A 15
#4 3 B 0
#5 4 A 8
#6 4 B 6
Or use all with |
df1 %>%
group_by(subj) %>%
filter(all(fixations!=0 | ias !="A"))
The same approach can be used with ave from base R
df1[with(df1, !ave(fixations==0 & ias =="A", subj, FUN = any)),]
data
df1 <- structure(list(subj = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), ias = c("A",
"B", "A", "B", "A", "B", "A", "B"), fixations = c(17L, 14L, 0L,
0L, 15L, 0L, 8L, 6L)), .Names = c("subj", "ias", "fixations"),
class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))

Transform a dataframe to use first column values as column names

I have a dataframe with 2 columns:
.id vals
1 A 10
2 B 20
3 C 30
4 A 100
5 B 200
6 C 300
dput(tst_df)
structure(list(.id = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor"), vals = c(10, 20, 30, 100, 200,
300)), .Names = c(".id", "vals"), row.names = c(NA, -6L), class = "data.frame")
Now i want to have the .id column to become my column names and the vals will become 2 rows.
Like this:
A B C
10 20 30
100 200 300
Basically .id is my grouping variable and i want to have all values belonging to 1 group as a row. I expected something simple like melt and transform. But after many tries i still not succeeded. Is anyone familiar with a function that will accomplish this?
You can do this in base R with unstack:
unstack(df, form=vals~.id)
A B C
1 10 20 30
2 100 200 300
The first argument is the name of the data.frame and the second is a formula which determines the unstacked structure.
You can also use tapply,
do.call(cbind, tapply(df$vals, df$.id, I))
# A B C
#[1,] 10 20 30
#[2,] 100 200 300
or wrap it in data frame, i.e.
as.data.frame(do.call(cbind, tapply(df$vals, df$.id, I)))

average column values across all rows of a data frame

I've got a data frame that I read from a file like this:
name, points, wins, losses, margin
joe, 1, 1, 0, 1
bill, 2, 3, 0, 4
joe, 5, 2, 5, -2
cindy, 10, 2, 3, -2.5
etc.
I want to average out the column values across all rows of this data, is there an easy way to do this in R?
For example, I want to get the average column values for all "Joe's", coming out with something like
joe, 3, 1.5, 2.5, -.5
After loading your data:
df <- structure(list(name = structure(c(3L, 1L, 3L, 2L), .Label = c("bill", "cindy", "joe"), class = "factor"), points = c(1L, 2L, 5L, 10L), wins = c(1L, 3L, 2L, 2L), losses = c(0L, 0L, 5L, 3L), margin = c(1, 4, -2, -2.5)), .Names = c("name", "points", "wins", "losses", "margin"), class = "data.frame", row.names = c(NA, -4L))
Just use the aggregate function:
> aggregate(. ~ name, data = df, mean)
name points wins losses margin
1 bill 2 3.0 0.0 4.0
2 cindy 10 2.0 3.0 -2.5
3 joe 3 1.5 2.5 -0.5
Obligatory plyr and reshape solutions:
library(plyr)
ddply(df, "name", function(x) mean(x[-1]))
library(reshape)
cast(melt(df), name ~ ..., mean)
And a data.table solution for easy syntax and memory efficiency
library(data.table)
DT <- data.table(df)
DT[,lapply(.SD, mean), by = name]
I have yet another way.
I show it on other example.
If we have matrix xt as:
a b c d
A 1 2 3 4
A 5 6 7 8
A 9 10 11 12
A 13 14 15 16
B 17 18 19 20
B 21 22 23 24
B 25 26 27 28
B 29 30 31 32
C 33 34 35 36
C 37 38 39 40
C 41 42 43 44
C 45 46 47 48
One can compute mean for duplicated columns in few steps:
1. Compute mean using aggregate function
2. Make two modifications: aggregate writes rownames as new (first) column so you have to define it back as a rownames...
3.... and remove this column, by selecting columns 2:number of columns of xa object.
xa=aggregate(xt,by=list(rownames(xt)),FUN=mean)
rownames(xa)=xa[,1]
xa=xa[,2:5]
After that we get:
a b c d
A 7 8 9 10
B 23 24 25 26
C 39 40 41 42
You can simply use functions from the tidyverse to group your data by name, and then summarise all remaining columns by a given function (eg. mean):
df <- tibble(name=c("joe","bill","joe","cindy"),
points=c(1,2,5,10), wins=c(1,3,2,2),
losses=c(0,0,5,3),
margin=c(1,4,-2, -2.5))
df %>% dplyr::group_by(name) %>% dplyr::summarise_all(mean)

Resources