Subset of data with criteria of two columns

Subset of data with criteria of two columns - r

I would like to create a subset of data that consists of Units that have a higher score in QTR 4 than QTR 1 (upward trend). Doesn't matter if QTR 2 or 3 are present.
Unit QTR Score
5 4 34
1 1 22
5 3 67
2 4 78
3 2 39
5 2 34
1 2 34
5 1 67
1 3 70
1 4 89
3 4 19
Subset would be:
Unit QTR Score
1 1 22
1 2 34
1 3 70
1 4 89
I've tried variants of something like this:
upward_subset <- subset(mydata,Unit if QTR=4~Score > QTR=1~Score)
Thank you for your time

If the dataframe is named "d", then this succeeds on your test set:
d[ which(d$Unit %in%
(sapply( split(d, d["Unit"]),
function(dd) dd[dd$QTR ==4, "Score"] - dd[dd$QTR ==1, "Score"]) > 0)) ,
]
#-------------
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89

An alternative in two steps:
result <- unlist(
by(
test,
test$Unit,
function(x) x$Score[x$QTR==4] > x$Score[x$QTR==2])
)
test[test$Unit %in% names(result[result==TRUE]),]
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89

A solution using data.table (Probably there are better versions than what I have at the moment).
Note: Assuming a QTR value for a given Unit is unique
Data:
df <- structure(list(Unit = c(5L, 1L, 5L, 2L, 3L, 5L, 1L, 5L, 1L, 1L,
3L), QTR = c(4L, 1L, 3L, 4L, 2L, 2L, 2L, 1L, 3L, 4L, 4L), Score = c(34L,
22L, 67L, 78L, 39L, 34L, 34L, 67L, 70L, 89L, 19L)), .Names = c("Unit",
"QTR", "Score"), class = "data.frame", row.names = c(NA, -11L
))
Solution:
dt <- data.table(df, key=c("Unit", "QTR"))
dt[, Score[Score[QTR == 4] > Score[QTR == 1]], by=Unit]
Unit V1
1: 1 22
2: 1 34
3: 1 70
4: 1 89

Related

Find the "top N" in a group and find the average of the "top N" in R

Rank Laps Average Time
1 1 1 30
2 2 1 34
3 3 1 35
4 1 2 32
5 2 2 33
6 3 2 56
7 4 1 43
8 5 1 23
9 6 1 31
10 4 2 23
11 5 2 88
12 6 2 54
I would like to know how I can group ranks 1-3 and ranks 4-6 and get an average of the "average time" for each lap. Also, I would like this to extend if I have groups 7-9, 10-13, etc.

One option is to use cut to put the different ranks into groups, and add Laps as a grouping variable. Then, you can summarize the data to get the mean.
library(tidyverse)
df %>%
group_by(gr = cut(Rank, breaks = seq(0, 6, by = 3)), Laps) %>%
summarize(avg = mean(Average_Time))
Output
gr Laps avg
<fct> <int> <dbl>
1 (0,3] 1 33
2 (0,3] 2 40.3
3 (3,6] 1 32.3
4 (3,6] 2 55
Or another option if you want the range of ranks displayed for the group:
df %>%
group_by(gr = cut(Rank, breaks = seq(0, 6, by = 3))) %>%
mutate(Rank_gr = paste0(min(Rank), "-", max(Rank))) %>%
group_by(Rank_gr, Laps) %>%
summarize(avg = mean(Average_Time))
Output
Rank_gr Laps avg
<chr> <int> <dbl>
1 1-3 1 33
2 1-3 2 40.3
3 4-6 1 32.3
4 4-6 2 55
Since you will have uneven groups, then you might want to use case_when to make the groups:
df %>%
group_by(gr=case_when(Rank %in% 1:3 ~ "1-3",
Rank %in% 4:6 ~ "4-6",
Rank %in% 7:9 ~ "7-9",
Rank %in% 10:13 ~ "10-13"),
Laps) %>%
summarize(avg = mean(Average_Time))
Data
df <- structure(list(Rank = c(1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L, 4L,
5L, 6L), Laps = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L,
2L), Average_Time = c(30L, 34L, 35L, 32L, 33L, 56L, 43L, 23L,
31L, 23L, 88L, 54L)), class = "data.frame", row.names = c(NA,
-12L))

how to remove part of a string without interrupting a data frame?

I have a data looks like this but way much bigger
df<- structure(list(names = c("bests-1", "trible-1", "crazy-1", "cool-1",
"nonsense-1", "Mean-1", "Lose-1", "Trye-1", "Trified-1"), Col = c(1L,
2L, NA, 4L, 47L, 294L, 2L, 1L, 3L), col2 = c(2L, 4L, 5L, 7L,
9L, 9L, 0L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-9L))
as an example, I am trying to remove -1 from all strings of the first column
I can do this with
as.data.frame(str_remove_all(df$names, "-1"))
the problem is that it will remove all other columns as well.
I dont want to split the data and merge again because I am afraid I Make a mismatch
Is there anyway without interrupting, just getting raid of specific strings?
for instance the output should looks like this
names Col col2
bests 1 2
trible 2 4
crazy NA 5
cool 4 7
nonsense 47 9
Mean 294 9
Lose 2 0
Try 1 2
Trified 3 3

Using gsub, escape the special \\-, and $ for end of string.
transform(df, names=gsub('\\-1$', '', names))
# names Col col2
# 1 bests 1 2
# 2 trible 2 4
# 3 crazy NA 5
# 4 cool 4 7
# 5 nonsense 47 9
# 6 Mean 294 9
# 7 Lose 2 0
# 8 Trye 1 2
# 9 Trified 3 3
Data:
df <- structure(list(names = c("bests-1", "trible-1", "crazy-1", "cool-1",
"nonsense-1", "Mean-1", "Lose-1", "Trye-1", "Trified-1"), Col = c(1L,
2L, NA, 4L, 47L, 294L, 2L, 1L, 3L), col2 = c(2L, 4L, 5L, 7L,
9L, 9L, 0L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-9L))

Using stringr package,
df$names = str_remove_all(df$names, '-1')
names Col col2
1 bests 1 2
2 trible 2 4
3 crazy NA 5
4 cool 4 7
5 nonsense 47 9
6 Mean 294 9
7 Lose 2 0
8 Trye 1 2
9 Trified 3 3

We could use trimws from base R
df$names <- trimws(df$names, whitespace = "-\\d+")
-output
> df
names Col col2
1 bests 1 2
2 trible 2 4
3 crazy NA 5
4 cool 4 7
5 nonsense 47 9
6 Mean 294 9
7 Lose 2 0
8 Trye 1 2
9 Trified 3 3

How to get the difference between groups with a dataframe in long format in R?

Have a simple dataframe with 2 ID's (N = 2) and 2 periods (T = 2), for example:
year id points
1 1 10
1 2 12
2 1 20
2 2 18
How does one achieves the following dataframe (preferably using dplyr or any tidyverse solution)?
id points_difference
1 10
2 6
Notice that the points_difference column is the difference between each ID in across time (namely T2 - T1).
Additionally, how to generalize for multiple columns and multiple ID (with only 2 periods)?
year id points scores
1 1 10 7
1 ... ... ...
1 N 12 8
2 1 20 9
2 ... ... ...
2 N 12 9
id points_difference scores_difference
1 10 2
... ... ...
N 0 1

If you are on dplyr 1.0.0(or higher), summarise can return multiple rows in output so this will also work if you have more than 2 periods. You can do :
library(dplyr)
df %>%
arrange(id, year) %>%
group_by(id) %>%
summarise(across(c(points, scores), diff, .names = '{col}_difference'))
# id points_difference scores_difference
# <int> <int> <int>
#1 1 10 2
#2 1 -7 1
#3 2 6 2
#4 2 -3 3
data
df <- structure(list(year = c(1L, 1L, 2L, 2L, 3L, 3L), id = c(1L, 2L,
1L, 2L, 1L, 2L), points = c(10L, 12L, 20L, 18L, 13L, 15L), scores = c(2L,
3L, 4L, 5L, 5L, 8L)), class = "data.frame", row.names = c(NA, -6L))

Merging two datasets by an ID without adding new columns that say ".x" or ".y"

Suppose I have two datasets. One main dataset, with many columns of metadata, and one new dataset which will be used to fill in some of the gaps in concentrations in the main dataset:
Main dataset:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 NA NA
1 4 22 0 NA NA
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 NA NA
2 4 37 3 NA NA
New data set to merge:
study_id timepoint concentration1 concentration2
1 3 11 20
1 4 21 35
2 3 7 17
2 4 14 25
Whenever I merge by "study_id" and "timepoint", I get two new columns that are "concentration1.y" and "concentration2.y" while the original columns get renamed as "concentration1.x" and "concentration2.x". I don't want this.
This is what I want:
study_id timepoint age occupation concentration1 concentration2
1 1 21 0 3 7
1 2 21 0 4 6
1 3 22 0 11 20
1 4 22 0 21 35
2 1 36 3 0 4
2 2 36 3 2 11
2 3 37 3 7 17
2 4 37 3 14 25
In other words, I want to merge by "study_id" and "timepoint" AND merge the two concentration columns so the data are within the same columns. Please note that both datasets do not have identical columns (dataset 1 has 1000 columns with metadata while dataset2 just has study id, timepoint, and concentration columns that match the concentration columns in dataset1).
Thanks so much in advance.

Using coalesce is one option (from dplyr package). This still adds the two columns for concentration 1 and 2 from the second data frame. These would be removed after NA filled in.
library(tidyverse)
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
mutate(concentration1 = coalesce(concentration1.x, concentration1.y),
concentration2 = coalesce(concentration2.x, concentration2.y)) %>%
select(-concentration1.x, -concentration1.y, -concentration2.x, -concentration2.y)
Or to generalize with multiple concentration columns:
df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y")) %>%
map_df(reduce, coalesce)
Edit: To prevent the resultant column names from being alphabetized from split.default, you can add an intermediate step of sorting the list based on the first data frame's column name order.
df3 <- df1 %>%
left_join(df2, by = c("study_id", "timepoint")) %>%
split.default(str_remove(names(.), "\\.x|\\.y"))
df3[names(df1)] %>%
map_df(reduce, coalesce)
Output
study_id timepoint age occupation concentration1 concentration2
1 1 1 21 0 3 7
2 1 2 21 0 4 6
3 1 3 22 0 11 20
4 1 4 22 0 21 35
5 2 1 36 3 0 4
6 2 2 36 3 2 11
7 2 3 37 3 7 17
8 2 4 37 3 14 25
Data
df1 <- structure(list(study_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
timepoint = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), age = c(21L,
21L, 22L, 22L, 36L, 36L, 37L, 37L), occupation = c(0L, 0L,
0L, 0L, 3L, 3L, 3L, 3L), concentration1 = c(3L, 4L, NA, NA,
0L, 2L, NA, NA), concentration2 = c(7L, 6L, NA, NA, 4L, 11L,
NA, NA)), class = "data.frame", row.names = c(NA, -8L))
df2 <- structure(list(study_id = c(1L, 1L, 2L, 2L), timepoint = c(3L,
4L, 3L, 4L), concentration1 = c(11L, 21L, 7L, 14L), concentration2 = c(20L,
35L, 17L, 25L)), class = "data.frame", row.names = c(NA, -4L))

Two different id values for the same individuals in different datasets

I have two vectors of id values associated with two different datasets. The two vectors correspond to the same individuals, but the id vectors are unrelated (and there are multiple observations for each individual in each dataset). My goal is to merge them by id, but because the ids are different and they are different lengths there is no way to do that without matching on id. There's obviously a lot more data than what I included in the example.
a <- c(4033,4833,681,9567,6175,7112,3889,264,3918,7685)
b <- c(1,4,7,10,14,18,22,26,27,37)
So 4033 = 1; 4833 = 4...etc.
dummy dataset1:
id day y
1 1 10
1 2 4
1 3 2
4 1 9
4 2 10
4 3 6
dummy dataset2:
id day y1
4033 1 100
4033 1 120
4033 2 150
4033 3 200
4833 1 120
4833 2 100
4833 2 50
4833 3 100
4833 3 200
What I would like is an easy way to get:
dummy dataset1 output:
id day y id.2
1 1 10 4033
1 2 4 4033
1 3 2 4033
4 1 9 4833
4 2 10 4833
4 3 6 4833
I'm trying a solution in a forloop like:
for (i in length(dataset)) {
dataset$id[dataset[[1]] %in% int] <- int1
}
But that's not working correctly (probably for an obvious reason I'm missing).

As we have two vectors, we can easily create a match with a named vector in base R
df1$id.2 <- setNames(a, b)[as.character(df1$id)]
df1
# id day y id.2
#1 1 1 10 4033
#2 1 2 4 4033
#3 1 3 2 4033
#4 4 1 9 4833
#5 4 2 10 4833
#6 4 3 6 4833
Or another base R option is match
df1$id.2 <- a[match(df1$id, b)]
data
df1 <- structure(list(id = c(1L, 1L, 1L, 4L, 4L, 4L), day = c(1L, 2L,
3L, 1L, 2L, 3L), y = c(10L, 4L, 2L, 9L, 10L, 6L)),
class = "data.frame", row.names = c(NA,
-6L))
df2 <- structure(list(id = c(4033L, 4033L, 4033L, 4033L, 4833L, 4833L,
4833L, 4833L, 4833L), day = c(1L, 1L, 2L, 3L, 1L, 2L, 2L, 3L,
3L), y1 = c(100L, 120L, 150L, 200L, 120L, 100L, 50L, 100L, 200L
)), class = "data.frame", row.names = c(NA, -9L))

Another approach is to make a data.frame of the IDs and use merge.
datasetID <- data.frame(id = b, id.2 = a)
merge(dataset1,datasetID)
id day y a
1 1 1 10 4033
2 1 2 4 4033
3 1 3 2 4033
4 4 1 9 4833
5 4 2 10 4833
6 4 3 6 4833
Data
a <- c(4033,4833,681,9567,6175,7112,3889,264,3918,7685)
b <- c(1,4,7,10,14,18,22,26,27,37)
dataset1 <- structure(list(id = c(1L, 1L, 1L, 4L, 4L, 4L), day = c(1L, 2L,
3L, 1L, 2L, 3L), y = c(10L, 4L, 2L, 9L, 10L, 6L)), class = "data.frame", row.names = c(NA,
-6L))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Subset of data with criteria of two columns - r

If the dataframe is named "d", then this succeeds on your test set: d[ which(d$Unit %in% (sapply( split(d, d["Unit"]), function(dd) dd[dd$QTR ==4, "Score"] - dd[dd$QTR ==1, "Score"]) > 0)) , ] #------------- Unit QTR Score 2 1 1 22 7 1 2 34 9 1 3 70 10 1 4 89

An alternative in two steps: result <- unlist( by( test, test$Unit, function(x) x$Score[x$QTR==4] > x$Score[x$QTR==2]) ) test[test$Unit %in% names(result[result==TRUE]),] Unit QTR Score 2 1 1 22 7 1 2 34 9 1 3 70 10 1 4 89

Related

Find the "top N" in a group and find the average of the "top N" in R

how to remove part of a string without interrupting a data frame?

How to get the difference between groups with a dataframe in long format in R?

Merging two datasets by an ID without adding new columns that say ".x" or ".y"

Two different id values for the same individuals in different datasets

Categories

Resources