Subsetting data.frame to return first 200 rows for specific condition in r - r

I have a data.frame with 3.3 million rows and 9 columns. Below is an example with the 3 relevant columns.
StimulusName Subject Pupil Means
1 1 101 3.270000
2 1 101 3.145000
3 1 101 3.265000
4 2 101 3.015000
5 2 101 3.100000
6 2 101 3.051250
7 1 102 3.035000
8 1 102 3.075000
9 1 102 3.050000
10 2 102 3.056667
11 2 102 3.059167
12 2 102 3.060000
13 1 103 3.085000
14 1 103 3.125000
15 1 103 3.115000
I want to subset data based on stimulus name and subject and then take either the first few or the last few rows for that subset. So, for example returning row 10 and 11 by getting the first 2 rows where df$StimulusName == 2 & df$Subject == 102.
The actual data frame contains thousands of observations per Stimulus and Subject. I want to use it to plot the first and last 200 observations of the stimulus separately.

Have not tested this out, but should work.
First 200
df_filtered <- subset(df, StimulusName == 2 & Subject == 102)
df_filtered <- df_filtered[1:200,]
Then plot df_filtered.
Last 200
df_filtered <- subset(df, StimulusName == 2 & Subject == 102)
df_filtered <- df_filtered[(nrow(df_filtered)-199):nrow(df_filtered),]
Then plot df_filtered.

Perhaps you want something like this:
subCond <- function(x, r, c) {
m <- x[x[, 1] == r & x[, 2] == c,]
return(m)
}
Yields e.g.:
> subCond(df, 1, 102)
StimulusName Subject PupilMeans
7 1 102 3.035
8 1 102 3.075
9 1 102 3.050
or
> subCond(df, 2, 101)
StimulusName Subject PupilMeans
4 2 101 3.01500
5 2 101 3.10000
6 2 101 3.05125

Related

Is there code to determine the amount of criteria met by a row in R?

I am trying to figure out a way to assign a column that would list out the number of criteria that is met by a certain row. For example, I am looking at how many risk factors for heart disease someone has met and trying to run an ordinal regression on those values. I have tried
cvd_status <- ifelse( data_tot$X5_A_01_d_Heart.Disease=="1"|data_tot$X5_A_01_e_Stroke=="1"|data_tot$X5_A_01_f_Chronic.Kidney.Disease==1, 1,0)
but that only gives me whether people have any risk factors, not how many risk factors they have. Is there any way to figure out how many risk factors someone would have?
Edit: The variables are not simply binary, but are either 1s or 2s or ranges of numbers.
If the variables contain only 0 or 1, then the following could be used:
with(data_tot,
rowSums(cbind(X5_A_01_d_Heart.Disease,
X5_A_01_e_Stroke,
X5_A_01_f_Chronic.Kidney.Disease))
)
Edit:
And if they are coded as 1 (yes) and 2 (no), plus if other risk factors such as blood pressure and cholesterol level are to be included, AND there are no missing values in these risk factor variables, then you'll can use something similar to the following:
data_tot %>%
mutate(CVD_Risk.Factors=
(Heart == 1) +
(Stroke == 1) +
(CKD == 1) +
(Systolic_BP >= 130) + (Diastolic_BP >= 80) +
(Cholesterol > 150))
Heart Stroke CKD Systolic_BP Diastolic_BP Cholesterol CVD_Risk.Factors
1 1 1 2 118 90 200 4
2 2 1 2 125 65 150 1
3 2 1 1 133 95 190 5
4 1 1 2 120 87 250 4
5 2 2 2 155 110 NA NA
6 2 2 2 130 105 140 2
You can see that if there are any missing values, then this would not work. One solution is to use rowwise and then sum.
data_tot %>%
rowwise() %>% # This tells R to apply a function by the rows of the selected inputs
mutate(CVD_Risk.Factors=sum( # This function has an "na.rm" argument
(Heart == 1),
(Stroke == 1),
(CKD == 1),
(Systolic_BP >= 130), (Diastolic_BP >= 80),
(Cholesterol > 150), na.rm=TRUE)) # Omit NA in the summations
# A tibble: 6 x 7
Heart Stroke CKD Systolic_BP Diastolic_BP Cholesterol CVD_Risk.Factors
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 1 1 2 118 90 200 4
2 2 1 2 125 65 150 1
3 2 1 1 133 95 190 5
4 1 1 2 120 87 250 4
5 2 2 2 155 110 NA 2 # not NA
6 2 2 2 130 105 140 2
Data:
data_tot <- data.frame(Heart=c(1,2,2,1,2,2),
Stroke=c(1,1,1,1,2,2),
CKD=c(2,2,1,2,2,2),
Systolic_BP=c(118,125,133,120,155,130),
Diastolic_BP=c(90,65,95,87,110,105),
Cholesterol=c(200,150,190,250,NA,140))

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

How to use apply function once for each unique factor value

I'm trying on some commands on the R-studio built-in databse, ChickWeight. The data looks as follows.
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
7 106 12 1 1
8 125 14 1 1
9 149 16 1 1
10 171 18 1 1
11 199 20 1 1
12 205 21 1 1
13 40 0 2 1
14 49 2 2 1
15 58 4 2 1
Now what I would like to do is to simply output the difference between the chicken-weight for the "Chick" column for time 0 and 21 (last time value). I.e the weight the chick has put on.
I've been trying tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1]). But this of course applies the value to all rows.
How do I make it so that it applies only once for each unique Chick-value?
If we need a single value per each 'factor' column (assuming that 'Chick', and 'Diet' are the factor columns)
library(data.table)
setDT(df1)[, list(Diff= abs(weight[Time==21]-weight[Time==0])) ,.(Chick, Diet)]
and If we need to create a column
setDT(df1)[, Diff:= abs(weight[Time==21]-weight[Time==0]) ,.(Chick, Diet)]
I noticed that in the example Time = 21 is not found in the Chick No:2, may be in that case, we need one of the number
setDT(df1)[, {tmp <- Time %in% c(0,21)
list(Diff= if(sum(tmp)>1) abs(diff(weight[tmp])) else weight[tmp]) } ,
by = .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 40
If we are taking the difference of 'weight' based on the max and min 'Time' for each group
setDT(df1)[, list(Diff=weight[which.max(Time)]-
weight[which.min(Time)]), .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 18
Also, if the 'Time' is ordered
setDT(df1)[, list(Diff= abs(diff(weight[c(1L,.N)]))), by =.(Chick, Diet)]
Using by from base R
by(df1[1:2], df1[3:4], FUN= function(x) with(x,
abs(weight[which.max(Time)]-weight[which.min(Time)])))
#Chick: 1
#Diet: 1
#[1] 163
#------------------------------------------------------------
#Chick: 2
#Diet: 1
#[1] 18
Here's a solution using dplyr:
ChickWeight %>%
group_by(Chick = as.numeric(as.character(Chick))) %>%
summarise(weight_gain = last(weight) - first(weight), final_time = last(Time))
(First and last as suggested by #ulfelder.)
Note that ChickWeight$Chick is an ordered factor so without coercing it into numeric the final order looks odd.
Using base R:
ChickWeight$Chick <- as.numeric(as.character(ChickWeight$Chick))
tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1])

cut with varied intervals?

I have a dataset with two variables, one is grouping variable, and the other is value. The data is sorted by value within each group. I want to cut the value variable into a factor within each group and less than the interval of diff(10). That is, if diff(val)>=10, than a new level is created. Below is a demo data, where newgrp is the new variable I want. Maybe filter() is desired here, but I have been in a daze with it for quite a while. Any thoughts?
grp val newgrp
a 101 1
a 101 1
a 102 1
a 110 1
a 111 2 <-- a new level is created since 111 - 101 > 9
a 112 2
a 148 3 <-- a new level is created sine 152 - 148 > 9,
a 157 3
a 158 4 <-- a new level is created since 158 - 148>9
b 8 1 <-- levels start over for group b
b 9 1
b 12 1
b 17 1
b 18 2
Edit
I don't think there's any way to avoid defining a function first that will loop through each vector, since two numbers (the "base" and "new group") need to be reset every time a large enough difference is encountered.
NewGroup = function(x)
{
base = x[1]
new = 1
newgrp = c()
for(i in seq_along(x))
{
if (x[i] - base > 9)
{
base = x[i]
new = new + 1
}
newgrp[i] <- new
}
return(newgrp)
}
dt[,newgrp:=NewGroup(val),by=grp]
grp val newgrp
1: a 101 1
2: a 101 1
3: a 102 1
4: a 110 1
5: a 111 2
6: a 112 2
7: a 148 3
8: a 157 3
9: a 158 4
10: b 8 1
11: b 9 1
12: b 12 1
13: b 17 1
14: b 18 2
You can use this:
do.call(rbind, by(yourdf, yourdf$grp, function(df) within(df, newgrp <- cumsum(c(1,diff(val))>9))))
Replace yourdf with your dataframe.

Using one data frame to sum a range of data from another data frame in R

I'm migrating from SAS to R. I need help figuring out how to sum up weather data for date ranges. In SAS, I take the date ranges, use a data step to create a record for every date (with startdate, enddate, date) in the range, merge with weather and then summarize (VAR hdd cdd; CLASS=startdate enddate sum=) to sum up the values for the date range.
R code:
startdate <- c(100,103,107)
enddate <- c(105,104,110)
billperiods <-data.frame(startdate,enddate);
to get:
> billperiods
startdate enddate
1 100 105
2 103 104
3 107 110
R code:
weatherdate <- c(100:103,105:110)
hdd <- c(0,0,4,5,0,0,3,1,9,0)
cdd <- c(4,1,0,0,5,6,0,0,0,10)
weather <- data.frame(weatherdate,hdd,cdd)
to get:
> weather
weatherdate hdd cdd
1 100 0 4
2 101 0 1
3 102 4 0
4 103 5 0
5 105 0 5
6 106 0 6
7 107 3 0
8 108 1 0
9 109 9 0
10 110 0 10
Note: weatherdate = 104 is missing. I may not have weather for a day.
I can't figure out how to get to:
> billweather
startdate enddate sumhdd sumcdd
1 100 105 9 10
2 103 104 5 0
3 107 110 13 10
where sumhdd is the sum of the hdd's from startdate to enddate in the weather data.frame.
Any ideas?
Here's a method using IRanges and data.table. Seemingly, for this question, this answer may seem kind of an overkill. But in general, I find it convenient to use IRanges to deal with intervals, how simple they may be.
# load packages
require(IRanges)
require(data.table)
# convert data.frames to data.tables
dt1 <- data.table(billperiods)
dt2 <- data.table(weather)
# construct Ranges to get overlaps
ir1 <- IRanges(dt1$startdate, dt1$enddate)
ir2 <- IRanges(dt2$weatherdate, width=1) # start = end
# find Overlaps
olaps <- findOverlaps(ir1, ir2)
# Hits of length 10
# queryLength: 3
# subjectLength: 10
# queryHits subjectHits
# <integer> <integer>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 2 4
# 7 3 7
# 8 3 8
# 9 3 9
# 10 3 10
# get billweather (final output)
billweather <- cbind(dt1[queryHits(olaps)],
dt2[subjectHits(olaps),
list(hdd, cdd)])[, list(sumhdd = sum(hdd),
sumcdd = sum(cdd)),
by=list(startdate, enddate)]
# startdate enddate sumhdd sumcdd
# 1: 100 105 9 10
# 2: 103 104 5 0
# 3: 107 110 13 10
Code breakdown for last line: First I construct using queryHits, subjectHits and cbind a mid-way data.table from which then, I group by startdate, enddate and get the sum of hdd and sum of cdd. It is easier to look at the line separately as shown below for better understanding.
# split for easier understanding
billweather <- cbind(dt1[queryHits(olaps)],
dt2[subjectHits(olaps),
list(hdd, cdd)])
billweather <- billweather[, list(sumhdd = sum(hdd),
sumcdd = sum(cdd)),
by=list(startdate, enddate)]
cbind(billperiods, t(sapply(apply(billperiods, 1, function(x)
weather[weather$weatherdate >= x[1] &
weather$weatherdate <= x[2], c("hdd", "cdd")]), colSums)))
startdate enddate hdd cdd
1 100 105 9 10
2 103 104 5 0
3 107 110 13 10
billweather <- cbind(billperiods,
t(apply(billperiods, 1, function(x) {
colSums(weather[weather[, 1] %in% c(x[1]:x[2]), 2:3])
})))

Resources