How to create a new variable on condition of others in R - r

I have the following data frame:
ID Measurement A Measurement B Date of Measurements A and B Date of Measurement C
1 23 24 12 16
1 22 23 12 15
1 24 22 12 17
1 21 20 12 11
1 27 29 12 17
This is example using 1 Identifier (ID), in reality I have thousands.
I want to create a variable which encapsulates
"if this ID's Measurement A OR Measurement B is > xxx, before the date of Measurement C, ON MORE THAN TWO OCCASSIONS, then designate
them a 1 in a new column called new_var".
So far, I removed all Date of Measurements A and B > Date of Measurement C
measurements <- subset(measurements, dateofmeasurementsAandB < dateofmeasurementC)
And then added in the cut offs in an ifelse statement
measurements$new_var<- ifelse(measurements$measurementA >= xxx | measurements$measurementB >= xxx, 1, 0)
But can't factor in the 'on more than one occasion bit' (as you can see from example, each ID has multiple rows/occasions)
Any help would be great, especially if it could be done simpler!

If I undestand what you're asking, I think I would use dplyr's count function:
#Starting from your dataframe
library(tidyverse)
df <- measurements %>%
filter(dateofmeasurementsAandB < dateofmeasurementC,
measurements$measurementA >= xxx | measurements$measurementB >= xxx)
This data frame should only have the conditions you're going for, so now we count them and filter the result:
df <- df %>% count(ID) %>% filter(n >= 2)
The vector df$ID should now only have the IDs that have been measured more than once which you can then feed back into your measurements data frame with ease, but I'm partial to this:
measurements$new_var <- 0
measurements[measurements$ID %in% df$ID]$new_var <- 1

Related

How to filter rows where there are changes in a categorical variable

We have a dataframe called data with 2 columns: Time which is arranged in ascending order, and Place which describes where the individual was:
data.frame(Time = seq(1,20,1),
Place = rep(letters[c(1:3,1)], c(5,5,3,7)))
Since this data is in ascending order with respect to Time, we want to subset the rows where Place changes from the previous observation.
The resulting dataframe for this data would look like this:
Time Place
1 a
6 b
11 c
14 a
Notice that the same Place can show up later, like Place == a did in this example. How can we perform this kind of subset in R?
Apply the duplicated on the rleid of the 'Place'
library(dplyr)
library(data.table)
df1 %>%
filter(!duplicated(rleid(Place)))
Or in base R with rle
subset(df1, !duplicated(with(rle(Place), rep(seq_along(values), lengths))))
-output
Time Place
1 1 a
6 6 b
11 11 c
14 14 a
Another base R option using subset + tail + head
subset(
df,
c(TRUE, tail(Place, -1) != head(Place, -1))
)
which gives
Time Place
1 1 a
6 6 b
11 11 c
14 14 a

Re-bin a data frame in R

I have a data frame which holds activity (A) data across time (T) for a number of subjects (S) in different groups (G). The activity data were sampled every 10 minutes. What I would like to do is to re-bin the data into, say, 30-minute bins (either adding or averaging values) keeping the subject Id and group information.
Example. I have something like this:
S G T A
1 A 30 25
1 A 40 20
1 A 50 15
1 A 60 20
1 A 70 5
1 A 80 20
2 B 30 10
2 B 40 10
2 B 50 10
2 B 60 20
2 B 70 20
2 B 80 20
And I'd like something like this:
S G T A
1 A 40 20
1 A 70 15
2 B 40 10
2 B 70 20
Whether time is the average time (as in the example) or the first/last time point and whether the activity is averaged (again, as in the example) or summed is not important for now.
I will appreciate any help you can provide on this. I was thinking about creating a script in Python to re-bin this particular dataframe, but I thought that there may be a way of doing it in R in a way that may be applied to any dataframe with differing numbers of columns, etc.
There are some ways to come to the wished dataframe.
I have reproduced your dataframe:
df <- data.frame(S = c(rep(1,6),rep(2,6)),
G = c(rep("A",6),rep("B",6)),
T = rep(seq(30,80,10),2),
A = c(25, 20, 15, 20, 5, 20, 10, 10, 10, 20, 20, 20))
The classical way could be:
df[df$T == 40 | df$T == 70,]
The more modern tidyverse way is
library(tidyverse)
df %>% filter(T == 40 | T ==70)
If you want to get the average of each group of G filtered for T==40 and 70:
df %>% filter(T == 40 | T == 70) %>%
group_by(G) %>%
mutate(A = mean(A))

How do I get the difference of two groups in one dataframe (longtable) in R?

I have this given dataframe:
days classtype scores
1 1 a 49
2 1 b 47
3 2 a 36
4 2 b 41
It is produce by this given code:
days=c(1,1,2,2)
classtype=c("a","b","a","b")
scores=c(49,47,36,41)
myData=data.frame(days,classtype,scores)
print(myData)
What lines do I need to add to the code in order to get calculate the difference in scores of the two classes for each day? I want to get this output:
days difference_in_scores
1 1 2
2 2 -5
If the format of your data is consistently as you have shown then you can accomplish this very neatly using data.table:
setDT(myData)
myData[, diff(scores), by = days]
days V1
1: 1 -2
2: 2 5
Or using just base-R:
aggregate(scores ~ days, myData, FUN = diff)
One approach you could take
library(dplyr)
library(reshape2)
days=c(1,1,2,2)
classtype=c("a","b","a","b")
scores=c(49,47,36,41)
myData=data.frame(days,classtype,scores)
myData %>%
# convert the data to wide format
dcast(days ~ classtype,
value.var = "scores") %>%
# calculate differences
mutate(difference_in_scores = a - b) %>%
# remove columns (just to match your desired output)
select(days, difference_in_scores)

How to re-arrange a data.frame

I am interested in re-arranging a data.frame in R. Bear with me a I stumble through a reproducible example.
I have a nominal variable which can have 1 of two values. Currently this nominal variable is a column. Instead I would like to have two columns, representing the two values this nominal variable can have. Here is an exmample data frame. S is the nominal variable with values T and C.
n <- c(1,1,2,2,3,3,4,4)
s <- c("t","c","t","c","t","c","t","c")
b <- c(11,23,6,5,12,16,41,3)
mydata <- data.frame(n, s, b)
I would rather have a data frame that looked like this
n.n <- c(1,2,3,4)
trt <- c(11,6,23,41)
cnt <- c(23,5,16,3)
new.data <- data.frame(n.n, trt, cnt)
I am sure there is a way to use mutate or possibly tidyr but I am not sure what the best route is and my data frame that I would like to re-arrange is quite large.
you want spread:
library(dplyr)
library(tidyr)
new.data <- mydata %>% spread(s,b)
n c t
1 1 23 11
2 2 5 6
3 3 16 12
4 4 3 41
How about unstack(mydata, b~s):
c t
1 23 11
2 5 6
3 16 12
4 3 41

generate the subset based on a given index set

There has a data set, A, like
id grade
1 10
2 20
3 30
4 40
In addition, there has another index data set, B, like
id
2
3
I would like to extract the subset of A based on B, the result will look like
id grade
2 20
3 30
Here's a data.table solution. This will be much faster if your dataset A is large, or if you have to do this a large number of times.
set.seed(1) # for reproducible example
A <- data.frame(id=1:1e6,grade=10*(1:1e6)) # 1,000,000 rows
B <- data.frame(id=sample(1:1e6,1000)) # random sample of 1000 ids
library(data.table)
setDT(A) # convert A to a data.table
setkey(A,id) # set the key
result <- A[J(B$id)] # extract records based in id
In this example data.table is about 20 times faster than either %in% or merge(...).
Note also that while all three retrieve the same records, they are not necessarily in the same order.
A$id %in% B$id
creates a logical vector the length of A$id, which elements are T if that element is found in B$id, then uses that to subset A. So the records in the result are in the same order as A.
merge(A,B)
sorts the result by the common column (id), so the result is sorted by increasing value of id. In your example and this example, these first two are the same.
A[J(B$id)]
returns a result ordered as B$id (which is random, in this example, but would be the same as the other two approached in your example).
Try this:
> x <- data.frame(id=1:4, grade=(1:4)*10)
> x
id grade
1 1 10
2 2 20
3 3 30
4 4 40
> id <- 2:3
> x[ x$id %in% id, ]
id grade
2 2 20
3 3 30
Alternatively you can also:
> id <- data.frame(id=2:3)
> merge(x, id)
id grade
1 2 20
2 3 30

Resources