I'm new and learning R. I'm trying to ask a question that I don't know the words for.
Suppose I have a data frame such that:
df<-data.frame(ID=c("A","A","A","B","B","B","C","C","C"),
Week=c(1,2,3,1,2,3,1,2,3),
Variable=c(30,25,27,42,44,45,30,50,19))
ID Week Variable
1 A 1 30
2 A 2 25
3 A 3 27
4 B 1 42
5 B 2 44
6 B 3 45
7 C 1 30
8 C 2 50
9 C 3 19
How can I find what is the average Variable at Week 2 for all ID that had Variable = 30 at Week 1?
For example, I would like the output in this example to = 37.5
This might be easier to read/see.
library(tidyverse)
df %>%
spread(Week, Variable) %>%
filter(`1` == 30) %>%
with(mean(`2`))
[1] 37.5
I think tidyverse code is easier to understand because you can read it left to right like you would any non-code text. And the piping %>% makes seeing the order of operations easier, no more parentheses to parse.
Step 1: Obtain the IDs which had Variable=30 in Week1
res<- subset( df,Variable==30 & Week==1, ID )
The output is:
> res
ID
1 A
7 C
Step 2:
Get all their variables at week 2:
dt<-subset(df,ID %in% as.vector(unlist(res)) & Week==2 ,select=c(ID,Variable))
The output is:
ID Variable
2 A 25
8 C 50
Step 3: Get the mean:
mean(dt$Variable)
The final output is:
37.5
In step 2 we have ID %in% as.vector(unlist(res)). So, what does it mean?
The %in% part simply is an operator which returns true if it finds an ID inside the right handside vector. For example, run the below sample:
a<- 1:10
b<-c(0,4,6,8,16)
b %in% a
and the result is:
FALSE TRUE TRUE TRUE FALSE
So, the %in% operator returns a Boolean value for each element of b. The result will be True if, that element exist in a, otherwise it returns False. As you see 0 and 16 have False.
But, the point is, a should be vector, meanwhile res is a data.frame so, i need to first unlist it, and then consider it as a vector (as.vector).
In conclusion, ID %in% as.vector(unlist(res)) checks if each ID exist in res or not.
First we need ID's which have entry for variable=30 AND week=1 and then from that ID's extract ID's with Week=2 and do avg(Variable)
Base R Solution:
mean(df[df$ID %in% (df[df$Week==1 & df$Variable==30,1]) & df$Week==2,3])
Output:
[1] 37.5
OR (another approach)
Using sqldf:
library(sqldf)
sqldf("select avg(Variable) from df where ID IN (select ID from df where variable=30 AND week=1) AND Week=2")
Output:
avg(Variable)
1 37.5
Related
This should be very simple but I can't figure out how to do It properly.
Given the following example dataframe:
telar <- data.frame(name=c("uno","dos","tres","cuatro","cinco"), id=c(1,2,3,1,2), test=c(10,11,12,13,14))
telar
name id test
1 uno 1 10
2 dos 2 11
3 tres 3 12
4 cuatro 1 13
5 cinco 2 14
I am trying to select all the rows that, for example, have a value of test that is bellow the average of al the values in the dataframe telar that have the same id value.
I have already properly grouped the values by id and computed their average like this, but I do not know how to perform the comparison.
> summarise(group_by(telar, id), test=mean(test))
A tibble: 3 x 2
id test
<dbl> <dbl>
1 1 11.5
2 2 12.5
3 3 12
Thank you!
You can simply create your groups and keep the values that are less than the mean, i.e.
library(dplyr)
telar %>%
group_by(name, id) %>%
filter(test < mean(test)) %>%
ungroup()
There is undoubtedly a way to do this without using data.table, but I offer it as a solution
library(data.table)
setDT(telar)
telar[, avg := mean(test), by = id][test < avg]
note I recommend if you're doing further analysis in data.frame after this, I recommend to return to a data.frame using setDF(telar)
Using base R, this can be done with ave
telar[with(telar, test < ave(test, id, name)),]
I have a data frame like this:
x=data.frame(type = c('a','b','c','a','b','a','b','c'),
value=c(5,2,3,2,10,6,7,8))
every item has attributes a, b, c while some records may be missing records, i.e. only have a and b
The desired output is
y=data.frame(item=c(1,2,3), a=c(5,2,6), b=c(2,10,7), c=c(3,NA,8))
How can I transform x to y? Thanks
We can use dcast
library(data.table)
out <- dcast(setDT(x), rowid(type) ~ type, value.var = 'value')
setnames(out, 'type', 'item')
out
# item a b c
#1: 1 5 2 3
#2: 2 2 10 8
#3: 3 6 7 NA
Create a grouping vector g assuming each occurrence of a starts a new group, use tapply to create a table tab and coerce that to a data frame. No packages are used.
g <- cumsum(x$type == "a")
tab <- with(x, tapply(value, list(g, type), c))
as.data.frame(tab)
giving:
a b c
1 5 2 3
2 2 10 NA
3 6 7 8
An alternate definition of the grouping vector which is slightly more complex but would be needed if some groups have a missing is the following. It assumes that x lists the type values in order of their levels within group so that if a level is less than the prior level it must be the start of a new group.
g <- cumsum(c(-1, diff(as.numeric(x$type))) < 0)
Note that ultimately there must be some restriction on missingness; otherwise, the problem is ambiguous. For example if one group can have b and c missing and then next group can have a missing then whether b and c in the second group actually form a second group or are part of the first group is not determinable.
I have long vector of patient statuses in R that are chronologically sorted, and a label of associated patient IDs. This vector is an element of a dataframe. I would like to label consecutive rows of data for which the patient status is the same. If the status changes, then reverts to its original value, that would be three separate events. This is different than most situations I have searched where duplicated or match would suffice.
An example would be along the lines of:
s <- c(0,0,0,1,1,1,0,0,2,1,1,0,0)
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2)
and the desired output would be
flag <- c(1,1,1,2,2,2,3,1,2,3,4,4)
or
flag <- c(1,1,1,2,2,2,3,4,5,6,7,7)
One inelegant approach would be to generate the sequence:
unlist(tapply(s, id, function(x) cumsum(c(T, x[-1] != rev(rev(x)[-1])))))
Is there a better way?
I think you could use rleid from data.table for this:
library(data.table)
rleid(s,id)
Output:
1 1 1 2 2 2 3 4 5 6 6 7 7
Or for the first sequence:
data.table(s,id)[,rleid(s),id]$V1
Output:
1 1 1 2 2 2 3 1 2 3 3 4 4
Run Length Encoding - rle()
tapply(s, id, function(x) {
v<-rle(x)$length
rep(1:length(v), v)
})
I need to pull records from a first data set (called df1 here) based on a combination of specific dates, ID#s, event start time, and event end time that match with a second data set (df2). Everything works fine when there is just 1 date, ID, and event start and end time, but some of the matching records between the data sets contain multiple IDs, dates, or times, and I can't get the records from df1 to subset properly in those cases. I ultimately want to put this in a FOR loop or independent function since I have a rather large dataset. Here's what I've got so far:
I started just by matching the dates between the two data sets as follows:
match_dates <- as.character(intersect(df1$Date, df2$Date))
Then I selected the records in df2 based on the first matching date, also keeping the other columns so I have the other ID and time information I need:
records <- df2[which(df2$Date == match_dates[1]), ]
The date, ID, start, and end time from records are then:
[1] "01-04-2009" "599091" "12:00" "17:21"
Finally I subset df1 for before and after the event based on the date, ID, and times in records and combined them into a new data frame called final to get at the data contained in df1 that I ultimately need.
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
after <- subset(df1, NUM==records$ID & Date==records$Date & Time>records$End)
final <- rbind(before, after)
Here's the real problem - some of the matching dates have more than 1 corresponding row in df2, and return multiple IDs or times. Here is what an example of multiple records looks like:
records <- df2[which(df2$Date == match_dates[25]), ]
> records$ID
[1] 507646 680845 680845
> records$Date
[1] "04-02-2009" "04-02-2009" "04-02-2009"
> records$Start
[1] "09:43" "05:37" "11:59"
> records$End
[1] "05:19" "11:29" "16:47"
When I try to subset df1 based on this I get an error:
before <- subset(df1, NUM==records$ID & Date==records$Date & Time<records$Start)
Warning messages:
1: In NUM == records$ID :
longer object length is not a multiple of shorter object length
2: In Date == records$Date :
longer object length is not a multiple of shorter object length
3: In Time < records$Start :
longer object length is not a multiple of shorter object length
Trying to do it manually for each ID-date-time combination would be way to tedious. I have 9 years worth of data, all with multiple matching dates for a given year between the data sets, so ideally I would like to set this up as a FOR loop, or a function with a FOR loop in it, but I can't get past this. Thanks in advance for any tips!
If you're asking what I think you are the filter() function from the dplyr package combined with the match function does what you're looking for.
> df1 <- data.frame(A = c(rep(1,4),rep(2,4),rep(3,4)), B = c(rep(1:4,3)))
> df1
A B
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 2 3
8 2 4
9 3 1
10 3 2
11 3 3
12 3 4
> df2 <- data.frame(A = c(1,2), B = c(3,4))
> df2
A B
1 1 3
2 2 4
> filter(df1, A %in% df2$A, B %in% df2$B)
A B
1 1 3
2 1 4
3 2 3
4 2 4
I have a dataframe containing a set of parts and test results. The parts are tested on 3 sites (North Centre and South). Sometimes those parts are re-tested. I want to eventually create some charts that compare the results from the first time that a part was tested with the second (or third, etc.) time that it was tested, e.g. to look at tester repeatability.
As an example, I've come up with the below code. I've explicitly removed the "Experiment" column from the morley data set, as this is the column I'm effectively trying to recreate. The code works, however it seems that there must be a more elegant way to approach this problem. Any thoughts?
Edit - I realise that the example given was overly simplistic for my actual needs (I was trying to generate a reproducible example as easily as possible).
New example:
part<-as.factor(c("A","A","A","B","B","B","A","A","A","C","C","C"))
site<-as.factor(c("N","C","S","C","N","S","N","C","S","N","S","C"))
result<-c(17,20,25,51,50,49,43,45,47,52,51,56)
data<-data.frame(part,site,result)
data$index<-1
repeat {
if(!anyDuplicated(data[,c("part","site","index")]))
{ break }
data$index<-ifelse(duplicated(data[,1:2]),data$index+1,data$index)
}
data
part site result index
1 A N 17 1
2 A C 20 1
3 A S 25 1
4 B C 51 1
5 B N 50 1
6 B S 49 1
7 A N 43 2
8 A C 45 2
9 A S 47 2
10 C N 52 1
11 C S 51 1
12 C C 56 1
Old example:
#Generate a trial data frame from the morley dataset
df<-morley[,c(2,3)]
#Set up an iterative variable
#Create the index column and initialise to 1
df$index<-1
# Loop through the dataframe looking for duplicate pairs of
# Runs and Indices and increment the index if it's a duplicate
repeat {
if(!anyDuplicated(df[,c(1,3)]))
{ break }
df$index<-ifelse(duplicated(df[,c(1,3)]),df$index+1,df$index)
}
# Check - The below vector should all be true
df$index==morley$Expt
We may use diff and cumsum on the 'Run' column to get the expected output. In this method, we are not creating a column of 1s i.e 'index' and also assuming that the sequence in 'Run' is ordered as showed in the OP's example.
indx <- cumsum(c(TRUE,diff(df$Run)<0))
identical(indx, morley$Expt)
#[1] TRUE
Or we can use ave
indx2 <- with(df, ave(Run, Run, FUN=seq_along))
identical(indx2, morley$Expt)
#[1] TRUE
Update
Using the new example
with(data, ave(seq_along(part), part, site, FUN=seq_along))
#[1] 1 1 1 1 1 1 2 2 2 1 1 1
Or we can use getanID from library(splitstackshape)
library(splitstackshape)
getanID(data, c('part', 'site'))[]
I think this is a job for make.unique, with some manipulation.
index <- 1L + as.integer(sub("\\d+(\\.)?","",make.unique(as.character(morley$Run))))
index <- ifelse(is.na(index),1L,index)
identical(index,morley$Expt)
[1] TRUE
Details of your actual data.frame may matter. However, a couple of options working with your example:
#this works if each group starts with 1:
df$index<-cumsum(df$Run==1)
#this is maybe more general, with data.table
require(data.table)
dt<-as.data.table(df)
dt[,index:=seq_along(Speed),by=Run]